Searching through combined vector embeddings

zlatko.suslevski · May 20, 2024, 11:51am

Let’s say I have posts (like Facebook), and every post can have a zero, one or many comments. I want to enable the user to be able to search semantically through all the comments and get the posts which are the most relevant for the search.

I plan to do frequent web-scraping of comments on posts. I am trying to design the best approach and I see two at the moment:

go to a post, combine the text of all the comments into one using string concatenation, and then create a single vector embedding from the text and store it in Pinecone
go to a post, create one vector embedding for each comment and store all in Pinecone

With the first approach, when the user searches for something, they search through all the comments in one go, and it’s easy to find the most relevant post based on the search. However, the process of concatenating the new comments with the old ones, removing the old vector embedding and creating a new one could be cumbersome.

With the second approach, once a new comment is posted, a new vector embedding is created and added to the vector store. However, when the user searches for something, it doesn’t take the content as a whole from all the comments, but searches individually on the comments (in other words, it returns the most relevant comment as opposed to the most relevant post).

My question is: how can I search on vector embeddings concatenated on the fly (grouped by post id) so I can get the best post instead of getting the best comment in isolation, in case the second approach is feasible?

ZacharyProser · May 20, 2024, 3:20pm

Hi @zlatko.suslevski, and welcome to the Pinecone forums!

Thanks for your question.

My mind immediately goes to your second proposal - I think that is a more robust approach because if you attempt to concatenate a long comments thread into a single string - that approach would break down pretty quickly as soon as you have an active discussion.

Conversely, indexing individual comments as you propose allows you to use Metadata as the bridge between the comment text and the original source post the comment is related to:

For example, as the Metadata guide I linked shows, you could do something like this:

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("pinecone-index")

index.upsert(
  vectors=[
    {
      "id": "A", 
      "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], 
      "metadata": {"postID": 3046, "category": "tech"}
    },
    {
      "id": "B", 
      "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
      "metadata": {"postID": 201, "category": "humor"}
    }
  ]
)

If you do this during upsert, you’ll be able to always know the postID of the comment when you retrieve it from Pinecone.

Within your application code, you could then take the top N (3? or 5?) results, and perform a simple analysis to see which postID occurs the most frequently, and use that one.

I hope this helps, and let me know if you have any follow-up questions!

Best,
Zack

zlatko.suslevski · May 20, 2024, 8:59pm

Thanks for your answer and the example, @ZacharyProser! It’s crystal clear from your explanation that the second approach is the one to go with. Great idea to order the results by count on items grouped by postID!

One question though: wouldn’t the “top N” parameter cut off the result set without taking into consideration the posts which come much later in the output and are basically ignored? Does it work like LIMIT in SQL? If that’s the case, I am afraid that the group count might be skewed because not all the data was processed, and I am grouping by postID manually, once the result has been already returned by PineCone.