Let’s say I have posts (like Facebook), and every post can have a zero, one or many comments. I want to enable the user to be able to search semantically through all the comments and get the posts which are the most relevant for the search.
I plan to do frequent web-scraping of comments on posts. I am trying to design the best approach and I see two at the moment:
- go to a post, combine the text of all the comments into one using string concatenation, and then create a single vector embedding from the text and store it in Pinecone
- go to a post, create one vector embedding for each comment and store all in Pinecone
With the first approach, when the user searches for something, they search through all the comments in one go, and it’s easy to find the most relevant post based on the search. However, the process of concatenating the new comments with the old ones, removing the old vector embedding and creating a new one could be cumbersome.
With the second approach, once a new comment is posted, a new vector embedding is created and added to the vector store. However, when the user searches for something, it doesn’t take the content as a whole from all the comments, but searches individually on the comments (in other words, it returns the most relevant comment as opposed to the most relevant post).
My question is: how can I search on vector embeddings concatenated on the fly (grouped by post id) so I can get the best post instead of getting the best comment in isolation, in case the second approach is feasible?