How to improve similarity results with a large number of vectors

admin1 · July 10, 2023, 11:54am

I’m ingesting PDFs and storing them in Pinecone. When a vector set is small, similarity search results are fairly accurate. But as I get up to 600 PDFs, the similarity search – even with highly specific queries from a user – returns data from many vectors seemingly unrelated to the question. I’m guessing that as the number of vectors increases there are more and more vectors that are similar.

Is there a way to increase the chances of choosing vectors? I’ve thought of adding data, such as the name of the company, directly to the vectors as they are embedded. Is there something people do to help with this situation?

admin1 · July 10, 2023, 12:21pm

To be clear, I do know about metadata and have included a set with each chunk uploaded. But I’m not sure how that would help the similarity search. Is preprocessing necessary to extract say, the company name from the user’s query and filter the similarity search on the metadata with that information? It seems less than optimal with the user free to type anything, including typos. Is there a similarity search on the metadata?

dra · July 11, 2023, 4:18am

Hi @admin1

Thank you for raising an interesting topic.
What was your score at this time? From what I’ve verified, if it’s 0.83 or higher, you’ll get a reasonably high degree of similarity.