I’m ingesting PDFs and storing them in Pinecone. When a vector set is small, similarity search results are fairly accurate. But as I get up to 600 PDFs, the similarity search – even with highly specific queries from a user – returns data from many vectors seemingly unrelated to the question. I’m guessing that as the number of vectors increases there are more and more vectors that are similar.
Is there a way to increase the chances of choosing vectors? I’ve thought of adding data, such as the name of the company, directly to the vectors as they are embedded. Is there something people do to help with this situation?