Does merely including metadata (source, keyword, summary) improve the performance of a semantic search, or would I still need to come up with a filter? If the latter, is there a way to come up with the filters dynamically (generated from user’s query, for example)?
I am using Pinecone to store my vectors and I am looking to improve accuracy of the search.
Additionally, what improvements can be made to further enhance the accuracy? I have heard about hybrid search, but could never find a resource that evaluates its accuracy.
metadata filtering is applied before vector searching (citation here).
From what i understand, simply adding metadata will not “improve” the results of the cosine (or otherwise) distance of a vector from anything stored in your pod. It simply reduces the scope of vectors to search against, which could result in faster queries in a sufficiently large vector space. So in that aspect it could improve.
Improving accuracy really comes into how you store the embedded data. Like anything LLM/ML related - garbage in - garbage out. You may not be chunking data with sufficient overlap or might even have messy data that does not semantically match your query. Semantic similarity relies on the prompt being pretty similar to the content being searched, its a big challenge overcoming the ambiguity of language.
Example Query: “I like dogs - tell me about dogs”
Vectorspace: 10,000 chunks of text about all types of animals.
Chunk 1: “Dogs are mans best friend, there are many types of dogs like [cont.]”
Chunk 2 (from the same document but much further down the page): “They also tend to eat kibble and get the zoomies at night”.
Result: Chunk 1 is semantically a close match. Chunk 2, not so much. What is “they”? Now it’s not this simple but in general you may know that two chunks are related but the index doesn’t. That is where saving chunks with metadata can help. You could filter on just “animal”: “dog” which could improve results.
The preparation of data is pretty important for large vector spaces, from my experience. Text chunking is a tricky task and the use of libraries to do this really abstracts a lot away at the loss of observability. This could be related to your issue?
You could possibly use an LLM to extract keywords for a chunk or document prior to insertion and possibly dynamically extract keywords as well on the query - given you know what keys do/dont exist in the vector database already and do a $in filter on the Pinecone query.
I dont have any direct experience with the sparse_vectors part of Pinecone, so I cannot comment on that. Hope this helps even though you may already know all of this information
Adding to what @tim said, I can confirm that simply adding metadata to your vectors will have no impact on your retrieval. Retrieval is purely based on the vectors you have added.
As for improving your results, there are many things you can do. In order of simplest and usually most effective to more complex (and often less impactful), you could try:
Try a different embedding model, OpenAI’s text-embedding-ada-002 is great and easy to use, but it actually isn’t the best performing retrieval/embedding model (by a long shot). Check out this leaderboard — ada-002 is right down at position 14 (at the time of writing).
Chunking methodology — my rule of thumb is to start with ~300 tokens with a ~20-40 token overlap. But that is a rule of thumb and also where I begin. There are many things to consider with chunk sizes, primarily think about:
(1) minimizing chunk size / minimizing the amount of text you’re feeding into an LLM context window, if you put too much text in the LLM performs worse, see this article for more info.
(2) do the chunks of text make sense as individual pieces of text to you? Do they contain enough information to be meaningful, but also not contain too much information that it feels there are many possible meanings — remember that it must be compressed into a single meaningful vector embedding.
You can return more contexts (like top_k=30) and then rerank them with a reranking model like this one from Cohere, you then pass the top 3 (for example) to your LLM as context.
hybrid search — as you mentioned, hybrid search is another option. The idea is that dense vectors (ie ada-002 and other models I linked in the leaderboards above) are great for capturing semantic meaning, but perform less well for words the models haven’t seen often or queries that rely on specific keywords. For example, if I wanted to search for “how do I perform backpropagation for neural networks in TF?” a semantic search would identify I want to do backprop for neural networks using a machine learning library — it might not care too much about which ML library (although it will likely learn towards TF, ie TensorFlow). So, we may end up with results about the same technique in other ML libraries like PyTorch or Flax. By using hybrid we can give greater importance to keywords, while still considering semantic meaning, so we should be able to tune it to give us TF-specific instructions only. You can see an example of hybrid search using BM25 here — I used a multi-modal dataset here, so just ignore the image part and replace that with your current dense model for text setup.
Another thing you can do is ask either your solo LLM (or agent if you’re using one) to generate multiple queries to search with rather than just one. So say something like "write three search queries that tackle the user’s query from different perspectives). Once you retrieve these three sets of results you can merge them and use the reranking model mentioned above to identify the most relevant contexts.
If you want to start using metadata filtering with search (this helps us narrow down the search space) you could try using LangChain self querying
There are more complicated, expensive things you can try too, but I’d try all of the above first. Other options: