metadata filtering is applied before vector searching (citation here).
From what i understand, simply adding metadata will not “improve” the results of the cosine (or otherwise) distance of a vector from anything stored in your pod. It simply reduces the scope of vectors to search against, which could result in faster queries in a sufficiently large vector space. So in that aspect it could improve.
Improving accuracy really comes into how you store the embedded data. Like anything LLM/ML related - garbage in - garbage out. You may not be chunking data with sufficient overlap or might even have messy data that does not semantically match your query. Semantic similarity relies on the prompt being pretty similar to the content being searched, its a big challenge overcoming the ambiguity of language.
Example Query: “I like dogs - tell me about dogs”
Vectorspace: 10,000 chunks of text about all types of animals.
Chunk 1: “Dogs are mans best friend, there are many types of dogs like [cont.]”
Chunk 2 (from the same document but much further down the page): “They also tend to eat kibble and get the zoomies at night”.
Result: Chunk 1 is semantically a close match. Chunk 2, not so much. What is “they”? Now it’s not this simple but in general you may know that two chunks are related but the index doesn’t. That is where saving chunks with metadata can help. You could filter on just “animal”: “dog” which could improve results.
The preparation of data is pretty important for large vector spaces, from my experience. Text chunking is a tricky task and the use of libraries to do this really abstracts a lot away at the loss of observability. This could be related to your issue?
You could possibly use an LLM to extract keywords for a chunk or document prior to insertion and possibly dynamically extract keywords as well on the query - given you know what keys do/dont exist in the vector database already and do a $in filter on the Pinecone query.
I dont have any direct experience with the sparse_vectors part of Pinecone, so I cannot comment on that. Hope this helps even though you may already know all of this information