I have embedded around 500,000 characters of text. When I try to query it, the response however does not contain anything relevant to it. For example, I have indexed this Wikipedia page: Beatrice Deloitte Davis - Wikipedia which definitely has 24 May as the date when she died.
However, even if I try to query pinecone with topK=100, it still does not return the text. If I try it with topK=1000, then it does. How do I fix it so that even for lower topK (example 5-15), it returns relevant data?
Unfortunately like with many things in software/tech, the answer here is going to be “it depends”.
It depends on what embedding model you’re using and how well suited it is for the data you’re indexing, i.e. how well does it preserve the “meaning” of the data.
It also depends on what your query is, the source of the query, and how closely it matches with the source content. Semantic search doesn’t just look for the query to match some snippet of the content, it’s looking for proximity of the query based on the entire source content. In some applications, it may make sense to pre-process the user query into a search string that is more likely to match the right content. For example, in some cases I’ve seen improved search performance when using an LLM to first distill a “topic” or “subject” from the query, and then using that subject to enrich the user query.
And it depends how the source data is split into meaningful searchable pieces, typically called chunking, which is itself an entire developing field of study. 500,000 characters probably comes out to around 80k-100k tokens, which is too large to effectively preserve meaning. Most chunking strategies will suggest a chunk size somewhere in the 300 to 500 token range. Much has already been written about chunking strategies, but here is a good place to start to learn more: Chunking Strategies for LLM Applications | Pinecone .
Hope this helps!
Thanks for the thorough response. Can you elaborate on “first distill a “topic” or “subject” from the query”. For example, based on my question, if I query “What happened on 24 May?” and set topK=100, the returned data does not contain the date whatsoever. However for the same query if I set topK=1000, the date is in one of the responses.
Also, thanks for mentioning the chunking strategy. I will further look into the same and see if that helps.
Yeah, for example, let’s say you had a customer support chatbot for a product, and you’ve indexed all of the product content, specifications, etc. in Pinecone.
If a user asked a question like, “How large is it?”, then what they’re asking for are the product dimensions or size.
The problem is that if you just directly take the user query “how large is it” and run it against your vector embeddings, you may or may not find the piece of content where the dimensions are specified.
Alternatively, you can use an LLM to discern the subject matter of the question, with a prompt consisting of instructions like the following, but tailored for your use case:
You are an expert in discerning the primary subject of a question based on a previous conversation history.
Summarize the primary subject matter of the question using 20 or fewer words that identify and describe the subject, including relevant details.
Question: How large is it?
Summarize the subject matter of the question.
This might output something like: “dimensions of the product”.
If you perform your vector query using this string, there’s a higher likelihood of finding the piece of content that contains the answer to what the user is really asking about.
Oh gotcha! Thanks a lot for this!
This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.