Ok, I think I’m getting a better handle of what’s going on. Here are some suggestions and ideas that together may help.
Application context and misalignment with semantic search
Are you just building a semantic search application? Or are you using Open AI as well to generate a response on top of the returned chunks?
It looks like a lot of the data you are returning is numerical, which usually isn’t a great match for semantic search. This can be due to tokenization issues with numerical data that don’t get represented well with embedding models. You may need to use an embedding model that is specialized for financials if you aim to ask a ton of queries along those lines.
However, there are ways to address this if you want to stick with semantic search. You can:
Query Issues
A good way to diagnose semantic search, RAG, or an AI application is asking: Could a human do this with the same tools and context?
Let’s look at your query:
“Analyze the top 3 DeFi opportunities on ZKsync ranked by APY”,
I’m assuming your Pinecone instance contains strings like the one you showed, where it’s a bunch of crypto stuff with money and APY data.
The desired outcome is to return the highest (?) APY opportunities on zksync.
In order to satisfy this query, we probably need the following:
- knowledge of what zksync, APY, DeFI are
- an understanding of the dataset you have uploaded and what it is/where it comes from
- Once that is achieved, we need to find all statements inside the vdb that have APY and opportunities in them, and then, order them from least to highest APY
- then, we need return the highest ones to you.
First, it’s important to point out that your embedding models might not even be parsing the important tokens (the chain name and numerical values) correctly, which could explain the “wrong” results.
The bigger issue, though, is that the tools we have mismatch the task we need to do. This would be easiest with a structured database that can parse the opportunities and APY, and simply sort them.
Instead, we are trying to find the semantic similarity inside your query and the chunked tests, which is a more difficult way of doing the above.
Not to mention, that this will not accomplish the last goal, since semantic similarity does not necessarily correspond to ordering by APY.
It’s important to note that Cohere’s reranker, as mentioned above, could address this problem, as they trained their model to do specifically this
But, it’ll only rerank the top K results you pass it, so even then it’s possible to miss information.
Final recommendations
Try to:
- reconsider the tasks you want to use a vector db and semantic search for
- if you still want to do vector search, use Cohere’s reranker to enable financial reranking
- consider hybrid search, sparse search, and structured parsing of your strings to extract the data you want to rank by
- use metadata to separate the chains if you know that data in advance
Hope this helps!
-Arjun