Enshrine datas across namespaces

Hey guys,

I’m using Pinecone DB and I’ve embedded/batchUpserted 3 documents sourced from 3 different APIs, which often provide similar types of data.

The issue is that I’m currently using a single large namespace in my indexer to store all the vectors from these documents. However, when I’m embedding a query into a vector, the results aren’t as accurate as I’d expect. I’m not entirely sure how the similarity metrics are being applied, but the vector query doesn’t seem to return the most relevant vector IDs with according meta-datas.

Could it be that my setup is suboptimal? Should I change my embedding LLM? Should I consider dividing the data into 3 separate namespaces, or is there another approach I should explore?

Thanks for your answers!

Hi paul1, and welcome to the forum!

Great question! It sounds like the problem you are facing is when irrelevant chunks are appearing from the one or two other documents you’ve upserted.

It’s a bit hard to tell what specifically could be happening here. Since relevance is estimated via semantic similarity/cosine distance between the query and the document chunks you used, it can depend on the content of the documents and the query sent, as well as the embedding model used too.

Can you follow up with description of your documents, the query you used, and the chunks returned that are not useful?

If I were to guess, it could just be something to do with the query type not aligning well with semantic search, or that the document chunks are too similar, that their distance in vector space to the given query is not distinguishable enough to cirrekate to relevance. In that case, yes, you’d need to examine:

  • embedding model used
  • chunks used
  • input query, is it descriptive or vague, long or short, does it have a question or just a few words etc

If you know the document you are asking a question about in advance, you should use metadata filters to restrict your search to that document, and expose that option to end users. That will certainly help.

Separating the documents by namespaces is an option, but you have to specify a namespace in querying which is a drawback. I’d suggest metadata filters instead for more flexibility.

Let us know the answers to the above, happy to help!

Sincerely,
Arjun

Hey Arjun,

Thanks for your help, here is an example:

Embedding model: OpenAi
My query is: “Analyze the top 3 DeFi opportunities on ZKsync ranked by APY”, it’s the query I’m embedding.

I’m picking 15 chunks from 2 API (Merkl, DeFiLlama) with meta-data matchs like that:

"{
      apy: 3.95039,
      chain: 'zksync era',
      protocol: 'zkswap-v2',
      source: 'defillama',
      text: 'zkswap-v2 ZK-WETH pool has $2828839.00 TVL with 3.95% APY.',
      tvl: 2828839,
      type: 'defi'
    },
    {
      apy: 3.95039,
      chain: 'zksync era',
      protocol: 'zkswap-v2',
      source: 'defillama',
      text: 'zkswap-v2 ZK-WETH pool has $2828839.00 TVL with 3.95% APY.',
      tvl: 2828839,
      type: 'defi'
    },
    {
      apy: 0.043,
      chain: 'zksync era',
      protocol: 'aave-v3',
      source: 'defillama',
      text: 'aave-v3 ZK pool has $29313719.00 TVL with 0.04% APY.',
      tvl: 29313719,
      type: 'defi'
    },{
      apy: 35.96816144428015,
      chain: 'zksync',
      protocol: 'PancakeSwap',
      source: 'merkl',
      text: 'Provide liquidity to PancakeSwapV3 ZK-WETH 0.25% on PancakeSwap has $2894303.21 TVL with 35.97% APY.',
      tvl: 2894303.213721852,
      type: 'defi'
    },
    {
      apy: 25.07699015125618,
      chain: 'zksync',
      protocol: 'SyncSwap',
      source: 'merkl',
      text: 'Provide liquidity to SyncSwap ZK-WETH on SyncSwap has $10042617.73 TVL with 25.08% APY.',
      tvl: 10042617.72566038,
      type: 'defi'
    },"

At the end, he is not ranking the dataset well by APY and he is missing better opportunities in the whole index when he is picking the 15 chunks.

The chunks answer from my 2 API are maybe too similar?

Ok, I think I’m getting a better handle of what’s going on. Here are some suggestions and ideas that together may help.

Application context and misalignment with semantic search

Are you just building a semantic search application? Or are you using Open AI as well to generate a response on top of the returned chunks?

It looks like a lot of the data you are returning is numerical, which usually isn’t a great match for semantic search. This can be due to tokenization issues with numerical data that don’t get represented well with embedding models. You may need to use an embedding model that is specialized for financials if you aim to ask a ton of queries along those lines.

However, there are ways to address this if you want to stick with semantic search. You can:

Query Issues

A good way to diagnose semantic search, RAG, or an AI application is asking: Could a human do this with the same tools and context?

Let’s look at your query:

“Analyze the top 3 DeFi opportunities on ZKsync ranked by APY”,

I’m assuming your Pinecone instance contains strings like the one you showed, where it’s a bunch of crypto stuff with money and APY data.

The desired outcome is to return the highest (?) APY opportunities on zksync.

In order to satisfy this query, we probably need the following:

  • knowledge of what zksync, APY, DeFI are
  • an understanding of the dataset you have uploaded and what it is/where it comes from
  • Once that is achieved, we need to find all statements inside the vdb that have APY and opportunities in them, and then, order them from least to highest APY
  • then, we need return the highest ones to you.

First, it’s important to point out that your embedding models might not even be parsing the important tokens (the chain name and numerical values) correctly, which could explain the “wrong” results.

The bigger issue, though, is that the tools we have mismatch the task we need to do. This would be easiest with a structured database that can parse the opportunities and APY, and simply sort them.

Instead, we are trying to find the semantic similarity inside your query and the chunked tests, which is a more difficult way of doing the above.

Not to mention, that this will not accomplish the last goal, since semantic similarity does not necessarily correspond to ordering by APY.

It’s important to note that Cohere’s reranker, as mentioned above, could address this problem, as they trained their model to do specifically this

But, it’ll only rerank the top K results you pass it, so even then it’s possible to miss information.

Final recommendations

Try to:

  • reconsider the tasks you want to use a vector db and semantic search for
  • if you still want to do vector search, use Cohere’s reranker to enable financial reranking
  • consider hybrid search, sparse search, and structured parsing of your strings to extract the data you want to rank by
  • use metadata to separate the chains if you know that data in advance

Hope this helps!
-Arjun

1 Like

Thanks a lot for the deep answer and it’s exactly the point. Numerical Data isn’t really matching with semantic research as I understood.

Do you maybe have some numerical data specific embedding model in mind?

And yes, we have built an homemade re-ranking model to sort by APY (maybe less efficient than Cohere) but if we missed information at the beginning during the similarity research, it will not solve the issue.

In another hand, we will try to remove the APY, TVL… numerical values and only paste them into meta data fields, so he will maybe create better outputs with semantic similarities with meta data.

Hmmm… the best option might be to write a regex to grab the APY, like finding the percents, decimal places, and the keyword APY, and insert that into metadata. Then you can use semantic search/reranking to capture sentiment or ranking that is described there.

It seems like a lot of these queries have very structured, common formats and could benefit from this metadata parsing.

As for numerical data specific embedding models, I don’t know of any that come to mind outside of something simple like a set of regexes/Named Entity Recognition models that can identify numbers/concepts, convert them to metadata, and then sort the whole dataset rather than querying over them.