Hybrid search - skewing heavily towards larger text chunks

Hi Pinecone community,

Been using the product for the past month or so now and I’ve really enjoyed it. Recently came across this article by James Briggs and decided to try and play around with this to see if it would improve results for what we’re building.

For context, the flow I’m validating is…

  • Ingest a bunch of documents from our customers (PDFs, .docx, etc.)
  • Extract blocks (sentences, tables, etc.)
  • Create dense vectors (OpenAI embeddings) and sparse vectors (BERT WordPiece tokenizer) and upsert to an S1 dotproduct Pinecone index

I tried comparing this to pure semantic search (alpha=0.9, 90% semantic, 10% keyword) and it seems to skew very heavily towards large text blocks. I tried alpha=0.99 too and am getting a similar result. I’m not sure how much this is affected by moving to a “dotproduct” similarity score rather than “cosine similarity”, but “dotproduct” is the only one that’s supported by Pinecone today for hybrid search. My hunch is that it’s because any presence of keyword searching favors larger chunks because it’s detecting the presence of these tokens in the overall snippet, so maybe more uniform chunking would help (based on number of tokens in chunk).

Looking for any high-level guidance on how to debug this? Am also wondering if my setup is ok? Not sure if mixing OpenAI embeddings with BERT WordPiece causes any issues either. Thanks!

+1 would also like to know the answer for this. Especially confused about using dotproduct with OpenAI’s ada embedding model, even though it appears cosine should be used (but hybrid doesn’t support this)?