Relation between Hybrid Index, Sparse-Dense Index and Metadata Filtering

pramodbiligiri · August 21, 2023, 10:19am

I want to query some Pinecone data with more of an eye on actual data rather than Semantic search. I was reading the docs and have the following questions now:

Does Pinecone store the raw data for a text record by default, or only when you explicitly push it say by tokenizing and inserting into the metadata field? Are SPLADE embeddings an alternative to pushing tokens to the Metadata field?

a) Is it preferable to use the new Sparse-Dense embeddings from Feb 2023 (Pinecone blog post titled “Introducing support for sparse-dense embeddings”) compared to the Hybrid Search feature from Oct 2022 (Pinecone blog post titled “Introducing the hybrid index to enable keyword-aware semantic search”)?

b) Is it preferable to use Sparse-Dense index compared to Metadata Filtering? In one forum answer it recommends the same: Question about searching for indexes with a keyword in their metadata? - #4 by Cory_Pinecone (“We have a new sparse/dense vector index under development, it should be in public preview in the next couple of weeks. I think using that approach will be much more efficient in the long run for finding specific vectors that match a keyword, rather than filtering on metadata”).

Does Metadata Filtering always use Single=Stage Filtering (The Missing WHERE Clause in Vector Search | Pinecone)?

igiloh · September 3, 2023, 10:37am

Hi @pramodbiligiri , thank you for your question!

No, Pinecone does not store the data or the text record, only the id and the vector values (both dense and sparse). As you mentioned - many customers choose to store the raw data as a metadata field (either tokenized or simply raw text), but this is fully optional. Note that by default all metadata fields are indexed for filtering, so It is recommended to set the indexed metadata fields explicitly, excluding the field used for raw data.

2a. The Hybrid Index feature was declared as a preview (beta), and was not released eventually to general availability. You can achieve the same functionality using Sparse values, which can be generated using Pinecone’s open-source library pinecone-text.

2b. Sparse values are usually used for different purpose than metadata filtering. A metadata filter is used as a “hard” boolean filtering, applied before even considering vector proximity (nearest neighbors). For example - filter specific hard categories, int values below a given threshold etc.
A sparse representation of a vector, however, is some means of representing the data in some large space, such as the space of all words in a vocabulary. Two vectors’ distance could be considered by calculating the dot product of their representation in this space. For example, word counting algorithms like TF-IDF and BM25 can be represented as sparse vectors.