Best Practices for Exact Text Search in Pinecone?

To my knowledge, Pinecone doesn’t support text search or substring matching. To identify records that contain a specific piece of text, we may need to maintain a separate mapping, such as:

id_to_text = {
“id-1”: “this is some text”,
“id-2”: “this is some other text”,
# …
}

Is this considered good practice? If so, given that this mapping can grow quite large, what are some effective strategies for persisting it?

If not, how do people typically handle exact text search for debugging purposes when using Pinecone?

Hey there – your premise just a bit to the side of what’s happening under the hood.

Pinecone actually supports two types of text search:

  1. Lexical (keyword) search using sparse vectors. This allows for exact word matching, where query terms are scored independently and summed, with the most similar records scored highest .
  2. Semantic search using dense vectors. This finds records that are semantically similar to the query text, even if they don’t contain the exact same words .

For example, here’s how to perform a lexical search in Python:

python

from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_API_KEY")index = pc.Index(host="INDEX_HOST")
results = index.search(    namespace="example-namespace",     query={        "inputs": {"text": "What is AAPL's outlook, considering both product launches and market conditions?"},         "top_k": 3    },    fields=["chunk_text", "quarter"])

The response includes the matching text and scores:

python

{'result': {'hits': [{'_id': 'vec2',                      '_score': 10.77734375,                      'fields': {'chunk_text': "Analysts suggest that AAPL'''s "                                               'upcoming Q4 product launch '                                               'event might solidify its '                                               'position in the premium '                                               'smartphone market.',                                 'quarter': 'Q4'}}]}}

Therefore, maintaining a separate text mapping isn’t necessary - you can use Pinecone’s built-in search capabilities for both exact keyword matching and semantic similarity search, depending on your needs.

Thank you! Just to confirm—if my application requires semantic search and I also want lexical search for debugging, does that mean I need to create two separate Pinecone indexes: one dense for semantic search and one sparse for lexical search?

You’ve got options –

Dense indexes store dense vectors for semantic search, while sparse indexes store sparse vectors for lexical/keyword search More about indexes.

  • However, Pinecone also offers hybrid search which allows you to combine both semantic and lexical search capabilities . With hybrid search, you can search both dense and sparse indexes, combine the results, and use Pinecone’s hosted reranking models to assign unified relevance scores .

The key differences between these approaches are:

  1. Dense vectors (semantic search):
  • Represent meaning and relationships of text
  • Used for finding semantically similar content
  1. Sparse vectors (lexical search):
  • Represent words or phrases in a document
  • Better for exact keyword matches
  • More predictable and precise for exact matching

The choice between using separate indexes or hybrid search will depend on your specific needs for debugging and production use cases.

Thanks so much for the explanations!

After reading hybrid search sample code, it looks like two separate indexes are used—correct? From a database population standpoint, both options seem to involve maintaining two indexes. Is this the correct understanding?

dense_index = pc.Index(host="INDEX_HOST")
sparse_index = pc.Index(host="INDEX_HOST")

Yes, for serverless indexes, Pinecone recommends using separate dense-only and sparse-only indexes for more flexible and accurate hybrid search . This is demonstrated in the code example you referenced, which shows two separate index instances being used for hybrid search
Query data - Pinecone Docs .

The workflow involves:

  1. Searching the dense index for semantic matches
  2. Searching the sparse index for lexical/keyword matches
  3. Merging and deduplicating the results

This approach allows you to combine the strengths of both semantic and keyword searching while maintaining separate optimized indexes for each type of search .

For pod-based indexes, there is an alternative option to use a single hybrid index that can store both sparse and dense vectors together , though this is limited to s1 and p1 pod types using the dotproduct metric .

Hope that helps!

PS - I’d also recommend checkout out our YouTube channel for video webinar content on doing hybrid search