To my knowledge, Pinecone doesn’t support text search or substring matching. To identify records that contain a specific piece of text, we may need to maintain a separate mapping, such as:
id_to_text = {
“id-1”: “this is some text”,
“id-2”: “this is some other text”,
# …
}
Is this considered good practice? If so, given that this mapping can grow quite large, what are some effective strategies for persisting it?
If not, how do people typically handle exact text search for debugging purposes when using Pinecone?
Hey there – your premise just a bit to the side of what’s happening under the hood.
Pinecone actually supports two types of text search:
Lexical (keyword) search using sparse vectors. This allows for exact word matching, where query terms are scored independently and summed, with the most similar records scored highest .
Semantic search using dense vectors. This finds records that are semantically similar to the query text, even if they don’t contain the exact same words .
For example, here’s how to perform a lexical search in Python:
python
from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_API_KEY")index = pc.Index(host="INDEX_HOST")
results = index.search( namespace="example-namespace", query={ "inputs": {"text": "What is AAPL's outlook, considering both product launches and market conditions?"}, "top_k": 3 }, fields=["chunk_text", "quarter"])
The response includes the matching text and scores:
python
{'result': {'hits': [{'_id': 'vec2', '_score': 10.77734375, 'fields': {'chunk_text': "Analysts suggest that AAPL'''s " 'upcoming Q4 product launch ' 'event might solidify its ' 'position in the premium ' 'smartphone market.', 'quarter': 'Q4'}}]}}
Therefore, maintaining a separate text mapping isn’t necessary - you can use Pinecone’s built-in search capabilities for both exact keyword matching and semantic similarity search, depending on your needs.