Inquiry on Vector Search Ranking Mechanism and Criteria

optimumbrew.nikunj · April 17, 2024, 6:13pm

I am integrating Pinecone’s vector database within our PHP Laravel application to enhance search functionality with a focus on context-based results. We use OpenAI’s text-embedding-3-large model for converting text into vector embeddings.

To give you a better understanding of our scenario, consider two templates:

Template One Tags: diwali, special, exhibition, poster, posters, festival, festivals, divali, dewali, deepavali, deepawali, red, elegant
Template Two Tags: diwali, festival, lights, festivals, deepawali, deepavali, greetings, wish, wishes, card, greeting, design, celebration, dipawali, दीपावली, dewali, deewali, dipavali, दिवाळी, divali, dipawli, wishfully, festivel, featival, dipabali, fastivel, depavali, lighted

Could you provide a detailed explanation on how Pinecone’s vector search algorithm evaluates and ranks search results in this context? Specifically, I am interested in understanding why the algorithm might rank one template higher than the other based on their semantic relevance to a search query like “diwali.” What factors does Pinecone consider in its ranking process?

Your insights will be invaluable as we aim to optimize our application’s search capabilities to better align with user intents.

Thank you for your time and assistance.

zeke_pinecone · April 22, 2024, 5:45pm

Hi @optimumbrew.nikunj, thanks for posting.

The search algorithms rank responses based on similarity score, which is determined using the similarity metric you select for the index. The search algorithm calculates the distance between the query vector and other vectors in the space and determines the most similar records.

A vector database uses a combination of different algorithms that all participate in Approximate Nearest Neighbor (ANN) search. These algorithms optimize the search through hashing, quantization, or graph-based search. For more details, see How does a vector database work.

The basic rule of thumb in selecting the best similarity metric for your Pinecone index is to match it to the one used to train your embedding model. For more details on similarity metrics, please see Vector Similarity Explained.

If you would like to view some deep-dives on how these algorithms and vector databases work in general, I encourage you to check out our Learning Center and YouTube channel. I’ve linked a relevant video below to help you get started:

Graph-Based Approximate Nearest Neighbors (ANN) and HNSW

Please consult these resources and let us know if you have any questions.