Hybrid search with custom scoring

gxkok · February 16, 2023, 6:07am

Hi, I wonder if it is possible to do hybrid search with custom scoring. A document could have an attribute that is used in scoring and the aggregate score of a document could be the vector distance/similarity (of the document vector with the query vector) + the value of an attribute of the document.

For e.g., given a query vector, the score of a document could be cosine similarity of the document vector with the query vector (say: 0.6) + the value of an attribute in the document (say 1.0), then the aggregate score of the document is 0.6 + 1.0 = 1.6.

Is this possible in pinecone?

Cory_Pinecone · February 17, 2023, 6:00pm

Hi @gxkok,

I’ll follow up later today with a longer answer, but the short one is yes, there are ways to do this in Pinecone. For one, I’d recommend using metadata to store the value you’re calculating as a float. Then add that value to the relevance score returned in the query.

I’ll post a longer answer later with an example of how you could do this.

Cory_Pinecone · February 17, 2023, 6:42pm

Hi @gxkok,

Now that I have time for a longer answer, I have some clarifying questions for you. What is the source of the document attribute you’re referring to? Is this something that is a line of text literally in the document itself? Or is it determined from some value of the overall document? If you could shed some light on the source of the value I can give a more detailed example of how you could do this.

Thanks.

gxkok · February 18, 2023, 12:07am

When I say document, I was actually referring to a sample in a dataset, or a row in a table. And the attributes are just numeric integer or float numbers. An example could be a product with attributes ratings and number sold. I want the search ranking to be semantically relevant which is done using ANN, and also as functions of ratings and number sold.

Cory_Pinecone · February 18, 2023, 6:43pm

Gotcha. Yes, in that case the answer is to extract those values when you’re running the data through your model and add them as metadata to the vector you’ll upsert to Pinecone.

It’s not practical to read a value of a field in a source document from the embeddings created from that document. There are some methods to reconstruct a source from embeddings, but they are computationally expensive in the best case. Storing the value(s) you want to add to the relevance score as metadata is your best course of action.

gxkok · February 18, 2023, 10:19pm

I lost you there. What do you mean by adding them as meta data to the vector? The vector is a latent representation of something which is used for KNN vector search, how do I add these additional numbers to the vector?

gxkok · February 18, 2023, 10:23pm

Is this what you mean? If my embeddings are 512 dimensions, then I concat these 2 numbers to make them 514 dimensions?

Cory_Pinecone · February 21, 2023, 5:50pm

Pinecone supports adding metadata to your vectors to help with prefiltering or to perform functions like what you’re working with here, where there are additional data that has to be tracked other than the pure vector representation of the data.

tiago · June 28, 2024, 3:53pm

Good morning Cory,
In my use case, I want to be able to chat with websites.
But I want to be able to give more importance to some websites than others. After including the “importance” in the metadata, what is the best way of accomplishing this?
Thanks!