Hybrid search with custom scoring

Hi, I wonder if it is possible to do hybrid search with custom scoring. A document could have an attribute that is used in scoring and the aggregate score of a document could be the vector distance/similarity (of the document vector with the query vector) + the value of an attribute of the document.

For e.g., given a query vector, the score of a document could be cosine similarity of the document vector with the query vector (say: 0.6) + the value of an attribute in the document (say 1.0), then the aggregate score of the document is 0.6 + 1.0 = 1.6.

Is this possible in pinecone?

Hi @gxkok,

I’ll follow up later today with a longer answer, but the short one is yes, there are ways to do this in Pinecone. For one, I’d recommend using metadata to store the value you’re calculating as a float. Then add that value to the relevance score returned in the query.

I’ll post a longer answer later with an example of how you could do this.

Hi @gxkok,

Now that I have time for a longer answer, I have some clarifying questions for you. What is the source of the document attribute you’re referring to? Is this something that is a line of text literally in the document itself? Or is it determined from some value of the overall document? If you could shed some light on the source of the value I can give a more detailed example of how you could do this.

Thanks.

When I say document, I was actually referring to a sample in a dataset, or a row in a table. And the attributes are just numeric integer or float numbers. An example could be a product with attributes ratings and number sold. I want the search ranking to be semantically relevant which is done using ANN, and also as functions of ratings and number sold.

Gotcha. Yes, in that case the answer is to extract those values when you’re running the data through your model and add them as metadata to the vector you’ll upsert to Pinecone.

It’s not practical to read a value of a field in a source document from the embeddings created from that document. There are some methods to reconstruct a source from embeddings, but they are computationally expensive in the best case. Storing the value(s) you want to add to the relevance score as metadata is your best course of action.

I lost you there. What do you mean by adding them as meta data to the vector? The vector is a latent representation of something which is used for KNN vector search, how do I add these additional numbers to the vector?

Is this what you mean? If my embeddings are 512 dimensions, then I concat these 2 numbers to make them 514 dimensions?

Pinecone supports adding metadata to your vectors to help with prefiltering or to perform functions like what you’re working with here, where there are additional data that has to be tracked other than the pure vector representation of the data.