I am embedding a contact list .csv file with multiple columns (first_name, last_name, title, industry, location) using the text-embedding-ada-002 engine from OpenAI.
To return contacts based on semantic search sentences such as “find me all the managers in the hospitality industry”, ChatGPT recommended embedding each column individually and then combine each column’s embedding array into one big array to be uploaded to Pinecone and queried.
Thus I currently have an index with 7680 dimensions (5x each column’s 1536 dimension embedding) and when I search the db I add a single search phrase embedding (ie. find me all the managers in the hospitality industry) 5 times into one big embedding array so the dimensions match the database that is 7680.
However this approach seems to produce dimensions that are unnecessarily large and the results aren’t quite accurate (ie. returning teachers closer to the top of list when I search for professors, which while directionally correct doesn’t return the most accurate results at the top of list first)
Is this the wrong way to perform embeddings in this case and are there ways to make the results more accurate? Thank you very much.
Very clever @Jasper! Definitely, a better way to think about the source data than just raw columnar data.
I think that’s one of the biggest tripping points for people who are new to vector databases. The traditional methods of data storage (relational, key-value, columnar) all depend on the data being in a specific order with assigned values. But with vectors, it’s not the individual components of the data that matter, but the set’s overall value. Adding the additional context of “ is a at <location” transforms the raw source into something that a human or AI would more naturally process. And that allows for a better vector representation of the data as a whole.
If I want to filter out irrelevant results that would naturally show up near the end of the returned data (ie. if topK is large enough), would it come down to trail and error to find a similarity score cutoff that works well? (ie. if found that results < 0.75 will start returning contacts from Canada even though asked for US then set the filter cutoff at 0.75 when making the search)
Search now more accurate overall with only 1536 dimensions compared to the previous 7680.
However some items that shouldn’t be in the top results appear there.
For example, the search “find me all the teachers in france” returns a contact from the United States as the second result with a score of 0.8012. The “location_country” of this contact is United States, however the job title is “Enseignant” which is French for teacher which may have caused the result to be placed higher up the list.
Not sure if the results are as accurate as they can currently be with existing embedding libraries and I would need to do a hybrid of vector + keyword search to get more accurate results.
(ie. use vector search to identify closest categories based on the csv/database columns that exist and run a traditional search filter using the returned categories)
I am no expert, but I read somewhere that you can add information to the metadata, that can then be used for non-semantical querying.
Vector databases are specifically designed for unstructured data and yet provide some of the functionality you’d expect from a traditional relational database. They can execute CRUD operations (create, read, update, and delete) on the vectors they store, provide data persistence, and filter queries by metadata. When you combine vector search with database operations, you get a powerful tool with many applications.
Source:
I have a similar problem where it’s not accurate at traditional database querying:
Example: If I wanted the find the user with the most posts.
It doesn’t filter the data in order as the overall value (Vectorised Position) is not based on the numerical value.
Solution 1:
I read a solution to this is to base vectorised values in relation to the numerical values, but this seems unintuitive and non-extendable.
Solution 2:
Embed the Numerical values as metadata and Fitler the data from a traditional sense.
Im creating this using Langchain js, and Solution 2 requires hardcoding the filter of metadata. I don’t know if this is the best solution, but I feel nervous about allowing a prompts response to set this filter+ it adds an additional Api call.