How to embed csv data with multiple columns for querying in vector db?

bryan_1 · March 23, 2023, 9:43pm

Hi,

I am embedding a contact list .csv file with multiple columns (first_name, last_name, title, industry, location) using the text-embedding-ada-002 engine from OpenAI.

To return contacts based on semantic search sentences such as “find me all the managers in the hospitality industry”, ChatGPT recommended embedding each column individually and then combine each column’s embedding array into one big array to be uploaded to Pinecone and queried.

Thus I currently have an index with 7680 dimensions (5x each column’s 1536 dimension embedding) and when I search the db I add a single search phrase embedding (ie. find me all the managers in the hospitality industry) 5 times into one big embedding array so the dimensions match the database that is 7680.

However this approach seems to produce dimensions that are unnecessarily large and the results aren’t quite accurate (ie. returning teachers closer to the top of list when I search for professors, which while directionally correct doesn’t return the most accurate results at the top of list first)

Is this the wrong way to perform embeddings in this case and are there ways to make the results more accurate? Thank you very much.

Jasper · March 24, 2023, 8:03am

Hi!

Instead of embedding every column I would go about it in a different way. Lets say the row is like:

John, Doe, manager, hospitality, NYC

From each line in .csv file I would then generate a new list fo sentences in a way:

{first_name} {last_name} is a {title} in {industry} from {location}.

so the first row would look like:

John Doe is a manager in hospitality from NYC.

Then I would embed these sentences instead all of the columns and have vectors of 1536. Questions would be embed the same and then do a “normal” query

Good luck and let us know if you find a good way to do it.

Cory_Pinecone · March 24, 2023, 3:46pm

Very clever @Jasper! Definitely, a better way to think about the source data than just raw columnar data.

I think that’s one of the biggest tripping points for people who are new to vector databases. The traditional methods of data storage (relational, key-value, columnar) all depend on the data being in a specific order with assigned values. But with vectors, it’s not the individual components of the data that matter, but the set’s overall value. Adding the additional context of “ is a at <location” transforms the raw source into something that a human or AI would more naturally process. And that allows for a better vector representation of the data as a whole.

bryan_1 · March 24, 2023, 8:29pm

Thanks @Jasper and @Cory_Pinecone!

I will try embedding the columns as a sentence.

If I want to filter out irrelevant results that would naturally show up near the end of the returned data (ie. if topK is large enough), would it come down to trail and error to find a similarity score cutoff that works well? (ie. if found that results < 0.75 will start returning contacts from Canada even though asked for US then set the filter cutoff at 0.75 when making the search)

bryan_1 · March 25, 2023, 12:39am

An update after making the embedding change:

Search now more accurate overall with only 1536 dimensions compared to the previous 7680.
However some items that shouldn’t be in the top results appear there.

For example, the search “find me all the teachers in france” returns a contact from the United States as the second result with a score of 0.8012. The “location_country” of this contact is United States, however the job title is “Enseignant” which is French for teacher which may have caused the result to be placed higher up the list.

Not sure if the results are as accurate as they can currently be with existing embedding libraries and I would need to do a hybrid of vector + keyword search to get more accurate results.
(ie. use vector search to identify closest categories based on the csv/database columns that exist and run a traditional search filter using the returned categories)

sam.morgan · June 1, 2023, 8:39pm

@bryan_1

I am no expert, but I read somewhere that you can add information to the metadata, that can then be used for non-semantical querying.

Vector databases are specifically designed for unstructured data and yet provide some of the functionality you’d expect from a traditional relational database. They can execute CRUD operations (create, read, update, and delete) on the vectors they store, provide data persistence, and filter queries by metadata. When you combine vector search with database operations, you get a powerful tool with many applications.

Source:

I have a similar problem where it’s not accurate at traditional database querying:

Example: If I wanted the find the user with the most posts.

It doesn’t filter the data in order as the overall value (Vectorised Position) is not based on the numerical value.

Solution 1:
I read a solution to this is to base vectorised values in relation to the numerical values, but this seems unintuitive and non-extendable.

Solution 2:
Embed the Numerical values as metadata and Fitler the data from a traditional sense.

Im creating this using Langchain js, and Solution 2 requires hardcoding the filter of metadata. I don’t know if this is the best solution, but I feel nervous about allowing a prompts response to set this filter+ it adds an additional Api call.

@Jasper
Is this the best solution?

Jasper · June 5, 2023, 5:42am

Hi @sam.morgan!

Yes! Meatadata is the place to store additional data, you want to use for filtering data when querying your searches.

While I am still not completely sure what you want to query (you have to query something - vector database needs a vector query one way or the other), I think COUNT() is not supported.

You can store this data in Pinecone, but I think it would be a long way around I guess. How I would do it.

For each user, I would create something in the lines of: ‘User with username Jasper has 10 posts.’. I would embedd this data and upsert it directly into Pinecone with an ID I would hand pick and know. I would then add some metadata { username, num_posts, datetime }. Okey, all good and well so far.

Every time a user would create a new post I would query the index with ‘How many posts does Jasper have?’ and set the top-k to 3. I would check each results metadata it the username is indeed the one I seek. Then just update the vector with new value, embedd again and update the index.

Next comes the querying. Here you could have a problem and would have to test how good the solution is. The biggest problem is, how to get the one with the MOST posts. Here you will have to see if querying actually returns results with results that make sense.

All in all, I think for your usecase vector database doesn’t provide what you want. Or I am not understanding your usecase completely (which is also possible ). Either way, let me know if I missed the point of your problem or if it helped you think it through.

Best, J

sam.morgan · June 5, 2023, 9:10am

Hey J,

I don’t know what I want, so I want to catch ‘every’ use case of a human trying to find data from a database (I’ve started small with a CSV, but want to scale on successful solutions), Im using my companies Slack Analytics as a proof of concept and to explore its limitations.

So looking into Langchain possibilities,

And you are right I can query with a filter of metadata
for example:

I can iterate with whatever the top messagesPosted value from the results from top-k

R: Ian = 5000
R: jef = 200
R: eli = 500
Q: MessagesPosted : {‘$gt’:5000}
R: Paul = 12200
…
Q: MessagesPosted : {‘$gt’:12200}
R: Paul = 144000
…
Q: MessagesPosted : {‘$gt’:144000}
…
…
But this seems very un-optimised and when if this was a SQL database I could just have 1 SQL query to return the row which contains the highest MessagesPosted

What I’ve concluded is that:
I’m trying to use a vectorised database just like it was a traditional database when I should see it more as a semantical database with semantical queries (Is the query/value subjective?). It will cover some of my ‘every’ use cases.

The next Solution for me to test:

Take the prompt and decide between

Extracting from a Vectorised DB or
Extracting from a SQL DB via Text to SQL

Then use those results as a reference for answering the prompt.

ausaf.ahmad · June 20, 2023, 9:42am

So what did you decide to go with? The Vectorised DB approach or Text to SQL?
I’m working on a similar problem and am kind of in the same situation as you. Especially as my data has a heck of a lot more numerical values. What would be the best method to move forward?

borjaobeso · June 30, 2023, 2:30pm

Could you fine-tune GPT to re-interpret the results of your vector query? Use the questions you would normally ask with examples of the expected answers. Then, when you tell the fine tuned model to interpret the results of the query for “find me all the teachers in france” it would know to ignore the “Enseignant” because you included an example in your fine-tuning data?

RobbieTheRobot · September 16, 2023, 8:59am

I have a similar issue, where I have a records with multiple numbers and text fields, and considering fine-tuning a free / open source model. Could you please let me know how this may be done?

vanya203 · September 28, 2024, 4:57am

yes, Pinecone is not very good in multi-field searches.
Combining many vectors in one - is not a solution. You cannot combine these vectors into a larger vector, as the combination does not reflect a meaningful representation of the original text’s semantics.
You need to apply a compound-typed search. Pinecone is an engine/index; you need to build the combination logic on top of it.
In Azure, you can do Search on multiple fields and even provide weights to the columns.

Here is an article about testing and comparing the approach: Showdown of Smart Search Systems: Pinecone vs. AI Search | by Pavlo Hrechko | Medium

tinuagrawal1097 · December 17, 2024, 6:45am

simmilar questions i have also

I am currently facing an issue. I have finance-related data in CSV format, containing multiple features (around 100), including company information, shareholder information, and profit-loss data. I created embeddings using Google Generative Embedding and used the Pinecone vector database. However, when I retrieve data based on MMR (Maximum Margin Retrieval) instead of similarity, the results are not accurate. I would like to ask you: in which format should I prepare my data to ensure accurate embedding and retrieval of answers?