When upserting, where to insert the text

hoke · March 7, 2023, 8:11pm

premise: noob here

Hi, so I am building a database that contains legal documents (for example it has several articles of approximately 400 words each).
Right now I am inserting the raw text of these documents in the metadata. As you might know, the max metadata size per vector is 40 KB, which is not good if I have to store many words.

The application I want to build is a simple semantic search engine.

User inputs the query
The query gets vectorized
I query Pinecone with those vectors
I extrapolate the metadata content to retrieve the documents’ raw words.

I don’t know if I am doing this right.

Jasper · March 9, 2023, 8:35am

Hi!
I’ve built a very similar if not same process. 400 words shouldn’t be a problem for metadata, but even so. Look at the documents you have and try to discern if you can split them up somehow. If you have law documents - split them by Articles of that law and inserts each article as its own document. If you have court cases, check if it would even be good to push the whole text. Maybe each paragraph writes about a different aspect of the case, should it all be joined into one vector? Could you split them up into smaller meaningful documents.
Good luck.

drennenbrown · May 9, 2023, 7:16pm

you left out the part about getting embeddings for your documents. The way I did it was break documents into paragraphs and get embeddings for that. Then format my paragaph data and embeddings into an object I can send into Pinecode
{ vectors: [{id: ‘item_0’, metadata: {text: The text I want returned}, values: [TheEmbeddings]} ]}
and then yeah, query this with vectorized user prompt.

bigrig · May 11, 2023, 5:25am

Same question here - from the docs it seems like storing text in the metadata is not the way to do it:

High cardinality consumes more memory: Pinecone indexes metadata to allow for filtering. If the metadata contains many unique values — such as a unique identifier for each vector — the index will consume significantly more memory. Consider using selective metadata indexing to avoid indexing high-cardinality metadata that is not needed for filtering.

I’m assuming text qualifies as high cardinality. I haven’t seen any example showing where to put text! Only thing I can assume is that it’s supposed to exist in some other DB and we retrieve it using the ID? Seems round about… so hoping there’s another answer…??

Jasper · May 11, 2023, 6:21am

Hi @bigrig

Indexing huge chunks of text is not recommended as it will slow down filtering and searching (as written in your post). But will you be using the full for filtering? In my case I use the text metadata for what you said last - avoiding another DB with id and text.

What you can do is selective metadata indexing when creating the index: create_index (using the metadata_config object) and selecting only the metadata you know you will need when filtering for indexing (Manage indexes). If you won’t need any filtering… well yeah

Hope this helps

hoke · May 11, 2023, 6:36am

I’ve thought about the same thing. No metadata. Only index the vectors and at the same time another database that contains the corresponding text linked via the id created for the vector database.

So when you query the pinecone database you only retrieve the id and then perform a query on a classical database with that id to retrieve the text itself.

Nils · May 11, 2023, 10:36am

I’m thinking along the same lines. Keep the semantic search index with just the embedding and source metadata. Use a separate db for document retrieval assuming this is not needed every time a search is performed. It would also provide an extra layer of security for sensitive docs.

bigrig · May 11, 2023, 2:49pm

Aha! @Jasper - selective metadata indexing is the solution for me. Wasn’t aware of that config option. That’ll keep memory profile low and prevent the secondary lookup. Although also good point @Nils about keeping data proliferation low.