When upserting, where to insert the text

premise: noob here

Hi, so I am building a database that contains legal documents (for example it has several articles of approximately 400 words each).
Right now I am inserting the raw text of these documents in the metadata. As you might know, the max metadata size per vector is 40 KB, which is not good if I have to store many words.

The application I want to build is a simple semantic search engine.

  1. User inputs the query
  2. The query gets vectorized
  3. I query Pinecone with those vectors
  4. I extrapolate the metadata content to retrieve the documents’ raw words.

I don’t know if I am doing this right.

I’ve built a very similar if not same process. 400 words shouldn’t be a problem for metadata, but even so. Look at the documents you have and try to discern if you can split them up somehow. If you have law documents - split them by Articles of that law and inserts each article as its own document. If you have court cases, check if it would even be good to push the whole text. Maybe each paragraph writes about a different aspect of the case, should it all be joined into one vector? Could you split them up into smaller meaningful documents.
Good luck.