Hey guys - This might be elementary and I apologize for that. Anyways, I have queried my index for matching vecotrs and I get the api response such as the screenshot below:
How then, do i take that vector and “translate” that into readable and meaningful text to then supply to GPT for my prompt or whatnot. Simply, how do I take a vector and make it readable text?
Thanks again guys! - Jon
Ok, I think I understand this after further research here but correct me if I am wrong…
When upserting the initial data, the “id”, in practice it sounds like, is when I want to add the actual text representation of the “vector” that I should be using for my prompts in best practice?
Can anyone confirm this? Thanks!
if you want to have text added to the vector you will need to add the text you vectorized to metadata of the vector when upserting it. Be careful as metadata is limited and does not support longer texts. You can no reverse the vector back to the original text.
Hope this helps
@Jasper Thanks man!
Another quick question… What is the ideal “character” or token count for chunking down the text vectors to an ideal size?
Wish I knew that its… difficult to determine that as different domains, topics, documents have a different way of saying things, if you compare a legal document and manual. The legal document will go all the way to the next galaxy and back to say a “very simple” thing where as the manual will usually describe things quick. Can we split both documents the same way? Not really.
The next problem is also information dilution. The more text you take, the more the information in it gets diluted when you create an embedding from it and the harder it will be to use it well.
I’ve found that it worked best if I used the structured and logical parts of the document to guide me - paragraphs, chapters etc. If a paragraph was too long to upsert to Pinecone, I split it to max of 4k characters and it worked for me well (you can calculate a rough estimate how much text you can upsert to metadata with 1char = 1 B, though I would not advise to use all of the space provided).
I know it is not a real answer, but I hope it helps
Start with a chunk size of around 300-400 words and finetune from there.