ELI5 - need basic help with our first project

wheelie.tips · May 8, 2023, 4:15pm

Hello everyone,

our first project involving 20k different file text stories. our goal is to convert these text files into embeddings, add metadata specifying the source and category of each story, and upload the data to Pinecone. once uploaded, we aim to utilize models like Koala 13b, GPT-4-x-aIpaca-13b-native-4bit-128g, and others with our private Pinecone data to develop chatbots and generate content.

so in the first step we need to create an index for the data, we were told that a vector_size=100 should be ok for our needs? is that true? and if so then when setting the index the dimensions should be equal to 100, is that also correct?

the next part is which metric settings is the best for our needs? should we use the default one cosine?

since our goal is to have open sourced LLM working with our private data, is there anything we must take into account when creating the index, and later on the embedding itself when upsert the data?

We would be grateful for any suggestions, tutorials, or other valuable information to help us bring this project to life!

Looking forward to your insights and guidance!
Best regards,
WT

rschwabco · October 17, 2023, 6:11pm

Hi @wheelie.tips and sorry for the late reply,

Regarding the vector_size or dimensions: Inmost semantic search applications, it’s recommended to use models like text-ada-002 which produces vectors with 768 dimensions or gpt-3.5 or gpt-4 which produce vectors with 1536 dimensions.
For the most part, you use the same metric that was used to train the model you’re using. In most cases, cosine and dotproduct should work for the popular models.
If you’re dealing with private data you don’t want to share, consider using an embedding model under your control (as oppose to 3rd party providers). Data sent to Pinecone is a) embedded, b) encrypted at rest, which should be sufficient for any private data you may have. Pinecone is both SOC2 and HIPPA compliant.
You should head on to our examples as well as the learning center for in depth tutorials on a variety of topics that would help you with your project. Feel free to reach out to us on this forum for more questions and guidance.

Good luck!