ELI5 - need basic help with our first project

Hello everyone,

our first project involving 20k different file text stories. our goal is to convert these text files into embeddings, add metadata specifying the source and category of each story, and upload the data to Pinecone. once uploaded, we aim to utilize models like Koala 13b, GPT-4-x-aIpaca-13b-native-4bit-128g, and others with our private Pinecone data to develop chatbots and generate content.

so in the first step we need to create an index for the data, we were told that a vector_size=100 should be ok for our needs? is that true? and if so then when setting the index the dimensions should be equal to 100, is that also correct?

the next part is which metric settings is the best for our needs? should we use the default one cosine?

since our goal is to have open sourced LLM working with our private data, is there anything we must take into account when creating the index, and later on the embedding itself when upsert the data?

We would be grateful for any suggestions, tutorials, or other valuable information to help us bring this project to life!

Looking forward to your insights and guidance!
Best regards,