Transcripts Insertion tip

Hello,

I am setting up a project and testing the platform and I have a question.

I have a significant number of raw .txt files with transcripts that can reach 80,000 words each.

Would it be a good idea to incorporate these files as they are or better to send them in paragraphs?

I have the possibility to export the source files in json.

In addition, I can also insert in each one the start and end time of each transcript.

Could this be done with metadata? And the creator of each document?

Thanks

It would probably be better to send the transcripts in paragraphs to avoid pushing too much data in one go. I’m guessing you would be creating embeddings through OpenAI or similar LLMs? Those usually have a token limit as well, and breaking them into smaller chunks can also help with getting more accurate contexts.

You can use metadata to include the start and end time of each transcript and the creator of each document and query/filter them as needed. Pinecone has some pretty good docs!