Create embeddings for different sections of a documents

NeueDocs · July 15, 2024, 11:43pm

Hey y’all,

My team and I are aiming create embeddings for different sections of a documents that are not uniquely formatted.

We’re experimenting with several python scripts that will divide documents into sections based on paragraphs, headings, or custom criteria.

Just curious to know if anyone has a great work flow or smooth way to programmatically solved the problem of dividing a document into sections, depending on the structure and content of the document?

ZacharyProser · July 16, 2024, 7:12pm

Hi @NeueDocs, and welcome to the Pinecone community forums!

Thanks for your question.

Have you taken a look at Unstructured.io? They have a REST API but also an open source offering that can be used for free: GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Hope that helps!

Best,
Zack