Optimizing document structure for searches

Hi,
I currently have large pdf’s of technical documents, I’m going to have to rewrite them in a more ‘pinecone friendly’ format. I’ve been told Markdown or html format works well, but I think I’ll still have issues with data taken out of context.

Below is a snippet of the docs, if you look at some of the paragraphs seperately, you’d have no idea they were talking about the relevent heading. Short of putting the heading on every paragraph and making sure they are less than my chunk size, is there a good way to get around this?

ZONE TEMPERATURE CONTROLLER INTERFACES

GENERAL:
The Air Conditioning System Controller (ACSC) communicates with other systems through hardware interfaces.

SDAC:
System data information is transmitted to the System Data Acquisition Concentrator (SDAC) via ARINC buses for system monitoring. Temperature, valve position, and other data are used for warnings and display.

DMU:
The ACSCs send system main status data to the Data Management Unit (DMU) for maintenance monitoring functions. The ACSC sends trim-air Pressure Regulating Valve (PRV) position, pack flow, water extractor and pack compressor discharge temperatures, BYPass valve, and ram air inlet flap positions to the DMU.

In html it would still cause the same issue
*Edit - of course it’s displaying in HTML you get the drift :stuck_out_tongue:

ZONE TEMPERATURE CONTROLLER INTERFACES

GENERAL:

The Air Conditioning System Controller (ACSC) communicates with other systems through hardware interfaces.

SDAC:

System data information is transmitted to the System Data Acquisition Concentrator (SDAC) via ARINC buses for system monitoring. Temperature, valve position, and other data are used for warnings and display.

DMU:

The ACSCs send system main status data to the Data Management Unit (DMU) for maintenance monitoring functions. The ACSC sends trim-air Pressure Regulating Valve (PRV) position, pack flow, water extractor and pack compressor discharge temperatures, BYPass valve, and ram air inlet flap positions to the DMU.

Keeping in mind that this is a tiny snippet and there’s generally 1 heading per page, some paragraphs are a lot larger too.

Hi @Dearlove,

Thanks for your question!

For what it’s worth, we’ve been discussing internally how to demonstrate different chunking strategies better and to make it easier to choose which approach will work best for your use case.

In the meantime, I wonder if you have heard of these options?

  • Frameworks like LangChain, which is available in both Python and TypeScript, have document loaders for markdown, CSV, and many other data formats - this can be a nice start and provide a uniform interface for interacting with documents later in your pipeline
  • We have a learn document on chunking strategies for LLM Applications that is quite comprehensive. It discusses different methods for chunking and how to determine which one is best suited to your use case
  • I also wonder if you might benefit from using metadata, which allows you to attach arbitrary data to each vector you upsert. For example, if you always want to be able to reference which document or heading the vector result relates to, you could attach that information to your vectors as metadata.

I hope that helps, and let me know if you have any additional questions!