Chunking Strategies for LLM applications

In the context of building LLM-related applications, chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a vector database once we use the LLM to embed content. In this blog post, we’ll explore if and how it helps improve efficiency and accuracy in LLM-related applications.

As we know, any content that we index in Pinecone needs to be embedded first. The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant.


This is a companion discussion topic for the original entry at https://www.pinecone.io/learn/chunking-strategies/
1 Like

I’ve built a “talk with your books” chatbot using this chunking approach and Pinecone. A fork of GitHub - mayooear/gpt4-pdf-chatbot-langchain: GPT4 & LangChain Chatbot for large PDF docs

A big problem here is that the bot cannot summarize a book or any large document because it only sees 1-2 chunks at a time. It can answer questions as long as the answer is fully contained inside a chunk, but not more than that. Are there any approaches to chunking that will allow a model to be able to “digest” a large document?

2 Likes

Hi, one way is to make smaller chunks and concatenate top-k before feeding them to the LLM. Even this approach is limited by the context length of the LLM. If the document is very large, I suppose currently the only method is to fine-tune the LLM on it.

1 Like

Thanks, both of those approaches sound good!

How do you think your model would perform with summarization instead of chunking?

Very informative, thank you!

That would have been great if you could give examples of performance tuning based on sets of text characteristics and different chunk methods, sizes and overlaps.

Can anyone point me at any scientific peer reviewed journal articles, whitepapers etc. on this topic? I’m writing a paper on the subject myself but so far I have not found any closely related previous work.