Chunking
Caskada does NOT provide built-in utilities
Instead, we offer examples that you can implement yourself. This approach gives you more flexibility and control over your project's dependencies and functionality.
We recommend some implementations of commonly used text chunking approaches.
Text Chunking is more a micro optimization, compared to the Flow Design.
It's recommended to start with the Naive Chunking and optimize later.
Example Python Code Samples
1. Naive (Fixed-Size) Chunking
Splits text by a fixed number of characters (not words, as the Python example implies), ignoring sentence or semantic boundaries.
def fixed_size_chunk(text: str, chunk_size: int = 100) -> list[str]:
"""Splits text into fixed-size chunks based on character count."""
chunks = []
for i in range(0, len(text), chunk_size):
chunks.append(text[i : i + chunk_size])
return chunks
# Example:
# text = "This is a sample text to demonstrate fixed-size chunking."
# chunks = fixed_size_chunk(text, 20)
# print(chunks)
# Output: ['This is a sample tex', 't to demonstrate fix', 'ed-size chunking.']function fixedSizeChunk(text: string, chunkSize: number = 100): string[] {
/** Splits text into fixed-size chunks based on character count. */
const chunks: string[] = []
for (let i = 0; i < text.length; i += chunkSize) {
chunks.push(text.slice(i, i + chunkSize))
}
return chunks
}
// Example:
// const text = "This is a sample text to demonstrate fixed-size chunking.";
// const chunks = fixedSizeChunk(text, 20);
// console.log(chunks);
// Output: [ 'This is a sample tex', 't to demonstrate fix', 'ed-size chunking.' ]However, sentences are often cut awkwardly, losing coherence.
2. Sentence-Based Chunking
Groups a fixed number of sentences together. Requires a sentence tokenizer library.
However, might not handle very long sentences or paragraphs well.
3. Other Chunking
Paragraph-Based: Split text by paragraphs (e.g., newlines). Large paragraphs can create big chunks.
Semantic: Use embeddings or topic modeling to chunk by semantic boundaries.
Agentic: Use an LLM to decide chunk boundaries based on context or meaning.
Last updated