chunkipy.text_chunker

Public chunker classes and data models exposed by chunkipy.text_chunker.

class chunkipy.text_chunker.BaseOverlapTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0)[source]

Bases: BaseTextChunker, ABC

Base class for chunkers that assemble chunks with overlap from text parts.

Parameters:
chunk(text)[source]

Chunk text by splitting first and then assembling chunk objects.

Return type:

Chunks

Parameters:

text (str)

abstract split_text(text)[source]

Split text into parts consumed by overlap-aware chunk assembly.

Return type:

Generator[TextPart, None, None]

Parameters:

text (str)

class chunkipy.text_chunker.BaseTextChunker(chunk_size=None, size_estimator=None)[source]

Bases: ABC

Base class for all chunker implementations.

Parameters:
  • chunk_size (int) – Maximum size allowed for a single chunk in the units defined by size_estimator.

  • size_estimator (BaseSizeEstimator) – Strategy used to measure text size. Defaults to WordSizeEstimator.

abstract chunk(text)[source]

Chunk the provided text into Chunks objects.

Return type:

Chunks

Parameters:

text (str)

class chunkipy.text_chunker.Chunk(overlap=<factory>, content=<factory>)[source]

Bases: object

Single chunk returned by a text chunker.

A chunk is composed of two ordered collections:

  • overlap: text parts repeated from the previous chunk to preserve context

  • content: text parts that are unique to the current chunk

The text and size properties are computed over the combined text_parts view.

Parameters:
content: TextParts
overlap: Overlap
property size: int

Calculates and returns the total size of all TextPart objects within text_parts.

Returns:

The total size of all TextPart objects.

Return type:

int

property text: str

Returns the full concatenated text of the chunk by joining all ‘text’ values from the TextPart objects.

Returns:

The full text of the chunk, concatenated from all text parts.

Return type:

str

property text_parts: TextParts

Return a combined ordered view of overlap and content text parts.

class chunkipy.text_chunker.Chunks(iterable=(), /)[source]

Bases: list[Chunk]

List-like collection of Chunk objects returned by chunkers.

get_all_text()[source]

Return the serialized text for every chunk.

Returns:

A list of strings, where each string is the full text of a chunk.

Return type:

List[str]

get_all_text_parts()[source]

Return the text parts for every chunk.

Returns:

A list of per-chunk TextParts collections.

Return type:

List[TextParts]

class chunkipy.text_chunker.FixedSizeTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0)[source]

Bases: BaseOverlapTextChunker

Chunk text into fixed-size slices using the configured size estimator.

Each segment emitted by size_estimator.segment is treated as a unit of size 1 during chunk assembly.

Parameters:
split_text(text)[source]

Split the provided text into smaller parts based on size estimator. Size Estimator is used to cut the text into segments and every segment has size equal to 1.

Parameters:

text (str) – The text to be split.

Yields:

Generator [TextPart, None, None] – A generator yielding TextPart objects, each containing a piece of text and its estimated size.

Return type:

Generator[TextPart, None, None]

class chunkipy.text_chunker.Overlap[source]

Bases: TextPartsMixin, deque[TextPart]

Deque-like collection used to carry overlap between consecutive chunks.

class chunkipy.text_chunker.RecursiveTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0, text_splitters=None)[source]

Bases: BaseOverlapTextChunker

Chunk text by recursively applying increasingly fine-grained splitters.

The chunker tries each splitter in order until a text part fits within the configured chunk_size. Custom splitters are attempted before the default fallback splitters.

Parameters:
split_text(text)[source]

Split the provided text into smaller parts based on the configured text splitters and chunk size. This method uses a recursive approach to apply different text splitters until the text fits properly within the chunk size (based on the size estimator).

Parameters:

text (str) – The text to be split.

Yields:

Generator [TextPart, None, None] – A generator yielding TextPart objects, each containing a piece of text and its estimated size.

Return type:

Generator[TextPart, None, None]

class chunkipy.text_chunker.TextPart(size, text)[source]

Bases: object

Represents a fragment or segment of a complete text, along with its character size.

Parameters:
  • size (int) – The size of the text based on the SizeEstimator used.

  • text (str) – The text of the segment.

size: int
text: str

Modules

base_overlap_text_chunker

base_text_chunker

data_models

fixed_size

recursive