chunkipy

Public package exports for chunkipy.

This module exposes the main chunker classes and data models that are intended to be imported directly from chunkipy.

class chunkipy.BaseLanguageDetector[source]

Bases: ABC

Base class for strategies that detect the language of a text.

abstract detect(text)[source]

Detect the language code for the given text.

Return type:: str
Parameters:: text (str)

class chunkipy.BaseTextChunker(chunk_size=None, size_estimator=None)[source]

Bases: ABC

Base class for all chunker implementations.

Parameters:

chunk_size (int) – Maximum size allowed for a single chunk in the units defined by size_estimator.
size_estimator (BaseSizeEstimator) – Strategy used to measure text size. Defaults to WordSizeEstimator.

abstract chunk(text)[source]

Chunk the provided text into Chunks objects.

Return type:: Chunks
Parameters:: text (str)

class chunkipy.Chunk(overlap=<factory>, content=<factory>)[source]

Bases: object

Single chunk returned by a text chunker.

A chunk is composed of two ordered collections:

overlap: text parts repeated from the previous chunk to preserve context
content: text parts that are unique to the current chunk

The text and size properties are computed over the combined text_parts view.

Parameters:

overlap (Overlap)
content (TextParts)

content: TextParts

overlap: Overlap

property size: int

Calculates and returns the total size of all TextPart objects within text_parts.

Returns:: The total size of all TextPart objects.
Return type:: int

property text: str

Returns the full concatenated text of the chunk by joining all ‘text’ values from the TextPart objects.

Returns:: The full text of the chunk, concatenated from all text parts.
Return type:: str

property text_parts: TextParts: Return a combined ordered view of overlap and content text parts.

class chunkipy.Chunks(iterable=(), /)[source]

Bases: list[Chunk]

List-like collection of Chunk objects returned by chunkers.

get_all_text()[source]

Return the serialized text for every chunk.

Returns:: A list of strings, where each string is the full text of a chunk.
Return type:: List[str]

get_all_text_parts()[source]

Return the text parts for every chunk.

Returns:: A list of per-chunk TextParts collections.
Return type:: List[TextParts]

class chunkipy.FastTextLanguageDetector(model_path, label_prefix='__label__')[source]

Bases: BaseLanguageDetector

Detect language codes using a FastText language identification model.

The detector expects a path to a model compatible with the FastText Python bindings, such as Facebook’s lid.176.bin.

Parameters:

model_path (str)
label_prefix (str)

detect(text)[source]

Return the top predicted FastText language code for text.

Return type:: str
Parameters:: text (str)

class chunkipy.FixedSizeTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0)[source]

Bases: BaseOverlapTextChunker

Chunk text into fixed-size slices using the configured size estimator.

Each segment emitted by size_estimator.segment is treated as a unit of size 1 during chunk assembly.

Parameters:

chunk_size (int)
size_estimator (BaseSizeEstimator)
overlap_ratio (float)

split_text(text)[source]

Split the provided text into smaller parts based on size estimator. Size Estimator is used to cut the text into segments and every segment has size equal to 1.

Parameters:: text (str) – The text to be split.
Yields:: Generator [TextPart, None, None] – A generator yielding TextPart objects, each containing a piece of text and its estimated size.
Return type:: Generator[TextPart, None, None]

class chunkipy.LangdetectLanguageDetector[source]

Bases: BaseLanguageDetector

Detect language codes using the optional langdetect dependency.

detect(text)[source]

Return the ISO-like language code detected by langdetect.

Return type:: str
Parameters:: text (str)

class chunkipy.Overlap[source]

Bases: TextPartsMixin, deque[TextPart]

Deque-like collection used to carry overlap between consecutive chunks.

class chunkipy.RecursiveTextChunker(chunk_size=None, size_estimator=None, overlap_ratio=0.0, text_splitters=None)[source]

Bases: BaseOverlapTextChunker

Chunk text by recursively applying increasingly fine-grained splitters.

The chunker tries each splitter in order until a text part fits within the configured chunk_size. Custom splitters are attempted before the default fallback splitters.

Parameters:

chunk_size (int)
size_estimator (BaseSizeEstimator)
overlap_ratio (float)
text_splitters (List[BaseTextSplitter])

split_text(text)[source]

Split the provided text into smaller parts based on the configured text splitters and chunk size. This method uses a recursive approach to apply different text splitters until the text fits properly within the chunk size (based on the size estimator).

Parameters:: text (str) – The text to be split.
Yields:: Generator [TextPart, None, None] – A generator yielding TextPart objects, each containing a piece of text and its estimated size.
Return type:: Generator[TextPart, None, None]

class chunkipy.TextPart(size, text)[source]

Bases: object

Represents a fragment or segment of a complete text, along with its character size.

Parameters:

size (int) – The size of the text based on the SizeEstimator used.
text (str) – The text of the segment.

size: int

text: str

Modules

`language_detectors`	Public language detector classes exposed by `chunkipy.language_detectors`.
`size_estimators`	Public size estimator classes exposed by `chunkipy.size_estimators`.
`text_chunker`	Public chunker classes and data models exposed by `chunkipy.text_chunker`.
`text_splitters`	Public text splitter classes exposed by `chunkipy.text_splitters`.
`utils`	Shared utility helpers used across chunkipy internals and optional extras.