chunkipy.text_splitters.semantic.base_semantic_text_splitter

Classes

BaseSemanticTextSplitter([text_limit])

Base class for semantic text splitters.

class chunkipy.text_splitters.semantic.base_semantic_text_splitter.BaseSemanticTextSplitter(text_limit=None)[source]

Bases: BaseTextSplitter

Base class for semantic text splitters. This class extends BaseTextSplitter and provides a framework for splitting text into semantic parts.

text_limit attribute helps to control input size for semantic models, that might fail with long texts. It is used to limit the size of text processed at once, which is useful for semantic models that may have constraints on input size. text_limit does not affect the splitting logic, but rather the size of the text that is passed to the _split method. For example, if your text is 3500 chars and is text_limit is set to 1000, the text will be split into 4 parts of at most 1000 characters before being passed to the _split method.

Parameters:

text_limit (int) – The maximum length of text to be processed at once.
None (If)
value (defaults to a large)

text_limit

The maximum length of text to be processed at once.

Type:: int

DEFAULT_TEXT_LIMIT

Default value for text_limit if not provided.

Type:: int

Raises:: NotImplementedError – If the _split method is not implemented in a subclass.
Parameters:: text_limit (int)

DEFAULT_TEXT_LIMIT = 1000000

split(text)[source]

Split the given text into text parts based on semantic rules. This method overrides the split method from BaseTextSplitter and uses the _split method to perform the actual splitting. It handles large texts by breaking them into smaller chunks based on the text_limit attribute. This method ensures that the text is split into manageable parts while preserving semantic meaning.

Parameters:: text (str) – The text to be split.
Returns:: A list of text parts.
Return type:: List[str]