StanzaSentenceTextSplitter

Description

StanzaSentenceTextSplitter detects the input language and tokenizes the text into sentences via Stanza pipelines. It is useful when you need robust sentence segmentation for multilingual texts and want to plug sentence-level splits into RecursiveTextChunker.

Note

Install the optional dependency first:

pip install "chunkipy[stanza]"

API / Documentation

class chunkipy.text_splitters.semantic.sentences.StanzaSentenceTextSplitter(text_limit=None, language_detector=None)[source]

Bases: BaseSemanticTextSplitter

Sentence splitter using Stanza for semantic text splitting. This class uses Stanza to split text into sentences based on the language detected in the text. It supports multiple languages by loading different Stanza models based on the detected language.

Parameters:

text_limit (int)
language_detector (BaseLanguageDetector | None)

text_limit

The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.

Type:: int

langdetect_stanza_mapping = {'af': 'af', 'ar': 'ar', 'bg': 'bg', 'bn': None, 'ca': 'ca', 'cs': 'cs', 'cy': None, 'da': 'da', 'de': 'de', 'el': 'el', 'en': 'en', 'es': 'es', 'et': 'et', 'fa': 'fa', 'fi': 'fi', 'fr': 'fr', 'gu': None, 'he': 'he', 'hi': 'hi', 'hr': 'hr', 'hu': 'hu', 'id': 'id', 'it': 'it', 'ja': 'ja', 'kn': None, 'ko': 'ko', 'lt': 'lt', 'lv': 'lv', 'mk': None, 'ml': None, 'mr': 'mr', 'ne': None, 'nl': 'nl', 'no': 'no', 'pa': None, 'pl': 'pl', 'pt': 'pt', 'ro': 'ro', 'ru': 'ru', 'sk': 'sk', 'sl': 'sl', 'so': None, 'sq': None, 'sv': 'sv', 'sw': None, 'ta': 'ta', 'te': 'te', 'th': None, 'tl': None, 'tr': 'tr', 'uk': 'uk', 'ur': 'ur', 'vi': 'vi', 'zh-cn': 'zh-hans', 'zh-tw': 'zh-hant'}

Example

This example is included in examples/chunkers/recursive/prebuilt_stanza_text_splitter.py.

from chunkipy import RecursiveTextChunker
from chunkipy.utils import MissingDependencyError


if __name__ == "__main__":

    with open("examples/texts/napoleon.txt", "r") as file:
        text = file.read()

    try:
        from chunkipy.text_splitters.semantic.sentences import StanzaSentenceTextSplitter
        
        stanza_text_splitter = StanzaSentenceTextSplitter()
        
        text_chunker = RecursiveTextChunker(
            chunk_size=200,
            overlap_ratio=0.25,
            text_splitters=[stanza_text_splitter]
        )

        chunks = text_chunker.chunk(text)
        print(f"Got: {len(chunks)}")
        print(f"Here the text_parts: {chunks.get_all_text_parts()}")
    except MissingDependencyError as e:
        print(f"Error: {e}")

    

More examples are available under examples/chunkers/recursive/.