StanzaSentenceTextSplitter

Description

StanzaSentenceTextSplitter detects the input language and tokenizes the text into sentences via Stanza pipelines. It is useful when you need robust sentence segmentation for multilingual texts and want to plug sentence-level splits into RecursiveTextChunker.

Note

Install the optional dependency first:

pip install "chunkipy[stanza]"

API / Documentation

class chunkipy.text_splitters.semantic.sentences.StanzaSentenceTextSplitter(text_limit=None, language_detector=None)[source]

Bases: BaseSemanticTextSplitter

Sentence splitter using Stanza for semantic text splitting. This class uses Stanza to split text into sentences based on the language detected in the text. It supports multiple languages by loading different Stanza models based on the detected language.

Parameters:
text_limit

The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.

Type:

int

langdetect_stanza_mapping = {'af': 'af', 'ar': 'ar', 'bg': 'bg', 'bn': None, 'ca': 'ca', 'cs': 'cs', 'cy': None, 'da': 'da', 'de': 'de', 'el': 'el', 'en': 'en', 'es': 'es', 'et': 'et', 'fa': 'fa', 'fi': 'fi', 'fr': 'fr', 'gu': None, 'he': 'he', 'hi': 'hi', 'hr': 'hr', 'hu': 'hu', 'id': 'id', 'it': 'it', 'ja': 'ja', 'kn': None, 'ko': 'ko', 'lt': 'lt', 'lv': 'lv', 'mk': None, 'ml': None, 'mr': 'mr', 'ne': None, 'nl': 'nl', 'no': 'no', 'pa': None, 'pl': 'pl', 'pt': 'pt', 'ro': 'ro', 'ru': 'ru', 'sk': 'sk', 'sl': 'sl', 'so': None, 'sq': None, 'sv': 'sv', 'sw': None, 'ta': 'ta', 'te': 'te', 'th': None, 'tl': None, 'tr': 'tr', 'uk': 'uk', 'ur': 'ur', 'vi': 'vi', 'zh-cn': 'zh-hans', 'zh-tw': 'zh-hant'}

Example

This example is included in examples/chunkers/recursive/prebuilt_stanza_text_splitter.py.

 1from chunkipy import RecursiveTextChunker
 2from chunkipy.utils import MissingDependencyError
 3
 4
 5if __name__ == "__main__":
 6
 7    with open("examples/texts/napoleon.txt", "r") as file:
 8        text = file.read()
 9
10    try:
11        from chunkipy.text_splitters.semantic.sentences import StanzaSentenceTextSplitter
12        
13        stanza_text_splitter = StanzaSentenceTextSplitter()
14        
15        text_chunker = RecursiveTextChunker(
16            chunk_size=200,
17            overlap_ratio=0.25,
18            text_splitters=[stanza_text_splitter]
19        )
20
21        chunks = text_chunker.chunk(text)
22        print(f"Got: {len(chunks)}")
23        print(f"Here the text_parts: {chunks.get_all_text_parts()}")
24    except MissingDependencyError as e:
25        print(f"Error: {e}")
26
27    
28
29

More examples are available under examples/chunkers/recursive/.