StanzaSentenceTextSplitter
Description
StanzaSentenceTextSplitter detects the input language and tokenizes the text
into sentences via Stanza pipelines. It is useful when you need robust sentence
segmentation for multilingual texts and want to plug sentence-level splits into
RecursiveTextChunker.
Note
Install the optional dependency first:
pip install "chunkipy[stanza]"
API / Documentation
- class chunkipy.text_splitters.semantic.sentences.StanzaSentenceTextSplitter(text_limit=None, language_detector=None)[source]
Bases:
BaseSemanticTextSplitterSentence splitter using Stanza for semantic text splitting. This class uses Stanza to split text into sentences based on the language detected in the text. It supports multiple languages by loading different Stanza models based on the detected language.
- Parameters:
text_limit (int)
language_detector (BaseLanguageDetector | None)
- text_limit
The maximum length of text to process at once. If None, DEFAULT_LIMIT from base class is applied.
- Type:
- langdetect_stanza_mapping = {'af': 'af', 'ar': 'ar', 'bg': 'bg', 'bn': None, 'ca': 'ca', 'cs': 'cs', 'cy': None, 'da': 'da', 'de': 'de', 'el': 'el', 'en': 'en', 'es': 'es', 'et': 'et', 'fa': 'fa', 'fi': 'fi', 'fr': 'fr', 'gu': None, 'he': 'he', 'hi': 'hi', 'hr': 'hr', 'hu': 'hu', 'id': 'id', 'it': 'it', 'ja': 'ja', 'kn': None, 'ko': 'ko', 'lt': 'lt', 'lv': 'lv', 'mk': None, 'ml': None, 'mr': 'mr', 'ne': None, 'nl': 'nl', 'no': 'no', 'pa': None, 'pl': 'pl', 'pt': 'pt', 'ro': 'ro', 'ru': 'ru', 'sk': 'sk', 'sl': 'sl', 'so': None, 'sq': None, 'sv': 'sv', 'sw': None, 'ta': 'ta', 'te': 'te', 'th': None, 'tl': None, 'tr': 'tr', 'uk': 'uk', 'ur': 'ur', 'vi': 'vi', 'zh-cn': 'zh-hans', 'zh-tw': 'zh-hant'}
Example
This example is included in examples/chunkers/recursive/prebuilt_stanza_text_splitter.py.
1from chunkipy import RecursiveTextChunker
2from chunkipy.utils import MissingDependencyError
3
4
5if __name__ == "__main__":
6
7 with open("examples/texts/napoleon.txt", "r") as file:
8 text = file.read()
9
10 try:
11 from chunkipy.text_splitters.semantic.sentences import StanzaSentenceTextSplitter
12
13 stanza_text_splitter = StanzaSentenceTextSplitter()
14
15 text_chunker = RecursiveTextChunker(
16 chunk_size=200,
17 overlap_ratio=0.25,
18 text_splitters=[stanza_text_splitter]
19 )
20
21 chunks = text_chunker.chunk(text)
22 print(f"Got: {len(chunks)}")
23 print(f"Here the text_parts: {chunks.get_all_text_parts()}")
24 except MissingDependencyError as e:
25 print(f"Error: {e}")
26
27
28
29
More examples are available under examples/chunkers/recursive/.