WordSizeEstimator

Description

WordSizeEstimator measures size by counting words and segments text at word boundaries. It is the default estimator for chunkers and is a practical choice for most NLP pipelines where word units are more meaningful than raw characters. It requires no optional dependency.

API / Documentation

class chunkipy.size_estimators.WordSizeEstimator[source]

Bases: BaseSizeEstimator

Size estimator that counts the number of words in the text.

estimate_size(text)[source]

Estimate the size of the given text by counting the number of words.

Parameters:: text (str) – The text to estimate the size of.
Returns:: The estimated size of the text in words.
Return type:: int

segment(text)[source]

Generate words from the given text using a regular expression.

Parameters:: text (str) – The text to analyze.
Yields:: str – A segment, representing of a word for estimation.
Return type:: Generator[str, None, None]

Example

This example is included in examples/size_estimators/word_size_estimator.py.

from chunkipy.size_estimators import WordSizeEstimator


if __name__ == "__main__":
    text = "Chunkipy estimates by words in this simple sentence."
    estimator = WordSizeEstimator()

    print(f"Estimated size: {estimator.estimate_size(text)}")
    print("Segments:", list(estimator.segment(text)))