WordSizeEstimator

Description

WordSizeEstimator measures size by counting words and segments text at word boundaries. It is the default estimator for chunkers and is a practical choice for most NLP pipelines where word units are more meaningful than raw characters. It requires no optional dependency.

API / Documentation

class chunkipy.size_estimators.WordSizeEstimator[source]

Bases: BaseSizeEstimator

Size estimator that counts the number of words in the text.

estimate_size(text)[source]

Estimate the size of the given text by counting the number of words.

Parameters:

text (str) – The text to estimate the size of.

Returns:

The estimated size of the text in words.

Return type:

int

segment(text)[source]

Generate words from the given text using a regular expression.

Parameters:

text (str) – The text to analyze.

Yields:

str – A segment, representing of a word for estimation.

Return type:

Generator[str, None, None]

Example

This example is included in examples/size_estimators/word_size_estimator.py.

1from chunkipy.size_estimators import WordSizeEstimator
2
3
4if __name__ == "__main__":
5    text = "Chunkipy estimates by words in this simple sentence."
6    estimator = WordSizeEstimator()
7
8    print(f"Estimated size: {estimator.estimate_size(text)}")
9    print("Segments:", list(estimator.segment(text)))