CharSizeEstimator

Description

CharSizeEstimator measures text size using character count. It is the simplest estimator when you want deterministic sizing independent of language tokenization rules. It also provides character-level segmentation that works without any external dependency.

API / Documentation

class chunkipy.size_estimators.CharSizeEstimator[source]

Bases: BaseSizeEstimator

Size estimator that counts the number of characters in the text.

estimate_size(text)[source]

Estimate the size of the given text by counting the number of characters.

Parameters:

text (str) – The text to estimate the size of.

Returns:

The estimated size of the text in characters.

Return type:

int

segment(text)[source]

Segment the given text into chars.

Parameters:

text (str) – The text to analyze.

Yields:

str – A segment, representing a char of the text.

Return type:

Generator[str, None, None]

Example

This example is included in examples/size_estimators/char_size_estimator.py.

1from chunkipy.size_estimators import CharSizeEstimator
2
3
4if __name__ == "__main__":
5    text = "Chunkipy estimates by characters."
6    estimator = CharSizeEstimator()
7
8    print(f"Estimated size: {estimator.estimate_size(text)}")
9    print("First 10 segments:", list(estimator.segment(text))[:10])