chunkipy.size_estimators

Public size estimator classes exposed by chunkipy.size_estimators.

class chunkipy.size_estimators.BaseSizeEstimator[source]

Bases: ABC

Base class for strategies that measure and segment text size.

abstract estimate_size(text)[source]

Estimate the size of the given text.

Parameters:

text (str) – The text to estimate the size of.

Returns:

Estimated size in units defined by the concrete estimator.

Return type:

int

segment(text)[source]

Segment the text into smaller parts for size estimation. This method allows dividing the text into manageable segments, which can be processed individually for size estimation purposes by downstream methods.

Parameters:

text (str) – The text to be divided into smaller parts.

Yields:

str – A segment of the text for estimation.

Raises:

NotImplementedError – If a subclass does not implement this method.

Return type:

Generator[str, None, None]

class chunkipy.size_estimators.CharSizeEstimator[source]

Bases: BaseSizeEstimator

Size estimator that counts the number of characters in the text.

estimate_size(text)[source]

Estimate the size of the given text by counting the number of characters.

Parameters:

text (str) – The text to estimate the size of.

Returns:

The estimated size of the text in characters.

Return type:

int

segment(text)[source]

Segment the given text into chars.

Parameters:

text (str) – The text to analyze.

Yields:

str – A segment, representing a char of the text.

Return type:

Generator[str, None, None]

class chunkipy.size_estimators.OpenAISizeEstimator(encoding='cl100k_base')[source]

Bases: BaseSizeEstimator

Estimate size using a tiktoken encoding compatible with OpenAI models.

Parameters:

encoding (str)

estimate_size(text)[source]

Estimate the size of the given text using OpenAI’s tokenization.

Parameters:

text (str) – The text to estimate the size of.

Returns:

The estimated size of the text in tokens.

Return type:

int

segment(text)[source]

Generate token segments from the given text using OpenAI’s tokenization. :type text: str :param text: The text to segment. :type text: str

Yields:

str – A single token as segmented by the tokenizer.

Return type:

Generator[str, None, None]

Parameters:

text (str)

class chunkipy.size_estimators.WordSizeEstimator[source]

Bases: BaseSizeEstimator

Size estimator that counts the number of words in the text.

estimate_size(text)[source]

Estimate the size of the given text by counting the number of words.

Parameters:

text (str) – The text to estimate the size of.

Returns:

The estimated size of the text in words.

Return type:

int

segment(text)[source]

Generate words from the given text using a regular expression.

Parameters:

text (str) – The text to analyze.

Yields:

str – A segment, representing of a word for estimation.

Return type:

Generator[str, None, None]

Modules

base_size_estimator

char_size_estimator

openai_size_estimator

word_size_estimator