OpenAI Size Estimator

Description

OpenAISizeEstimator measures text size with tiktoken encodings, making chunk boundaries closer to LLM token budgets than plain words or characters. Use it when your downstream model has token limits and you need a more realistic size metric for prompt construction.

Note

Install the optional dependency first:

pip install "chunkipy[tiktoken]"

API / Documentation

class chunkipy.size_estimators.OpenAISizeEstimator(encoding='cl100k_base')[source]

Bases: BaseSizeEstimator

Estimate size using a tiktoken encoding compatible with OpenAI models.

Parameters:

encoding (str)

estimate_size(text)[source]

Estimate the size of the given text using OpenAI’s tokenization.

Parameters:

text (str) – The text to estimate the size of.

Returns:

The estimated size of the text in tokens.

Return type:

int

segment(text)[source]

Generate token segments from the given text using OpenAI’s tokenization. :type text: str :param text: The text to segment. :type text: str

Yields:

str – A single token as segmented by the tokenizer.

Return type:

Generator[str, None, None]

Parameters:

text (str)

Example

This example is included in examples/size_estimators/openai_size_estimator.py.

 1from chunkipy.size_estimators import OpenAISizeEstimator
 2from chunkipy.utils import MissingDependencyError
 3
 4
 5if __name__ == "__main__":
 6    text = "Token-aware estimation with tiktoken."
 7
 8    try:
 9        estimator = OpenAISizeEstimator()
10        print(f"Estimated token size: {estimator.estimate_size(text)}")
11    except MissingDependencyError as error:
12        print(error)