OpenAI Size Estimator

Description

OpenAISizeEstimator measures text size with tiktoken encodings, making chunk boundaries closer to LLM token budgets than plain words or characters. Use it when your downstream model has token limits and you need a more realistic size metric for prompt construction.

Note

Install the optional dependency first:

pip install "chunkipy[tiktoken]"

API / Documentation

class chunkipy.size_estimators.OpenAISizeEstimator(encoding='cl100k_base')[source]

Bases: BaseSizeEstimator

Estimate size using a tiktoken encoding compatible with OpenAI models.

Parameters:: encoding (str)

estimate_size(text)[source]

Estimate the size of the given text using OpenAI’s tokenization.

Parameters:: text (str) – The text to estimate the size of.
Returns:: The estimated size of the text in tokens.
Return type:: int

segment(text)[source]

Generate token segments from the given text using OpenAI’s tokenization. :type text: str :param text: The text to segment. :type text: str

Yields:: str – A single token as segmented by the tokenizer.
Return type:: Generator[str, None, None]
Parameters:: text (str)

Example

This example is included in examples/size_estimators/openai_size_estimator.py.

from chunkipy.size_estimators import OpenAISizeEstimator
from chunkipy.utils import MissingDependencyError


if __name__ == "__main__":
    text = "Token-aware estimation with tiktoken."

    try:
        estimator = OpenAISizeEstimator()
        print(f"Estimated token size: {estimator.estimate_size(text)}")
    except MissingDependencyError as error:
        print(error)