Tokenizer

interface Tokenizer

A public interface for tokenization and de-tokenization tasks, especially tailored for handling text encoding and decoding. The primary operations include encode to convert text to a sequence of integers (tokens), and decode to convert a sequence of integers back to text.

The companion object provides methods to obtain an instance of Tokenizer with specified encodings, either by encoding name or model name.

Types

Link copied to clipboard
object Companion

Functions

Link copied to clipboard
abstract fun decode(token: Int): String

Decodes a token into bytes.

abstract fun decode(tokens: List<Int>): String

Decodes the given sequence of integers (tokens) back into text based on the underlying encoding scheme.

Link copied to clipboard
abstract fun encode(text: String, allowedSpecial: Set<String> = emptySet(), disallowedSpecial: Set<String> = setOf("all")): List<Int>

Encodes a string into tokens.

Link copied to clipboard
abstract fun encodeSingleToken(text: String): Int

Encodes text corresponding to a single token to its token value.