`langml.tokenizer`

LangML Tokenizer

WPTokenizer: WordPiece Tokenizer
SPTokenizer: SentencePiece Tokenizer

Wrap for:

tokenizers.BertWordPieceTokenizer
sentencepiece.SentencePieceProcessor

We don’t provide all functions of raw tokenizer, please use raw tokenizer for full usage.

Module Contents

Classes

`Encoding`	Product of tokenizer encoding
`SpecialTokens`
`Tokenizer`	Base Tokenizer
`SPTokenizer`	SentencePiece Tokenizer
`WPTokenizer`	WordPieceTokenizer

class langml.tokenizer.Encoding(ids: Union[numpy.ndarray, List[int]], segment_ids: Union[numpy.ndarray, List[int]], tokens: List[str])[source]

Product of tokenizer encoding

ids[source]

segment_ids[source]

tokens[source]

class langml.tokenizer.SpecialTokens[source]

PAD = [PAD][source]

UNK = [UNK][source]

MASK = [MASK][source]

CLS = [CLS][source]

SEP = [SEP][source]

__contains__(self, token: str) → bool[source]

Check if the input token exists in special tokens. :param - token: str

Returns: bool

tokens(self) → List[str][source]

class langml.tokenizer.Tokenizer(vocab_path: str, lowercase: bool = False)[source]

Base Tokenizer

enable_truncation(self, max_length: int, strategy: str = 'post')[source]

Parameters

max_length (-) – int,
strategy (-) – str, optional, truncation strategy, options: post or pre, default post

tokens_mapping(self, sequence: str, tokens: List[str]) → List[Tuple[int, int]][source]

Get tokens to their corresponding sequence position mapping. Tokens may contain special marks, e.g., ##, ▁, and [UNK]. Use this function can obtain the corresponding raw token in the sequence.

Parameters

sequence (-) – str, the input sequence
tokens (-) – List[str], tokens of the input sequence

Returns

List[Tuple[int, int]]

Examples: >>> sequence = ‘I like watermelons’ >>> tokens = [‘[CLS]’, ‘▁i’, ‘▁like’, ‘▁water’, ‘mel’, ‘ons’, ‘[SEP]’] >>> mapping = tokenizer.tokens_mapping(tokens) >>> start_index, end_index = 3, 5 >>> print(“current token”, tokens[start_index: end_index + 1]) [’▁water’, ‘mel’, ‘ons’] >>> print(“raw token”, sequence[mapping[start_index][0]: mapping[end_index][1]]) watermelons

Reference:: https://github.com/bojone/bert4keras

encode(self, sequence: str, pair: Optional[str] = None, return_array: bool = False) → Encoding[source]

Parameters

sequence (-) – str, input sequence
pair (-) – str, optional, pair sequence, default None
return_array (-) – bool, optional, whether to return numpy array, default True

Returns

Encoding object

encode_batch(self, inputs: Union[List[str], List[Tuple[str, str]], List[List[str]]], padding: bool = True, padding_strategy: str = 'post', return_array: bool = False) → Encoding[source]

Parameters

inputs (-) – Union[List[str], List[Tuple[str, str]], List[List[str]]], list of texts or list of text pairs.
padding (-) – bool, optional, whether to padding sequences, default True
padding_strategy (-) – str, optional, options: post or pre, default post
return_array (-) – bool, optional, whether to return numpy array, default True

Returns

Encoding object

stem(self, token)[source]

sequence_lower(self, sequence: str) → str[source]

Do lower to sequence, except for special tokens. :param - sequence: str

Returns: str

sequence_truncating(self, max_token_length: int, tokens: List[str], pair_tokens: Optional[List[str]] = None) → Tuple[List[str], Optional[List[str]]][source]

Truncating sequence :param - max_token_length: int, maximum token length :param - tokens: List[str], input tokens :param - pair_tokens: Optional[List[str]], optional, input pair tokens, default None

Returns: Tuple[List[str], Optional[List[str]]]

raw_tokenizer(self) → object[source]: Return raw tokenizer, i.e. object of tokenizers.BertWordPieceTokenizer or sentencepiece.SentencePieceProcessor

abstract tokenize(self, sequence: str) → List[str][source]

abstract decode(self, ids: List[int], skip_special_tokens: bool = True) → List[str][source]

abstract get_vocab_size(self) → int[source]

abstract id_to_token(self, idx: int) → str[source]

abstract token_to_id(self, token: str) → int[source]

abstract get_vocab(self) → Dict[source]

class langml.tokenizer.SPTokenizer(vocab_path: str, lowercase: bool = False)[source]

Bases: Tokenizer

SentencePiece Tokenizer Wrap for sentencepiece.

get_vocab_size(self) → int[source]: Return vocab size

token_to_id(self, token: str) → int[source]

Convert the input token to corresponding index :param - token: str

Returns: int

id_to_token(self, idx: int) → str[source]

Convert index to corresponding token :param - idx: int

Returns: str

tokenize(self, sequence: str) → List[str][source]

Tokenize sequence to token peices. :param - sequence: str

Returns: List[str]

decode(self, ids: List[int], skip_special_tokens: bool = True) → List[str][source]

Decode indexs to tokens :param - ids: List[int] :param - skip_special_tokens: bool, optioanl, whether to skip special tokens, default True

Returns: List[str]

get_vocab(self) → Dict[source]: Return vocabulary

class langml.tokenizer.WPTokenizer(vocab_path: str, lowercase: bool = False)[source]

Bases: Tokenizer

WordPieceTokenizer Wrap for BertWordPieceTokenizer.

get_vocab_size(self) → int[source]: Return vocab size

token_to_id(self, token: str) → int[source]

Convert the input token to corresponding index :param - token: str

Returns: int

id_to_token(self, idx: int) → str[source]

Convert index to corresponding token :param - idx: int

Returns: str

tokenize(self, sequence: str) → List[str][source]

Tokenize sequence to token peices. :param - sequence: str

Returns: List[str]

decode(self, ids: List[int], skip_special_tokens: bool = True) → List[str][source]

Decode indexs to tokens :param - ids: List[int] :param - skip_special_tokens: bool, optioanl, whether to skip special tokens, default True

Returns: List[str]

get_vocab(self) → Dict[source]: Return vocabulary

add_special_tokens(self, tokens: List[str])[source]: Specify special tokens, the tokenizer will reserve special tokens as a whole (i.e. don’t split them) in tokenizing. Currently, only the WPTokenizer supports specifying special tokens. :param - tokens: List[str], special tokens

langml.tokenizer

Module Contents

Classes

`langml.tokenizer`