langml.tokenizer

LangML Tokenizer

  • WPTokenizer: WordPiece Tokenizer

  • SPTokenizer: SentencePiece Tokenizer

Wrap for:
  • tokenizers.BertWordPieceTokenizer

  • sentencepiece.SentencePieceProcessor

We don’t provide all functions of raw tokenizer, please use raw tokenizer for full usage.

Module Contents

Classes

Encoding

Product of tokenizer encoding

SpecialTokens

Tokenizer

Base Tokenizer

SPTokenizer

SentencePiece Tokenizer

WPTokenizer

WordPieceTokenizer

class langml.tokenizer.Encoding(ids: Union[numpy.ndarray, List[int]], segment_ids: Union[numpy.ndarray, List[int]], tokens: List[str])[source]

Product of tokenizer encoding

ids[source]
segment_ids[source]
tokens[source]
class langml.tokenizer.SpecialTokens[source]
PAD = [PAD][source]
UNK = [UNK][source]
MASK = [MASK][source]
CLS = [CLS][source]
SEP = [SEP][source]
__contains__(self, token: str) bool[source]

Check if the input token exists in special tokens. :param - token: str

Returns

bool

tokens(self) List[str][source]
class langml.tokenizer.Tokenizer(vocab_path: str, lowercase: bool = False)[source]

Base Tokenizer

enable_truncation(self, max_length: int, strategy: str = 'post')[source]
Parameters
  • max_length (-) – int,

  • strategy (-) – str, optional, truncation strategy, options: post or pre, default post

tokens_mapping(self, sequence: str, tokens: List[str]) List[Tuple[int, int]][source]

Get tokens to their corresponding sequence position mapping. Tokens may contain special marks, e.g., ##, , and [UNK]. Use this function can obtain the corresponding raw token in the sequence.

Parameters
  • sequence (-) – str, the input sequence

  • tokens (-) – List[str], tokens of the input sequence

Returns

List[Tuple[int, int]]

Examples: >>> sequence = ‘I like watermelons’ >>> tokens = [‘[CLS]’, ‘▁i’, ‘▁like’, ‘▁water’, ‘mel’, ‘ons’, ‘[SEP]’] >>> mapping = tokenizer.tokens_mapping(tokens) >>> start_index, end_index = 3, 5 >>> print(“current token”, tokens[start_index: end_index + 1]) [’▁water’, ‘mel’, ‘ons’] >>> print(“raw token”, sequence[mapping[start_index][0]: mapping[end_index][1]]) watermelons

Reference:

https://github.com/bojone/bert4keras

encode(self, sequence: str, pair: Optional[str] = None, return_array: bool = False) Encoding[source]
Parameters
  • sequence (-) – str, input sequence

  • pair (-) – str, optional, pair sequence, default None

  • return_array (-) – bool, optional, whether to return numpy array, default True

Returns

Encoding object

encode_batch(self, inputs: Union[List[str], List[Tuple[str, str]], List[List[str]]], padding: bool = True, padding_strategy: str = 'post', return_array: bool = False) Encoding[source]
Parameters
  • inputs (-) – Union[List[str], List[Tuple[str, str]], List[List[str]]], list of texts or list of text pairs.

  • padding (-) – bool, optional, whether to padding sequences, default True

  • padding_strategy (-) – str, optional, options: post or pre, default post

  • return_array (-) – bool, optional, whether to return numpy array, default True

Returns

Encoding object

stem(self, token)[source]
sequence_lower(self, sequence: str) str[source]

Do lower to sequence, except for special tokens. :param - sequence: str

Returns

str

sequence_truncating(self, max_token_length: int, tokens: List[str], pair_tokens: Optional[List[str]] = None) Tuple[List[str], Optional[List[str]]][source]

Truncating sequence :param - max_token_length: int, maximum token length :param - tokens: List[str], input tokens :param - pair_tokens: Optional[List[str]], optional, input pair tokens, default None

Returns

Tuple[List[str], Optional[List[str]]]

raw_tokenizer(self) object[source]

Return raw tokenizer, i.e. object of tokenizers.BertWordPieceTokenizer or sentencepiece.SentencePieceProcessor

abstract tokenize(self, sequence: str) List[str][source]
abstract decode(self, ids: List[int], skip_special_tokens: bool = True) List[str][source]
abstract get_vocab_size(self) int[source]
abstract id_to_token(self, idx: int) str[source]
abstract token_to_id(self, token: str) int[source]
abstract get_vocab(self) Dict[source]
class langml.tokenizer.SPTokenizer(vocab_path: str, lowercase: bool = False)[source]

Bases: Tokenizer

SentencePiece Tokenizer Wrap for sentencepiece.

get_vocab_size(self) int[source]

Return vocab size

token_to_id(self, token: str) int[source]

Convert the input token to corresponding index :param - token: str

Returns

int

id_to_token(self, idx: int) str[source]

Convert index to corresponding token :param - idx: int

Returns

str

tokenize(self, sequence: str) List[str][source]

Tokenize sequence to token peices. :param - sequence: str

Returns

List[str]

decode(self, ids: List[int], skip_special_tokens: bool = True) List[str][source]

Decode indexs to tokens :param - ids: List[int] :param - skip_special_tokens: bool, optioanl, whether to skip special tokens, default True

Returns

List[str]

get_vocab(self) Dict[source]

Return vocabulary

class langml.tokenizer.WPTokenizer(vocab_path: str, lowercase: bool = False)[source]

Bases: Tokenizer

WordPieceTokenizer Wrap for BertWordPieceTokenizer.

get_vocab_size(self) int[source]

Return vocab size

token_to_id(self, token: str) int[source]

Convert the input token to corresponding index :param - token: str

Returns

int

id_to_token(self, idx: int) str[source]

Convert index to corresponding token :param - idx: int

Returns

str

tokenize(self, sequence: str) List[str][source]

Tokenize sequence to token peices. :param - sequence: str

Returns

List[str]

decode(self, ids: List[int], skip_special_tokens: bool = True) List[str][source]

Decode indexs to tokens :param - ids: List[int] :param - skip_special_tokens: bool, optioanl, whether to skip special tokens, default True

Returns

List[str]

get_vocab(self) Dict[source]

Return vocabulary

add_special_tokens(self, tokens: List[str])[source]

Specify special tokens, the tokenizer will reserve special tokens as a whole (i.e. don’t split them) in tokenizing. Currently, only the WPTokenizer supports specifying special tokens. :param - tokens: List[str], special tokens