langml.tokenizer
LangML Tokenizer
WPTokenizer: WordPiece Tokenizer
SPTokenizer: SentencePiece Tokenizer
- Wrap for:
tokenizers.BertWordPieceTokenizer
sentencepiece.SentencePieceProcessor
We don’t provide all functions of raw tokenizer, please use raw tokenizer for full usage.
Module Contents
Classes
Product of tokenizer encoding |
|
Base Tokenizer |
|
SentencePiece Tokenizer |
|
WordPieceTokenizer |
- class langml.tokenizer.Encoding(ids: Union[numpy.ndarray, List[int]], segment_ids: Union[numpy.ndarray, List[int]], tokens: List[str])[source]
Product of tokenizer encoding
- class langml.tokenizer.SpecialTokens[source]
- class langml.tokenizer.Tokenizer(vocab_path: str, lowercase: bool = False)[source]
Base Tokenizer
- enable_truncation(self, max_length: int, strategy: str = 'post')[source]
- Parameters
max_length (-) – int,
strategy (-) – str, optional, truncation strategy, options: post or pre, default post
- tokens_mapping(self, sequence: str, tokens: List[str]) List[Tuple[int, int]][source]
Get tokens to their corresponding sequence position mapping. Tokens may contain special marks, e.g., ##, ▁, and [UNK]. Use this function can obtain the corresponding raw token in the sequence.
- Parameters
sequence (-) – str, the input sequence
tokens (-) – List[str], tokens of the input sequence
- Returns
List[Tuple[int, int]]
Examples: >>> sequence = ‘I like watermelons’ >>> tokens = [‘[CLS]’, ‘▁i’, ‘▁like’, ‘▁water’, ‘mel’, ‘ons’, ‘[SEP]’] >>> mapping = tokenizer.tokens_mapping(tokens) >>> start_index, end_index = 3, 5 >>> print(“current token”, tokens[start_index: end_index + 1]) [’▁water’, ‘mel’, ‘ons’] >>> print(“raw token”, sequence[mapping[start_index][0]: mapping[end_index][1]]) watermelons
- Reference:
- encode(self, sequence: str, pair: Optional[str] = None, return_array: bool = False) Encoding[source]
- Parameters
sequence (-) – str, input sequence
pair (-) – str, optional, pair sequence, default None
return_array (-) – bool, optional, whether to return numpy array, default True
- Returns
Encoding object
- encode_batch(self, inputs: Union[List[str], List[Tuple[str, str]], List[List[str]]], padding: bool = True, padding_strategy: str = 'post', return_array: bool = False) Encoding[source]
- Parameters
inputs (-) – Union[List[str], List[Tuple[str, str]], List[List[str]]], list of texts or list of text pairs.
padding (-) – bool, optional, whether to padding sequences, default True
padding_strategy (-) – str, optional, options: post or pre, default post
return_array (-) – bool, optional, whether to return numpy array, default True
- Returns
Encoding object
- sequence_lower(self, sequence: str) str[source]
Do lower to sequence, except for special tokens. :param - sequence: str
- Returns
str
- sequence_truncating(self, max_token_length: int, tokens: List[str], pair_tokens: Optional[List[str]] = None) Tuple[List[str], Optional[List[str]]][source]
Truncating sequence :param - max_token_length: int, maximum token length :param - tokens: List[str], input tokens :param - pair_tokens: Optional[List[str]], optional, input pair tokens, default None
- Returns
Tuple[List[str], Optional[List[str]]]
- class langml.tokenizer.SPTokenizer(vocab_path: str, lowercase: bool = False)[source]
Bases:
TokenizerSentencePiece Tokenizer Wrap for sentencepiece.
- token_to_id(self, token: str) int[source]
Convert the input token to corresponding index :param - token: str
- Returns
int
- id_to_token(self, idx: int) str[source]
Convert index to corresponding token :param - idx: int
- Returns
str
- tokenize(self, sequence: str) List[str][source]
Tokenize sequence to token peices. :param - sequence: str
- Returns
List[str]
- class langml.tokenizer.WPTokenizer(vocab_path: str, lowercase: bool = False)[source]
Bases:
TokenizerWordPieceTokenizer Wrap for BertWordPieceTokenizer.
- token_to_id(self, token: str) int[source]
Convert the input token to corresponding index :param - token: str
- Returns
int
- id_to_token(self, idx: int) str[source]
Convert index to corresponding token :param - idx: int
- Returns
str
- tokenize(self, sequence: str) List[str][source]
Tokenize sequence to token peices. :param - sequence: str
- Returns
List[str]