Tokenization
Tokenizer
A wrapper class for the BERT tokenizer from the Hugging Face Transformers library.
Use this with vocab_file
and it makes sure that the correct vocabulary is used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
checkpoint |
str
|
The name or path of the pre-trained BERT checkpoint to use. |
'bert-base-uncased'
|
vocab_file |
str
|
The path to the custom vocabulary file to use (optional). |
'config/vocab.txt'
|
Attributes:
Name | Type | Description |
---|---|---|
tokenizer |
BertTokenizer
|
The BERT tokenizer object. |
Source code in notebooks/experiments/tokenization.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
__call__(text, add_special_tokens=True)
Tokenizes the input text using the BERT tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text to tokenize. |
required |
add_special_tokens |
bool
|
Whether to add special tokens to the tokenized text (optional). |
True
|
Returns:
Name | Type | Description |
---|---|---|
tokens |
List[int]
|
A list of token IDs representing the tokenized text. |
Source code in notebooks/experiments/tokenization.py
35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
__init__(checkpoint='bert-base-uncased', vocab_file='config/vocab.txt')
Initializes the Tokenizer object with the specified checkpoint and vocabulary file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
checkpoint |
str
|
The name or path of the pre-trained BERT checkpoint to use. |
'bert-base-uncased'
|
vocab_file |
str
|
The path to the custom vocabulary file to use (optional). |
'config/vocab.txt'
|
Returns:
Type | Description |
---|---|
None
|
None. |
Source code in notebooks/experiments/tokenization.py
22 23 24 25 26 27 28 29 30 31 32 33 |
|
decode(tokens, skip_special_tokens=True)
Decodes the input token IDs into a list of strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens |
List[int]
|
A list of token IDs to decode. |
required |
skip_special_tokens |
bool
|
Whether to add special tokens to the tokenized text (optional). |
True
|
Returns:
Name | Type | Description |
---|---|---|
text |
List[str]
|
A list of strings representing the decoded tokens. |
Source code in notebooks/experiments/tokenization.py
49 50 51 52 53 54 55 56 57 58 59 60 61 |
|