Check if mask words are in the model vocabulary.
Usage
BERT_vocab(
models,
mask.words,
add.tokens = FALSE,
add.method = c("sum", "mean")
)
Arguments
- models
Model names at HuggingFace.
- mask.words
Option words filling in the mask.
- add.tokens
Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary? Defaults to
FALSE
. It only temporarily adds tokens for tasks but does not change the raw model file.- add.method
Method used to produce the token embeddings of new added tokens. Can be
"sum"
(default) or"mean"
of subword token embeddings.
Value
A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).
Examples
if (FALSE) { # \dontrun{
models = c("bert-base-uncased", "bert-base-cased")
BERT_info(models)
BERT_vocab(models, c("bruce", "Bruce"))
BERT_vocab(models, 2020:2025) # some are out-of-vocabulary
BERT_vocab(models, 2020:2025, add.tokens=TRUE) # add vocab
BERT_vocab(models,
c("individualism", "artificial intelligence"),
add.tokens=TRUE)
} # }