Check if mask words are in the model vocabulary.

Usage

BERT_vocab(
  models,
  mask.words,
  add.tokens = FALSE,
  add.method = c("sum", "mean"),
  add.verbose = TRUE
)

Arguments

models: A character vector of model names at HuggingFace.
mask.words: Option words filling in the mask.
add.tokens: Add new tokens (for out-of-vocabulary words or phrases) to model vocabulary? Defaults to FALSE. It only temporarily adds tokens for tasks but does not change the raw model file.
add.method: Method used to produce the token embeddings of newly added tokens. Can be "sum" (default) or "mean" of subword token embeddings.
add.verbose: Print composition information of new tokens (for out-of-vocabulary words or phrases)? Defaults to TRUE.

Value

A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).

Examples

if (FALSE) { # \dontrun{
models = c("bert-base-uncased", "bert-base-cased")
BERT_info(models)

BERT_vocab(models, c("bruce", "Bruce"))

BERT_vocab(models, 2020:2025)  # some are out-of-vocabulary
BERT_vocab(models, 2020:2025, add.tokens=TRUE)  # add vocab

BERT_vocab(models,
           c("individualism", "artificial intelligence"),
           add.tokens=TRUE)
} # }

Check if mask words are in the model vocabulary.

Usage

Arguments

Value

See also

Examples