Skip to contents

Check if mask words are in the model vocabulary.

Usage

BERT_vocab(
  models,
  mask.words,
  add.tokens = FALSE,
  add.method = c("sum", "mean")
)

Arguments

models

Model names at HuggingFace.

mask.words

Option words filling in the mask.

add.tokens

Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary? Defaults to FALSE. It only temporarily adds tokens for tasks but does not change the raw model file.

add.method

Method used to produce the token embeddings of new added tokens. Can be "sum" (default) or "mean" of subword token embeddings.

Value

A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).

Examples

if (FALSE) { # \dontrun{
models = c("bert-base-uncased", "bert-base-cased")
BERT_info(models)

BERT_vocab(models, c("bruce", "Bruce"))

BERT_vocab(models, 2020:2025)  # some are out-of-vocabulary
BERT_vocab(models, 2020:2025, add.tokens=TRUE)  # add vocab

BERT_vocab(models,
           c("individualism", "artificial intelligence"),
           add.tokens=TRUE)
} # }