π· The Fill-Mask Association Test (ζ©η ε‘«η©Ίθη³»ζ΅ιͺ).
The Fill-Mask Association Test (FMAT) is an integrative and probability-based method using BERT Models to measure conceptual associations (e.g., attitudes, biases, stereotypes, social norms, cultural values) as propositions in natural language (Bao, 2024, JPSP).
Citation
- Bao, H.-W.-S. (2023). FMAT: The Fill-Mask Association Test. https://CRAN.R-project.org/package=FMAT
-
Note: This is the original citation. Please refer to the information when you
library(FMAT)
for the APA-7 format of the version you installed.
-
Note: This is the original citation. Please refer to the information when you
- Bao, H.-W.-S. (in press). The Fill-Mask Association Test (FMAT): Measuring propositions in natural language. Journal of Personality and Social Psychology. DOI: 10.1037/pspa0000396
Installation
To use the FMAT, the R package FMAT
and two Python packages (transformers
and torch
) all need to be installed.
(1) R Package
## Method 1: Install from CRAN
install.packages("FMAT")
## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/FMAT", force=TRUE)
(2) Python Environment and Packages
Step 1
Install Anaconda (a recommended package manager which automatically installs Python, Python IDEs like Spyder, and a large list of necessary Python package dependencies).
Step 2
Specify the Python interpreter in RStudio.
RStudio β Tools β Global/Project Options
β Python β Select β Conda Environments
β Choose ββ¦/Anaconda3/python.exeβ
Step 3
Install the βtransformersβ and βtorchβ Python packages.
(Windows Command / Anaconda Prompt / RStudio Terminal)
pip install transformers torch
See Guidance for GPU Acceleration for installation guidance if you have an NVIDIA GPU device on your PC and want to use GPU to accelerate the pipeline.
Alternative Approach
(Not suggested) Besides the pip/conda installation in the Conda Environment, you might instead create and use a Virtual Environment (see R code below with the reticulate
package), but then you need to specify the Python interpreter as β~/.virtualenvs/r-reticulate/Scripts/python.exeβ in RStudio.
## DON'T RUN THIS UNLESS YOU PREFER VIRTUAL ENVIRONMENT
library(reticulate)
# install_python()
virtualenv_create()
virtualenv_install(packages=c("transformers", "torch"))
Guidance for FMAT
FMAT Step 1: Query Design
Design queries that conceptually represent the constructs you would measure (see Bao, 2024, JPSP for how to design queries).
Use FMAT_query()
and/or FMAT_query_bind()
to prepare a data.table
of queries.
FMAT Step 2: Model Loading
Use BERT_download()
and FMAT_load()
to (down)load BERT models. Model files are permanently saved to your local folder β%USERPROFILE%/.cache/huggingfaceβ. A full list of BERT-family models are available at Hugging Face.
If you want to use GPU (see Guidance for GPU Acceleration), please skip to FMAT Step 3: Model Processing and directly use FMAT_run()
without FMAT_load()
.
FMAT Step 3: Model Processing
Use FMAT_run()
to get raw data (probability estimates) for further analysis.
Several steps of pre-processing have been included in the function for easier use (see FMAT_run()
for details).
- For BERT variants using
<mask>
rather than[MASK]
as the mask token, the input query will be automatically modified so that users can always use[MASK]
in query design. - For some BERT variants, special prefix characters such as
\u0120
and\u2581
will be automatically added to match the whole words (rather than subwords) for[MASK]
.
Notes
- Improvements are ongoing, especially for adaptation to more diverse (less popular) BERT models.
- If you find bugs or have problems using the functions, please report them at GitHub Issues or send me an email.
Guidance for GPU Acceleration
By default, the FMAT
package uses CPU to enable the functionality for all users. But for advanced users who want to accelerate the pipeline with GPU, the FMAT_run()
function now supports using a GPU device, about 3x faster than CPU.
Test results (on the developerβs computer, depending on BERT model size):
- CPU (Intel 13th-Gen i7-1355U): 500~1000 queries/min
- GPU (NVIDIA GeForce RTX 2050): 1500~3000 queries/min
Checklist:
- Ensure that you have an NVIDIA GPU device (e.g., GeForce RTX Series) and an NVIDIA GPU driver installed on your system.
- Install PyTorch (Python
torch
package) with CUDA support.- Find guidance for installation command at https://pytorch.org/get-started/locally/.
- CUDA is available only on Windows and Linux, but not on MacOS.
- If you have installed a version of
torch
without CUDA support, please first uninstall it (command:pip uninstall torch
) and then install the suggested one. - You may also install the corresponding version of CUDA Toolkit (e.g., for the
torch
version supporting CUDA 12.1, the same version of CUDA Toolkit 12.1 may also be installed).
Example code for installing PyTorch with CUDA support:
(Windows Command / Anaconda Prompt / RStudio Terminal)
pip install torch --index-url https://download.pytorch.org/whl/cu121
BERT Models
The reliability and validity of the following 12 representative BERT models have been established in my research articles, but future work is needed to examine the performance of other models.
(model name on Hugging Face - downloaded model file size)
- bert-base-uncased (420 MB)
- bert-base-cased (416 MB)
- bert-large-uncased (1283 MB)
- bert-large-cased (1277 MB)
- distilbert-base-uncased (256 MB)
- distilbert-base-cased (251 MB)
- albert-base-v1 (45 MB)
- albert-base-v2 (45 MB)
- roberta-base (476 MB)
- distilroberta-base (316 MB)
- vinai/bertweet-base (517 MB)
- vinai/bertweet-large (1356 MB)
If you are new to BERT, these references can be helpful:
- What is Fill-Mask? [HuggingFace]
- An Explorable BERT [HuggingFace]
- BERT Model Documentation [HuggingFace]
- BERT Explained
- Breaking BERT Down
- Illustrated BERT
- Visual Guide to BERT
library(FMAT)
model.names = c(
"bert-base-uncased",
"bert-base-cased",
"bert-large-uncased",
"bert-large-cased",
"distilbert-base-uncased",
"distilbert-base-cased",
"albert-base-v1",
"albert-base-v2",
"roberta-base",
"distilroberta-base",
"vinai/bertweet-base",
"vinai/bertweet-large"
)
BERT_download(model.names)
βΉ Device Info:
Python Environment:
Package Version
transformers 4.38.2
torch 2.2.1+cu121
NVIDIA GPU CUDA Support:
CUDA Enabled: TRUE
CUDA Version: 12.1
GPU (Device): NVIDIA GeForce RTX 2050
ββ Downloading model "bert-base-uncased" βββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 570/570 [00:00<00:00, 113kB/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 48.0/48.0 [00:00<?, ?B/s]
vocab.txt: 100%|ββββββββββ| 232k/232k [00:00<00:00, 1.37MB/s]
tokenizer.json: 100%|ββββββββββ| 466k/466k [00:00<00:00, 3.94MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 440M/440M [01:21<00:00, 5.40MB/s]
β Successfully downloaded model "bert-base-uncased"
ββ Downloading model "bert-base-cased" βββββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 570/570 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 49.0/49.0 [00:00<00:00, 8.18kB/s]
vocab.txt: 100%|ββββββββββ| 213k/213k [00:00<00:00, 1.30MB/s]
tokenizer.json: 100%|ββββββββββ| 436k/436k [00:00<00:00, 3.67MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 436M/436M [01:20<00:00, 5.41MB/s]
β Successfully downloaded model "bert-base-cased"
ββ Downloading model "bert-large-uncased" ββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 571/571 [00:00<00:00, 143kB/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 48.0/48.0 [00:00<00:00, 12.0kB/s]
vocab.txt: 100%|ββββββββββ| 232k/232k [00:00<00:00, 6.04MB/s]
tokenizer.json: 100%|ββββββββββ| 466k/466k [00:00<00:00, 1.57MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 1.34G/1.34G [04:09<00:00, 5.39MB/s]
β Successfully downloaded model "bert-large-uncased"
ββ Downloading model "bert-large-cased" ββββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 762/762 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 49.0/49.0 [00:00<?, ?B/s]
vocab.txt: 100%|ββββββββββ| 213k/213k [00:00<00:00, 2.14MB/s]
tokenizer.json: 100%|ββββββββββ| 436k/436k [00:00<00:00, 1.75MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 1.34G/1.34G [04:08<00:00, 5.38MB/s]
β Successfully downloaded model "bert-large-cased"
ββ Downloading model "distilbert-base-uncased" βββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 483/483 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 28.0/28.0 [00:00<?, ?B/s]
vocab.txt: 100%|ββββββββββ| 232k/232k [00:00<00:00, 1.36MB/s]
tokenizer.json: 100%|ββββββββββ| 466k/466k [00:00<00:00, 1.82MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 268M/268M [00:51<00:00, 5.24MB/s]
β Successfully downloaded model "distilbert-base-uncased"
ββ Downloading model "distilbert-base-cased" βββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 465/465 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 29.0/29.0 [00:00<?, ?B/s]
vocab.txt: 100%|ββββββββββ| 213k/213k [00:00<00:00, 1.34MB/s]
tokenizer.json: 100%|ββββββββββ| 436k/436k [00:00<00:00, 4.20MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 263M/263M [00:49<00:00, 5.36MB/s]
β Successfully downloaded model "distilbert-base-cased"
ββ Downloading model "albert-base-v1" ββββββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 684/684 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 25.0/25.0 [00:00<00:00, 1.65kB/s]
spiece.model: 100%|ββββββββββ| 760k/760k [00:00<00:00, 4.58MB/s]
tokenizer.json: 100%|ββββββββββ| 1.31M/1.31M [00:00<00:00, 3.09MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 47.4M/47.4M [00:09<00:00, 5.07MB/s]
β Successfully downloaded model "albert-base-v1"
ββ Downloading model "albert-base-v2" ββββββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 684/684 [00:00<00:00, 45.5kB/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 25.0/25.0 [00:00<?, ?B/s]
spiece.model: 100%|ββββββββββ| 760k/760k [00:00<00:00, 2.13MB/s]
tokenizer.json: 100%|ββββββββββ| 1.31M/1.31M [00:00<00:00, 5.66MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 47.4M/47.4M [00:08<00:00, 5.51MB/s]
β Successfully downloaded model "albert-base-v2"
ββ Downloading model "roberta-base" ββββββββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 481/481 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 25.0/25.0 [00:00<?, ?B/s]
vocab.json: 100%|ββββββββββ| 899k/899k [00:00<00:00, 5.73MB/s]
merges.txt: 100%|ββββββββββ| 456k/456k [00:00<00:00, 6.16MB/s]
tokenizer.json: 100%|ββββββββββ| 1.36M/1.36M [00:00<00:00, 5.50MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 499M/499M [01:32<00:00, 5.38MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
β Successfully downloaded model "roberta-base"
ββ Downloading model "distilroberta-base" ββββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 480/480 [00:00<00:00, 30.7kB/s]
β (2) Downloading tokenizer...
tokenizer_config.json: 100%|ββββββββββ| 25.0/25.0 [00:00<00:00, 7.98kB/s]
vocab.json: 100%|ββββββββββ| 899k/899k [00:00<00:00, 5.18MB/s]
merges.txt: 100%|ββββββββββ| 456k/456k [00:00<00:00, 5.71MB/s]
tokenizer.json: 100%|ββββββββββ| 1.36M/1.36M [00:00<00:00, 3.83MB/s]
β (3) Downloading model...
model.safetensors: 100%|ββββββββββ| 331M/331M [01:01<00:00, 5.39MB/s]
β Successfully downloaded model "distilroberta-base"
ββ Downloading model "vinai/bertweet-base" βββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 558/558 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
vocab.txt: 100%|ββββββββββ| 843k/843k [00:00<00:00, 5.56MB/s]
bpe.codes: 100%|ββββββββββ| 1.08M/1.08M [00:00<00:00, 5.55MB/s]
tokenizer.json: 100%|ββββββββββ| 2.91M/2.91M [00:00<00:00, 5.50MB/s]
emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
β (3) Downloading model...
pytorch_model.bin: 100%|ββββββββββ| 543M/543M [01:40<00:00, 5.39MB/s]
β Successfully downloaded model "vinai/bertweet-base"
ββ Downloading model "vinai/bertweet-large" ββββββββββββββββββββββββββββββββββββββββ
β (1) Downloading configuration...
config.json: 100%|ββββββββββ| 614/614 [00:00<?, ?B/s]
β (2) Downloading tokenizer...
vocab.json: 100%|ββββββββββ| 899k/899k [00:00<00:00, 5.59MB/s]
merges.txt: 100%|ββββββββββ| 456k/456k [00:00<00:00, 5.04MB/s]
tokenizer.json: 100%|ββββββββββ| 1.36M/1.36M [00:00<00:00, 5.42MB/s]
β (3) Downloading model...
pytorch_model.bin: 100%|ββββββββββ| 1.42G/1.42G [04:23<00:00, 5.40MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at vinai/bertweet-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
β Successfully downloaded model "vinai/bertweet-large"
ββ Downloaded models: ββ
Size
albert-base-v1 45 MB
albert-base-v2 45 MB
bert-base-cased 416 MB
bert-base-uncased 420 MB
bert-large-cased 1277 MB
bert-large-uncased 1283 MB
distilbert-base-cased 251 MB
distilbert-base-uncased 256 MB
distilroberta-base 316 MB
roberta-base 476 MB
vinai/bertweet-base 517 MB
vinai/bertweet-large 1356 MB
β Downloaded models saved at C:/Users/Bruce/.cache/huggingface/hub (6.52 GB)
(Tested 2024/03 on the developerβs computer: HP Probook 450 G10 Notebook PC)
Related Packages
While the FMAT is an innovative method for the computational intelligent analysis of psychology and society, you may also seek for an integrative toolbox for other text-analytic methods. Another R package I developedβPsychWordVecβis useful and user-friendly for word embedding analysis (e.g., the Word Embedding Association Test, WEAT). Please refer to its documentation and feel free to use it.