
Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.
Source:R/02-static.R
train_wordvec.Rd
Usage
train_wordvec(
text,
method = c("word2vec", "glove", "fasttext"),
dims = 300,
window = 5,
min.freq = 5,
threads = 8,
model = c("skip-gram", "cbow"),
loss = c("ns", "hs"),
negative = 5,
subsample = 1e-04,
learning = 0.05,
ngrams = c(3, 6),
x.max = 10,
convergence = -1,
stopwords = character(0),
encoding = "UTF-8",
tolower = FALSE,
normalize = FALSE,
iteration,
tokenizer,
remove,
file.save,
compress = "bzip2",
verbose = TRUE
)
Arguments
- text
A character vector of text, or a file path on disk containing text.
- method
Training algorithm:
- dims
Number of dimensions of word vectors to be trained. Common choices include 50, 100, 200, 300, and 500. Defaults to
300
.- window
Window size (number of nearby words behind/ahead the current word). It defines how many surrounding words to be included in training: [window] words behind and [window] words ahead ([window]*2 in total). Defaults to
5
.- min.freq
Minimum frequency of words to be included in training. Words that appear less than this value of times will be excluded from vocabulary. Defaults to
5
(take words that appear at least five times).- threads
Number of CPU threads used for training. A modest value produces the fastest training. Too many threads are not always helpful. Defaults to
8
.- model
<Only for Word2Vec / FastText>
Learning model architecture:
"skip-gram"
(default): Skip-Gram, which predicts surrounding words given the current word"cbow"
: Continuous Bag-of-Words, which predicts the current word based on the context
- loss
<Only for Word2Vec / FastText>
Loss function (computationally efficient approximation):
"ns"
(default): Negative Sampling"hs"
: Hierarchical Softmax
- negative
<Only for Negative Sampling in Word2Vec / FastText>
Number of negative examples. Values in the range 5~20 are useful for small training datasets, while for large datasets the value can be as small as 2~5. Defaults to
5
.- subsample
<Only for Word2Vec / FastText>
Subsampling of frequent words (threshold for occurrence of words). Those that appear with higher frequency in the training data will be randomly down-sampled. Defaults to
0.0001
(1e-04
).- learning
<Only for Word2Vec / FastText>
Initial (starting) learning rate, also known as alpha. Defaults to
0.05
.- ngrams
<Only for FastText>
Minimal and maximal ngram length. Defaults to
c(3, 6)
.- x.max
<Only for GloVe>
Maximum number of co-occurrences to use in the weighting function. Defaults to
10
.- convergence
<Only for GloVe>
Convergence tolerance for SGD iterations. Defaults to
-1
.- stopwords
<Only for Word2Vec / GloVe>
A character vector of stopwords to be excluded from training.
- encoding
Text encoding. Defaults to
"UTF-8"
.- tolower
Convert all upper-case characters to lower-case? Defaults to
FALSE
.- normalize
Normalize all word vectors to unit length? Defaults to
FALSE
. Seenormalize
.- iteration
Number of training iterations. More iterations makes a more precise model, but computational cost is linearly proportional to iterations. Defaults to
5
for Word2Vec and FastText while10
for GloVe.- tokenizer
Function used to tokenize the text. Defaults to
text2vec::word_tokenizer
.- remove
Strings (in regular expression) to be removed from the text. Defaults to
"_|'|<br/>|<br />|e\\.g\\.|i\\.e\\."
. You may turn off this by specifyingremove=NULL
.- file.save
File name of to-be-saved R data (must be .RData).
- compress
Compression method for the saved file. Defaults to
"bzip2"
.Options include:
1
or"gzip"
: modest file size (fastest)2
or"bzip2"
: small file size (fast)3
or"xz"
: minimized file size (slow)
- verbose
Print information to the console? Defaults to
TRUE
.
Download
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Examples
review = text2vec::movie_review # a data.frame'
text = review$review
## Note: All the examples train 50 dims for faster code check.
## Word2Vec (SGNS)
dt1 = train_wordvec(
text,
method="word2vec",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: Word2Vec (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14205 unique tokens (time cost = 5 secs)
dt1
#> # wordvec (data.table): [14205 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.0920, ...<50 dims>] 58797
#> 2: and [ 0.2850, ...<50 dims>] 32193
#> 3: a [ 0.0397, ...<50 dims>] 31783
#> 4: of [ 0.2136, ...<50 dims>] 29142
#> 5: to [ 0.1510, ...<50 dims>] 27218
#> ------
#> 14201: drunks [ 0.2087, ...<50 dims>] 5
#> 14202: flea [ 0.1382, ...<50 dims>] 5
#> 14203: liquid [ 0.1316, ...<50 dims>] 5
#> 14204: LOTR [ 0.1956, ...<50 dims>] 5
#> 14205: morose [ 0.2606, ...<50 dims>] 5
most_similar(dt1, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: ive 0.8430820 110
#> 2: seen 0.8033601 943
#> 3: criticized 0.7790563 1542
#> 4: funniest 0.7717599 3487
#> 5: Weve 0.7634671 5131
#> 6: Possibly 0.7593620 6331
#> 7: lately 0.7549035 6928
#> 8: youve 0.7546543 7225
#> 9: animes 0.7498266 8453
#> 10: Hands 0.7487120 12820
most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8153507 260
#> 2: daughter 0.8037683 299
#> 3: girl 0.7874227 523
#> 4: child 0.7636914 547
#> 5: shes 0.7493600 623
most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.7865210 260
#> 2: shes 0.7125322 299
#> 3: woman 0.7097587 523
#> 4: kid 0.7088952 547
#> 5: child 0.6783223 674
## GloVe
dt2 = train_wordvec(
text,
method="glove",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: GloVe
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: N/A
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 10 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 10 secs)
dt2
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.0173, ...<50 dims>] 58797
#> 2: and [ 0.0584, ...<50 dims>] 32193
#> 3: a [ 0.0089, ...<50 dims>] 31783
#> 4: of [ 0.0465, ...<50 dims>] 29142
#> 5: to [-0.0207, ...<50 dims>] 27218
#> ------
#> 14203: yea [ 0.1827, ...<50 dims>] 5
#> 14204: yearly [-0.0442, ...<50 dims>] 5
#> 14205: yearning [ 0.0199, ...<50 dims>] 5
#> 14206: yelled [ 0.3556, ...<50 dims>] 5
#> 14207: yer [-0.0790, ...<50 dims>] 5
most_similar(dt2, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: seen 0.9358600 91
#> 2: ever 0.8966583 110
#> 3: worst 0.7652979 124
#> 4: heard 0.7553788 127
#> 5: since 0.7244759 261
#> 6: youve 0.7128120 262
#> 7: havent 0.7066291 305
#> 8: watched 0.7020304 515
#> 9: movies 0.6791108 767
#> 10: best 0.6758477 950
most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8710100 34
#> 2: girl 0.7657829 198
#> 3: young 0.7483093 260
#> 4: child 0.7434563 299
#> 5: who 0.7288085 523
most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8251125 153
#> 2: young 0.7573023 198
#> 3: woman 0.7339802 260
#> 4: named 0.7169584 299
#> 5: man 0.6813441 867
## FastText
dt3 = train_wordvec(
text,
method="fasttext",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: FastText (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 12 secs)
dt3
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.0558, ...<50 dims>] 58797
#> 2: and [ 0.0328, ...<50 dims>] 32193
#> 3: a [ 0.1769, ...<50 dims>] 31783
#> 4: of [ 0.1591, ...<50 dims>] 29142
#> 5: to [-0.0726, ...<50 dims>] 27218
#> ------
#> 14203: spray [ 0.1070, ...<50 dims>] 5
#> 14204: disabilities [ 0.1351, ...<50 dims>] 5
#> 14205: crook [ 0.0652, ...<50 dims>] 5
#> 14206: Syndrome [ 0.0327, ...<50 dims>] 5
#> 14207: snipers [ 0.1003, ...<50 dims>] 5
most_similar(dt3, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: Youve 0.8487351 110
#> 2: Weve 0.8264918 765
#> 3: seen 0.7998097 945
#> 4: youve 0.7919921 3250
#> 5: WORST 0.7798288 5898
#> 6: ve 0.7662747 6913
#> 7: Columbo 0.7539891 7108
#> 8: havent 0.7537364 8617
#> 9: beforehand 0.7511504 9171
#> 10: Daisies 0.7450552 12894
most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8750410 261
#> 2: henchman 0.7684046 299
#> 3: salesman 0.7669638 5594
#> 4: madman 0.7548135 6553
#> 5: girl 0.7518621 12263
most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8039851 261
#> 2: kid 0.7290113 299
#> 3: woman 0.7038419 676
#> 4: boys 0.7001571 1045
#> 5: widow 0.6801468 5504