
Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.
Source:R/02-static.R
train_wordvec.Rd
Usage
train_wordvec(
text,
method = c("word2vec", "glove", "fasttext"),
dims = 300,
window = 5,
min.freq = 5,
threads = 8,
model = c("skip-gram", "cbow"),
loss = c("ns", "hs"),
negative = 5,
subsample = 1e-04,
learning = 0.05,
ngrams = c(3, 6),
x.max = 10,
convergence = -1,
stopwords = character(0),
encoding = "UTF-8",
tolower = FALSE,
normalize = FALSE,
iteration,
tokenizer,
remove,
file.save,
compress = "bzip2",
verbose = TRUE
)
Arguments
- text
A character vector of text, or a file path on disk containing text.
- method
Training algorithm:
- dims
Number of dimensions of word vectors to be trained. Common choices include 50, 100, 200, 300, and 500. Defaults to
300
.- window
Window size (number of nearby words behind/ahead the current word). It defines how many surrounding words to be included in training: [window] words behind and [window] words ahead ([window]*2 in total). Defaults to
5
.- min.freq
Minimum frequency of words to be included in training. Words that appear less than this value of times will be excluded from vocabulary. Defaults to
5
(take words that appear at least five times).- threads
Number of CPU threads used for training. A modest value produces the fastest training. Too many threads are not always helpful. Defaults to
8
.- model
<Only for Word2Vec / FastText>
Learning model architecture:
"skip-gram"
(default): Skip-Gram, which predicts surrounding words given the current word"cbow"
: Continuous Bag-of-Words, which predicts the current word based on the context
- loss
<Only for Word2Vec / FastText>
Loss function (computationally efficient approximation):
"ns"
(default): Negative Sampling"hs"
: Hierarchical Softmax
- negative
<Only for Negative Sampling in Word2Vec / FastText>
Number of negative examples. Values in the range 5~20 are useful for small training datasets, while for large datasets the value can be as small as 2~5. Defaults to
5
.- subsample
<Only for Word2Vec / FastText>
Subsampling of frequent words (threshold for occurrence of words). Those that appear with higher frequency in the training data will be randomly down-sampled. Defaults to
0.0001
(1e-04
).- learning
<Only for Word2Vec / FastText>
Initial (starting) learning rate, also known as alpha. Defaults to
0.05
.- ngrams
<Only for FastText>
Minimal and maximal ngram length. Defaults to
c(3, 6)
.- x.max
<Only for GloVe>
Maximum number of co-occurrences to use in the weighting function. Defaults to
10
.- convergence
<Only for GloVe>
Convergence tolerance for SGD iterations. Defaults to
-1
.- stopwords
<Only for Word2Vec / GloVe>
A character vector of stopwords to be excluded from training.
- encoding
Text encoding. Defaults to
"UTF-8"
.- tolower
Convert all upper-case characters to lower-case? Defaults to
FALSE
.- normalize
Normalize all word vectors to unit length? Defaults to
FALSE
. Seenormalize
.- iteration
Number of training iterations. More iterations makes a more precise model, but computational cost is linearly proportional to iterations. Defaults to
5
for Word2Vec and FastText while10
for GloVe.- tokenizer
Function used to tokenize the text. Defaults to
text2vec::word_tokenizer
.- remove
Strings (in regular expression) to be removed from the text. Defaults to
"_|'|<br/>|<br />|e\\.g\\.|i\\.e\\."
. You may turn off this by specifyingremove=NULL
.- file.save
File name of to-be-saved R data (must be .RData).
- compress
Compression method for the saved file. Defaults to
"bzip2"
.Options include:
1
or"gzip"
: modest file size (fastest)2
or"bzip2"
: small file size (fast)3
or"xz"
: minimized file size (slow)
- verbose
Print information to the console? Defaults to
TRUE
.
Download
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Examples
review = text2vec::movie_review # a data.frame'
text = review$review
## Note: All the examples train 50 dims for faster code check.
## Word2Vec (SGNS)
dt1 = train_wordvec(
text,
method="word2vec",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: Word2Vec (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14205 unique tokens (time cost = 5 secs)
dt1
#> # wordvec (data.table): [14205 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.0491, ...<50 dims>] 58797
#> 2: and [ 0.1014, ...<50 dims>] 32193
#> 3: a [-0.0011, ...<50 dims>] 31783
#> 4: of [-0.0503, ...<50 dims>] 29142
#> 5: to [ 0.0211, ...<50 dims>] 27218
#> ------
#> 14201: drunks [ 0.0411, ...<50 dims>] 5
#> 14202: flea [-0.0367, ...<50 dims>] 5
#> 14203: liquid [ 0.0864, ...<50 dims>] 5
#> 14204: LOTR [ 0.0415, ...<50 dims>] 5
#> 14205: morose [-0.0026, ...<50 dims>] 5
most_similar(dt1, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: ive 0.8554545 766
#> 2: Possibly 0.7726418 943
#> 3: Hands 0.7681225 1542
#> 4: youve 0.7506576 2564
#> 5: Holes 0.7481481 3487
#> 6: Weve 0.7437758 5300
#> 7: funniest 0.7426766 5709
#> 8: occasions 0.7330269 6331
#> 9: favorites 0.7324063 6928
#> 10: havent 0.7306304 7225
most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8391127 260
#> 2: girl 0.7895311 299
#> 3: widow 0.7391445 1814
#> 4: Marie 0.7178303 2338
#> 5: lonely 0.7158450 5485
most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8547386 197
#> 2: woman 0.7767119 260
#> 3: kid 0.7070389 299
#> 4: young 0.6804687 674
#> 5: aged 0.6775028 1836
## GloVe
dt2 = train_wordvec(
text,
method="glove",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: GloVe
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: N/A
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 10 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 10 secs)
dt2
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.2248, ...<50 dims>] 58797
#> 2: and [ 0.1703, ...<50 dims>] 32193
#> 3: a [ 0.1768, ...<50 dims>] 31783
#> 4: of [-0.0120, ...<50 dims>] 29142
#> 5: to [ 0.1712, ...<50 dims>] 27218
#> ------
#> 14203: yea [ 0.1299, ...<50 dims>] 5
#> 14204: yearly [-0.0281, ...<50 dims>] 5
#> 14205: yearning [-0.0889, ...<50 dims>] 5
#> 14206: yelled [-0.1296, ...<50 dims>] 5
#> 14207: yer [-0.1449, ...<50 dims>] 5
most_similar(dt2, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: seen 0.9304579 24
#> 2: ever 0.8837130 74
#> 3: worst 0.7618407 91
#> 4: heard 0.7599531 110
#> 5: since 0.7306617 124
#> 6: watched 0.7053683 261
#> 7: have 0.6843567 262
#> 8: been 0.6843400 305
#> 9: already 0.6757345 468
#> 10: movies 0.6696096 515
most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8767317 198
#> 2: child 0.7512722 260
#> 3: young 0.7464289 299
#> 4: hit 0.7401282 523
#> 5: girl 0.7238749 594
most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8122265 150
#> 2: named 0.7351005 198
#> 3: young 0.7237474 260
#> 4: woman 0.7198218 299
#> 5: old 0.6657580 867
## FastText
dt3 = train_wordvec(
text,
method="fasttext",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: FastText (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 12 secs)
dt3
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.0031, ...<50 dims>] 58797
#> 2: and [ 0.0294, ...<50 dims>] 32193
#> 3: a [ 0.0985, ...<50 dims>] 31783
#> 4: of [ 0.0670, ...<50 dims>] 29142
#> 5: to [-0.0271, ...<50 dims>] 27218
#> ------
#> 14203: spray [ 0.0782, ...<50 dims>] 5
#> 14204: disabilities [ 0.0619, ...<50 dims>] 5
#> 14205: crook [ 0.0310, ...<50 dims>] 5
#> 14206: Syndrome [-0.0005, ...<50 dims>] 5
#> 14207: snipers [ 0.0125, ...<50 dims>] 5
most_similar(dt3, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: Youve 0.8477870 110
#> 2: Weve 0.8298344 765
#> 3: seen 0.8187215 945
#> 4: youve 0.8012831 3250
#> 5: ve 0.7895936 3494
#> 6: havent 0.7800220 5898
#> 7: WORST 0.7585596 6913
#> 8: ive 0.7508475 7108
#> 9: beforehand 0.7484528 9171
#> 10: Columbo 0.7372960 12894
most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8824737 261
#> 2: girl 0.7754487 299
#> 3: salesman 0.7562189 5594
#> 4: henchman 0.7539343 6553
#> 5: madman 0.7441767 12263
most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8236447 261
#> 2: woman 0.7225464 299
#> 3: kid 0.6877228 676
#> 4: boys 0.6849813 1045
#> 5: teenager 0.6844702 2364