
Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.
Source:R/02-static.R
train_wordvec.Rd
Usage
train_wordvec(
text,
method = c("word2vec", "glove", "fasttext"),
dims = 300,
window = 5,
min.freq = 5,
threads = 8,
model = c("skip-gram", "cbow"),
loss = c("ns", "hs"),
negative = 5,
subsample = 1e-04,
learning = 0.05,
ngrams = c(3, 6),
x.max = 10,
convergence = -1,
stopwords = character(0),
encoding = "UTF-8",
tolower = FALSE,
normalize = FALSE,
iteration,
tokenizer,
remove,
file.save,
compress = "bzip2",
verbose = TRUE
)
Arguments
- text
A character vector of text, or a file path on disk containing text.
- method
Training algorithm:
- dims
Number of dimensions of word vectors to be trained. Common choices include 50, 100, 200, 300, and 500. Defaults to
300
.- window
Window size (number of nearby words behind/ahead the current word). It defines how many surrounding words to be included in training: [window] words behind and [window] words ahead ([window]*2 in total). Defaults to
5
.- min.freq
Minimum frequency of words to be included in training. Words that appear less than this value of times will be excluded from vocabulary. Defaults to
5
(take words that appear at least five times).- threads
Number of CPU threads used for training. A modest value produces the fastest training. Too many threads are not always helpful. Defaults to
8
.- model
<Only for Word2Vec / FastText>
Learning model architecture:
"skip-gram"
(default): Skip-Gram, which predicts surrounding words given the current word"cbow"
: Continuous Bag-of-Words, which predicts the current word based on the context
- loss
<Only for Word2Vec / FastText>
Loss function (computationally efficient approximation):
"ns"
(default): Negative Sampling"hs"
: Hierarchical Softmax
- negative
<Only for Negative Sampling in Word2Vec / FastText>
Number of negative examples. Values in the range 5~20 are useful for small training datasets, while for large datasets the value can be as small as 2~5. Defaults to
5
.- subsample
<Only for Word2Vec / FastText>
Subsampling of frequent words (threshold for occurrence of words). Those that appear with higher frequency in the training data will be randomly down-sampled. Defaults to
0.0001
(1e-04
).- learning
<Only for Word2Vec / FastText>
Initial (starting) learning rate, also known as alpha. Defaults to
0.05
.- ngrams
<Only for FastText>
Minimal and maximal ngram length. Defaults to
c(3, 6)
.- x.max
<Only for GloVe>
Maximum number of co-occurrences to use in the weighting function. Defaults to
10
.- convergence
<Only for GloVe>
Convergence tolerance for SGD iterations. Defaults to
-1
.- stopwords
<Only for Word2Vec / GloVe>
A character vector of stopwords to be excluded from training.
- encoding
Text encoding. Defaults to
"UTF-8"
.- tolower
Convert all upper-case characters to lower-case? Defaults to
FALSE
.- normalize
Normalize all word vectors to unit length? Defaults to
FALSE
. Seenormalize
.- iteration
Number of training iterations. More iterations makes a more precise model, but computational cost is linearly proportional to iterations. Defaults to
5
for Word2Vec and FastText while10
for GloVe.- tokenizer
Function used to tokenize the text. Defaults to
text2vec::word_tokenizer
.- remove
Strings (in regular expression) to be removed from the text. Defaults to
"_|'|<br/>|<br />|e\\.g\\.|i\\.e\\."
. You may turn off this by specifyingremove=NULL
.- file.save
File name of to-be-saved R data (must be .RData).
- compress
Compression method for the saved file. Defaults to
"bzip2"
.Options include:
1
or"gzip"
: modest file size (fastest)2
or"bzip2"
: small file size (fast)3
or"xz"
: minimized file size (slow)
- verbose
Print information to the console? Defaults to
TRUE
.
Download
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Examples
review = text2vec::movie_review # a data.frame'
text = review$review
## Note: All the examples train 50 dims for faster code check.
## Word2Vec (SGNS)
dt1 = train_wordvec(
text,
method="word2vec",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: Word2Vec (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14205 unique tokens (time cost = 5 secs)
dt1
#> # wordvec (data.table): [14205 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.1656, ...<50 dims>] 58797
#> 2: and [ 0.2045, ...<50 dims>] 32193
#> 3: a [ 0.3494, ...<50 dims>] 31783
#> 4: of [ 0.2850, ...<50 dims>] 29142
#> 5: to [ 0.2042, ...<50 dims>] 27218
#> ------
#> 14201: drunks [ 0.2062, ...<50 dims>] 5
#> 14202: flea [ 0.1849, ...<50 dims>] 5
#> 14203: liquid [ 0.1972, ...<50 dims>] 5
#> 14204: LOTR [ 0.1978, ...<50 dims>] 5
#> 14205: morose [ 0.2178, ...<50 dims>] 5
most_similar(dt1, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: ive 0.8429288 110
#> 2: Youve 0.8357433 3487
#> 3: seen 0.8102906 3586
#> 4: Weve 0.8083023 4252
#> 5: Bollywood 0.8058193 4958
#> 6: lately 0.7966336 5131
#> 7: Honestly 0.7906977 5989
#> 8: Hands 0.7880533 6331
#> 9: Possibly 0.7835832 6928
#> 10: EVER 0.7807971 7225
most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8276659 260
#> 2: girl 0.8064163 299
#> 3: lady 0.7874162 523
#> 4: child 0.7652093 750
#> 5: herself 0.7614746 1140
most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.7993256 150
#> 2: woman 0.7154416 260
#> 3: lady 0.7129868 299
#> 4: child 0.6833631 523
#> 5: old 0.6626918 1140
## GloVe
dt2 = train_wordvec(
text,
method="glove",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: GloVe
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: N/A
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 10 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 10 secs)
dt2
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [-0.0137, ...<50 dims>] 58797
#> 2: and [ 0.0008, ...<50 dims>] 32193
#> 3: a [-0.0542, ...<50 dims>] 31783
#> 4: of [-0.0098, ...<50 dims>] 29142
#> 5: to [-0.0002, ...<50 dims>] 27218
#> ------
#> 14203: yea [ 0.2000, ...<50 dims>] 5
#> 14204: yearly [ 0.0159, ...<50 dims>] 5
#> 14205: yearning [ 0.0174, ...<50 dims>] 5
#> 14206: yelled [ 0.3742, ...<50 dims>] 5
#> 14207: yer [-0.0544, ...<50 dims>] 5
most_similar(dt2, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: seen 0.9347657 110
#> 2: ever 0.8857054 124
#> 3: heard 0.7634097 127
#> 4: worst 0.7576497 261
#> 5: since 0.7205904 262
#> 6: havent 0.7159107 305
#> 7: watched 0.7016216 468
#> 8: youve 0.6929644 515
#> 9: already 0.6788851 767
#> 10: best 0.6756883 950
most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8769695 198
#> 2: young 0.7668200 260
#> 3: girl 0.7626733 299
#> 4: child 0.7506577 523
#> 5: hit 0.7483415 594
most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8216121 153
#> 2: young 0.7601146 198
#> 3: woman 0.7331937 260
#> 4: named 0.7299843 299
#> 5: man 0.6748270 867
## FastText
dt3 = train_wordvec(
text,
method="fasttext",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: FastText (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 12 secs)
dt3
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [-0.0648, ...<50 dims>] 58797
#> 2: and [ 0.0762, ...<50 dims>] 32193
#> 3: a [ 0.0280, ...<50 dims>] 31783
#> 4: of [ 0.1054, ...<50 dims>] 29142
#> 5: to [-0.0241, ...<50 dims>] 27218
#> ------
#> 14203: spray [ 0.1294, ...<50 dims>] 5
#> 14204: disabilities [ 0.1156, ...<50 dims>] 5
#> 14205: crook [ 0.0155, ...<50 dims>] 5
#> 14206: Syndrome [ 0.0243, ...<50 dims>] 5
#> 14207: snipers [-0.0207, ...<50 dims>] 5
most_similar(dt3, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: Youve 0.8410995 110
#> 2: Weve 0.8326673 765
#> 3: seen 0.8123911 945
#> 4: WORST 0.7928248 5898
#> 5: youve 0.7796043 6913
#> 6: havent 0.7602517 7108
#> 7: Daisies 0.7538066 8617
#> 8: BETTER 0.7528388 9171
#> 9: ve 0.7524821 12545
#> 10: beforehand 0.7523178 12894
most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8891171 261
#> 2: girl 0.7929141 299
#> 3: salesman 0.7764813 5594
#> 4: henchman 0.7543738 12263
#> 5: caveman 0.7449740 12884
most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.7815206 150
#> 2: boys 0.7069117 261
#> 3: kid 0.7003743 299
#> 4: woman 0.6952878 676
#> 5: old 0.6799686 1045