
Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.
Source:R/02-static.R
train_wordvec.RdUsage
train_wordvec(
text,
method = c("word2vec", "glove", "fasttext"),
dims = 300,
window = 5,
min.freq = 5,
threads = 8,
model = c("skip-gram", "cbow"),
loss = c("ns", "hs"),
negative = 5,
subsample = 1e-04,
learning = 0.05,
ngrams = c(3, 6),
x.max = 10,
convergence = -1,
stopwords = character(0),
encoding = "UTF-8",
tolower = FALSE,
normalize = FALSE,
iteration,
tokenizer,
remove,
file.save,
compress = "bzip2",
verbose = TRUE
)Arguments
- text
A character vector of text, or a file path on disk containing text.
- method
Training algorithm:
"word2vec"(default): usingword2vec::word2vec()"glove": usingrsparse::GloVe()andtext2vec::text2vec()"fasttext": usingfastTextR::ft_train()
- dims
Number of dimensions of word vectors to be trained. Common choices include 50, 100, 200, 300, and 500. Defaults to
300.- window
Window size (number of nearby words behind/ahead the current word). It defines how many surrounding words to be included in training: [window] words behind and [window] words ahead ([window]*2 in total). Defaults to
5.- min.freq
Minimum frequency of words to be included in training. Words that appear less than this value of times will be excluded from vocabulary. Defaults to
5(take words that appear at least five times).- threads
Number of CPU threads used for training. A modest value produces the fastest training. Too many threads are not always helpful. Defaults to
8.- model
[Only for Word2Vec / FastText] Learning model architecture:
"skip-gram"(default): Skip-Gram, which predicts surrounding words given the current word"cbow": Continuous Bag-of-Words, which predicts the current word based on the context
- loss
[Only for Word2Vec / FastText] Loss function (computationally efficient approximation):
"ns"(default): Negative Sampling"hs": Hierarchical Softmax
- negative
[Only for Negative Sampling in Word2Vec / FastText] Number of negative examples. Values in the range 5~20 are useful for small training datasets, while for large datasets the value can be as small as 2~5. Defaults to
5.- subsample
[Only for Word2Vec / FastText] Subsampling of frequent words (threshold for occurrence of words). Those that appear with higher frequency in the training data will be randomly down-sampled. Defaults to
0.0001(1e-04).- learning
[Only for Word2Vec / FastText] Initial (starting) learning rate, also known as alpha. Defaults to
0.05.- ngrams
[Only for FastText] Minimal and maximal ngram length. Defaults to
c(3, 6).- x.max
[Only for GloVe] Maximum number of co-occurrences to use in the weighting function. Defaults to
10.- convergence
[Only for GloVe] Convergence tolerance for SGD iterations. Defaults to
-1.- stopwords
[Only for Word2Vec / GloVe] A character vector of stopwords to be excluded from training.
- encoding
Text encoding. Defaults to
"UTF-8".- tolower
Convert all upper-case characters to lower-case? Defaults to
FALSE.- normalize
Normalize all word vectors to unit length? Defaults to
FALSE. Seenormalize().- iteration
Number of training iterations. More iterations makes a more precise model, but computational cost is linearly proportional to iterations. Defaults to
5for Word2Vec and FastText while10for GloVe.- tokenizer
Function used to tokenize the text. Defaults to
text2vec::word_tokenizer().- remove
Strings (in regular expression) to be removed from the text. Defaults to
"_|'|<br/>|<br />|e\\\\.g\\\\.|i\\\\.e\\\\.". You may turn off this by specifyingremove=NULL.- file.save
File name of to-be-saved R data (must be .RData).
- compress
Compression method for the saved file. Defaults to
"bzip2".1or"gzip": modest file size (fastest)2or"bzip2": small file size (fast)3or"xz": minimized file size (slow)
- verbose
Print information to the console? Defaults to
TRUE.
Download
Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf
Examples
review = text2vec::movie_review # a data.frame'
text = review$review
## Note: All the examples train 50 dims for faster code check.
## Word2Vec (SGNS)
dt1 = train_wordvec(
text,
method="word2vec",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: Word2Vec (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14205 unique tokens (time cost = 5 secs)
dt1
#> # wordvec (data.table): [14205 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.1920, ...<50 dims>] 58797
#> 2: and [ 0.0153, ...<50 dims>] 32193
#> 3: a [ 0.0948, ...<50 dims>] 31783
#> 4: of [-0.0234, ...<50 dims>] 29142
#> 5: to [ 0.1207, ...<50 dims>] 27218
#> ------
#> 14201: drunks [ 0.1232, ...<50 dims>] 5
#> 14202: flea [ 0.1208, ...<50 dims>] 5
#> 14203: liquid [ 0.1434, ...<50 dims>] 5
#> 14204: LOTR [ 0.1306, ...<50 dims>] 5
#> 14205: morose [ 0.1506, ...<50 dims>] 5
most_similar(dt1, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: ive 0.8877912 110
#> 2: seen 0.8031776 766
#> 3: havent 0.7913808 943
#> 4: criticized 0.7844135 1131
#> 5: Bollywood 0.7789753 3127
#> 6: dub 0.7767296 3487
#> 7: youve 0.7697009 3586
#> 8: WORST 0.7686428 4821
#> 9: quote 0.7680053 6904
#> 10: recent 0.7652648 8453
most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8964092 260
#> 2: girl 0.8416096 299
#> 3: lonely 0.7838781 478
#> 4: boy 0.7815670 1213
#> 5: married 0.7659517 2338
most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8561397 197
#> 2: woman 0.7813664 260
#> 3: aged 0.7309787 299
#> 4: young 0.7034726 1140
#> 5: lady 0.6838961 1836
## GloVe
dt2 = train_wordvec(
text,
method="glove",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: GloVe
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: N/A
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 10 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 10 secs)
dt2
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.1325, ...<50 dims>] 58797
#> 2: and [ 0.0807, ...<50 dims>] 32193
#> 3: a [ 0.1503, ...<50 dims>] 31783
#> 4: of [-0.1085, ...<50 dims>] 29142
#> 5: to [ 0.1437, ...<50 dims>] 27218
#> ------
#> 14203: yea [ 0.1552, ...<50 dims>] 5
#> 14204: yearly [ 0.0437, ...<50 dims>] 5
#> 14205: yearning [-0.0816, ...<50 dims>] 5
#> 14206: yelled [-0.1324, ...<50 dims>] 5
#> 14207: yer [-0.0856, ...<50 dims>] 5
most_similar(dt2, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: seen 0.9229199 24
#> 2: ever 0.8886434 91
#> 3: worst 0.7852577 110
#> 4: heard 0.7593296 124
#> 5: since 0.7340906 261
#> 6: watched 0.7040197 262
#> 7: read 0.6937020 305
#> 8: movies 0.6933035 330
#> 9: already 0.6910354 468
#> 10: have 0.6736166 515
most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8647834 34
#> 2: young 0.7469375 198
#> 3: child 0.7445291 260
#> 4: girl 0.7283451 299
#> 5: who 0.7177403 523
most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.8038970 150
#> 2: woman 0.7168044 198
#> 3: young 0.7102703 260
#> 4: named 0.7032806 299
#> 5: old 0.6642969 867
## FastText
dt3 = train_wordvec(
text,
method="fasttext",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
#> ✔ Tokenized: 70105 sentences (time cost = 2 secs)
#> ✔ Text corpus: 5242249 characters, 1185427 tokens (roughly words)
#>
#> ── Training model information ──────────────────────────────────────────────────
#> - Method: FastText (Skip-Gram with Negative Sampling)
#> - Dimensions: 50
#> - Window size: 5 (5 words behind and 5 words ahead the current word)
#> - Subsampling: 1e-04
#> - Min. freq.: 5 occurrences in text
#> - Iterations: 5 training iterations
#> - CPU threads: 8
#>
#> ── Training...
#> ✔ Word vectors trained: 14207 unique tokens (time cost = 12 secs)
dt3
#> # wordvec (data.table): [14207 × 3] (normalized)
#> word vec freq
#> 1: the [ 0.0145, ...<50 dims>] 58797
#> 2: and [-0.0386, ...<50 dims>] 32193
#> 3: a [ 0.0273, ...<50 dims>] 31783
#> 4: of [ 0.0613, ...<50 dims>] 29142
#> 5: to [-0.0866, ...<50 dims>] 27218
#> ------
#> 14203: spray [ 0.0716, ...<50 dims>] 5
#> 14204: disabilities [ 0.1356, ...<50 dims>] 5
#> 14205: crook [ 0.0305, ...<50 dims>] 5
#> 14206: Syndrome [-0.0106, ...<50 dims>] 5
#> 14207: snipers [ 0.0300, ...<50 dims>] 5
most_similar(dt3, "Ive") # evaluate performance
#> [Word Vector] =~ Ive
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: Youve 0.8561718 110
#> 2: seen 0.8367139 945
#> 3: Weve 0.8362379 2612
#> 4: youve 0.8028984 3105
#> 5: WORST 0.7753204 3494
#> 6: ve 0.7745071 5898
#> 7: ive 0.7565490 6913
#> 8: funnier 0.7524717 7108
#> 9: weve 0.7469759 8617
#> 10: Daisies 0.7461700 12894
most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ man - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: woman 0.8977993 261
#> 2: girl 0.7848848 299
#> 3: salesman 0.7776333 5594
#> 4: madman 0.7481051 6553
#> 5: henchman 0.7462895 12263
most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance
#> [Word Vector] =~ boy - he + she
#> (normalized to unit length)
#> word cos_sim row_id
#> <char> <num> <int>
#> 1: girl 0.7819794 261
#> 2: boys 0.7061669 299
#> 3: woman 0.6909817 523
#> 4: kid 0.6832291 676
#> 5: child 0.6823652 1045