Extract a subset of word vectors data (with S3 methods).
Source:R/01-basic.R
data_wordvec_subset.Rd
Extract a subset of word vectors data (with S3 methods).
You may specify either a wordvec
or embed
loaded by data_wordvec_load
)
or an .RData file transformed by data_transform
).
Arguments
- x
Can be:
a
wordvec
orembed
loaded bydata_wordvec_load
an .RData file transformed by
data_transform
- words
[Option 1] Character string(s).
- pattern
[Option 2] Regular expression (see
str_subset
). If neitherwords
norpattern
are specified (i.e., both areNULL
), then all words in the data will be extracted.- as
Reshape to
wordvec
(data.table) orembed
(matrix). Defaults to the original class ofx
.- file.save
File name of to-be-saved R data (must be .RData).
- compress
Compression method for the saved file. Defaults to
"bzip2"
.Options include:
1
or"gzip"
: modest file size (fastest)2
or"bzip2"
: small file size (fast)3
or"xz"
: minimized file size (slow)
- compress.level
Compression level from
0
(none) to9
(maximal compression for minimal file size). Defaults to9
.- verbose
Print information to the console? Defaults to
TRUE
.- ...
Parameters passed to
data_wordvec_subset
when using the S3 methodsubset
.
Download
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Examples
## directly use `embed[i, j]` (3x faster than `wordvec`):
d = as_embed(demodata)
d[1:5]
#> # embed (matrix): [5 × 300] (NOT normalized)
#> dim1 ... dim300
#> 1: in 0.0703 ... <300 dims>
#> 2: for -0.0118 ... <300 dims>
#> 3: that -0.0157 ... <300 dims>
#> 4: is 0.0070 ... <300 dims>
#> 5: on 0.0267 ... <300 dims>
d["people"]
#> # embed (matrix): [1 × 300] (NOT normalized)
#> dim1 ... dim300
#> 1: people 0.2637 ... <300 dims>
d[c("China", "Japan", "Korea")]
#> # embed (matrix): [3 × 300] (NOT normalized)
#> dim1 ... dim300
#> 1: China -0.0732 ... <300 dims>
#> 2: Japan 0.0508 ... <300 dims>
#> 3: Korea 0.1089 ... <300 dims>
## specify `x` as a `wordvec` or `embed` object:
subset(demodata, c("China", "Japan", "Korea"))
#> # wordvec (data.table): [3 × 2] (NOT normalized)
#> word vec
#> 1: China [-0.0732, ...<300 dims>]
#> 2: Japan [ 0.0508, ...<300 dims>]
#> 3: Korea [ 0.1089, ...<300 dims>]
subset(d, pattern="^Chi")
#> 4 words matched...
#> # embed (matrix): [4 × 300] (NOT normalized)
#> dim1 ... dim300
#> 1: China -0.0732 ... <300 dims>
#> 2: Chicago -0.0786 ... <300 dims>
#> 3: Chinese -0.1367 ... <300 dims>
#> 4: Chile -0.2754 ... <300 dims>
## specify `x` and `pattern`, and save with `file.save`:
subset(demodata, pattern="Chin[ae]|Japan|Korea",
file.save="subset.RData")
#> 6 words matched...
#>
#> Compressing and saving...
#> ✔ Saved to subset.RData (time cost = 0.004 secs)
#> # wordvec (data.table): [6 × 2] (NOT normalized)
#> word vec
#> 1: China [-0.0732, ...<300 dims>]
#> 2: Japan [ 0.0508, ...<300 dims>]
#> 3: Chinese [-0.1367, ...<300 dims>]
#> 4: Japanese [ 0.0105, ...<300 dims>]
#> 5: Korean [ 0.0898, ...<300 dims>]
#> 6: Korea [ 0.1089, ...<300 dims>]
## load the subset:
d.subset = load_wordvec("subset.RData")
#> Loading...
#> ✔ Word vectors data: 6 vocab, 300 dims (time cost = 0.002 secs)
#> ✔ All word vectors have been normalized to unit length 1.
d.subset
#> # wordvec (data.table): [6 × 2] (normalized)
#> word vec
#> 1: China [-0.0270, ...<300 dims>]
#> 2: Japan [ 0.0183, ...<300 dims>]
#> 3: Chinese [-0.0499, ...<300 dims>]
#> 4: Japanese [ 0.0037, ...<300 dims>]
#> 5: Korean [ 0.0264, ...<300 dims>]
#> 6: Korea [ 0.0338, ...<300 dims>]
## specify `x` as an .RData file and save with `file.save`:
data_wordvec_subset("subset.RData",
words=c("China", "Chinese"),
file.save="new.subset.RData")
#> Loading...
#> ✔ Word vectors data: 6 vocab, 300 dims (time cost = 0.002 secs)
#>
#> Compressing and saving...
#> ✔ Saved to new.subset.RData (time cost = 0.003 secs)
#> # wordvec (data.table): [2 × 2] (NOT normalized)
#> word vec
#> 1: China [-0.0732, ...<300 dims>]
#> 2: Chinese [-0.1367, ...<300 dims>]
d.new.subset = load_embed("new.subset.RData")
#> Loading...
#> ✔ Word vectors data: 2 vocab, 300 dims (time cost = 0.002 secs)
#> ✔ All word vectors have been normalized to unit length 1.
d.new.subset
#> # embed (matrix): [2 × 300] (normalized)
#> dim1 ... dim300
#> 1: China -0.0270 ... <300 dims>
#> 2: Chinese -0.0499 ... <300 dims>
unlink("subset.RData") # delete file for code check
unlink("new.subset.RData") # delete file for code check