Tabulate cosine similarity/distance of word pairs.
Usage
tab_similarity(
data,
words = NULL,
pattern = NULL,
words1 = NULL,
words2 = NULL,
unique = FALSE,
distance = FALSE
)
Arguments
- data
A
wordvec
(data.table) orembed
(matrix), seedata_wordvec_load
.- words
[Option 1] Character string(s).
- pattern
[Option 2] Regular expression (see
str_subset
). If neitherwords
norpattern
are specified (i.e., both areNULL
), then all words in the data will be extracted.- words1, words2
[Option 3] Two sets of words for only n1 * n2 word pairs. See examples.
- unique
Return unique word pairs (
TRUE
) or all pairs with duplicates (FALSE
; default).- distance
Compute cosine distance instead? Defaults to
FALSE
(cosine similarity).
Value
A data.table
of words, word pairs,
and their cosine similarity (cos_sim
)
or cosine distance (cos_dist
).
Download
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
Examples
tab_similarity(demodata, cc("king, queen, man, woman"))
#> word1 word2 wordpair cos_sim
#> 1: king king king-king 1.0000000
#> 2: king queen king-queen 0.6510958
#> 3: king man king-man 0.2294268
#> 4: king woman king-woman 0.1284797
#> 5: queen king queen-king 0.6510958
#> 6: queen queen queen-queen 1.0000000
#> 7: queen man queen-man 0.1665822
#> 8: queen woman queen-woman 0.3161813
#> 9: man king man-king 0.2294268
#> 10: man queen man-queen 0.1665822
#> 11: man man man-man 1.0000000
#> 12: man woman man-woman 0.7664012
#> 13: woman king woman-king 0.1284797
#> 14: woman queen woman-queen 0.3161813
#> 15: woman man woman-man 0.7664012
#> 16: woman woman woman-woman 1.0000000
tab_similarity(demodata, cc("king, queen, man, woman"),
unique=TRUE)
#> word1 word2 wordpair cos_sim
#> 1: king queen king-queen 0.6510958
#> 2: king man king-man 0.2294268
#> 3: king woman king-woman 0.1284797
#> 4: queen man queen-man 0.1665822
#> 5: queen woman queen-woman 0.3161813
#> 6: man woman man-woman 0.7664012
tab_similarity(demodata, cc("Beijing, China, Tokyo, Japan"))
#> word1 word2 wordpair cos_sim
#> 1: Beijing Beijing Beijing-Beijing 1.0000000
#> 2: Beijing China Beijing-China 0.7648461
#> 3: Beijing Tokyo Beijing-Tokyo 0.5229628
#> 4: Beijing Japan Beijing-Japan 0.3995245
#> 5: China Beijing China-Beijing 0.7648461
#> 6: China China China-China 1.0000000
#> 7: China Tokyo China-Tokyo 0.3814305
#> 8: China Japan China-Japan 0.5967756
#> 9: Tokyo Beijing Tokyo-Beijing 0.5229628
#> 10: Tokyo China Tokyo-China 0.3814305
#> 11: Tokyo Tokyo Tokyo-Tokyo 1.0000000
#> 12: Tokyo Japan Tokyo-Japan 0.7002254
#> 13: Japan Beijing Japan-Beijing 0.3995245
#> 14: Japan China Japan-China 0.5967756
#> 15: Japan Tokyo Japan-Tokyo 0.7002254
#> 16: Japan Japan Japan-Japan 1.0000000
tab_similarity(demodata, cc("Beijing, China, Tokyo, Japan"),
unique=TRUE)
#> word1 word2 wordpair cos_sim
#> 1: Beijing China Beijing-China 0.7648461
#> 2: Beijing Tokyo Beijing-Tokyo 0.5229628
#> 3: Beijing Japan Beijing-Japan 0.3995245
#> 4: China Tokyo China-Tokyo 0.3814305
#> 5: China Japan China-Japan 0.5967756
#> 6: Tokyo Japan Tokyo-Japan 0.7002254
## only n1 * n2 word pairs across two sets of words
tab_similarity(demodata,
words1=cc("king, queen, King, Queen"),
words2=cc("man, woman"))
#> word1 word2 wordpair cos_sim
#> 1: king man king-man 0.22942676
#> 2: king woman king-woman 0.12847968
#> 3: queen man queen-man 0.16658216
#> 4: queen woman queen-woman 0.31618132
#> 5: King man King-man 0.15777646
#> 6: King woman King-woman 0.06369529
#> 7: Queen man Queen-man 0.09365463
#> 8: Queen woman Queen-woman 0.20171619