Skip to contents

This demo data contains a sample of 8000 English words with 300-dimension word vectors pre-trained using the "word2vec" algorithm based on the Google News corpus. Most of these words are from the Top 8000 frequent wordlist, whereas a few are selected from less frequent words and appended.

Usage

data(demodata)

Format

A data.table (of new class wordvec) with two variables word and vec, transformed from the raw data (see the URL in Source) into .RData using the data_transform function.

Source

Google Code - word2vec (https://code.google.com/archive/p/word2vec/)

Examples

class(demodata)
#> [1] "wordvec"    "data.table" "data.frame"
demodata
#> # wordvec (data.table): [8000 × 2] (NOT normalized)
#>                  word                      vec
#>    1:              in [ 0.0703, ...<300 dims>]
#>    2:             for [-0.0118, ...<300 dims>]
#>    3:            that [-0.0157, ...<300 dims>]
#>    4:              is [ 0.0070, ...<300 dims>]
#>    5:              on [ 0.0267, ...<300 dims>]
#> -----                                         
#> 7996:     salesperson [ 0.1245, ...<300 dims>]
#> 7997:     computation [ 0.0791, ...<300 dims>]
#> 7998:   psychotherapy [ 0.1445, ...<300 dims>]
#> 7999:       equations [ 0.3242, ...<300 dims>]
#> 8000: psychotherapist [ 0.1357, ...<300 dims>]

embed = as_embed(demodata, normalize=TRUE)
class(embed)
#> [1] "embed"  "matrix" "array" 
embed
#> # embed (matrix): [8000 × 300] (normalized)
#>                          dim1 ...     dim300
#>    1: in               0.0530 ... <300 dims>
#>    2: for             -0.0085 ... <300 dims>
#>    3: that            -0.0124 ... <300 dims>
#>    4: is               0.0037 ... <300 dims>
#>    5: on               0.0167 ... <300 dims>
#> -----                                       
#> 7996: salesperson      0.0381 ... <300 dims>
#> 7997: computation      0.0227 ... <300 dims>
#> 7998: psychotherapy    0.0405 ... <300 dims>
#> 7999: equations        0.0957 ... <300 dims>
#> 8000: psychotherapist  0.0364 ... <300 dims>