Embeddings

Documentation Status https://travis-ci.org/vzhong/embeddings.svg?branch=master

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto’s character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

embeddings package

embeddings.embedding module

class embeddings.embedding.Embedding[source]

Bases: object

__len__()[source]
Returns:number of embeddings in the database.
Return type:count (int)
clear()[source]

Deletes all embeddings from the database.

static download_file(url, local_filename)[source]

Downloads a file from an url to a local file.

Parameters:
  • url (str) – url to download from.
  • local_filename (str) – local file to download to.
Returns:

file name of the downloaded file.

Return type:

str

static ensure_file(name, url=None, force=False, logger=<RootLogger root (WARNING)>, postprocess=None)[source]

Ensures that the file requested exists in the cache, downloading it if it does not exist.

Parameters:
  • name (str) – name of the file.
  • url (str) – url to download the file from, if it doesn’t exist.
  • force (bool) – whether to force the download, regardless of the existence of the file.
  • logger (logging.Logger) – logger to log results.
  • postprocess (function) – a function that, if given, will be applied after the file is downloaded. The function has the signature f(fname)
Returns:

file name of the downloaded file.

Return type:

str

static initialize_db(fname)[source]
Parameters:fname (str) – location of the database.
Returns:a SQLite3 database with an embeddings table.
Return type:db (sqlite3.Connection)
insert_batch(batch)[source]
Parameters:batch (list) – a list of embeddings to insert, each of which is a tuple (word, embeddings).

Example:

e = Embedding()
e.db = e.initialize_db(self.e.path('mydb.db'))
e.insert_batch([
    ('hello', [1, 2, 3]),
    ('world', [2, 3, 4]),
    ('!', [3, 4, 5]),
])
load_memory()[source]
lookup(w)[source]
Parameters:w – word to look up.
Returns:embeddings for w, if it exists. None, otherwise.
static path(p)[source]
Parameters:p (str) – relative path.
Returns:absolute path to the file, located in the $EMBEDDINGS_ROOT directory.
Return type:str

embeddings.fasttext module

class embeddings.fasttext.FastTextEmbedding(lang='en', show_progress=True, default='none')[source]

Bases: embeddings.embedding.Embedding

Reference: https://arxiv.org/abs/1607.04606

__init__(lang='en', show_progress=True, default='none')[source]
Parameters:
  • lang (en) – what language to use.
  • show_progress (bool) – whether to print progress.
  • default (str) – how to embed words that are out of vocabulary.

Note

Default can use zeros, return None, or generate random between [-0.1, 0.1].

d_emb = 300
emb(word, default=None)[source]
load_word2emb(show_progress=True, batch_size=1000)[source]
sizes = {'en': 1}
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.zip'

embeddings.glove module

class embeddings.glove.GloveEmbedding(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]

Bases: embeddings.embedding.Embedding

Reference: http://nlp.stanford.edu/projects/glove

class GloveSetting(url, d_embs, size, description)

Bases: tuple

d_embs

Alias for field number 1

description

Alias for field number 3

size

Alias for field number 2

url

Alias for field number 0

__init__(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]
Parameters:
  • name – name of the embedding to retrieve.
  • d_emb – embedding dimensions.
  • show_progress – whether to print progress.
  • default – how to embed words that are out of vocabulary. Can use zeros, return None, or generate random between [-0.1, 0.1].
emb(word, default=None)[source]
load_word2emb(show_progress=True, batch_size=1000)[source]
settings = {'common_crawl_48': GloveSetting(url='http://nlp.stanford.edu/data/glove.42B.300d.zip', d_embs=[300], size=1917494, description='48B token common crawl'), 'common_crawl_840': GloveSetting(url='http://nlp.stanford.edu/data/glove.840B.300d.zip', d_embs=[300], size=2195895, description='840B token common crawl'), 'twitter': GloveSetting(url='http://nlp.stanford.edu/data/glove.twitter.27B.zip', d_embs=[25, 50, 100, 200], size=1193514, description='27B token twitter'), 'wikipedia_gigaword': GloveSetting(url='http://nlp.stanford.edu/data/glove.6B.zip', d_embs=[50, 100, 200, 300], size=400000, description='6B token wikipedia 2014 + gigaword 5')}

embeddings.kazuma module

class embeddings.kazuma.KazumaCharEmbedding(show_progress=True)[source]

Bases: embeddings.embedding.Embedding

Reference: https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/

__init__(show_progress=True)[source]
Parameters:show_progress – whether to print progress.
d_emb = 100
emb(w, default='zero')[source]
load_word2emb(show_progress=True, batch_size=1000)[source]
size = 874474
url = 'https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/jmt_pre-trained_embeddings.tar.gz'
embeddings.kazuma.ngrams(sentence, n)[source]
Returns:a list of lists of words corresponding to the ngrams in the sentence.
Return type:list