Embeddings¶

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation¶

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage¶

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker¶

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto’s character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution¶

Pull requests welcome!

embeddings package¶

embeddings.embedding module¶

class embeddings.embedding.Embedding[source]¶

Bases: object

__len__()[source]¶

Returns:	number of embeddings in the database.
Return type:	count (int)

clear()[source]¶: Deletes all embeddings from the database.

static download_file(url, local_filename)[source]¶

Downloads a file from an url to a local file.

Parameters:	url (str) – url to download from. local_filename (str) – local file to download to.
Returns:	file name of the downloaded file.
Return type:	str

static ensure_file(name, url=None, force=False, logger=<RootLogger root (WARNING)>, postprocess=None)[source]¶

Ensures that the file requested exists in the cache, downloading it if it does not exist.

Parameters:	name (str) – name of the file. url (str) – url to download the file from, if it doesn’t exist. force (bool) – whether to force the download, regardless of the existence of the file. logger (logging.Logger) – logger to log results. postprocess (function) – a function that, if given, will be applied after the file is downloaded. The function has the signature `f(fname)`
Returns:	file name of the downloaded file.
Return type:	str

static initialize_db(fname)[source]¶

Parameters:	fname (str) – location of the database.
Returns:	a SQLite3 database with an embeddings table.
Return type:	db (sqlite3.Connection)

insert_batch(batch)[source]¶

Parameters:	batch (list) – a list of embeddings to insert, each of which is a tuple `(word, embeddings)`.

Example:

e = Embedding()
e.db = e.initialize_db(self.e.path('mydb.db'))
e.insert_batch([
    ('hello', [1, 2, 3]),
    ('world', [2, 3, 4]),
    ('!', [3, 4, 5]),
])

load_memory()[source]¶

lookup(w)[source]¶

Parameters:	w – word to look up.
Returns:	embeddings for `w`, if it exists. `None`, otherwise.

static path(p)[source]¶

Parameters:	p (str) – relative path.
Returns:	absolute path to the file, located in the `$EMBEDDINGS_ROOT` directory.
Return type:	str

embeddings.fasttext module¶

class embeddings.fasttext.FastTextEmbedding(lang='en', show_progress=True, default='none')[source]¶

Bases: embeddings.embedding.Embedding

Reference: https://arxiv.org/abs/1607.04606

__init__(lang='en', show_progress=True, default='none')[source]¶

Parameters:	lang (en) – what language to use. show_progress (bool) – whether to print progress. default (str) – how to embed words that are out of vocabulary.

Note

Default can use zeros, return None, or generate random between [-0.1, 0.1].

d_emb = 300¶

emb(word, default=None)[source]¶

load_word2emb(show_progress=True, batch_size=1000)[source]¶

sizes = {'en': 1}¶

url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.zip'¶

embeddings.glove module¶

class embeddings.glove.GloveEmbedding(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]¶

Bases: embeddings.embedding.Embedding

Reference: http://nlp.stanford.edu/projects/glove

class GloveSetting(url, d_embs, size, description)¶

Bases: tuple

d_embs¶: Alias for field number 1

description¶: Alias for field number 3

size¶: Alias for field number 2

url¶: Alias for field number 0

__init__(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]¶

Parameters:	name – name of the embedding to retrieve. d_emb – embedding dimensions. show_progress – whether to print progress. default – how to embed words that are out of vocabulary. Can use zeros, return `None`, or generate random between `[-0.1, 0.1]`.

emb(word, default=None)[source]¶

load_word2emb(show_progress=True, batch_size=1000)[source]¶

settings = {'common_crawl_48': GloveSetting(url='http://nlp.stanford.edu/data/glove.42B.300d.zip', d_embs=[300], size=1917494, description='48B token common crawl'), 'common_crawl_840': GloveSetting(url='http://nlp.stanford.edu/data/glove.840B.300d.zip', d_embs=[300], size=2195895, description='840B token common crawl'), 'twitter': GloveSetting(url='http://nlp.stanford.edu/data/glove.twitter.27B.zip', d_embs=[25, 50, 100, 200], size=1193514, description='27B token twitter'), 'wikipedia_gigaword': GloveSetting(url='http://nlp.stanford.edu/data/glove.6B.zip', d_embs=[50, 100, 200, 300], size=400000, description='6B token wikipedia 2014 + gigaword 5')}¶

embeddings.kazuma module¶

class embeddings.kazuma.KazumaCharEmbedding(show_progress=True)[source]¶

Bases: embeddings.embedding.Embedding

Reference: https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/

__init__(show_progress=True)[source]¶

Parameters:	show_progress – whether to print progress.

d_emb = 100¶

emb(w, default='zero')[source]¶

load_word2emb(show_progress=True, batch_size=1000)[source]¶

size = 874474¶

url = 'https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/jmt_pre-trained_embeddings.tar.gz'¶

embeddings.kazuma.ngrams(sentence, n)[source]¶

Returns:	a list of lists of words corresponding to the ngrams in the sentence.
Return type:	list