Embeddings¶
Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.
Instead of loading a large file to query for embeddings, embeddings
is backed by a database and fast to load and query:
>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop
>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop
>>> g = GloveEmbedding('common_crawl_840', d_emb=300)
>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop
Installation¶
pip install embeddings # from pypi
pip install git+https://github.com/vzhong/embeddings.git # from github
Usage¶
Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database.
This may take a long time for large embeddings such as GloVe.
Further usage of the embeddings are directly queried against the database.
Embedding databases are stored in the $EMBEDDINGS_ROOT
directory (defaults to ~/.embeddings
). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.
from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding
g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
print('embedding {}'.format(w))
print(g.emb(w))
print(f.emb(w))
print(k.emb(w))
print(c.emb(w))
Docker¶
If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto’s character ngram embeddings is available at vzhong/embeddings.
To mount volumes from this container, set $EMBEDDINGS_ROOT
in your container to /opt/embeddings
.
For example:
docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py
embeddings package¶
embeddings.embedding module¶
-
class
embeddings.embedding.
Embedding
[source]¶ Bases:
object
-
static
download_file
(url, local_filename)[source]¶ Downloads a file from an url to a local file.
Parameters: Returns: file name of the downloaded file.
Return type:
-
static
ensure_file
(name, url=None, force=False, logger=<RootLogger root (WARNING)>, postprocess=None)[source]¶ Ensures that the file requested exists in the cache, downloading it if it does not exist.
Parameters: - name (str) – name of the file.
- url (str) – url to download the file from, if it doesn’t exist.
- force (bool) – whether to force the download, regardless of the existence of the file.
- logger (logging.Logger) – logger to log results.
- postprocess (function) – a function that, if given, will be applied after the file is downloaded. The function has the signature
f(fname)
Returns: file name of the downloaded file.
Return type:
-
static
initialize_db
(fname)[source]¶ Parameters: fname (str) – location of the database. Returns: a SQLite3 database with an embeddings table. Return type: db (sqlite3.Connection)
-
insert_batch
(batch)[source]¶ Parameters: batch (list) – a list of embeddings to insert, each of which is a tuple (word, embeddings)
.Example:
e = Embedding() e.db = e.initialize_db(self.e.path('mydb.db')) e.insert_batch([ ('hello', [1, 2, 3]), ('world', [2, 3, 4]), ('!', [3, 4, 5]), ])
-
static
embeddings.fasttext module¶
-
class
embeddings.fasttext.
FastTextEmbedding
(lang='en', show_progress=True, default='none')[source]¶ Bases:
embeddings.embedding.Embedding
Reference: https://arxiv.org/abs/1607.04606
-
__init__
(lang='en', show_progress=True, default='none')[source]¶ Parameters: Note
Default can use zeros, return
None
, or generate random between[-0.1, 0.1]
.
-
d_emb
= 300¶
-
sizes
= {'en': 1}¶
-
url
= 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.zip'¶
-
embeddings.glove module¶
-
class
embeddings.glove.
GloveEmbedding
(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]¶ Bases:
embeddings.embedding.Embedding
Reference: http://nlp.stanford.edu/projects/glove
-
class
GloveSetting
(url, d_embs, size, description)¶ Bases:
tuple
-
d_embs
¶ Alias for field number 1
-
description
¶ Alias for field number 3
-
size
¶ Alias for field number 2
-
url
¶ Alias for field number 0
-
-
__init__
(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]¶ Parameters: - name – name of the embedding to retrieve.
- d_emb – embedding dimensions.
- show_progress – whether to print progress.
- default – how to embed words that are out of vocabulary. Can use zeros, return
None
, or generate random between[-0.1, 0.1]
.
-
settings
= {'common_crawl_48': GloveSetting(url='http://nlp.stanford.edu/data/glove.42B.300d.zip', d_embs=[300], size=1917494, description='48B token common crawl'), 'common_crawl_840': GloveSetting(url='http://nlp.stanford.edu/data/glove.840B.300d.zip', d_embs=[300], size=2195895, description='840B token common crawl'), 'twitter': GloveSetting(url='http://nlp.stanford.edu/data/glove.twitter.27B.zip', d_embs=[25, 50, 100, 200], size=1193514, description='27B token twitter'), 'wikipedia_gigaword': GloveSetting(url='http://nlp.stanford.edu/data/glove.6B.zip', d_embs=[50, 100, 200, 300], size=400000, description='6B token wikipedia 2014 + gigaword 5')}¶
-
class
embeddings.kazuma module¶
-
class
embeddings.kazuma.
KazumaCharEmbedding
(show_progress=True)[source]¶ Bases:
embeddings.embedding.Embedding
Reference: https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/
-
d_emb
= 100¶
-
size
= 874474¶
-
url
= 'https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/jmt_pre-trained_embeddings.tar.gz'¶
-