Embeddings¶
Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.
Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:
>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop
>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop
>>> g = GloveEmbedding('common_crawl_840', d_emb=300)
>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop
Installation¶
pip install embeddings # from pypi
pip install git+https://github.com/vzhong/embeddings.git # from github
Usage¶
Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database.
This may take a long time for large embeddings such as GloVe.
Further usage of the embeddings are directly queried against the database.
Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.
from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding
g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
print('embedding {}'.format(w))
print(g.emb(w))
print(f.emb(w))
print(k.emb(w))
print(c.emb(w))
Docker¶
If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto’s character ngram embeddings is available at vzhong/embeddings.
To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.
For example:
docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py
embeddings package¶
embeddings.embedding module¶
-
class
embeddings.embedding.Embedding[source]¶ Bases:
object-
static
download_file(url, local_filename)[source]¶ Downloads a file from an url to a local file.
Parameters: Returns: file name of the downloaded file.
Return type:
-
static
ensure_file(name, url=None, force=False, logger=<RootLogger root (WARNING)>, postprocess=None)[source]¶ Ensures that the file requested exists in the cache, downloading it if it does not exist.
Parameters: - name (str) – name of the file.
- url (str) – url to download the file from, if it doesn’t exist.
- force (bool) – whether to force the download, regardless of the existence of the file.
- logger (logging.Logger) – logger to log results.
- postprocess (function) – a function that, if given, will be applied after the file is downloaded. The function has the signature
f(fname)
Returns: file name of the downloaded file.
Return type:
-
static
initialize_db(fname)[source]¶ Parameters: fname (str) – location of the database. Returns: a SQLite3 database with an embeddings table. Return type: db (sqlite3.Connection)
-
insert_batch(batch)[source]¶ Parameters: batch (list) – a list of embeddings to insert, each of which is a tuple (word, embeddings).Example:
e = Embedding() e.db = e.initialize_db(self.e.path('mydb.db')) e.insert_batch([ ('hello', [1, 2, 3]), ('world', [2, 3, 4]), ('!', [3, 4, 5]), ])
-
static
embeddings.fasttext module¶
-
class
embeddings.fasttext.FastTextEmbedding(lang='en', show_progress=True, default='none')[source]¶ Bases:
embeddings.embedding.EmbeddingReference: https://arxiv.org/abs/1607.04606
-
__init__(lang='en', show_progress=True, default='none')[source]¶ Parameters: Note
Default can use zeros, return
None, or generate random between[-0.1, 0.1].
-
d_emb= 300¶
-
sizes= {'en': 1}¶
-
url= 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.zip'¶
-
embeddings.glove module¶
-
class
embeddings.glove.GloveEmbedding(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]¶ Bases:
embeddings.embedding.EmbeddingReference: http://nlp.stanford.edu/projects/glove
-
class
GloveSetting(url, d_embs, size, description)¶ Bases:
tuple-
d_embs¶ Alias for field number 1
-
description¶ Alias for field number 3
-
size¶ Alias for field number 2
-
url¶ Alias for field number 0
-
-
__init__(name='common_crawl_840', d_emb=300, show_progress=True, default='none')[source]¶ Parameters: - name – name of the embedding to retrieve.
- d_emb – embedding dimensions.
- show_progress – whether to print progress.
- default – how to embed words that are out of vocabulary. Can use zeros, return
None, or generate random between[-0.1, 0.1].
-
settings= {'common_crawl_48': GloveSetting(url='http://nlp.stanford.edu/data/glove.42B.300d.zip', d_embs=[300], size=1917494, description='48B token common crawl'), 'common_crawl_840': GloveSetting(url='http://nlp.stanford.edu/data/glove.840B.300d.zip', d_embs=[300], size=2195895, description='840B token common crawl'), 'twitter': GloveSetting(url='http://nlp.stanford.edu/data/glove.twitter.27B.zip', d_embs=[25, 50, 100, 200], size=1193514, description='27B token twitter'), 'wikipedia_gigaword': GloveSetting(url='http://nlp.stanford.edu/data/glove.6B.zip', d_embs=[50, 100, 200, 300], size=400000, description='6B token wikipedia 2014 + gigaword 5')}¶
-
class
embeddings.kazuma module¶
-
class
embeddings.kazuma.KazumaCharEmbedding(show_progress=True)[source]¶ Bases:
embeddings.embedding.EmbeddingReference: https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/
-
d_emb= 100¶
-
size= 874474¶
-
url= 'https://www.logos.t.u-tokyo.ac.jp/~hassy/publications/arxiv2016jmt/jmt_pre-trained_embeddings.tar.gz'¶
-