导入 GoogleNews-vectors-negative300.bin答案

【问题标题】：Import GoogleNews-vectors-negative300.bin导入 GoogleNews-vectors-negative300.bin
【发布时间】：2018-03-08 02:37:56
【问题描述】：

我正在使用 gensim 编写代码，并且很难对代码中的 ValueError 进行故障排除。我终于能够压缩 GoogleNews-vectors-negative300.bin.gz 文件，这样我就可以在我的模型中实现它了。我也试过gzip，结果不成功。代码中的错误发生在最后一行。我想知道可以做些什么来修复错误。有什么解决方法吗？最后，有没有可以参考的网站？

非常感谢您的协助！

import gensim
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Mode

pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
word2vec = 
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, 
binary=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-23bd96c1d6ab> in <module>()
  1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
----> 2 word2vec = 
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, 
binary=True)

C:\Users\green\Anaconda3\envs\py35\lib\site-
packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, 
fvocab, binary, encoding, unicode_errors, limit, datatype)
244                             word.append(ch)
245                     word = utils.to_unicode(b''.join(word), 
encoding=encoding, errors=unicode_errors)
--> 246                     weights = fromstring(fin.read(binary_len), 
dtype=REAL)
247                     add_word(word, weights)
248             else:

ValueError: string size must be a multiple of element size

【问题讨论】：

我可以毫无错误地执行您的代码。您确定您拥有最新版本的 gensim 吗？您实际上是压缩矢量文件（正如您在帖子中所写）还是您的意思是“解压缩”？您是否尝试过设置binary=false 来检查您是否有文本文件而不是二进制文件？
我使用 Winzip 压缩文件。我也试过二进制=假。我使用最新版本的 gensim 得到了相同的结果。我正在使用 Python 3.6。
我怀疑您的文件已损坏，或者不是真正的未压缩二进制文件。 Gensim 可以很好地读取.gz 文件，因此您可以使用该原始文件。尝试下载新鲜并确保大小符合预期。如果您仍然有问题，请报告您正在尝试的文件的 MD5 哈希，以便与其他人的版本进行比较。
谢谢。我偶然发现了 wget 包并以这种方式下载了 bin 文件。我再试一次。

标签： python gensim

【解决方案1】：

这对我有用。我加载了模型的一部分，而不是整个模型，因为它很大。

!pip install wget

import wget
url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
filename = wget.download(url)

f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
f_out.writelines(f_in)

import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.decomposition import PCA

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)

【讨论】：

我正在尝试使用您的方法，但是当我执行wget.download(url) 时，我得到了URLError: <urlopen error [Errno 11001] getaddrinfo failed> 任何建议？

【解决方案2】：

以下命令有效。

brew install wget

wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

这会下载 GZIP 压缩文件，您可以使用以下方法解压缩：

gzip -d GoogleNews-vectors-negative300.bin.gz

然后您可以使用以下命令获取 wordVector。

from gensim import models

w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)

【讨论】：

我收到了这个警告：/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function 'See the migration notes for details: %s' % _MIGRATION_NOTES_URL 任何想法如何解决它？
记住这是一个超过十亿字的巨大文件

【解决方案3】：

试试这个 -

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']

另外，请访问此链接：https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

【讨论】：

不幸的是，该模型无法推断不熟悉单词的向量。这是 Word2Vec 的一个限制：如果此限制对您很重要，请查看 FastText 模型。

【解决方案4】：

你必须写完整的路径。

使用这条路径：

https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

【讨论】：

欢迎提供解决方案链接，但请确保您的答案在没有它的情况下有用：add context around the link 这样您的其他用户就会知道它是什么以及为什么会出现，然后引用最相关的您链接到的页面的一部分，以防目标页面不可用。 Answers that are little more than a link may be deleted.
如果每个人都使用此链接，我们 (Skymind) 将不胜感激：deeplearning4jblob.blob.core.windows.net/resources/wordvectors/…
@CrabMan 我们每月收到数千张托管该文件的账单，它耗尽了我们所有的 AWS 积分。