Word2vec 保存的模型不是 UTF-8 编码的，但是 Word2vec 模型的句子输入是 UTF-8 编码的答案

【问题标题】：Word2vec saved model is not UTF-8 encoded but the sentence input to the Word2vec model is UTF-8 encodedWord2vec 保存的模型不是 UTF-8 编码的，但是 Word2vec 模型的句子输入是 UTF-8 编码的
【发布时间】：2017-06-22 23:06:43
【问题描述】：

我使用 gensim 包训练了一个 word2vec 模型并使用以下名称保存它。

model_name = "300features_1minwords_10context"
model.save(model_name)

我得到了这些日志消息信息。在模型得到训练和保存时。

INFO : not storing attribute syn0norm
INFO : not storing attribute cum_table

然后，我尝试使用这个加载模型，

from gensim.models import Word2Vec
model = Word2Vec.load("300features_1minwords_10context")

我收到以下错误。

2017-06-22 21:27:14,975 : INFO : loading Word2Vec object from 300features_1minwords_10context
2017-06-22 21:27:15,496 : INFO : loading wv recursively from 300features_1minwords_10context.wv.* with mmap=None
2017-06-22 21:27:15,497 : INFO : setting ignored attribute syn0norm to None
2017-06-22 21:27:15,498 : INFO : setting ignored attribute cum_table to None
2017-06-22 21:27:15,499 : INFO : loaded 300features_1minwords_10context
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-9d90db0f07c0> in <module>()
      1 from gensim.models import Word2Vec
      2 model = Word2Vec.load("300features_1minwords_10context")
----> 3 model.syn0.shape

AttributeError: 'Word2Vec' object has no attribute 'syn0'

另外，在文件“300features_1minwords_10context”中，显示

"300features_1minwords_10context" is not UTF-8 encoded
Saving disabled.
Open console for more details

为了修复上述属性错误，我也尝试了谷歌论坛的以下内容，

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format("300features_1minwords_10context")
model.syn0.shape

这导致另一个错误是

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

该模型使用 UTF-8 编码的句子进行训练。我不确定为什么会抛出这个错误？

更多信息：

df = pd.read_csv('UNSPSCdataset.csv',encoding='mac_roman',low_memory=False)
features = ['MaterialDescription']
temp_features = df[features]
temp_features.to_csv('materialDescription', encoding='UTF-8')
X = pd.read_csv('materialDescription',encoding='UTF-8')

在这里，我必须使用“mac_roman”编码才能使用 pandas 数据框访问它。由于在训练模型时数据帧中的文本必须采用 UTF-8 格式，因此我通过使用 UTF-8 对其进行编码，将该特定特征保存在单独的 csv 文件中，之后，我访问了该特定列。

任何帮助都是可观的

【问题讨论】：

标签： python-3.x utf-8 nlp gensim word2vec

【解决方案1】：

您使用的是最新的 gensim 吗？如果没有，请务必尝试一下——旧版本中有时会出现 save()/load() 错误。

INFO“不存储”日志行是正常的——它们并不表示有任何问题（因此可以从您的问题中删除。）

您是否直接在load() 上收到“没有属性”错误？（这里有一个完整的错误堆栈会很有用，并澄清这一点。）

更新：从现在显示的错误堆栈中，当您尝试访问model.syn0.shape。最近版本的 gensim 不再将 syn0 作为 Word2Vec 类对象的属性——该信息被移动到 wv 属性中的组成 KeyedVectors 对象。所以model.wv.syn0.shape 可能会访问您正在寻找的内容，而不会出现错误。

当您的模型较大时，save() 可以为模型的大型数组属性（如 syn0）生成多个带有额外扩展名的文件。这些文件必须与要重新load()ed 的模型的主文件名一起保存。是否有可能您已将 300features_1minwords_10context 文件（但没有任何此类随附文件）移动到 load() 不完整的新位置？

你不能 load_word2vec_format() 一个原生 gensim save()d 的文件——它们完全不同的格式，所以编码错误只是试图读取二进制 Python 泡菜文件的产物（来自 save()）完全作为另一种格式。

【讨论】：

嗨@gojomo，我正在使用最新的gensim。是的，我在 load() 上直接收到“没有属性”错误，我在问题中附加了错误堆栈。不会生成 syn0 之类的东西。甚至模型保存的文件中也没有任何内容。我猜由于 Unicode 问题，模型根本没有保存。当我打印句子时，我没有得到像这样的 unicode 符号 u'Airtel'
见上面的内联更新；实际错误不在load() 中，而是在您自己的下一行尝试访问model.syn0 - 它尝试访问syn0 （在最近的gensim 版本中）它不再存在，已移至model.wv.syn0。
嗨@gojomo，请试试这个。 stackoverflow.com/questions/44740161/…