【问题标题】:How to visualize Gensim Word2vec Embeddings in Tensorboard Projector如何在 Tensorboard Projector 中可视化 Gensim Word2vec 嵌入
【发布时间】:2021-11-13 00:19:56
【问题描述】:

按照gensim word2vec embedding tutorial,我训练了一个简单的word2vec模型:

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.save("/content/word2vec.model")

我想将其可视化using the Embedding Projector in TensorBoardThere is another straightforward tutorial in gensim documentation。我在 Colab 中做了以下操作:

!python3 -m gensim.scripts.word2vec2tensor -i /content/word2vec.model -o /content/my_model

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 94, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 68, in word2vec2tensor
    model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py", line 172, in _load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 355, in any2unicode
    return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

请注意,我确实首先检查了此 exact same question from 2018 - 但已接受的答案不再有效,因为 gensim 和 tensorflow 都已更新,因此我认为值得在 2021 年第四季度再次询问。

【问题讨论】:

  • 您能否更具体地说明旧信息如何“不再有效”? (它是否遇到特定错误?Gie 结果看起来不对?等)如果您在问题中显示任何特定错误,则可能有可以解决它的微不足道的代码更新,对于任何一个包 - 例如 Gensim 4 中给出的各种提示迁移指南:github.com/RaRe-Technologies/gensim/wiki/…)。
  • 能否请您参考此doc,希望对您有所帮助。谢谢

标签: python tensorflow gensim word2vec tensorboard


【解决方案1】:

以原始 C word2vec 实现格式保存模型可解决问题: model.wv.save_word2vec_format("/content/word2vec.model"):

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("/content/word2vec.model")

gensim 中存储 word2vec 模型有两种格式:原始 word2vec 实现的键控向量格式和额外存储隐藏权重、词汇频率等的格式。示例和详细信息可以在documentation 中找到。脚本word2vec2tensor.py 使用原始格式并使用load_word2vec_format 加载模型:code

【讨论】:

  • 您能否提供端到端的可运行答案,包括对该问题的简要说明?
  • 我已经添加了详细信息。
猜你喜欢
  • 2018-11-02
  • 1970-01-01
  • 2018-05-25
  • 2017-10-02
  • 1970-01-01
  • 1970-01-01
  • 2017-05-30
  • 2019-02-07
  • 2017-04-12
相关资源
最近更新 更多