【发布时间】:2020-10-27 00:30:14
【问题描述】:
我正在训练一个 word2vec 模型,使用大约 700 个文本文件作为我的语料库。但是,当我在预处理步骤之后开始读取文件时,我收到了上述错误。代码如下
class MyCorpus(object):
def __iter__(self):
for i in ceo_path: /// ceo_path contains abs path of all text files
file = open(i, 'r', encoding='utf-8')
text = file.read()
###########
########### /// text preprocessing steps
###########
yield final_text /// returns preprocessed text
sentences = MyCorpus()
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)
# training the model
cores = multiprocessing.cpu_count()
w2v_model = Word2Vec(min_count=5,
iter=30,
window=3,
size=200,
sample=6e-5,
alpha=0.025,
min_alpha=0.0001,
negative=20,
workers=cores-1,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
w2v_model.save('ceo1.model')
我得到的错误是:
Traceback (most recent call last):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 131, in <module>
w2v_model.build_vocab(sentences)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\base_any2vec.py", line 921, in build_vocab
total_words, corpus_count = self.vocabulary.scan_vocab(
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1403, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1372, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 65, in __iter__
text = file.read()
File "C:\Users\name\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
由于我是新手,所以我无法理解该错误。当我没有使用 iter 函数并像现在这样以块的形式发送数据时,我在读取文本文件时没有收到错误。
【问题讨论】:
标签: python-3.x gensim word2vec