keyerror : 单词 ' ' 不在词汇表 WORD2VEC 中答案

【问题标题】：keyerror : word ' ' not in vocabulary WORD2VECkeyerror : 单词 ' ' 不在词汇表 WORD2VEC 中
【发布时间】：2021-05-16 01:15:07
【问题描述】：

enter image description here enter image description here我正在做 Python 项目，我正在使用 Word2Vec 来推荐产品。该代码对于包含 19401 的数据集非常有效，但是每当我传递产品的 id 时，我都会得到 此错误“keyerror : word '1077' not in words” 我不知道如何解决这个问题，因为我对此知之甚少，我还在学习。请帮我解决这个问题！

purchases_train = []

for i in tqdm(product_train):
    temp = train_df[train_df["Clothing ID"] == i]["Review Text"].tolist()
    purchases_train.append(temp)


purchases_val = []

for i in tqdm(validation_df['Clothing ID'].unique()):
    temp = validation_df[validation_df["Clothing ID"] == i]["Review Text"].tolist()
    purchases_val.append(temp)



model = Word2Vec(window = 10, sg = 1, hs = 0,
                 negative = 10, # for negative sampling
                 alpha=0.03, min_count= 1 , min_alpha=0.0007,
                 seed = 14)


model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

# save word2vec model
model.save("word2vec_2.model")


model.init_sims(replace=True)

# extract all vectors
X = model[model.wv.vocab]

products = train_df[["Clothing ID", "Review Text"]]

# remove duplicates
products.drop_duplicates(inplace=True, subset='Clothing ID', keep="last")

# create product-ID and product-description dictionary
products_dict = products.groupby('Clothing ID')['Review Text'].apply(list).to_dict()


def similar_products(v, n = 6):
    
    # extract most similar products for the input vector
    ms = model.similar_by_vector(v, topn= n+1)[1:]
    
    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0], j[1])
        new_ms.append(pair)
        
    return new_ms


similar_products(model['1077'])

【问题讨论】：

请发布错误的整个追溯，以及您正在处理的示例数据。

标签： python word2vec keyerror

【解决方案1】：

如果您收到错误 word '847' not in vocabulary，那么您可以确定：您的训练数据中未提供令牌 '847'。

如果您认为它存在，您应该查看数据以发现它不存在。

如果您的代码需要能够对不在训练数据中的单词做一些有用的事情，您应该将其扩展为：

(1) 在尝试获取其向量之前检查一个单词是否存在

    if '847' in model:
        similar_products(model['847'])
    else:
        # do something else
        ...

...或...

(2) 抓住KeyError 并在它被抓住时做其他事情。

【讨论】：