【问题标题】:Why are my TF-IDF features per sample different for train and test inputs?为什么训练和测试输入的每个样本的 TF-IDF 特征不同?
【发布时间】:2020-06-08 00:30:14
【问题描述】:

Tf -idf 给出值错误它在抛出错误之前工作正常

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(2,2))
tf_train=tf_idf_vectorizer.fit_transform(X_train)
tf_test= tf_idf_vectorizer.transform(X_test)
model=LogisticRegression()
model.fit(X_train,y_train)
y_predict=model.predict(X_test)

ValueError: X has 97624 features per sample; expecting 11

【问题讨论】:

    标签: python machine-learning scikit-learn tf-idf


    【解决方案1】:

    应该是model.fit(tf_train, y_train),然后是model.predict(tf_test)

    tf_idf_vectorizer = TfidfVectorizer(ngram_range=(2,2))
    
    tf_train=tf_idf_vectorizer.fit_transform(X_train)
    tf_test= tf_idf_vectorizer.transform(X_test)
    
    model=LogisticRegression()
    
    model.fit(tf_train, y_train)
    
    y_predict=model.predict(tf_test)
    

    fit_tranform 转换后的输入,即tf_train,并且您将model.predict 应用于转换后的测试输入,即tf_test


    出于理智,检查一下,执行len(X_train),您应该得到 97624,然后是 len(X_test),您应该得到 11。这就是这个错误的来源:

    ValueError: X 每个样本有 97624 个特征;预计 11

    P/S:仔细看https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

    【讨论】:

      猜你喜欢
      • 2017-11-26
      • 2021-06-19
      • 2020-09-22
      • 2020-09-27
      • 1970-01-01
      • 2015-01-17
      • 2017-04-05
      • 1970-01-01
      • 2018-04-27
      相关资源
      最近更新 更多