【问题标题】:How to add a predicted-data column to my dataframe?如何将预测数据列添加到我的数据框中?
【发布时间】:2023-04-01 04:02:02
【问题描述】:

我正在使用朴素贝叶斯从一组地址中预测国家/地区名称,我试过这个

import re
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
def normalize_text(s):
    s = s.lower()
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    s = re.sub('\s+',' ',s)
    return(s)
df['TEXT'] = [normalize_text(s) for s in df['Full_Address']]

# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])

encoder = LabelEncoder()
y = encoder.fit_transform(df['CountryName'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)

所以我想要的是在我的数据框中添加另一列,其中包含预测的国家名称,我该如何实现呢?

更新:

df['Predicted'] = nb.predict(x)

         CountryName                                       Full_Address  \
8913       Indonesia  EJIP Industrial Park Plot 1E-2, Sukaresmi, Cik...   
7870   United States    360 Thelma Street, Sandusky, Michigan 48471 USA   
32037          China  1027, 26/F, Zhao Feng Mansion, Chang Ning Road...   
38769    New Zealand  NZ - 164 ST. ASAPH STREET, \tCHRISTCHURCH 8011...   
46639          India  301-306, Sahajanand Trade Center, Opp. Kothawa...   

                                                    TEXT  Predicted  
8913   ejip industrial park plot 1e-2 sukaresmi cikar...         66  
7870       360 thelma street sandusky michigan 48471 usa        169  
32037  1027 26/f zhao feng mansion chang ning road sh...         30  
38769  nz 164 st asaph street christchurch 8011 new z...        112  
46639  301-306 sahajanand trade center opp kothawala ...         65

【问题讨论】:

  • 也许我遗漏了什么,但我认为基本上应该是df['Predicted'] = nb.predict(x)
  • 它显示一列整数而不是国家名称,我确定每个整数都表示一个国家名称,但我如何才能真正获得国家名称? @piterbarg

标签: python pandas scikit-learn naivebayes multinomial


【解决方案1】:

您应该在y 的预测值上使用encoder.fit_transform 的倒数,应用于模型的输出。所以像

df['Predicted'] = encoder.inverse_transform(nb.predict(x))

这假定nb.predict(x) 的输出是整数列表(而不是列表列表)——如果不是,您可能需要进行一些重塑。由于我无法在没有访问 df 的情况下运行您的代码,所以我真的不能说

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-10-30
    • 2016-11-17
    • 1970-01-01
    • 2017-09-16
    • 2022-01-05
    • 2013-04-25
    • 2022-01-03
    • 1970-01-01
    相关资源
    最近更新 更多