如何计算Python或Excel中每行excel重复的标签数量？答案

【问题标题】：How to count the number of a label repeated for each row of excel in Python or Excel?如何计算Python或Excel中每行excel重复的标签数量？
【发布时间】：2020-08-27 11:51:40
【问题描述】：

我有一个包含 10K 行的 excel 文件，每行都有一些推文信息。例如这些列：Tweet、Date of Tweet、User Name、Retweet Count、...、User Location、Sentiment（此列中的值是正面或负面或中性），State（该列的值为美国50个州），Abbreviation（该列的值为CA、NJ等州的缩写， NY,..), CountofNegative（此列为空，我希望在此列中写入每个州的负面推文数量，因此此列将有 50 个数字）。

在下面你可以看到这个数据集的截图：

问题：统计每个州或其缩写的负面推文数量，并写在 CountofNegative 列。以下是我的代码：

import pandas as pd

file=pd.read_excel("C:/Users/amtol/Desktop/Project/filter.xlsx")
UserLocation= file["User Location"]
Sentiment= file["Sentiment"]
CountofNegative= file["CountofNegative"]
State=file["State"]
Abbreviation= file["Abbreviation"]

for i, (loc,sent) in enumerate(zip(UserLocation, Sentiment)):
    count=0
    for j, (state, abbr) in enumerate(zip(State, Abbreviation)):
        if (loc == state or loc == abbr and sent == "Negative"):
            count=count+1
        file.loc[j+1,"CountofNegative"]=count

print(CountofNegative)

file.to_excel("C:/Users/amtol/Desktop/Project/filter.xlsx")

没有错误，但是在创建输出文件时，“CountofNegative”列的前 24 个值为零，其余为 1（它们不是正确答案）。另外，我想通过print(CountofNegative) 测试程序，但仍然没有发生任何事情。（无输出）。如何修复我的代码？

【问题讨论】：

请provide a reproducible copy of the DataFrame with df.to_clipboard(sep=',')。 Stack Overflow Discourages Screenshots。这个问题很可能会被否决。您不鼓励提供帮助，因为没有人愿意重新输入您的数据或代码，而且屏幕截图通常难以辨认。
或者您不想提供数据，或者如果数据太大，请演示一个看起来像真实数据的示例数据。

标签： python python-3.x excel pandas twitterapi-python

【解决方案1】：

好的，所以如果缩写和州名没有通用性，那么首先使用代码中的字典将全名转换为缩写。如果某些名称/缩写不正确，请在 dict 中进行一些更改。

因为我们只关心“负”计数。将 Negative 转换为 1，将其他响应转换为 0，如下所示：

#Created sample dataset
 data={'State':['New York','New York','New York','New Jersey','New Jersey','New Jersey','California','California','California','NY','NJ','CA'],
'Sentiment' :['Negative','Positive','Negative','Neutral','Negative','Positive','Positive','Positive','Positive','Negative','Positive','Negative'], }
 df = pd.DataFrame(data, columns = ['State', 'Sentiment'])
 print (df)

#Dictionary of US states and abbreviations 
 di = {
'Alabama': 'AL',
'Alaska': 'AK',
'American Samoa': 'AS',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Guam': 'GU',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}

#Match the names in the dictionary to columns using
df=df.replace({"State": di}) 

#Create a function to give weight only to negative comments
def convert_to_int(word):
word_dict = {'Negative':1, 'Positive':0, 'Neutral':0, 0: 0}
return word_dict[word]

#Convert the Sentiment col as per the above function
df['Sentiment'] = df['Sentiment'].apply(lambda x : convert_to_int(x))

#Now the final part of doing the count of negative
df['negative_sum'] = df['Sentiment'].groupby(df['State']).transform('sum')


#My final output

 State  Sentiment   negative_sum
0   NY  1   3
1   NY  0   3
2   NY  1   3
3   NJ  0   1
4   NJ  1   1
5   NJ  0   1
6   CA  0   1
7   CA  0   1
8   CA  0   1
9   NY  1   3
10  NJ  0   1
11  CA  1   1

现在，您还可以选择再次将情感列转换为字符串，因为现在我们有了我们需要的负和列。我希望这足以达到目的。

【讨论】：

我已经编辑了我的问题以明确。请你再读一遍好吗？也许这一次你可以更好地理解我的问题。谢谢
嘿，现在根据问题编辑我的答案。请看看这是否足以达到目的。
谢谢。因此，您将标签“负数”更改为 1，将其余标签更改为 0。然后您计算了每个州的数量。实际上，我想为该州及其其缩写找到负面标签。例如，如果 CA 有 3 个负数，California 有 2 个负数，那么我们的程序在单元格 J6 中分配 5。另外，我们如何在当前的 excel 文件中为“每个州的负数数”创建这个新列？你对我的代码有什么想法？这是完全错误的还是你能解决它？
嘿，对代码进行了一些更改。请看看这是否足以达到目的。 如果某些州名缩写和全名不匹配，请自行编辑字典。在代码的最后一行中已经添加了一个新的negative_sum 列。将此 df 保存为 excel，以便在 excel 文件中也包含所需的列以及初始列。
我在末尾添加了一行，用于将 df 保存为 exceldf.to_excel("C:/Users/amtol/Desktop/Project/filter5.xlsx")。我看到列“negative_sum”已创建，其中所有成员为 0 或 1。但在此数据集中，我有 186 条推文（行）和 50 个状态，这意味着某些状态的数字应该大于 1负总和”。是我添加以将文件保存在 excel 中的最后一行，是错误的还是您的代码？