(Python) 如何将 2 列或更多列与 Pandas 进行比较？答案

【问题标题】：(Python) How can I compare 2 or more columns with Pandas?(Python) 如何将 2 列或更多列与 Pandas 进行比较？
【发布时间】：2020-10-06 02:22:30
【问题描述】：

我一直在使用模块 pandas 进行数据抓取，尽管我了解如何 ()，但我仍然不确定如何比较 CSV 的 2 列或更多列。以下面的代码为例，我想知道，例如分别发布更多动作、射击和平台游戏的 3 家发行商。我写了下面的代码，但输出显示“False”而不是流派的名称。至少我相信前 3 名的出版商是正确的，但我不确定。有人可以看看吗？

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

a = data['Publisher'].groupby((data['Genre'] == 'Action')).value_counts().head(3)
print(a)

s = data['Publisher'].groupby((data['Genre'] == 'Shooter')).value_counts().head(3)
print(s)

p = data['Publisher'].groupby((data['Genre'] == 'Platform')).value_counts().head(3)
print(p)

另外，我应该找出动作、射击和平台游戏销量最高的前 3 家发行商。我试着写这个，但没有用。如何同时使用同一列的 3 个项目，并将它们与另外 2 个列进行比较？如果我想包含一个时间范围，例如比较过去 10 年的所有这些列，该怎么办？

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

a = ((data['Genre'] == 'Action') & (data['Genre'] == 'Shooter') & (data['Genre'] == 'Platform')).groupby((data['Publisher']) & (data['Global_Sales'])).value_counts().head(3)
print(a)

【问题讨论】：

标签： python pandas csv

【解决方案1】：

一下子有很多问题：

a = data['Publisher'].groupby((data['Genre'] == 'Action')).value_counts().head(3) print(a)

在 groupby 中，您没有指定具体的流派，例如“动作”。这就是查询的用途。 groupby 的重点是对every Genre

进行如下计算

In [11]: number_of_games = data.groupby('Genre')['Publisher'].value_counts()                                                                                                                                                        
Out[11]: 
Genre     Publisher              
Action    Activision                 311
          Namco Bandai Games         251
          Ubisoft                    198
          THQ                        194
          Electronic Arts            183
                                    ... 
Strategy  Time Warner Interactive      1
          Titus                        1
          Trion Worlds                 1
          Westwood Studios             1
          Zoo Digital Publishing       1
Name: Publisher, dtype: int64

请注意，Publisher 的选择是在分组之后，因此 pandas 在内部循环遍历 Genre 中的所有值并对 Publisher 进行 value_count

我应该找出销售最多动作、射击和平台游戏的前 3 名发行商

像这样简单地过滤你想要的类别

In [25]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :]                                                                                                                                                 
Out[25]: 
Genre    Publisher         
Action   Activision            311
         Namco Bandai Games    251
         Ubisoft               198
         THQ                   194
         Electronic Arts       183
                              ... 
Shooter  Visco                   1
         Warashi                 1
         Wargaming.net           1
         Xseed Games             1
         id Software             1
Name: Publisher, dtype: int64

然后你又想要最大的 3 个发布者每个流派，因此你使用另一个 groupby

In [30]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :].groupby(['Genre']).head(3)                                                                                                                      
Out[30]: 
Genre     Publisher         
Action    Activision            311
          Namco Bandai Games    251
          Ubisoft               198
Platform  Nintendo              112
          THQ                    85
          Ubisoft                70
Shooter   Activision            162
          Electronic Arts       145
          Ubisoft                92
Name: Publisher, dtype: int64

函数head 隐式依赖于被排序的值。或者你可以使用nlargest

In [31]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :].groupby(['Genre']).nlargest(3).droplevel(0)                                                                                                     
Out[31]: 
Genre     Publisher         
Action    Activision            311
          Namco Bandai Games    251
          Ubisoft               198
Platform  Nintendo              112
          THQ                    85
          Ubisoft                70
Shooter   Activision            162
          Electronic Arts       145
          Ubisoft                92
Name: Publisher, dtype: int64

结果相同，但您需要使用 droplevel 清理索引，因为它出现了两次

如果我想包含一个时间范围，例如比较过去 10 年的所有这些列，该怎么办？

您显然需要时间范围内的数据。如果您只想要最近 10 年发布的游戏，请过滤 10 年之后的游戏的原始数据。如果您想确定哪些出版商每年发布最多，您将创建一个包含出版年份的列，并按此分组。使用您已经看到的流派和出版商，您可以按功能列表进行分组。

【讨论】：

【解决方案2】：

对于前 3 个，您可以这样做：

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

a = data[data['Genre']=='Action'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(a.head(3))

s = data[data['Genre']=='Shooter'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(s.head(3))

s = data[data['Genre']=='Platform'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(s.head(3))

输出：

              Publisher   Genre  count
10           Activision  Action    311
148  Namco Bandai Games  Action    251
214             Ubisoft  Action    198

           Publisher    Genre  count
5         Activision  Shooter    162
39   Electronic Arts  Shooter    145
135          Ubisoft  Shooter     92

   Publisher     Genre  count
60  Nintendo  Platform    112
81       THQ  Platform     85
86   Ubisoft  Platform     70

最后一个你可以这样做：

all = data[(data['Genre']=='Platform') | (data['Genre']=='Shooter') | (data['Genre']=='Action')].groupby(by=['Publisher'], as_index=False).agg({'Global_Sales': 'sum'}).reset_index(drop=['index']).sort_values('Global_Sales', ascending=False)
print(all.head(3))

输出：

           Publisher  Global_Sales
195         Nintendo        623.24
11        Activision        480.94
84   Electronic Arts        287.13

【讨论】：

【解决方案3】：

第一个疑惑，可以编码：

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

# Group data in Genres
grouped_data = data['Publisher'].groupby((data['Genre'])).value_counts()

# By know you already have the values you want inside the "grouped_data"
# But, you can create smaller tables to see it better

a = grouped_data['Action']
s = grouped_data['Shooter']
p = grouped_data['Platform']

在第二个问题中，我没有得到您所需要的确切信息。但是您可以使用以下方法比较 Publisher 和 Genre：

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

# group data by Publisher and Genre using .sum() for getting total sales
grouped_2 = data.groupby((data['Publisher'], data['Genre'])).sum()

# Look for a specific Publisher x Genre
specific = grouped_2.loc['Nintendo', 'Sports']
print(specific)

# For making the analysis for last 10 years
Recent_data = data[data['Year_of_Release']>2010]

# Now, you can replace "data" for "Recent_data" and make the same analysis for the last 10 years.

【讨论】：