如何运行按列值分组的分析，而不是使用整个数据集答案

【问题标题】：How can I run Analysis grouped by Column value, instead of using the whole dataset如何运行按列值分组的分析，而不是使用整个数据集
【发布时间】：2019-08-12 14:01:34
【问题描述】：

我正在研究 Python product recommendation system（请参阅答案 Mohsin hasan）。

简单的脚本将两个变量（UserId、ItemId）作为输入，并给出两个产品之间的亲和度得分作为输出。

但是，我添加了第三列（国家/地区）。 我想按国家/地区（而不是整个数据框）单独进行分析。

最初，我使用 R，dplyr 的 'group_by' 函数应该会有所帮助。但是目前我被卡住了（请参阅下面的尝试）。有人知道我如何按国家/地区进行此分析吗？（我觉得“pandas.DataFrame.groupby”也可以解决这个问题，而不是我尝试使用 for 循环）。

示例数据（请注意：唯一的区别是我添加了国家列：

UserId      ItemId          Country

1           Babyphone       Netherlands
1           Babyphone       Netherlands
1           CoffeeMachine   Netherlands
2           CoffeeMachine   Netherlands
2           Shaver          Netherlands
3           Shaver          Netherlands
3           CoffeeMachine   Netherlands
4           CoffeeMachine   Netherlands
4           Shaver          Netherlands
4           Blender         Netherlands
5           Blender         Netherlands
5           BabyPhone       Netherlands
5           Shaver          Netherlands
6           Shaver          Netherlands
7           CoffeeMachine   Netherlands
7           CoffeeMachine   Netherlands
8           BabyPhone       Netherlands
9           Blender         Netherlands
9           Blender         Netherlands   
1           Babyphone       Germany
1           Babyphone       Germany
1           CoffeeMachine   Germany
2           CoffeeMachine   Germany
2           Shaver          Germany
3           Shaver          Germany
3           CoffeeMachine   Germany
4           CoffeeMachine   Germany
4           Shaver          Germany
4           Blender         Germany
5           Blender         Germany
5           BabyPhone       Germany
5           Shaver          Germany
6           Shaver          Germany
7           CoffeeMachine   Germany
7           CoffeeMachine   Germany
8           BabyPhone       Germany
9           Blender         Germany
9           Blender         Germany

Working -original- code（使用 UserId 和 ItemId，无 Country）

# main is our data.

# get unique items
items = set(main.productId)

n_users = len(set(main.userId))

# make a dictionary of item and users who bought that item
item_users = main.groupby('productId')['userId'].apply(set).to_dict()

# iterate over combinations of item1 and item2 and store scores
result = []
for item1, item2 in itertools.combinations(items, 2):

  score = len(item_users[item1] & item_users[item2]) / n_users
  item_tuples = [(item1, item2), (item2, item1)]
  result.append((item1, item2, score))
  result.append((item2, item1, score)) # store score for reverse order as well

# convert results to a dataframe
result = pd.DataFrame(result, columns=["item1", "item2", "score"])

我的尝试（与国家，但它不起作用）。我尝试了什么？

按国家/地区过滤数据框（是的，这很糟糕，因为它不是动态的）
循环遍历数据框（每个国家/地区都有 1 个数据框）
尝试插入解决方案（见上），分别申请数据框。

如您所见，不幸的是它无法正常工作......

       Netherlands = df.loc[df['Country'] == 'Netherlands']
       Germany     = df.loc[df['Country'] == 'Germany']
       results = []
       for dataset in (Netherlands, Germany):
           for index, row in dataset.iterrows():
           Country = row['Country'] # Need to bind the name of the df later to the results 

           items = set(dataset.ItemId) #Get unique Items per country
           n_users = len(set(dataset.UserId) # Get unique number of users per country 
           item_users = dataset.groupby('ItemId'['UserId'].apply(set).to_dict() # I tried to add country here, but without results. 

           for item1, item2 in itertools.combinations(items, 2):
                print("item1", item1)
                print("item2", item2)
                score = len(item_users[item1] & item_users[item2]) / n_users
                item_tuples = [(item1, item2), (item2, item1)]
                result.append((item1, item2, score))
                result.append((item2, item1, score)) # store score for reverse order as well
                result = pd.DataFrame(result, columns=["item1", "item2", "score"])

Edit1：预期输出

编辑 2 分数是如何计算的？分数代表：有多少客户一起购买产品组合。

例如，在数据中，您看到剃须刀和咖啡机 = 0.333（因为 9 个人中有 3 个人在每个国家/地区购买了这个组合）。在第一个代码中，分数运行良好。但是，我无法按国家/地区运行它（这是这里的关键问题）。

提前非常感谢！

【问题讨论】：

是否可以添加预期的输出？
或者至少解释一下如何计算分数。请始终提供示例数据和预期输出，并用文字解释您想要做什么，而不是使用嵌套 for 循环时更难理解的代码。
嗨@Erfan，当然。我添加了模型的输出。分数是多少：有多少客户一起购买产品组合。例如，在数据中，您看到剃须刀和咖啡机 = 0.333（因为 9 个人中有 3 个人购买了 PER COUNTRY 的组合）。

标签： python pandas pandas-groupby

【解决方案1】：

给你

=^..^=

正如你提到的，group by will by used。首先将您的得分循环移动到具有附加字段“国家”的函数中，然后在分组数据帧上使用它，如下所示：

import pandas as pd
import itertools

将分数移动到函数中：

def get_score(item):
    country = item[0]
    df = item[1]

    # get unique items
    items = set(df.ItemId)
    n_users = len(set(df.UserId))

    # make a dictionary of item and users who bought that item
    item_users = df.groupby('ItemId')['UserId'].apply(set).to_dict()

    # iterate over combinations of item1 and item2 and store scores
    result = []
    for item1, item2 in itertools.combinations(items, 2):

      score = len(item_users[item1] & item_users[item2]) / n_users
      item_tuples = [(item1, item2), (item2, item1)]
      result.append((item1, item2, score, country))
      result.append((item2, item1, score, country)) # store score for reverse order as well

    # convert results to a dataframe
    result = pd.DataFrame(result, columns=["item1", "item2", "score", 'country'])
    return result

按国家/地区分组数据，然后遍历每个组以获得分数：

grouped_data = df.groupby(['Country'])

df_list = []
for item in list(grouped_data):
    df_list.append(get_score(item))

# concat frames
df = pd.concat(df_list)
# remove rows with 0 score
df = df[df['score'] > 0]

输出：

            item1          item2     score      country
0       BabyPhone        Blender  0.111111      Germany
1         Blender      BabyPhone  0.111111      Germany
4       BabyPhone         Shaver  0.111111      Germany
5          Shaver      BabyPhone  0.111111      Germany
8         Blender  CoffeeMachine  0.111111      Germany
9   CoffeeMachine        Blender  0.111111      Germany
10        Blender         Shaver  0.222222      Germany
11         Shaver        Blender  0.222222      Germany
14  CoffeeMachine         Shaver  0.333333      Germany
15         Shaver  CoffeeMachine  0.333333      Germany
16  CoffeeMachine      Babyphone  0.111111      Germany
17      Babyphone  CoffeeMachine  0.111111      Germany
0       BabyPhone        Blender  0.111111  Netherlands
1         Blender      BabyPhone  0.111111  Netherlands
4       BabyPhone         Shaver  0.111111  Netherlands
5          Shaver      BabyPhone  0.111111  Netherlands
8         Blender  CoffeeMachine  0.111111  Netherlands
9   CoffeeMachine        Blender  0.111111  Netherlands
10        Blender         Shaver  0.222222  Netherlands
11         Shaver        Blender  0.222222  Netherlands
14  CoffeeMachine         Shaver  0.333333  Netherlands
15         Shaver  CoffeeMachine  0.333333  Netherlands
16  CoffeeMachine      Babyphone  0.111111  Netherlands
17      Babyphone  CoffeeMachine  0.111111  Netherlands

【讨论】：

看起来很简单，很棒！非常感谢@Zaraki Kenpachi
一个问题：我猜这行 ( item_tuples = [(item1, item2), (item2, item1)] ) 是不需要的？