【问题标题】:pandas: use value_counts in an apply functionpandas:在应用函数中使用 value_counts
【发布时间】:2021-10-15 10:21:30
【问题描述】:

这是我的 pandas 数据框的玩具示例:

    country_market  language_market
0   United States   English
1   United States   French
2   Not used    Not used
3   Canada OR United States English
4   Germany English
5   United Kingdom  French
6   United States   German
7   United Kingdom  English
8   United Kingdom  English
9   Not used    Not used
10  United States   French
11  United States   English
12  United Kingdom  English
13  United States   French
14  Not used    English
15  Not used    English
16  United States   French
17  United States   Not used
18  Not used    English
19  United States   German

我想添加一列top_country,显示country_market 中的值是否是数据中最常见的两个国家之一。如果是,我希望新的top_country 列显示country_market 中的值,如果不是,那么我希望它显示“其他”。我想为language_market 重复此过程(以及我未在此处显示的大量其他市场专栏)。

这就是我希望数据在处理后的样子:

    country_market  language_market top_country top_language
0   United States   English United States   English
1   United States   French  United States   French
2   Not used    Not used    Not used    Other
3   Canada OR United States English Other   English
4   Germany English Other   English
5   United Kingdom  French  Other   French
6   United States   German  United States   Other
7   United Kingdom  English Other   English
8   United Kingdom  English Other   English
9   Not used    Not used    Not used    Other
10  United States   French  United States   French
11  United States   English United States   English
12  United Kingdom  English Other   English
13  United States   French  United States   French
14  Not used    English Not used    English
15  Not used    English Not used    English
16  United States   French  United States   French
17  United States   Not used    United States   Other
18  Not used    English Not used    English
19  United States   German  United States   Other

我创建了一个函数original_top_markets_function 来执行此操作,但我不知道如何将我的函数的value_counts 部分传递给pandas apply。我不断收到AttributeError: 'str' object has no attribute 'value_counts'

def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
    if i in top2: 
        return i
    else: 
        return 'Other'         

我知道这是因为apply 正在查看我的目标列中的每个元素,但我还需要一次考虑整个列的函数,以便我可以使用value_counts。我不知道该怎么做。

所以我想出了这个top_markets 函数作为解决方案,使用一个列表,它可以满足我的需求,但效率不高。我需要将此函数应用于许多不同的市场列,所以我想要一些更 Python 的东西。

def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
    if i in top2: 
        results.append(i)
    else: 
        results.append('Other')         
return results

这是一个可重现的示例。请以某种方式帮助我修复我的top_markets 函数,以便我可以将它与apply 一起使用?

import pandas as pd

d = {0: {'country_market': 'United States', 'language_market': 'English'},
 1: {'country_market': 'United States', 'language_market': 'French'},
 2: {'country_market': 'Not used', 'language_market': 'Not used'},
 3: {'country_market': 'Canada OR United States',
  'language_market': 'English'},
 4: {'country_market': 'Germany', 'language_market': 'English'},
 5: {'country_market': 'United Kingdom', 'language_market': 'French'},
 6: {'country_market': 'United States', 'language_market': 'German'},
 7: {'country_market': 'United Kingdom', 'language_market': 'English'},
 8: {'country_market': 'United Kingdom', 'language_market': 'English'},
 9: {'country_market': 'Not used', 'language_market': 'Not used'},
 10: {'country_market': 'United States', 'language_market': 'French'},
 11: {'country_market': 'United States', 'language_market': 'English'},
 12: {'country_market': 'United Kingdom', 'language_market': 'English'},
 13: {'country_market': 'United States', 'language_market': 'French'},
 14: {'country_market': 'Not used', 'language_market': 'English'},
 15: {'country_market': 'Not used', 'language_market': 'English'},
 16: {'country_market': 'United States', 'language_market': 'French'},
 17: {'country_market': 'United States', 'language_market': 'Not used'},
 18: {'country_market': 'Not used', 'language_market': 'English'},
 19: {'country_market': 'United States', 'language_market': 'German'}}

df = pd.DataFrame.from_dict(d, orient='index')

def top_markets(x):
    top2 = x.value_counts().nlargest(2).index
    results = []
    for i in x:
        if i in top2: 
            results.append(i)
        else: 
            results.append('Other')         
    return results

df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])

df

【问题讨论】:

    标签: python pandas apply


    【解决方案1】:

    如果需要 DataFrame.apply 在某些功能中按多列工作,例如这里lambda function使用:

    cols = ['language_market', 'country_market']
    
    f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
    df = df.join(df[cols].apply(f).add_prefix('total_'))
    

    没有 lambda 函数的解决方案:

    def top_markets(x):
        return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
    
    df = df.join(df[cols].apply(top_markets).add_prefix('total_'))
    

    【讨论】:

    • 完美,可以满足我的需要,我可以轻松地将它用于我需要以这种方式转换的 30 列,谢谢!
    【解决方案2】:

    我认为你可以使用:

    df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
    df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')
    

    如果你想使用自己的功能,你可以使用:

    df['top_country'] = df[['country_market']].apply(top_markets)
    df['top_language'] = df[['language_market']].apply(top_markets)
    
    #OR
    df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)
    

    根据 cmets 中的讨论进行编辑:

    def top_markets(x, top):
        if x in top:
            return x
        else:
            'Other'
    
    top_country = df['country_market'].value_counts().nlargest(2).index
    top_languages = df['language_market'].value_counts().nlargest(2).index
    
    df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
    df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))
    

    【讨论】:

    • 太好了,这确实让我得到了我想要的东西!但是,更一般地说,有没有办法使用像 value_counts 这样的函数作为我想要传递给 apply 的函数的一部分?
    • 我认为这样做比使用自定义函数更快
    • 是的,我不同意np.where 的速度,但为了我自己的学习,我想知道如何在我传递给@ 的函数中使用value_counts 之类的东西987654328@.
    • @meenaparam - 检查另一个答案。
    • 如果您想使用自定义函数,请检查编辑后的答案
    猜你喜欢
    • 2016-08-14
    • 1970-01-01
    • 2014-06-05
    • 1970-01-01
    • 2019-11-23
    • 1970-01-01
    • 2022-12-02
    相关资源
    最近更新 更多