【问题标题】:Expecting a sample of 1% or 3 in Audit script预计审计脚本中的样本为 1% 或 3
【发布时间】:2020-04-10 15:19:52
【问题描述】:

我正在编写一个脚本,该脚本从 excel 文件中的每个类别中抽取样本。脚本有效,但我的结果并不如预期 - 我得到了 2 个样本。我希望脚本从每个类别中抽取 1%、3% 或 5%,除非该类别中的项目数量有限;在这种情况下,我想要一个 2 的样本。我已经复制了下面的代码——对于大块的文本感到抱歉,我只是认为查看整个代码会有所帮助。任何解决此问题的帮助将不胜感激。

#imports
import pandas as pd

#read file
df = pd.read_excel(r"C:\Users\***\Desktop\***.xlsx")

#check for certain condition (Y)
df2 = df.loc[(df['Track Item']=='Y')]
print(len(df2))


#unique categories and subcategories
categories = df2['Category'].unique()
subcategories = df2['Subcategory'].unique()

#check for empty subcategories
subcategory = df2['Subcategory'].isnull().all()

#taking a sample based on whether subcategory is empty and the number of y-tracked items 
if subcategory == True:
    def sample_per(df2):
        if len(df2) >= 1500:
            for category in categories: 
                return df2.loc[(df2["Category"] == category)].apply(lambda x: x.sample(n=2) if 
                x.size*0.01 < 2 else x.sample(frac=0.01))
       elif len(df2) < 15000 and len(df2) > 10000:
            for category in categories: 
                return df2.loc[(df2["Category"] == category)].apply(lambda x: x.sample(n=2) if 
                x.size*0.03 < 2 else x.sample(frac=0.03))
       else:
            for category in categories: 
                return df2.loc[(df2["Category"] == category)].apply(lambda x: x.sample(n=2) if 
                x.size*0.05 < 2 else x.sample(frac=0.05))
else:
     def sample_per(df2):
        if len(df2) >= 1500:
            for subcategory in subcategories: 
                return df2.loc[(df2["Subcategory"] == subcategory)].apply(lambda x: x.sample(n=2) if 
                x.size*0.01 < 2 else x.sample(frac=0.01))
        elif len(df2) < 15000 and len(df2) > 10000:
            for subcategory in categories: 
                return df2.loc[(df2["Subcategory"] == subcategory)].apply(lambda x: x.sample(n=2) if 
                x.size*0.03 < 2 else x.sample(frac=0.03))
        else:
            for subcategory in subcategories: 
                return df2.loc[(df2["Subcategory"] == subcategory)].apply(lambda x: x.sample(n=2) if 
                x.size*0.05 < 2 else x.sample(frac=0.05))

    #result of sample_per function
    final = sample_per(df2)

因为线条很长,所以间距看起来不对--缩进是正确的

【问题讨论】:

    标签: python pandas if-statement lambda


    【解决方案1】:

    我在您发布的代码中看到至少两个问题。首先,在函数中,return 将在函数被命中后立即停止计算。这意味着您不会为每个类别返回样本,而只会对第一个(子)类别进行采样,然后完全退出该函数。其次,您的 if 条件的顺序意味着永远不会触发中间条件,并且小型 大型数据帧(=15000 行)都将使用第三个条件处理。

    这是一个我认为应该做你想做的功能。首先,我进行类别/子类别测试以确定使用哪一列(并因此消除大量重复代码)并获得适当的(子)类别。其次,我创建了一个空数据框来保存结果。循环会将不同的子样本附加到此。请注意,这不是一种计算效率高的方法,但只要您的数据帧不会变得太大,它就不应该成为问题。第三,我创建了一个内部函数来实际进行二次采样。最后,我重新排列了 if/else 条件的顺序。通过从最大的开始并逐步减少,它们是相互排斥的并且穷尽所有可能性。请注意,最后一个条件是如果您有 pass 作为占位符。

    def sample_per(df):
        # Conditionally set column name and categories variable
        if df['Subcategory'].isnull().all():
            col_name = 'Subcategory'
        else:
            col_name = 'Category'
    
        # Get unique (sub)categories
        categories = df[col_name].unique()
    
        # Create an empty dataframe to store results
        sample_df = pd.DataFrame()
    
        # Create an internal function to do the sampling
        def subsample(df, col_name, cat, frac):
            return df.loc[(df[col_name] == cat)].apply(lambda x: x.sample(n=2) if x.size*frac < 2 else x.sample(frac=frac))
    
        if df.shape[0] >= 15000:
            for cat in categories:
                sample_df = sample_df.append(subsample(df, col_name, cat, 0.05))
        elif df.shape[0] >= 10000:
            for cat in categories:
                sample_df = sample_df.append(subsample(df, col_name, cat, 0.03))
        elif df.shape[0] >= 1500:
            for cat in categories:
                sample_df = sample_df.append(subsample(df, col_name, cat, 0.01))
        else:
            pass
    
        # Return the sampled dataframe
        return sample_df
    
    # result of sample_per function
    final = sample_per(df2)
    

    当然,您也可以使用groupby 完成所有这些操作:

    def simple_sample(df):
        # Conditionally set column name
        if df['Subcategory'].isnull().all():
            col_name = 'Subcategory'
        else:
            col_name = 'Category'
    
        def subsample(df, col_name, frac):
            return df.groupby(col_name).apply(lambda x: x.sample(n=2) if x.size*frac < 2 else x.sample(frac=frac))
    
        if df.shape[0] >= 15000:
            return subsample(df, col_name, 0.05)
        elif df.shape[0] >= 10000:
            return subsample(df, col_name, 0.03)
        elif df.shape[0] >= 1500:
            return subsample(df, col_name, 0.01)
        else:
            return None
    

    【讨论】:

    • 很高兴能帮上忙!
    猜你喜欢
    • 2020-09-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-03-11
    • 2015-01-10
    • 1970-01-01
    • 2013-06-05
    相关资源
    最近更新 更多