【问题标题】:How to create filtered DataFrame with minimum code如何用最少的代码创建过滤后的 DataFrame
【发布时间】:2017-01-07 12:36:30
【问题描述】:

有四辆车:bmwgeovwporsche

import pandas as pd
df = pd.DataFrame({
    'car':      ['bmw','geo','vw','porsche'],
    'warranty': ['yes','yes','yes','no'], 
    'dvd':      ['yes','yes','no','yes'], 
    'sunroof':  ['yes','no','no','no']})

我想创建一个过滤后的 DataFrame,其中仅列出具有所有三个功能的汽车:DVD 播放器、天窗和保修(我们知道这里是 BMW,所有功能都设置为“是”)。

我可以一次做一列:

cars_with_warranty = df['car'][df['warranty']=='yes']
print(cars_with_warranty)

然后我需要对dvd和天窗柱进行类似的列计算:

cars_with_dvd = df['car'][df['dvd']=='yes']
cars_with_sunroof = df['car'][df['sunroof']=='yes']

我想知道是否有一种巧妙的方法可以创建过滤后的DataFrame

稍后编辑:

发布的解决方案效果很好。但生成的cars_with_all_three 是一个简单的列表变量。我们需要 DataFrame 对象,其中只有一辆“bmw”汽车作为其唯一的行和所有三列:dvd、天窗和保修(所有三个值都设置为“yes”)。

cars_with_all_three = []
for ind, car in enumerate(df['car']):
    if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
        cars_with_all_three.append(car)

【问题讨论】:

    标签: python pandas indexing dataframe conditional-statements


    【解决方案1】:

    您可以使用简单的loopenumerate

    cars_with_all_three = []
    for ind, car in enumerate(df['car']):
        if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
            cars_with_all_three.append(car)
    

    如果您执行print(cars_with_all_three),您将获得['bmw']

    或者,如果你想变得非常聪明并使用单线,你可以这样做:

    [car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes']
    

    希望对你有帮助

    【讨论】:

      【解决方案2】:

      你可以使用boolean indexing:

      print ((df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes'))
      0     True
      1    False
      2    False
      3    False
      dtype: bool
      
      print (df[(df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes')])
         car  dvd sunroof warranty
      0  bmw  yes     yes      yes
      
      #if need filter only column 'car' 
      print (df.ix[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes'), 'car'])
      0    bmw
      Name: car, dtype: object
      

      另一种解决方案是检查列中的所有值是否为yes,然后通过all 检查所有值是否为True

      print ((df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1))
      0     True
      1    False
      2    False
      3    False
      dtype: bool
      
      print (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
         car  dvd sunroof warranty
      0  bmw  yes     yes      yes
      
      print (df.ix[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1), 'car'])
      0    bmw
      Name: car, dtype: object
      

      使用最少代码的解决方案,如果 DataFrame 只有 4 列,如示例:

      print (df[(df.set_index('car') == 'yes').all(1).values])
         car  dvd sunroof warranty
      0  bmw  yes     yes      yes
      

      时间安排

      In [44]: %timeit ([car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes'])
      10 loops, best of 3: 120 ms per loop
      
      In [45]: %timeit (df[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes')])
      The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
      100 loops, best of 3: 2.09 ms per loop
      
      In [46]: %timeit (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
      1000 loops, best of 3: 1.53 ms per loop
      
      In [47]: %timeit (df[(df.ix[:, [u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
      The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached.
      1000 loops, best of 3: 1.51 ms per loop
      
      In [48]: %timeit (df[(df.set_index('car') == 'yes').all(1).values])
      1000 loops, best of 3: 1.64 ms per loop
      
      In [49]: %timeit (mer(df))
      The slowest run took 4.17 times longer than the fastest. This could mean that an intermediate result is being cached.
      100 loops, best of 3: 3.85 ms per loop
      

      计时码

      df = pd.DataFrame({
          'car':      ['bmw','geo','vw','porsche'],
          'warranty': ['yes','yes','yes','no'], 
          'dvd':      ['yes','yes','no','yes'], 
          'sunroof':  ['yes','no','no','no']})
      
      print (df)
      df = pd.concat([df]*1000).reset_index(drop=True)
      
      def mer(df):
          df = df.set_index('car')
          return df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()
      

      【讨论】:

        【解决方案3】:

        试试这个:

        df = df.set_index('car')
        df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()
        
         df
           car  dvd sunroof warranty
        0  bmw  yes     yes      yes
        
        
        df = df.set_index('car')
        df[df[[ u'dvd', u'sunroof', u'warranty']]== "yes"].dropna().index.values
        
        ['bmw']   
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-04-10
          • 1970-01-01
          相关资源
          最近更新 更多