【问题标题】:Python Pandas: Combine Dataframes that are unevenly filledPython Pandas:组合填充不均匀的数据框
【发布时间】:2020-08-30 06:41:07
【问题描述】:

早安,

从我们的一位客户那里,我们得到的 csv-exports 看起来像这样:

id  |  name  |  object_a  |  amount_a  |  object_b  |  amount_b  |  object_c  |  amount_c
1      abc      object_1     12           none         none         none         none


id  |  name  |  object_a  |  amount_a  |  object_b  |  amount_b  |  object_c  |  amount_c
2      def      object_2     7            object_3     19           none         none


id  |  name  |  object_a  |  amount_a  |  object_b  |  amount_b  |  object_c  |  amount_c
3      ghi      object_4     25           none         none         none         none

现在我真的只关心这对对象(对象名称和数量)。在每组数据中,最大对数始终相同,但它们是随机填充的。 我的问题:是否可以将它们全部加载到数据框中并将它们转换成这样的东西:

object   |   amount
object_1     12
object_2     7
object_3     19
object_4     25

将所有这些 csv-exports 加载到单个数据帧中不是问题,但是 panda 是否包含此类问题的解决方案?

感谢您的帮助!

【问题讨论】:

    标签: python pandas dataframe sorting


    【解决方案1】:

    首先concat所有的csv,然后使用pd.wide_to_long

    csv_paths = ["your_csv_paths..."]
    
    df = pd.concat([pd.read_csv(i) for i in csv_paths]).replace("none", np.NaN)
    
    print (pd.wide_to_long(df, stubnames=["object","amount"],
                           i=["id","name"],j="Hi", suffix="\w*",
                           sep="_").dropna())
    
                  object amount
    id name Hi                 
    1  abc  a   object_1     12
    2  def  a   object_2      7
            b   object_3     19
    3  ghi  a   object_4     25
    

    【讨论】:

      【解决方案2】:

      parser.py

      import pandas as pd
      
      df = pd.read_csv('test.csv')
      fields = (
          ('object_a', 'amount_a'),
          ('object_b', 'amount_b'),
          ('object_c', 'amount_c')
      )
      print(df, '\n')
      
      newDf = pd.DataFrame(columns=('object', 'amount'))
      for idx, row in df.iterrows():
          for fieldName, fieldValue in fields:
              if row[fieldName] != 'none':
                  newDf.loc[len(newDf)] = (row[fieldName], row[fieldValue])
      
      print(newDf, '\n')
      

      test.csv

      id,name,object_a,amount_a,object_b,amount_b,object_c,amount_c
      1,abc,object_1,12,none,none,none,none
      1,abc,object_2,15,object_3,42,none,none
      1,abc,none,none,none,none,object_4,16
      

      输出

         id name  object_a amount_a  object_b amount_b  object_c amount_c
      0   1  abc  object_1       12      none     none      none     none
      1   1  abc  object_2       15  object_3       42      none     none
      2   1  abc      none     none      none     none  object_4       16
      
           object amount
      0  object_1     12
      1  object_2     15
      2  object_3     42
      3  object_4     16
      

      【讨论】:

        【解决方案3】:

        这可能不是最好的方法,但如果所有 .cvs 只包含一行,你可以这样做:

        import pandas as pd
        
        def append_df(df, result_df):
        
            for column in df.columns:
                if column.startswith('object_'):
                    print(df[column].values)
                    if df[column].values[0] != 'none':
                        suffix = column.replace('object_','')
                        amount_col='amount_'+suffix
        
                        object_name = df[column].values [0]
                        amunt_value=df[amount_col].values [0]
        
                        data_to_append={'object':object_name,'amount':amunt_value}
                        result_df=result_df.append(data_to_append, ignore_index=True)
        
            return result_df
        
        result_df=pd.DataFrame()
        
        data={'id':[1], 'name':['abc'],'object_a':['Obj1'], 'amount_a':[17],'object_b':['none'], 'amount_b':['none'],'object_c':['none'], 'amount_c':['none'] }
        df = pd.DataFrame(data)
        result_df=append_df(df,result_df)
        
        data={'id':[2], 'name':['def'],'object_a':['Obj2'], 'amount_a':[24],'object_b':['Obj3'], 'amount_b':[18],'object_c':['none'], 'amount_c':['none'] }
        df = pd.DataFrame(data)
        result_df=append_df(df,result_df)
        
        data={'id':[3], 'name':['ghi'],'object_a':['Obj4'], 'amount_a':[40],'object_b':['none'], 'amount_b':['none'],'object_c':['Obj5'], 'amount_c':[70] }
        df = pd.DataFrame(data)
        result_df=append_df(df,result_df)
        
        #reoder columns
        result_df = result_df[['object','amount']]
        print(result_df)
        

        结果:

          object  amount
        0   Obj1    17.0
        1   Obj2    24.0
        2   Obj3    18.0
        3   Obj4    40.0
        4   Obj5    70.0
        

        【讨论】:

          【解决方案4】:

          这是一种使用pd.read_fwf() 读取固定文件的方法。以编程方式找到分隔符位置。 @HenryYik 提出的wide_to_long() 也被用到了。

          # original data
          from io import StringIO
          import pandas as pd
          
          data = '''id  |  name  |  object_a  |  amount_a  |  object_b  |  amount_b  |  object_c  |  amount_c
          1      abc      object_1     12           none         none         none         none
          id  |  name  |  object_a  |  amount_a  |  object_b  |  amount_b  |  object_c  |  amount_c
          2      def      object_2     7            object_3     19           none         none
          id  |  name  |  object_a  |  amount_a  |  object_b  |  amount_b  |  object_c  |  amount_c
          3      ghi      object_4     25           none         none         none         none
          '''
          
          # get location of delimiters '|' from first line of file
          first_line = next(StringIO(data)).rstrip('\n')
          delimiter_pos = (
              [-1] +  # we will add 1 to this, to get 'real' starting location
              [idx for idx, c in enumerate(first_line) if c == '|'] + 
              [len(first_line)])
          
          # convert delimiter positions to start/end positions for each field
          #   zip() terminates with the shortest sequence is exhausted
          colspecs = [ (start + 1, end) 
                      for start, end in zip(delimiter_pos, delimiter_pos[1:])]
          
          # import fixed width file
          df = pd.read_fwf(StringIO(data), colspecs=colspecs)
          
          # drop repeated header rows
          df = df[ df['id'] != df.columns[0] ]
          
          # convert wide to long
          df = pd.wide_to_long(
              df, stubnames=['object', 'amount'],
              i = ['id', 'name'], j = 'group',
              suffix='\w*', sep='_',).reset_index()
          
          # drop rows with no info
          mask = (df['object'] != 'none') & (df['amount'] != 'none')
          t = df.loc[mask, ['object', 'amount']].set_index('object')
          print(t)
          
                   amount
          object         
          object_1     12
          object_2      7
          object_3     19
          object_4     25
          

          【讨论】:

            猜你喜欢
            • 2021-11-14
            • 2021-08-17
            • 2022-06-13
            • 1970-01-01
            • 2021-10-07
            • 2013-06-10
            • 1970-01-01
            • 1970-01-01
            • 2015-11-19
            相关资源
            最近更新 更多