【问题标题】:Reorder strings within object based on another data frame lookup根据另一个数据框查找重新排序对象内的字符串
【发布时间】:2021-07-15 02:51:31
【问题描述】:

这是我的第一个数据框 (df1)。

我必须根据第二个数据帧 (df2) 的最小计数对序列列(我的 df1)中的元素重新排序

所以我的最终结果应该是这样的

【问题讨论】:

  • 请解释一下 df2 是如何生成的

标签: python


【解决方案1】:

我认为下面的代码可以满足您的需求。我已经评论了内联并放置了进一步阅读的链接......

我知道它可以压缩成更短的代码,但我希望步骤清晰。

import pandas as pd
from pprint import pprint


data1 = {'id': ['A1234', 'A2345'],
         'Sequence': ['16 31 17', '51 59 43']}
df1 = pd.DataFrame(data1)
 

# I assumed the label en count columns are integers
data2 = {'label': [10, 11, 12, 13, 16, 17, 21, 24, 31, 43, 44, 51, 59, 60],
         'count': [214, 128, 135, 37, 184, 68, 267, 264, 231, 13, 82, 100, 99, 92]}
df2 = pd.DataFrame(data2)


def seq_value_sort(seq_df, label_df):
    new_sequence_list = []
    for value in seq_df['Sequence'].values:
        print(f'{"":-<40}') # prints a line
        # convert string to list of integers
        # https://www.geeksforgeeks.org/python-converting-all-strings-in-list-to-integers/
        sequence = [int(i) for i in value.split()]
        
        # generate an unsorted list of dict items based on Sequence
        data = []
        for index, row in label_df.T.iteritems():
            if int(row['label']) in sequence:
                data.append({'label': int(row['label']),
                             'count': int(row['count'])})
        pprint(data)

        # now sort the unsorted list based on key 'count'
        # https://stackoverflow.com/a/73050/9267296
        data = sorted(data, key=lambda k: k['count'])
        pprint(data)

        # list comprehension to make list of strings out
        # of the list of dict
        # https://stackoverflow.com/a/7271523/9267296
        sequence_sorted = [ str(item['label']) for item in data ]
        pprint(sequence_sorted)

        # create the final sequence string from the list
        new_sequence_list.append(' '.join(sequence_sorted))
    
    # create return data
    return_data = {'id': list(seq_df['id'].values),
                   'Sequence': new_sequence_list}
    pprint(return_data)
    
    # finally return a new df
    return pd.DataFrame(return_data)


df3 = seq_value_sort(df1, df2)
print(f'{"":-<40}')
print(df3)

编辑:

忘记输出了:

----------------------------------------
[{'count': 184, 'label': 16},
 {'count': 68, 'label': 17},
 {'count': 231, 'label': 31}]
[{'count': 68, 'label': 17},
 {'count': 184, 'label': 16},
 {'count': 231, 'label': 31}]
['17', '16', '31']
----------------------------------------
[{'count': 13, 'label': 43},
 {'count': 100, 'label': 51},
 {'count': 99, 'label': 59}]
[{'count': 13, 'label': 43},
 {'count': 99, 'label': 59},
 {'count': 100, 'label': 51}]
['43', '59', '51']
{'Sequence': ['17 16 31', '43 59 51'], 'id': ['A1234', 'A2345']}
----------------------------------------
      id  Sequence
0  A1234  17 16 31
1  A2345  43 59 51

【讨论】:

    猜你喜欢
    • 2020-06-03
    • 2019-05-25
    • 2021-03-27
    • 2021-07-20
    • 1970-01-01
    • 1970-01-01
    • 2014-10-19
    • 1970-01-01
    • 2022-01-10
    相关资源
    最近更新 更多