【问题标题】:Merge lines that share the same key into one line将共享相同键的行合并为一行
【发布时间】:2020-08-04 09:50:30
【问题描述】:

我有一个数据框,并且想创建另一个列,该列将名称以相同 value 开头的列组合在 AnswerQID 中。

也就是说,有如下Dataframe

    QID     Category    Text    QType   Question:   Answer0     Answer1     Country
0   16  Automotive  Access to car   Single  Do you have access to a car?    I own a car/cars    I own a car/cars  UK
1   16  Automotive  Access to car   Single  Do you have access to a car?    I lease/ have a company car     I lease/have a company car  UK
2   16  Automotive  Access to car   Single  Do you have access to a car?    I have access to a car/cars     I have access to a car/cars     UK
3   16  Automotive  Access to car   Single  Do you have access to a car?    No, I don’t have access to a car/cars   No, I don't have access to a car    UK
4   16  Automotive  Access to car   Single  Do you have access to a car?    Prefer not to say   Prefer not to say   UK

我想得到以下结果:

        QID     Category    Text    QType   Question:   Answer0     Answer1     Answer2    Answer3  Country    Answers
    0   16  Automotive  Access to car   Single  Do you have access to a car?    I own a car/cars    I lease/ have a company car      I have access to a car/cars    No, I don’t have access to a car/cars    UK    ['I own a car/cars', 'I lease/ have a company car'   ,'I have access to a car/cars', 'No, I don’t have access to a car/cars', 'Prefer not to say     Prefer not to say']

到目前为止,我已经尝试了以下方法:

previous_qid = None
i = 0
j = 0
answers = []
new_row = {}
new_df = pd.DataFrame(columns=df.columns)
for _, row in df.iterrows():
    # get QID
    qid = row['QID']
    if qid == previous_qid:
        i+=1
        new_row['Answer'+str(i)]=row['Answer0']
        answers.append(row['Answer0'])
    elif new_row != {}:
        # we moved to a new row
        new_row['QID'] = qid
        new_row['Question'] = row['Question']
        new_row['Answers'] = answers
        # we create a new row in the new_dataframe
        new_df.append(new_row, ignore_index=True)
        # we clean up everything to receive the next row
        answers = []
        i=0
        j+=1
        new_row = {}
        # we add the information of the current row
        new_row['Answer'+str(i)]=row['Answer0']
        answers.append(row['Answer0'])
    previous_qid = qid

new_df 结果为空。

【问题讨论】:

  • 发布更多基本示例和预期结果。上述预期结果对我来说毫无意义。

标签: python python-3.x pandas dataframe


【解决方案1】:

这是通过 QID 得到一个 Answers 列表然后将列表拆分回列

的逻辑分组
import re
data = """    QID     Category    Text    QType   Question:   Answer0     Answer1     Country
0   16  Automotive  Access to car   Single  Do you have access to a car?    I own a car/cars    I own a car/cars  UK
1   16  Automotive  Access to car   Single  Do you have access to a car?    I lease/ have a company car     I lease/have a company car  UK
2   16  Automotive  Access to car   Single  Do you have access to a car?    I have access to a car/cars     I have access to a car/cars     UK
3   16  Automotive  Access to car   Single  Do you have access to a car?    No, I don’t have access to a car/cars   No, I don't have access to a car    UK
4   16  Automotive  Access to car   Single  Do you have access to a car?    Prefer not to say   Prefer not to say   UK"""
a = [[t.strip() for t in re.split("  ",l) if t!=""]  for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]

df = pd.DataFrame(data=a[1:], columns=a[0])

# lazy - want first of all attributes except QID and Answer columns
agg = {col:"first" for col in list(df.columns) if col!="QID" and "Answer" not in col}
# get a list of all answers in Answer0 for a QID
agg = {**agg, **{"Answer0":lambda s: list(s)}}

# helper function for row call.  not needed but makes more readable
def ans(r, i):
    return "" if i>=len(r["AnswerT"]) else r["AnswerT"][i]

# split list from aggregation back out into columns using assign
# rename Answer0 to AnserT from aggregation so that it can be referred to.  
# AnswerT drop it when don't want it any more
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
    Answer0=lambda dfa: dfa.apply(lambda r: ans(r, 0), axis=1),
    Answer1=lambda dfa: dfa.apply(lambda r: ans(r, 1), axis=1),
    Answer2=lambda dfa: dfa.apply(lambda r: ans(r, 2), axis=1),
    Answer3=lambda dfa: dfa.apply(lambda r: ans(r, 3), axis=1),
    Answer4=lambda dfa: dfa.apply(lambda r: ans(r, 4), axis=1),
    Answer5=lambda dfa: dfa.apply(lambda r: ans(r, 5), axis=1),
    Answer6=lambda dfa: dfa.apply(lambda r: ans(r, 6), axis=1),
).drop("AnswerT", axis=1)

print(dfgrouped.to_string(index=False))


输出

QID    Category           Text   QType                     Question: Country           Answer0                      Answer1                      Answer2                                Answer3            Answer4 Answer5 Answer6
 16  Automotive  Access to car  Single  Do you have access to a car?      UK  I own a car/cars  I lease/ have a company car  I have access to a car/cars  No, I don’t have access to a car/cars  Prefer not to say                

更具活力

这会更深入地了解高级python。使用**kwargsfunctools.partial。实际上它仍然是静态的,列定义为常量MAXANS

import functools 
MAXANS=8
def ansassign(dfa, row=0):
    return dfa.apply(lambda r: "" if row>=len(r["AnswerT"]) else r["AnswerT"][row], axis=1)
dfgrouped = df.groupby("QID").agg(agg).reset_index().rename(columns={"Answer0":"AnswerT"}).assign(
    **{f"Answer{i}":functools.partial(ansassign, row=i) for i in range(MAXANS)}
).drop("AnswerT", axis=1)

【讨论】:

  • 非常感谢!实际上我可能有更多的答案而不是只有 7 个,我怎样才能让它动态地获得与相同的行一样多的答案 (QID,Question:)?
  • @RevolucionforMonica 我想不出一种真正动态的方法,因为无法找到列表中的逐行项数。已更新,但要注意很少有人精通此类编码。
  • 你的更新真的很棒!好先进!我正在考虑分享xlsx of the data I was using for this question。也许它可以帮助动态获取数字
  • @RevolucionforMonica 在很多方面我更喜欢第一种方法 - 它更加透明。 80% 的时间都花在维护代码上……使用大量高级概念的代码维护起来非常昂贵。
  • 是的,其实我认为你是对的。我正在努力使用我共享的数据框使您的代码动态化。还没有头绪,但我确信我可以做点什么。我发布了a new question,而不是在这里用这个动态问题来打扰你,如果你想要更多的点^^
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-06-20
  • 1970-01-01
  • 2011-12-13
相关资源
最近更新 更多