根据条件将行合并在一起答案

【问题标题】：Merging rows together on condition根据条件将行合并在一起
【发布时间】：2021-08-06 09:31:31
【问题描述】：

我的数据看起来像这样

import pandas as pd
current_df = pd.DataFrame({ 'Col1':['Something (Something', 'Else) words', 'x', 'y', 'something (another', 'word) blah'], 
                    'Col2':['(some value)', '', 'some value', 'a', 'some value',''],
                    'Col3':['Something (Something', 'Else) words', '(x)', 'y', 'something (another', 'word) blah'], 
                    'Col4':['some value', '', 'some value', 'a', 'some value',''],
                    'Col5':['some value', '', 'some value', 'a', 'some value','']})

                   Col1          Col2                  Col3        Col4        Col5
0  Something (Something  (some value)  Something (Something  some value  some value
1           Else) words                         Else) words
2                     x    some value                   (x)  some value  some value
3                     y             a                     y           a           a
4    something (another    some value    something (another  some value  some value
5            word) blah

我从 PDF 中抓取了这个，在某些情况下，出现了一个奇怪的格式问题，有些东西会转移到另一行。它总是在相同的两列中，正如我所展示的那样，这些列中没有其他内容。有没有办法可以将这些值与上面的行合并？所需的输出如下。

在示例中，我要加入的数据在一行中被 ( 和 ) 分割。这对我大约 99% 的数据都是正确的，我正在考虑尝试利用它。但如果有更简洁的方法来合并单元格，请告诉我。

goal_df = pd.DataFrame({ 'Col1':['Something (Something Else) words', 'x', 'y', 'something (another word) blah'], 
                    'Col2':['(some value)', 'some value', 'a', 'some value'],
                    'Col3':['Something (Something Else) words', '(x)', 'y', 'something (another word) blah'], 
                    'Col4':['some value', 'some value', 'a', 'some value'],
                    'Col5':['some value', 'some value', 'a', 'some value'],})

                               Col1          Col2                              Col3        Col4        Col5
0  Something (Something Else) words  (some value)  Something (Something Else) words  some value  some value
1                                 x    some value                               (x)  some value  some value
2                                 y             a                                 y           a           a
3     something (another word) blah    some value     something (another word) blah  some value  some value

【问题讨论】：

退一步，你有没有研究另一种抓取方法？也许这种方法有帮助..?..convert from pdf to text: lines and words are broken
我正在从格式化的表格中抓取。它大约有 200 页，每一页都是相同的格式，但我有几百个案例，我有这个问题。 Col1 中的数据对于边距来说太大了，并转移到下一行。我还在学习 python，但我不确定这是否适合我。

标签： python pandas

【解决方案1】：

这可能是一种手动方式，但它会完成这项工作。您需要按照以下步骤操作：

合并前两行（如果您在其他地方遇到相同问题，可以为任何其他行更改此设置）。
删除第二行，因为您不再需要它了。
重置索引，因为删除第二行会将索引更改为：0 2 3。

代码如下：

df.iloc[0,] = pd.DataFrame([df.loc[0,] + df.loc[1,]])
df = df.drop(df.index[1])
df.reset_index(drop=True)

【讨论】：

【解决方案2】：

试试：

import pandas as pd
import re

current_df = pd.DataFrame({ 'Col1':['Something (Something', 'Else) words', 'x', 'y', 'something (another', 'word) blah'], 
                    'Col2':['(some value)', '', 'some value', 'a', 'some value',''],
                    'Col3':['Something (Something', 'Else) words', '(x)', 'y', 'something (another', 'word) blah'], 
                    'Col4':['some value', '', 'some value', 'a', 'some value',''],
                    'Col5':['some value', '', 'some value', 'a', 'some value','']})

print(current_df)

'''Shows:
    
    Col1                    Col2            Col3                    Col4            Col5
0   Something (Something    (some value)    Something (Something    some value      some value
1   Else) words                             Else) words     
2   x                       some value      (x)                     some value      some value
3   y                       a               y                       a               a
4   something (another      some value      something (another      some value      some value
5   word) blah                              word) blah      
'''
print('\n\n')

pattern_check = r'.*?\)(?:.(?!\)))+$'

cols = list(current_df.columns)

current_df['mergeUp'] = False
current_df['mergeUp'] = current_df.apply(lambda x: x.str.contains(pattern_check, regex=True).any(), axis=1)

for col in cols:
    for row in range(1, len(current_df)):
        if current_df.loc[row, 'mergeUp'] is False:
            continue
        elif re.search(pattern_check, current_df.loc[row, col]):
            current_df.loc[row-1, col] = current_df.loc[row-1, col] + ' ' + current_df.loc[row, col]
        else:
            continue

current_df = current_df.loc[~current_df.mergeUp, :]

del current_df['mergeUp']
current_df = current_df.reset_index(drop=True)

print(current_df)

'''
    Col1                                Col2            Col3                                Col4        Col5
0   Something (Something Else) words    (some value)    Something (Something Else) words    some value  some value
2   x                                   some value  (x)                                     some value  some value
3   y                                   a               y                                   a           a
4   something (another word) blah       some value      something (another word) blah       some value  some value
'''

print('\n')

更清晰的输出视图：

【讨论】：