【问题标题】:How to split long strings in pandas columns by punctuation如何通过标点符号拆分熊猫列中的长字符串
【发布时间】:2020-04-20 20:24:40
【问题描述】:

我有一个看起来像这样的 df:

words                                              col_a   col_b  
I guess, because I have thought over that. Um,       1       0 
That? yeah.                                          1       1
I don't always think you're up to something.         0       1                                                       

我想将 df.words 拆分为存在标点符号 (.,?!:;) 的单独行。但是,我想为每个新行保留原始行中的 col_b 和 col_b 值。例如,上面的 df 应该是这样的:

words                                              col_a   col_b  
I guess,                                             1       0
because I have thought over that.                    1       0
Um,                                                  1       0 
That?                                                1       1
yeah.                                                1       1
I don't always think you're up to something.         0       1

【问题讨论】:

    标签: python pandas nlp


    【解决方案1】:

    一种方法是使用str.findall(.*?[.,?!:;]) 模式来匹配任何这些标点符号及其前面的字符(非贪婪),并分解结果列表:

    (df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
       .explode('words')
       .reset_index(drop=True))
    
                                              words  col_a  col_b
    0                                      I guess,      1      0
    1             because I have thought over that.      1      0
    2                                           Um,      1      0
    3                                         That?      1      1
    4                                         yeah.      1      1
    5  I don't always think you're up to something.      0      1
    

    【讨论】:

    • 我本来打算使用split,但它有效(-:对不起,我的意思是说这样更好。
    猜你喜欢
    • 2019-05-17
    • 2021-05-15
    • 2022-01-20
    • 2021-01-11
    • 1970-01-01
    • 2016-09-16
    • 2018-12-22
    • 2021-08-05
    • 2022-10-01
    相关资源
    最近更新 更多