协助将数据框拆分为新列答案

【问题标题】：Assistance with splitting data frame to new columns协助将数据框拆分为新列
【发布时间】：2020-12-28 10:10:43
【问题描述】：

我在按 _ 拆分数据框并从中创建新列时遇到问题。

原股

AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt    as section

我当前的代码

df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)

输出：

section                                 text
AMAT_10Q_Filing Section: Risk Factors_  The COVID-19 pandemic and global measures taken in response 
                                        thereto have adversely impacted, and may continue to adversely 
                                        impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_  The COVID-19 pandemic and measures taken in response by 
                                        governments and businesses worldwide to contain its spread, 
                                        
AMAT_10Q_Filing Section: Risk Factors_  The degree to which the pandemic ultimately impacts Applied’s 
                                        financial condition and results of operations and the global 
                                        economy will depend on future developments beyond our control

我真的很想拆分“部分”，将其放入基于“_”的新列中我尝试了许多不同的正则表达式变体来拆分“部分”，所有这些变体要么给了我没有填充的标题，要么在部分和文本之后添加了列，这没有用。我还应该添加大约 100,000 个观察值。

想要的结果：

Ticker  Filing type  Section                       Text
AMAT    10Q          Filing Section: Risk Factors  The COVID-19 pandemic and global measures taken in response

任何指导将不胜感激。

【问题讨论】：

标签： python regex pandas string re

【解决方案1】：

如果你总是知道拆分的数量，你可以这样做：

import pandas as pd

df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })

# Split column by "_"
items = df["a"].str.split("_")

# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)

# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)

# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)

这样，dataframe就会变成

           a  b  c      d
0   test_a_b  b  a   test
1  test2_c_d  d  c  test2

由于您希望列按特定顺序排列，因此您可以重新排列数据框的列，如下所示：

>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
       d  c  b          a
0   test  a  b   test_a_b
1  test2  c  d  test2_c_d

【讨论】：

嘿马可，当我尝试这个时，添加的列出现在列部分和文本的后面
您可以在插入列后重新排序。我将编辑我的答案以包括此类重新排序。
希望我能对您的回答投赞成票 - 现在一切正常 - 不胜感激！
@xSuperAnnuated 我很高兴有帮助！