使用来自另一列的值对 pandas 列进行切片答案

【问题标题】：slicing pandas column using values from another column使用来自另一列的值对 pandas 列进行切片
【发布时间】：2018-03-01 15:07:34
【问题描述】：

所以我有一个数据框，其中有一列中有一些文本。我试图在列的每一行中找到 2 个字符串，然后在这两个字符串之间分割行文本以获得一个子字符串。像这样的：

startinds = df[column].str.find("First Event = ")
endinds   = df[column].str.find("\nLast Event = ")

df["first_timestamp"] = df[column].str.slice(startinds,endinds)

现在这不起作用，因为startinds 和endinds 是系列，所以我不能将它们用作对column 中的字符串进行切片的索引。

有人知道我可以访问这些值以在每一行上执行子字符串的方法吗？

示例输入：

    Data
0   "Blahblah
     First Event = 09/20/2017 12:00:00
     Last Event = 09/20/2017 13:00:00
     Blahblahblah"
1   "Blahblahblahblah
     Blahablahblah
     First Event = 09/20/2017 12:30:00
     Last Event = 09/20/2017 12:45:00
     Blahblahblah"

输出：

    first_timestamp
0   "First Event = 09/20/2017 12:00:00"
1   "First Event = 09/20/2017 12:30:00"

【问题讨论】：

这是一个open issue on github。您很可能必须手动完成。
做"First Event = " + df.Data.str.extract('(?<=First Event = )(.*)(?=\\\\nLast Event)', expand=False)?

标签： python python-2.7 pandas substring

【解决方案1】：

要完成您的切片方法，您可以使用 lambda 即存储 startinds 和 endinds 在 df 中，然后使用 lambda 跨列 ie 根据列对字符串进行切片（请注意，您需要一个转义字符才能获得 \n )

df['startinds'] = df['Data'].str.find("First Event = ")
df['endinds']  = df['Data'].str.find("\\nLast Event = ")

df.apply(lambda x : str(x['Data'])[x['startinds']:x['endinds']],1 )

输出：

0 第一个事件 = 09/20/2017 12:00:00 1 第一个事件 = 09/20/2017 12:30:00 数据类型：对象

【讨论】：

我的错。 \n 是换行符。我只是将它们放入示例数据中，而不是做实际的换行符。但这不是文字反斜杠。我已经编辑了原文
有点疑问，First Event 总是在第二行？
没有。它可以在任何地方。有时它实际上可能不在数据中。我意识到我必须使用正则表达式解决方案，因为当关键字不显示时，这种字符串切片不起作用。

【解决方案2】：

与 cmets 中的答案不同，Series.str.extract 的这种方法应该有效：

df['first_timestamp'] = df['Data'].str.extract('(First Event = .+)')

#                                                 Data  \
# 0  Blahblah\nFirst Event = 09/20/2017 12:00:00\nL...   
# 1  Blahblahblahblah\nFirst Event = 09/20/2017 12:...   
# 
#                      first_timestamp  
# 0  First Event = 09/20/2017 12:00:00  
# 1  First Event = 09/20/2017 12:30:00

模式'(First Event = .+)' 捕获一组（即()），其中“First Event =”后跟一个或多个字符（即.+），在换行处停止（. 字符匹配除换行符）。

【讨论】：

@andraiamatrix 正则表达式中的. 字符匹配除换行符之外的任何内容（因此.+ 匹配除换行符之外的任何内容）。根据您更新的问题，df['Data'].str.extract('(First Event = .+)') 似乎将捕获您的 first_timestamp 组。我会更新我的答案。
所以我注意到.+ 停在换行符处，但它并没有停在回车处，\r （事实证明这是我的数据中的内容）。有什么东西会停下来吗？我尝试了(First Event = .+)[\r\n]，但这并没有阻止回车出现在我的输出中。
不使用.，你可以试试这个吗？ df['Data'].str.extract('(First Event = [^\n\r]+)')