如何用逗号分割字符串并插入熊猫数据框答案

【问题标题】：How to split strings by the commas and insert into a pandas dataframe如何用逗号分割字符串并插入熊猫数据框
【发布时间】：2020-09-10 04:00:08
【问题描述】：

我有一个带有 for 循环的函数，它返回一堆字符串，例如：

58，冥王星第172章 5、桃子

如何在 pandas 数据框中的一列中获取字符串的第一部分（数字），在第二列中获取第二部分（水果）。列应命名为“金额”和“水果”。

这是目前为止的代码：

regex = r"(\d+)( ML/year )(in the |the )([\w \/\(\)]+)"
for line in finalText.splitlines():
    matches = re.finditer(pattern, line)

    for matchNum, match in enumerate(matches, start=1):
        print (match.group(1) +","+ match.group(4))

我正在使用 re 从一大块文本中过滤掉我需要的数据，但现在它只是打印到控制台，我需要它进入数据框。

基本上，该代码中的最后一个打印语句需要更改，因此我插入数据帧而不是打印。

最终文本示例为：

(a)58 ML/Y 在梨区 (二) 苹果地区 64 ML/Y

纯文本

【问题讨论】：

使用 append 函数追加数据帧。在您的示例中，您想要 58 和桃子？其他人都掉了？
基本上我希望 match.group(1) 在一个列中，而 match.group(4) 在另一列中。
用逗号分割字符串，可以使用s.split(',')，其中s是字符串的名称
可以不给df[matchNum] = [match.group(1), match.group(4)]
首先您需要通过创建一个空的数据框来定义数据框。

标签： python pandas dataframe python-re

【解决方案1】：

必须努力为您找出一个更简单的解决方案。使用 \W 正则表达式从字符串中删除 ()\。

如果你的字符串的模式总是这样

(x)## ML/Y in the fruit region (y) ## ML/Y in the fruit region

然后使用此代码。它将从列表中删除 ( ) \ 并为您提供更简单的列表。使用列表中的第 3 位、第 8 位、第 13 位和第 18 位来获得您想要的。

import pandas as pd
import re

finalText = '(a)58 ML/Y in the pear region (b) 64 ML/Y in the apple region'

df = pd.DataFrame(data=None, columns=['amount','fruit'])

for line in finalText.splitlines():
    matches = re.split(r'\W',line)
    df.loc[len(df)] = [matches[2],matches[7]]
    df.loc[len(df)] = [matches[12],matches[17]]

print(df)

输出结果为：

  amount  fruit
0     58   pear
1     64  apple

另一种方法是使用 findall。

for line in finalText.splitlines():
    print (line)
    m = re.findall(r'\w+',line)
    print (m)
    matches = re.findall(r'\w+',line)
    df.loc[len(df)] = [matches[1],matches[6]]
    df.loc[len(df)] = [matches[9],matches[14]]

print(df)

结果与上面相同

  amount  fruit
0     58   pear
1     64  apple

旧代码

试试这个，让我知道它是否有效。

import pandas as pd

df = pd.DataFrame(data=None, columns=['amount','fruit'])

regex = r"(\d+)( ML/year )(in the |the )([\w \/\(\)]+)"
for line in finalText.splitlines():
    matches = re.finditer(pattern, line)

    for matchNum, match in enumerate(matches, start=1):
        df[matchNum] = [match.group(1) , match.group(4)]

【讨论】：

代码运行没有错误，但是当我执行 df.head() 时它什么也不返回
没关系，当我打印 df 时，它确实有第一组数据，但是它被放在单独的行中的第三列中
您是否得到了它的工作，或者您需要进一步的帮助？如果问题得到解决，您可以阅读当有人answers您的问题时该怎么做。
您好，我还需要进一步的帮助。您的解决方案似乎只是部分工作，因为它没有在数据框中正确格式化，并且只添加了第一行。

【解决方案2】：

这是我的解决方案

s = "58, pluto 172, uno 5, peaches"
temp = s.split() # ['58,', 'pluto', '172,', 'uno', '5,', 'peaches']
amount = temp[::2] #['58,', '172,', '5,']
fruit = temp[1::2] # ['pluto', 'uno', 'peaches']
df['amount'] = amount
df['fruit'] = fruit

您可以继续删除逗号并将类型从 string 更改为 int

【讨论】：

您好，我不确定此解决方案是否有效，并且我没有您所显示的像 's' 这样的字符串。相反，我需要将特定的 match.group() 项放入数据框（match.group(1) 和 match.group(4)）