使用正则表达式提取数据 [关闭]答案

【问题标题】：Extracts data using regular expression [closed]使用正则表达式提取数据 [关闭]
【发布时间】：2021-02-14 06:46:25
【问题描述】：

text='''

        Consumer Price Index:
        +0.3% in Aug 2020

        Unemployment Rate:
        +2.4% in Aug 2020
'''

使用正则表达式将数据提取到元组列表中，例如

[('Consumer Price Index', '+0.2%', 'Aug 2020'), ...]

并返回元组列表

我试了几次

re.findall( , text)

谁有好的想法？

【问题讨论】：

只是一个大字符串吗？
在此处发布问题时，请务必小心指定使用什么工具 来解决问题。在这种情况下，正则表达式是一个可以使用的工具，但通常你最好详细描述你的问题（示例输入，示例所需的输出）并将解决方案留给答案作者。
@Oliver Hnat 这是一个简短的文本示例
到目前为止你尝试过什么？除了那个单一的函数调用还有什么？

标签： python regex nlp

【解决方案1】：

我会先将字符串按'\n\n' 拆分，将它们分成单独的部分（以避免混淆），然后在每个部分上运行正则表达式以提取组。

以此为例：

import re

text = '''

        Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020
        '''


sections = text.split('\n\n')

results = []

for section in sections:
    pattern = re.compile(r'\s+([\w\s]+):\n.+(\+.+) in ([\w\d\s]+)')

    matches = pattern.match(section)

    if matches:
        results.append(matches.groups())

print(results)

输出：

[
   ('Consumer Price Index', '+0.2%', 'Sep 2020'),
   ('Unemployment Rate', '+7.9%', 'Sep 2020')
]

更新：

这是re.findall 的解决方案，但就像我说的，可能存在不一致，具体取决于text 的结构。为了安全起见，您应该分而治之。

import re

text = '''

        Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020
        '''


sections = text.split('\n\n')

pattern = re.compile(r'\s+([\w\s]+):\n.+(\+.+) in ([\w\d\s]+)\n')

results = pattern.findall(text)

print(results)

【讨论】：

re.findall 怎么样