从 OCR 提取中提取日期范围答案

【问题标题】：Extract date ranges from OCR extracts从 OCR 提取中提取日期范围
【发布时间】：2019-06-13 00:15:45
【问题描述】：

我的正则表达式返回一个项目列表，我只需要从中获取日期范围。该列表并不总是具有特定索引处的日期范围。

我尝试先将列表转换为字符串，然后仅提取日期范围：

possible_billing_periods = list(re.findall(r'Billing Period: (.*)|Billing period: (.*)|Billing Period (.*)|Billing period (.*)|period (.*)|period: (.*)', data))  
billing_period           = str(possible_billing_periods)

    for k in billing_period.split("\n"):
        if k != ['(A-Za-Z0-9)']:
            billing_period_2 = re.sub(r"[^a-zA-Z0-9]+", ' ', k) 

    print(possible_billing_periods)

输出：[('', '', '', '', 'Tel', ''), ('21-june-2018 - 25 -September-2018', '', '', '', '', '')]

预期结果：21-june-2018 25-September-2018

结果得到：Tel 21 june 2018 25 September 2018

样本数据：
2018 年 8 月 28 日开始指数：B1 0
2018 年 8 月 28 日开始指数：E1 0
计费期：2018 年 6 月 21 日 - 2018 年 9 月 25 日
预计下一读：2018 年 12 月 25 日

【问题讨论】：

你能不能也给我们看看样本data ？
我们需要看几行data
2018 年 8 月 28 日开始指数：B1 0 2018 年 8 月 28 日开始指数：E1 0 计费期：2018 年 6 月 21 日 - 2018 年 9 月 25 日预计下一读：2018 年 12 月 25 日
您需要的日期是否总是在以'Billing Period' 开头的行中？你需要正则表达式来处理大写/小写吗？
打印语句输出中的 ('', '', '', '', 'Tel', '') 表明您的数据与正则表达式中的第五组匹配。那是“|句号 (.*)|”，其中 .* 匹配“Tel”。

标签： python regex

【解决方案1】：

根据您的样本数据的大小，正则表达式可能不是检索信息的最佳方式（性能方面）。

假设所需的日期字符串总是在以'Billing Period' 开头的行中，您可以尝试这样的操作：

sample_data = """28 August2018 Start Index: B1 0
28 August 2018 Start Index: E1 0
Billing Period: 21-june-2018 - 25-September-2018
Expected next reading: 25 December 2018"""

billing_periods = list()
line_start = {'Billing':0, 'period':0, 'period:':0}

for line in sample_data.split('\n'):
    if line.split()[0] in line_start:
        billing_periods.append((line.split()[-3], line.split()[-1]))

print(billing_periods)

输出：

[('2018 年 6 月 21 日'，'2018 年 9 月 25 日')]

dict line_start 使您能够定义一些可能的行首字符。

【讨论】：

【解决方案2】：

我猜数据来自一个文件，所以最容易逐行处理它。以下是处理文件的常用方法的伪代码：

for each line in the file:
    if it is a line we care about:
        process the line

根据示例数据，我们关心的行以“计费周期：”的一些变体开头。这是一个正则表达式，用于查找以示例代码中的任何变体开头的行。开头的 ?x 相当于 re.VERBOSE 标志。它告诉正则表达式编译器忽略空格，以便我可以展开正则表达式的各个部分并解释一些 cmets 发生了什么。

billing_period_re = re.compile(r"""\
   (?xi)            # ignorecase and verbose
   ^                # match at the begining of the string
   \s*
   (?:Billing)?     # optional Billing. (?: ...) means don't save the group
   \s*
   Period                      
   \s*
   :?               # optional colon
   \s*
   """)

现在，如果计费周期正则表达式匹配，那么我们需要找到一个日期范围。根据示例数据，日期范围是两个由“ - ”分隔的日期。日期是 1-2 位数的日期、月份名称和 4 位数的年份，以“-”分隔。这是为日期范围构建正则表达式的一种方法：

day   = r"\d{1,2}"
month = r"(?:january|february|march|april|may|june|july|august|september|october|november|december)"
year  = r"\d{4}"
date = rf"{day}-{month}-{year}"

date_range_re = re.compile(rf"(?i)(?P<from>{date}) - (?P<to>{date})")

把它们放在一起

# this could be for line in input_file:
for line in data.splitlines():

    # check if it's a billing period line
    linematch = billing_period_re.search(line)

    if linematch:

        # check if there is a date range
        date_range = date_range_re.search(line, linematch.end())

        if date_range:
            print(f"from: {date_range['from']} to: {date_range['to']}")

【讨论】：