从 R 中的文本解析日期答案

【问题标题】：Parsing Dates from Text in R从 R 中的文本解析日期
【发布时间】：2015-12-04 12:58:11
【问题描述】：

我反复遇到从相对非结构化的文本文档中解析日期的问题，其中日期嵌入在文本中，并且其位置和格式因情况而异。一些示例文本是：

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

我想从文本中提取日期字符串 "July 1st, 2015"（步骤 1）并将其转换为格式，例如 2015-07-01 UTC（步骤 2）。步骤 2 可以使用，例如，来自包 lubridate 的 parse_date_time 执行（这对于多种适用的日期格式很好）：

案例 1：

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

在某些情况下，parse_date_time 也适用于包含日期的较大字符串。例如：

案例 2：

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

但是，据我了解，第 2 步不能直接作用于完整的示例文本：

案例 3：

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

显然，文本中的一些附加信息使得直接从全文中解析日期变得很麻烦。我可以想到一种方法，其中使用正则表达式执行步骤 1 以提取包含日期且 parse_date_time 工作的简化字符串（类似于案例 1 或案例 2）。但是，将正则表达式与日期结合使用似乎总是有点脏，因为正则表达式不知道它是否提取了有效的日期。

有没有办法像上面的例子（案例 3）那样直接对非结构化文本执行第 2 步（即，不使用基于正则表达式的解决方法）？

非常感谢任何输入！

【问题讨论】：

您最好的选择可能是阅读parse_date_time 中的源代码并尝试使其适应“案例3”等情况，因为我认为没有明确的/解决这类问题的公式化方法。
欢迎来到regex-hell... 我想你最好的尝试是调整一些正则表达式 sn-p 来查看代码并提取值..
两步正则表达式方法非常适用于非结构化文本，但对于多种日期格式可能会变得有点麻烦。因此，我不会说正则表达式是地狱......也许是它的正则表达式。
使用this website，我们可以构造一些正则表达式代码：(( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+) 但它在R中不起作用...... :(这就是我所说的“地狱”......
感谢reference。

标签： regex r date parsing lubridate

【解决方案1】：

使用this website，我们可以构造一些正则表达式代码：(( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+) 但它在 R 中不起作用... :(

如果纠正它确实有效。

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"

【讨论】：