如何计算 R 文本中的年份？答案

【问题标题】：How to count years in the text in R?如何计算 R 文本中的年份？
【发布时间】：2020-05-29 11:14:25
【问题描述】：

我想计算以下名为 txt 的文本中左括号和右括号之间的年份。

library(stringr)
txt <- "Text Mining exercise (2020) Mining, p. 628508; Computer Science text analysis (1998) Computer Science, p.345-355; Introduction to data mining (2015) J. Data Science, pp. 31-33"

lengths(strsplit(txt,"\\(\\d{4}\\)")) 给我4 这是错误的。请帮忙？

【问题讨论】：

你的意思是在（括号）内而不是在[括号]内？
@sindri_baldur ( ) 通常被称为括号。 Source

标签： r stringr strsplit

【解决方案1】：

您可以将str_extract_all 与正前瞻和后瞻正则表达式一起使用。

stringr::str_extract_all(txt, '(?<=\\()\\d+(?=\\))')[[1]]
#[1] "2020" "1998" "2015"

如果你想计算有多少存在，请使用length。

length(stringr::str_extract_all(txt, '(?<=\\()\\d+(?=\\))')[[1]])
#[1] 3

可能，使用str_match_all 更容易

stringr::str_match_all(txt, '\\((\\d+)\\)')[[1]][, 2]
#[1] "2020" "1998" "2015"

【讨论】：

【解决方案2】：

如果你更喜欢 Base-R

regmatches(txt, gregexpr("[^0-9]\\d{4}[^0-9]", txt))

给予

[[1]]
[1] "(2020)" "(1998)" "(2015)"

如果我们把它包裹在lengths( ... )中，我们会得到正确的答案

编辑：或者如果你真的只想要我们可以缩短到的计数

lengths(gregexpr("[^0-9]\\d{4}[^0-9]", txt))

【讨论】：

【解决方案3】：

我想你在找stringr::str_count():

str_count(txt, "\\([0-9]{4}\\)")
[1] 3

在括号中仅包含四位数字，也以 1 或 2 开头，后跟 0 或 9：

str_count(txt, "\\([1-2][0|9][0-9]{2}\\)")

严格从 19 或 20 开始：

str_count(txt, "\\(19[0-9]{2}\\)|\\(20[0-9]{2}\\)")
# In R 4.0
str_count(txt, r"(\(19[0-9]{2}\)|\(20[0-9]{2}\))")

【讨论】：

谢谢。是否可以强制代码考虑括号中年份的前两位数字应以19 或20 开头？其他两位数可以是任意数字。
我又来找你了。我想考虑有或没有像(2020)或2020这样的左括号和右括号的年份。鉴于代码str_count(txt, "\\(19[0-9]{2}\\)|\\(20[0-9]{2}\\)")考虑(2020)是对的，那么这行代码如何更新考虑一年不加括号呢？
@Gadaa 只需从模式中删除 \\( 和 \\)。