【问题标题】:Extracting strings between two regular expressions in R在R中的两个正则表达式之间提取字符串
【发布时间】:2020-08-19 12:51:58
【问题描述】:

我有一个包含 19 世纪国会演讲稿的 txt 文件。这是广泛的格式:

Mr. JOHNSON. Researching congress is neat!  
Mr. JACKSON. For sure. Sometimes I think 
that I would do it for a living.  
Mr. SMITH, of Virginia. But then I realize
it's actually pretty hard!

我想建立一个数据框来分隔每个说话者所说的块。类似的东西:

SPEAKER                   STATEMENT
Mr. JOHNSON               Researching ...
Mr. Jackson               For sure. ...
Mr. Smith, of Virginia    But then...

我想出了一个正则表达式,可以识别 Mr. [something] 或 Mr. [something, of some place] 的每个实例(这些演讲来自不幸的是他们都是 Mr.)。是这样的:

pattern <- regex("((Mr\\.\\s[A-Z][A-Za-z1-9]+)(\\,\\sof\\s[A-Za-z1-9]+\\.|\\.)|(The\\sCHAIRMAN))", dotall = TRUE)
str_extract_all(data, pattern)

这会返回

[1] Mr. JOHNSON.
[2] Mr. JACKSON.
[3] Mr. SMITH, of Virginia.

我现在的问题是:如何提取每个提取的名称之间的文本?我尝试了以下方法,但没有成功:

library(qdapRegex)
ex_between(data, pattern, pattern)[[1]]

有什么想法吗?非常感谢!

【问题讨论】:

  • 我知道python中的正则表达式。是否可以使用正则表达式来替换扬声器。所以你会用一个空字符串替换说话者的名字,然后剩下的文本就剩下了吗?
  • 您能否说明您是如何创建变量 data 的?如何将文本读入 R 对于帮助回答这个问题很重要。

标签: r regex string


【解决方案1】:

我通常不喜欢 for 循环,但这确实有效。它建立在您对 ex_between 的尝试的基础上,但对最后一个语句有一个特殊情况(因为该语句不在两个发言者之间)。

library(tidyverse)
library(readr)
library(qdapRegex)
data <- read_file("Mr. JOHNSON. Researching congress is neat!
Mr. JACKSON. For sure. Sometimes I think that I would do it for a living.
Mr. SMITH, of Virginia. But then I realize it's actually pretty hard!")
data <- data %>% 
  str_replace_all("\\\n", " ")

pattern <- regex("((Mr\\.\\s[A-Z][A-Za-z1-9]+)(\\,\\sof\\s[A-Za-z1-9]+\\.|\\.)|(The\\sCHAIRMAN))", dotall = TRUE)
people <- str_extract_all(data, pattern)[[1]]

statements <- as.character()
for (i in seq(1, length(people))) {
  if (i <= length(people) - 1) {
    statements[i] <- ex_between(data, people[i], people[i + 1])[[1]][1]
  } else {
    statements[i] <-
      str_extract_all(data, sprintf("(?<=%s).*", people[i]))[[1]][1]
  }
}

df <- data.frame(people, statements, stringsAsFactors = FALSE)
df

                   people                                                   statements
1            Mr. JOHNSON.                                Researching congress is neat!
2            Mr. JACKSON. For sure. Sometimes I think that I would do it for a living.
3 Mr. SMITH, of Virginia.                But then I realize it's actually pretty hard!

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-05-30
    • 1970-01-01
    • 2018-11-22
    • 2014-06-12
    • 1970-01-01
    • 2017-04-15
    相关资源
    最近更新 更多