如何从语料库中提取特定文本？答案

【问题标题】：How to extract specific text from a corpus?如何从语料库中提取特定文本？
【发布时间】：2019-10-31 07:27:12
【问题描述】：

我有一个包含 213 个长度不同的文档的语料库。我的目标是从每个文件中提取一段特定的文本，它指的是“财政政策”。使我的尝试复杂化的是，我要提取的文本位在文本之间并不相同。唯一经常出现在开头的关键词是财政政策或财政政策，仅此而已。

举个例子：

df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"))

cp <- corpus (df)

最终目标是得到这样的语料库：

df <- data.frame(Text = c("As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future.", "Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes.", "As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries.", "Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes."))

cp <- corpus(df)

请注意，即使我只是获得了一点兴趣以及我不想要的“更多文本”，我也会很高兴。我可以简单地对其进行子集化。我没能到达那里。到目前为止，我尝试使用 corpus_segment 以及使用数据框均未成功。

谁能帮我解决这个问题？

非常感谢！

【问题讨论】：

如果这就是你所拥有的所有信息，我认为你无法获得比stringr::str_extract(df$Text, stringr::regex("fiscal.*$", ignore_case=TRUE))更进一步的信息

标签： r corpus

【解决方案1】：

不需要语料库功能的Base R解决方案：

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

针对进一步的问题——查找索引并使用它的子集数据：

# Return vector of sentences containing pattern: 

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

# Store the matched text as a vector: 

matched_text <- trimws(grep("fiscal .*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

#Get the index of the dataframe for each element:

matched_text_idx <- sapply(matched_text, function(x){which(grepl(x, df$Text))})

# If you want to subset the dataframe to contain only the elements which contain pattern: 

df$Text[(which(grepl("fiscal polic.*", df$Text)))]

数据：

    df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"), stringsAsFactors = FALSE)

【讨论】：

@Arma_91 不用担心，base R 非常有用，通常不应忽视软件包！
你是对的！我将您的代码应用于我的数据框并且它可以工作（比我预期的要好，因为它实际上抓住了引用“财政”的所有位）。只有一个负面因素：数据帧的结构丢失了——不再有 213 个文档，而是 N 位字符。如果我可以将这些位引用到提取它们的文本中，那将是非常壮观的。与我的示例不同，每行不止一位。
@Arma_91 为简洁而编辑/针对搜索进行了优化
让我们continue this discussion in chat.