【问题标题】:Text Extraction in R with stringi package使用 stringi 包在 R 中提取文本
【发布时间】:2016-12-27 09:44:46
【问题描述】:

我有下面的文字,需要提取特定单词前后的特定单词

例子:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)

下面的实际输出

[[1]]
[1] "engineering plastics"

[[2]]
[1] "iso 9001"

[[3]]
[1] "office automation"

需要的输出:

[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of

基本上需要提取我提到的具体词前后的文字

【问题讨论】:

  • 您对stri_extract_all_fixed 的调用引用了一个未定义的变量prav_1。请让您的示例可重现。
  • 所有文字都在您的特定字词之前或之后。您似乎想要“工程塑料”之前的 3 个字和之后的 4 个字;在“iso 9001”之前有 2 个字,之后有很多字......你有一个可靠的逻辑可以解释你想提取之前和之后多少?
  • 请将 prav_1 更改为 sometext
  • 我需要前10个字和后10个字..

标签: r text-extraction stringr stringi


【解决方案1】:

这是一个开始的想法:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)

解释: 我在您想要的单词之前和之后添加简单的正则表达式:

"([^ ]+ ){0,10}"

意思是:

  1. 除了空间之外的任何内容,尽可能多地重复
  2. 然后是空格
  3. 所有这些最多十次

这是非常简单和幼稚的(例如,它将所有 '&' 或 '>' 视为单词)但有效。

【讨论】:

    猜你喜欢
    • 2021-08-07
    • 2016-09-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多