【问题标题】:Extract n Words Around Defined Term (Multicase)在定义的术语周围提取 n 个单词(多案例)
【发布时间】:2018-02-11 02:03:17
【问题描述】:

我有一个文本字符串向量,如:

Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.",
               "Everyone was busy, so I went to the movie alone. Two seats were vacant.",
               "TEST Rock music approaches at high velocity.",
               "I am happy to take your TEST donation; any amount will be greatly TEST appreciated.",
               "A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.",
               "Rock music approaches at high velocity TEST.")

我想提取n(例如:三个)单词(一个单词的特征是在字符前后都有一个空格)AROUND (即之前和之后)特定术语(例如,“TEST”)。 重要提示:Several ma​​tches 应该是allowed(即,如果特定术语出现多次,则预期的解决方案应涵盖这些情况)。

结果可能是这样的(格式可以改进):

S1  <- c(before = "the promotion, but", after = "my attendance wasn’t")
S2  <- c(before = "",                   after = "")
S3  <- c(before = "",                   after = "Rock music approaches")
S4a <- c(before = "to take your",       after = "donation; any amount")
S4b <- c(before = "will be greatly",    after = "appreciated.")
S5a <- c(before = "a green donkey",     after = "flew a TEST")
S5b <- c(before = "TEST flew",          after = "kite in the")
S6  <- c(before = "at high velocit",    after = "")  

我该怎么做?我已经找到了其他 psot,它们要么是 only for one-case-matches,要么与 fixed sentence structures 相关。

【问题讨论】:

    标签: r text text-mining tm


    【解决方案1】:

    quanteda 包有一个很好的功能:kwic()(上下文中的关键字)。

    开箱即用,这在您的示例中非常有效:

    library("quanteda")
    names(Sentences) <- paste0("S", seq_along(Sentences))
    (kw <- kwic(Sentences, "TEST", window = 3))
    # 
    # [S1, 9]   promotion, but | TEST | my attendance wasn't 
    # [S3, 1]                  | TEST | Rock music approaches
    # [S4, 7]     to take your | TEST | donation; any        
    # [S4, 15] will be greatly | TEST | appreciated.         
    # [S5, 8]   a green donkey | TEST | flew a TEST          
    # [S5, 11]     TEST flew a | TEST | kite in the          
    # [S6, 7] at high velocity | TEST | .               
    
    (kw2 <- as.data.frame(kw)[, c("docname", "pre", "post")])
    #   docname              pre                  post
    # 1      S1  promotion , but  my attendance wasn't
    # 2      S3                  Rock music approaches
    # 3      S4     to take your        donation ; any
    # 4      S4  will be greatly         appreciated .
    # 5      S5   a green donkey           flew a TEST
    # 6      S5      TEST flew a           kite in the
    # 7      S6 at high velocity                     .
    

    这可能是比您在问题中要求的单独对象更好的格式。但是为了尽可能接近你的目标,你可以进一步改造它,如下所示。

    # this picks up the empty matching sentence S2
    (kw3 <- merge(kw2, 
                  data.frame(docname = names(Sentences), stringsAsFactors = FALSE), 
                  all.y = TRUE))
    # replaces the NA with the empty string
    kw4 <- as.data.frame(lapply(kw3, function(x) { x[is.na(x)] <- ""; x} ), 
                         stringsAsFactors = FALSE)
    # renames pre/post to before/after
    names(kw4)[2:3] <- c("before", "after")
    # makes the docname unique
    kw4$docname <- make.unique(kw4$docname)
    
    kw4
    #   docname           before                 after
    # 1      S1  promotion , but  my attendance wasn't
    # 2      S2                                       
    # 3      S3                  Rock music approaches
    # 4      S4     to take your        donation ; any
    # 5    S4.1  will be greatly         appreciated .
    # 6      S5   a green donkey           flew a TEST
    # 7    S5.1      TEST flew a           kite in the
    # 8      S6 at high velocity                     .
    

    【讨论】:

    • 完美。谢谢!
    猜你喜欢
    • 1970-01-01
    • 2011-08-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多