【问题标题】:Cleaning 'stringr str_replace_all' automatic concatenation when matching multiple times多次匹配时清理'stringr str_replace_all'自动连接
【发布时间】:2016-03-03 15:22:21
【问题描述】:

我使用police_officer <- str_extract_all(txtparts, "ID:.*\n") 从文本文件中提取了参与 911 呼叫的所有警察的姓名。 例如:
2237 DISTURBANCE Report taken
Call Taker: Telephone Operators Sharon L Moran Location/Address: [BRO 6949] 61 WILSON ST ID: Patrolman Darvin Anderson Disp-22:43:39 Arvd-22:48:57 Clrd-23:49:45 ID: Patrolman Stephen T Pina Disp-22:43:48 Clrd-22:46:10 ID: Sergeant Michael V Damiano Disp-22:46:33 Arvd-22:47:14 Clrd-22:55:22

在某些部分匹配多个ID: 时,我得到:"c(\" Patrolman Darvin Anderson\\n\", \" Patrolman Stephen T Pina\\n\", \" Sergeant Michael V Damiano\\n\")"。 以下是我迄今为止尝试清理数据的方法:
police_officer <- str_replace_all(police_officer,"c\\(.","") police_officer <- str_replace_all(police_officer,"\\)","") police_officer <- str_replace_all(police_officer,"ID:","") police_officer <- str_replace_all(police_officer,"\\n\","") # I can't get rid of\\n\.

这就是我最终得到的结果
" Patrolman Darvin Anderson\\n\", \" Patrolman Stephen T Pina\\n\", \" Sergeant Michael V Damiano\\n\""

我需要帮助清理\\n\

【问题讨论】:

    标签: regex r string substring stringr


    【解决方案1】:

    您可以将以下正则表达式与str_match_all 一起使用:

    \bID:\s*(\w+(?:\h+\w+)*)
    

    regex demo

    > txt <- "Call Taker:    Telephone Operators Sharon L Moran\n  Location/Address:    [BRO 6949] 61 WILSON ST\n                ID:    Patrolman Darvin Anderson\n                       Disp-22:43:39                 Arvd-22:48:57  Clrd-23:49:45\n                ID:    Patrolman Stephen T Pina\n                       Disp-22:43:48                                Clrd-22:46:10\n                ID:    Sergeant Michael V Damiano\n                       Disp-22:46:33                 Arvd-22:47:14  Clrd-22:55:22"
    > str_match_all(txt, "\\bID:\\s*(\\w+(?:\\h+\\w+)*)")
    [[1]]
         [,1]                                [,2]                        
    [1,] "ID:    Patrolman Darvin Anderson"  "Patrolman Darvin Anderson" 
    [2,] "ID:    Patrolman Stephen T Pina"   "Patrolman Stephen T Pina"  
    [3,] "ID:    Sergeant Michael V Damiano" "Sergeant Michael V Damiano"
    

    正则表达式将ID: 匹配为一个完整的单词,然后匹配零个或多个空格(使用\s*),然后捕获 字母数字字符序列,可选地用水平空格分隔。 str_match_all 有助于提取捕获的部分,因此,您不能将 str_extract_all 与此正则表达式一起使用。

    更新:

    > time <- str_trim(str_extract(txt, " [[:digit:]]{4}"))
    > Call_taker <- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %>% str_replace_all("\n","")
    > address <- str_extract(txt, "Location/Address:.*\n")
    > Police_officer <- str_match_all(txt, "\\bID:\\s*(\\w+(?:\\h+\\w+)*)")
    > BPD_log <- cbind(time,Call_taker,address,list(Police_officer[[1]][,2]))
    > BPD_log <- as.data.frame(BPD_log)
    > colnames(BPD_log) <- c("time", "Call_taker", "address", "Police_officer")
    > BPD_log
      time                             Call_taker                                        address
    1 6949     Telephone Operators Sharon L Moran Location/Address:    [BRO 6949] 61 WILSON ST\n
                                                                       Police_officer
    1 Patrolman Darvin Anderson, Patrolman Stephen T Pina, Sergeant Michael V Damiano
    > 
    

    【讨论】:

    • 谢谢!我想真正的问题是当我将所有内容放入带有 Call_takertimeaddressPolice_officer 的数据框中时。 time &lt;- str_trim(str_extract(txt, " [[:digit:]]{4}")) Call_taker &lt;- str_replace_all(str_extract(txt, "Call Taker:.*\n"),"Call Taker:","" ) %&gt;% str_replace_all("\n","") address &lt;- str_extract(txt, "Location/Address:.*\n") Police_officer &lt;- str_match_all(txt, "\\bID:\\s*(\\w+(?:\\h+\\w+)*)") BPD_log &lt;- cbind(time,Call_taker,address,Police_officer) BPD_log &lt;- as.data.frame(BPD_log) 当我们带上 Police_officer 时,我们仍然会得到 c(
    • 我不知道您的最终数据框应该是什么样子,但请注意,您只是添加了 str_match_all 的整个输出,而您只需要 [,2] 维度。试试BPD_log &lt;- cbind(time,Call_taker,address,Police_officer[[1]][,2])
    • 刚看到你的更新,但我希望数据显示在一行下,这意味着所有警察都应该在一个单元格中。如果你能做到,那就太好了。
    • BPD_log &lt;- cbind(time,Call_taker,address,list(Police_officer[[1]][,2])) 呢?
    • 我不确定您需要什么。 (?s)Location\/Address:[^\n]*\R(.*)?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-02-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多