【问题标题】:replacing regex pattern in r替换 r 中的正则表达式模式
【发布时间】:2017-09-02 23:44:30
【问题描述】:

我有一个文本列,其中包含客户和代理之间电话通话的语音到文本记录。在对原始文本值进行一些文本操作之后,假设我有一个如下所示的向量:

text <- " customer:customer text1 agent:agent text 1 customer:customer text2 agent:agent text 2"

(注意向量文本开头的空格。)

问题:如何从原始源字段(在本例中为text 向量)将客户和代理文本提取到两个单独的字段中?

# desired outputs:
# field for customer texts
"customer text1, customer text2"
# field for agent texts
"agent text1, agent text2"

到目前为止,我能做的(在正则表达式方面的经验有限)是:

customerText <- gsub("^ customer:| agent:(.*)", "", text)
customerText 
[1] "customer text1"

编辑:

请考虑下面基于数据帧的方法的可重现代码,而不是上面基于向量的代码。

> callid <- c("1","2")
> conversation <- c(" customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2",
+                   " agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9")
> conversationCustomer <- c("customer text 1, customer text 2", "customer text 8, customer text 9")
> conversationAgent <- c("agent text 1, agent text 2", "agent text 8, agent text 9")
> df <- data.frame(callid, conversation)
> dfDesired <- data.frame(callid, conversation, conversationCustomer, conversationAgent)
> rm(callid, conversation, conversationCustomer, conversationAgent)
> 
> df
  callid                                                                             conversation
1      1  customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2
2      2  agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9
> dfDesired
  callid                                                                             conversation             conversationCustomer          conversationAgent
1      1  customer:customer text 1 agent:agent text 1 customer:customer text 2 agent:agent text 2 customer text 1, customer text 2 agent text 1, agent text 2
2      2  agent:agent text 8 customer:customer text 8 agent:agent text 9 customer:customer text 9 customer text 8, customer text 9 agent text 8, agent text 9

谢谢!

【问题讨论】:

  • R 用于文本解析?上帝保佑你。

标签: r regex gsub


【解决方案1】:

我们可以使用str_extract

library(stringr)
v1 <- str_extract_all(text, "(?<=:)(customer\\s+\\w+\\s*\\d*)|(agent\\s+\\w+\\s*\\d*)")[[1]]
v1[c(TRUE, FALSE)]
v1[c(FALSE, TRUE)]

或使用strsplit

v1 <- strsplit(trimws(text), "(customer|agent):\\s*")[[1]]
v2 <- trimws(v1[nzchar(v1)])
toString(v2[c(TRUE, FALSE)])
toString(v2[c(FALSE, TRUE)])

【讨论】:

  • 在问题的上方,我以向量“文本”为例,您的解决方案对此非常有效。谢谢!但是,当我尝试('strsplit' 方法)在我的数据框中使用真实数据时,它给出了以下错误。 > df$conversation_customer $<-.data.frame(*tmp*, conversationCustomer, value = c(" ", : 替换有 86 行,数据有 1。然后,在你的代码的帮助下,我想出了: df$conversationCustomer
  • @kzmlbyrk 如果它是一个data.frame,那么你不需要对第一个元素进行子集化,例如lst &lt;- strsplit(trimws(df$conversation), "(customer|agent):\\s*"); do.call(rbind, lapply(lst, function(x) x[nzchar(x)][c(TRUE, FALSE)]))c(FALSE, TRUE)类似
  • 我错过了什么吗? df$conversationCustomer
  • 我刚刚意识到的另一件事:“df$conversation”列有时以客户文本开头,有时以代理文本开头。因此,[c(TRUE, FALSE)] 语句可能不会一直过滤所需的文本。很抱歉这么晚才意识到这种情况。
【解决方案2】:

现在,我可以如下解决它。我想它可能会被正则表达式经验丰富的人缩短。

df$conversationCustomer <- gsub("agent:.*?customer:", ",", df$conversation)  # replaces any text starting with "agent:" and ending with "customer:" and assigns the customer text to new variable.
df$conversationCustomer <- gsub("agent:.*", "", df$conversationCustomer) # this is for the agent texts at the end of conversation those I couldn't clean the "agent:" part using first regex 
df$conversationCustomer <- gsub("customer:", "", df$conversationCustomer) # this is for removing the "customer:" in the conversations those starts with customer text. (Again, I couldn't clean "customer:" part using first regex.)
df$conversationAgent <- gsub("customer:.*?agent:", ",", df$conversation)
df$conversationAgent <- gsub("customer:.*", "", df$conversationAgent)
df$conversationAgent <- gsub("agent:", "", df$conversationAgent)

【讨论】:

    猜你喜欢
    • 2016-10-20
    • 1970-01-01
    • 1970-01-01
    • 2012-02-12
    • 1970-01-01
    • 1970-01-01
    • 2023-03-22
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多