用单词向量分割长字符串答案

【问题标题】：Split long string by a vector of words用单词向量分割长字符串
【发布时间】：2018-01-04 18:33:35
【问题描述】：

我希望将一些电视脚本拆分为具有两个变量的数据框：(1) 口语对话和 (2) 说话者。

这里是示例数据：http://www.buffyworld.com/buffy/transcripts/127_tran.html

通过以下方式加载到 R：

require(rvest)

url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)

all <- url %>% html_text()

[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n  Transcript\nWritten by Drew Goddard\n  Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n   \n        NB: The content of this transcript, including the characters \n          and the story, belongs to Mutant Enemy. This transcript was created \n          based on the broadcast episode.\n      \n       \n      \n             \n            BUFFYWORLD.COM \n              prefers that you direct link to this transcript rather than post \n              it on your site, but you can post it on your site if you really \n              want, as long as you keep everything intact, this includes the link \n              to buffyworld.com and this writing. Please also keep the disclaimers \n              intact.\n            \n            Originally transcribed for: http://www.buffyworld.com/.\n\t  \n    TEASER (RECAP SEGMENT):\n  GILES (V.O.)\n\n  Previousl... <truncated>

我现在正在尝试按每个角色的名称进行拆分（我有一个完整的列表）。例如，上面的“GILES”。这很好用，除非我在那里拆分时无法保留角色名称。这是一个简化的示例。

to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)

这给了我想要的分割，但不保留角色名称。

有限的问题：有什么方法可以保留我在做什么的角色名称？无限问题：我应该尝试其他任何方法吗？

提前致谢！

【问题讨论】：

标签： r string strsplit

【解决方案1】：

我认为您可以将与 perl 兼容的正则表达式与 strsplit 一起使用。出于解释的目的，我使用了一个较短的示例字符串，但它应该是一样的：

string <- "text BUFFY more text WILLOW other text"

to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)

#[[1]]
#[1] "text BUFFY"        " more text WILLOW" " other text"

正如@Lamia 所建议的那样，如果您在文本之前使用名称，则可以进行积极的前瞻。我稍微编辑了建议，以便拆分字符串包含分隔符。

strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)

#[[1]]
#[1] "text "             "BUFFY more text "  "WILLOW other text"

【讨论】：

工作就像一个魅力 - 感谢您的快速和有益的回应。我也可以很容易地从一个更简单的示例字符串开始，下次会这样做。
如果您想在文字之前获取角色名称，也可以使用strsplit(x,".(?=GILES|BUFFY|WILLOW)",perl=T)。
@MikeH.Just FYI - 我不需要更改顺序，但是当我为上面的第二个选项实现代码时出现错误。 error strsplit(string, paste0("(?