【发布时间】:2018-01-04 18:33:35
【问题描述】:
我希望将一些电视脚本拆分为具有两个变量的数据框:(1) 口语对话和 (2) 说话者。
这里是示例数据:http://www.buffyworld.com/buffy/transcripts/127_tran.html
通过以下方式加载到 R:
require(rvest)
url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)
all <- url %>% html_text()
[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n Transcript\nWritten by Drew Goddard\n Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n \n NB: The content of this transcript, including the characters \n and the story, belongs to Mutant Enemy. This transcript was created \n based on the broadcast episode.\n \n \n \n \n BUFFYWORLD.COM \n prefers that you direct link to this transcript rather than post \n it on your site, but you can post it on your site if you really \n want, as long as you keep everything intact, this includes the link \n to buffyworld.com and this writing. Please also keep the disclaimers \n intact.\n \n Originally transcribed for: http://www.buffyworld.com/.\n\t \n TEASER (RECAP SEGMENT):\n GILES (V.O.)\n\n Previousl... <truncated>
我现在正在尝试按每个角色的名称进行拆分(我有一个完整的列表)。例如,上面的“GILES”。这很好用,除非我在那里拆分时无法保留角色名称。这是一个简化的示例。
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)
这给了我想要的分割,但不保留角色名称。
有限的问题:有什么方法可以保留我在做什么的角色名称? 无限问题:我应该尝试其他任何方法吗?
提前致谢!
【问题讨论】: