【问题标题】:how to split a piece text by a word in R?( break the text after a specific word)如何用 R 中的一个词分割一段文本?(在特定词之后打破文本)
【发布时间】:2021-11-26 20:01:04
【问题描述】:

我需要将 pdf 文件拆分为它们的章节。在每个 pdf 中,在每一章的开头,我添加了“Hirfar”这个词来查看和拆分文本。考虑以下示例:

t <- c(" Hirfar Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”.

Hirfar In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”.

 Hirfar “At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said.

Hirfar He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”

Hirfar Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health.")

这里我用这段代码把它分解成单词:

library(stringr)
wrds <- str_split(t, pattern = boundary(type = "word")

现在,我想查找“Hirfar”这个词并将该文本分成 5 个不同的文本。每一个都必须包含 Hirfar 之后的第一个单词,直到 Hirfar 之前的下一个单词。

【问题讨论】:

    标签: r tidyverse tidyr data-mining stringr


    【解决方案1】:

    我们可能会使用正则表达式环视

    strsplit(t, "\\s+(?=Hirfar)", perl = TRUE)[[1]][-1]
    

    -输出

    [1] "Hirfar Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”."                                                                                                                                                                                                        
    [2] "Hirfar In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”."                                                                                                                                                                   
    [3] "Hirfar “At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said."                                                                                                                                                                                                                                                   
    [4] "Hirfar He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”"                                                                                                                                            
    [5] "Hirfar Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health."
    

    如果它不应该包括 Hirfar

    strsplit(t, "Hirfar\\s+")[[1]][-1]
    [1] "Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”.\n\n"                                                                                                                                                                                                    
    [2] "In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”.\n\n "                                                                                                                                                              
    [3] "“At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said.\n\n"                                                                                                                                                                                                                                               
    [4] "He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”\n\n"                                                                                                                                        
    [5] "Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health."
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-06-29
      • 1970-01-01
      • 2015-04-02
      • 1970-01-01
      • 2019-03-12
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多