【问题标题】:R: parsing text file of quotations / splitting into paragraphsR:解析引用文本文件/分成段落
【发布时间】:2020-03-27 22:26:54
【问题描述】:

我正在寻找一个 R 解决方案来解决解析引用文本文件(如下所示)的问题,该解决方案给出一个 data.frame,每个引用一个观察值,变量 textsource 如下所述。

DIAGRAMS are of great utility for illustrating certain questions of vital statistics by
conveying ideas on the subject through the eye, which cannot be so readily grasped when
contained in figures.
--- Florence Nightingale, Mortality of the British Army, 1857

To give insight to statistical information it occurred to me, that making an
appeal to the eye when proportion and magnitude are concerned, is the best and
readiest method of conveying a distinct idea. 
--- William Playfair, The Statistical Breviary (1801), p. 2


Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
--- William Playfair, Elemens de statistique, Paris, 1802, p. XX.

The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
--- Charles Joseph Minard

这里,每个引用都是一个段落,用"\n\n" 与下一个分隔。在该段落中,以--- 开头的所有行构成text--- 之后的行是source

我想我可以先将文本行分成段落(由'\\n\\n+'(2 个或更多空白行)分隔),但我无法做到这一点。

【问题讨论】:

  • 您能与我们分享您尝试过的代码吗?

标签: r parsing paragraph quotations


【解决方案1】:

假设您在rawText 变量中加载了初始文本

library(stringr)

strsplit(rawText, "\n\n")[[1]] %>% 
  str_split_fixed("\n--- ", 2) %>% 
  as.data.frame() %>% 
  setNames(c("text", "source"))

【讨论】:

  • 对我来说,这似乎是迄今为止最简单、最整洁的解决方案。我缺少的是(a)strsplit 的使用,需要unlist[[1]]; (b) setNames() 用于在管道中分配名称。
【解决方案2】:

这应该可以完成您需要实现的大部分工作。我假设您已经将文件保存在长度为 1 的字符向量中,称为 txt

library(tidyverse)

txt                                             %>% 
strsplit("\n{2,5}")                             %>% 
unlist()                                        %>% 
lapply(function(x) unlist(strsplit(x, "--- "))) %>%
{do.call("rbind", .)}                           %>%
as.data.frame(stringsAsFactors = FALSE)         %>%
setNames(c("Text", "Source"))                    ->
df

如果你通过用空格替换换行符来整理文本,你会得到以下结果:

df$Text <- gsub("\n", " ", df$Text)
as_tibble(df)
#> # A tibble: 4 x 2
#>   Text                                              Source                             
#>   <chr>                                             <chr>                              
#> 1 "DIAGRAMS are of great utility for illustrating ~ Florence Nightingale, Mortality of~
#> 2 "To give insight to statistical information it o~ William Playfair, The Statistical ~
#> 3 "Regarding numbers and proportions, the best way~ William Playfair, Elemens de stati~
#> 4 "The aim of my carte figurative is to convey pro~ Charles Joseph Minard 

【讨论】:

  • 和我下面的回答差不多。在这种情况下,我建议只加载 magrittr,因为 tidyverse 是一个非常大的依赖关系,仅用于获取管道 (%&gt;%) 运算符。
  • @MarcusA。我是magrittr 的忠实粉丝,但我在SO 上的经验是大多数(尽管不是全部)用户已经加载了部分或全部tidyverse。此外,出于显示目的,我想将结果显示为 tibble 而不是数据框。我只是懒惰!
  • 100% 同意大多数人有tidyverse。我故意试图潜入一些代码优化/将tidyverse 内容的黑匣子分解到线程中:)
【解决方案3】:

假设您的文本文件是工作目录中的quote.txt

R base解决方案:将其拆分2次:(1)\n\n和(2)---,然后合并成数据框。

quote <- readLines("quote.txt")
quote <- paste(quote, collapse = "\n")

DF <- strsplit(unlist(strsplit(quote, "\n\n")), "---")
DF <- data.frame(text= trimws(sapply(DF, "[[", 1)), 
           source = trimws(sapply(DF, "[[", 2)))

输出

DF
                                                                                                                                                                                                                                                                                 # text
# 1     DIAGRAMS are of great utility for illustrating certain questions of vital statistics by\nconveying ideas on the subject through the eye, which cannot be so readily grasped when\ncontained in figures.
# 2 To give insight to statistical information it occurred to me, that making an\nappeal to the eye when proportion and magnitude are concerned, is the best and\nreadiest method of conveying a distinct idea.
# 3                                                                                                           Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
# 4                                                                     The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
#                                                          source
# 1     Florence Nightingale, Mortality of the British Army, 1857
# 2       William Playfair, The Statistical Breviary (1801), p. 2
# 3 William Playfair, Elemens de statistique, Paris, 1802, p. XX.
# 4                                         Charles Joseph Minard

【讨论】:

    猜你喜欢
    • 2017-01-23
    • 2023-02-18
    • 2016-04-28
    • 1970-01-01
    • 2010-11-05
    • 1970-01-01
    • 1970-01-01
    • 2017-09-21
    • 2017-08-28
    相关资源
    最近更新 更多