R 中是否有类似于 bash 中的“此处文档”的内容？答案

【问题标题】：Is there in R something like the "here document" in bash?R 中是否有类似于 bash 中的“此处文档”的内容？
【发布时间】：2015-10-24 13:37:56
【问题描述】：

我的脚本包含该行

lines <- readLines("~/data")

我想将文件数据的内容（逐字）保留在脚本本身中。 R 中是否有“read_the_following_lines”函数？类似于 bash shell 中的“此处文档”？

【问题讨论】：

这取决于您的数据的组织方式。看看read.table 的text 参数。

标签： r

【解决方案1】：

多行字符串将尽可能接近。这绝对是不一样的（因为你必须关心引号），但它对于你想要实现的目标确实很有效（你可以通过read.table 来实现）：

here_lines <- 'line 1
line 2
line 3
'

readLines(textConnection(here_lines))

## [1] "line 1" "line 2" "line 3" ""


here_csv <- 'thing,val
one,1
two,2
'

read.table(text=here_csv, sep=",", header=TRUE, stringsAsFactors=FALSE)

##   thing val
## 1   one   1
## 2   two   2


here_json <- '{
"a" : [ 1, 2, 3 ],
"b" : [ 4, 5, 6 ],
"c" : { "d" : { "e" : [7, 8, 9]}}
}
'

jsonlite::fromJSON(here_json)

## $a
## [1] 1 2 3
## 
## $b
## [1] 4 5 6
## 
## $c
## $c$d
## $c$d$e
## [1] 7 8 9

here_xml <- '<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>a
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
</CATALOG>
'

str(xml <- XML::xmlParse(here_xml))

## Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>

print(xml)

## <?xml version="1.0"?>
## <CATALOG>
##   <PLANT><COMMON>Bloodroot</COMMON><BOTANICAL>Sanguinaria canadensis</BOTANICAL><ZONE>4</ZONE>a
## <LIGHT>Mostly Shady</LIGHT><PRICE>$2.44</PRICE><AVAILABILITY>031599</AVAILABILITY></PLANT>
##   <PLANT>
##     <COMMON>Columbine</COMMON>
##     <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##     <ZONE>3</ZONE>
##     <LIGHT>Mostly Shady</LIGHT>
##     <PRICE>$9.37</PRICE>
##     <AVAILABILITY>030699</AVAILABILITY>
##   </PLANT>
## </CATALOG>

【讨论】：

谢谢。不幸的是，我的台词充满了引号和反斜杠。当您从外部文件中读取时，这些都被转义了。如果你想把所有东西都保存在同一个文件中，那似乎是不可能实现的。
有没有简单的方法来转义字符串中的引号？

【解决方案2】：

第 90 页。 An introduction to R 声明可以像这样编写 R 脚本（我引用了从那里修改的示例）：

chem <- scan()
2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70

print(chem)

将这些行写入一个文件，并命名为heredoc.R。如果您随后通过在终端中输入以非交互方式执行该脚本

Rscript heredoc.R

你会得到以下输出

Read 24 items
 [1]  2.90  3.10  3.40  3.40  3.70  3.70  2.80  2.50  2.40  2.40  2.70  2.20
[13]  5.28  3.37  3.03  3.03 28.95  3.77  3.40  2.20  3.50  3.60  3.70  3.70

所以您看到文件中提供的数据保存在变量chem 中。默认情况下，函数scan(.) 从连接stdin() 中读取。 stdin() 指的是交互模式下来自控制台的用户输入（在没有指定脚本的情况下调用R），但是当读入输入脚本时，会读取该脚本的以下行*）。数据后面的空行很重要，因为它标志着数据的结束。

这也适用于表格数据：

tab <- read.table(file=stdin(), header=T)
A B C
1 1 0
2 1 0
3 2 9

summary(tab)

使用readLines(.)时，必须指定读取的行数；空行的方法在这里不起作用：

txt <- readLines(con=stdin(), n=5)                                             
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ultricies diam   
sed felis mattis, id commodo enim hendrerit. Suspendisse iaculis bibendum eros, 
ut mattis eros interdum sit amet. Pellentesque condimentum eleifend blandit. Ut 
commodo ligula quis varius faucibus. Aliquam accumsan tortor velit, et varius   
sapien tristique ut. Sed accumsan, tellus non iaculis luctus, neque nunc        

print(txt)

您可以通过一次读取一行来克服此限制，直到一行为空或其他一些预定义的字符串。但是请注意，如果您以这种方式读取大 (>100MB) 文件，您可能会耗尽内存，因为每次将字符串附加到读入的数据时，all 数据都会复制到记忆中的另一个地方。请参阅The R inferno 中的“增长对象”一章：

txt <- c()
repeat{
    x <- readLines(con=stdin(), n=1)
    if(x == "") break # you can use any EOF string you want here
    txt = c(txt, x)
}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ultricies diam
sed felis mattis, id commodo enim hendrerit. Suspendisse iaculis bibendum eros,
ut mattis eros interdum sit amet. Pellentesque condimentum eleifend blandit. Ut
commodo ligula quis varius faucibus. Aliquam accumsan tortor velit, et varius
sapien tristique ut. Sed accumsan, tellus non iaculis luctus, neque nunc

print(txt)

*) 如果您想从 R 脚本中的标准输入中读取数据，例如因为您想创建一个可使用任何输入数据调用的可重用脚本（Rscript reusablescript.R < input.txt 或 some-data-generating-command | Rscript reusablescript.R)，不要使用stdin()，而是使用file("stdin")。

【讨论】：

【解决方案3】：

一种处理多行字符串但不用担心引号（仅反引号）的方法：

as.character(quote(`
all of the crazy " ' ) characters, except 
backtick and bare backslashes that aren't 
printable, e.g. \n works but a \ and c with no space between them would fail`))

【讨论】：

【解决方案4】：

从 R v4.0.0 开始，有一种用于原始字符串的新语法 as stated in changelogs，它在很大程度上允许创建 heredocs 样式的文档。

另外，来自help(Quotes)：

也可以使用分隔符对 [] 和 {}，并且可以使用 R 代替 r。为了增加灵活性，可以在开始引号和开始分隔符之间放置多个短划线，只要在结束分隔符和结束引号之间出现相同数量的短划线即可。

例如，可以使用（在带有 BASH shell 的系统上）：

file_raw_string <-
r"(#!/bin/bash
echo $@
for word in $@;
do
  echo "This is the word: '${word}'."
done
exit 0
)"

writeLines(file_raw_string, "print_words.sh")

system("bash print_words.sh Word/1 w@rd2 LongWord composite-word")

甚至是另一个 R 脚本：

file_raw_string <- r"(
x <- lapply(mtcars[,1:4], mean)
cat(
  paste(
    "Mean for column", names(x), "is", format(x,digits = 2),
    collapse = "\n"
  )
)
cat("\n")
cat(r"{ - This is a raw string where \n, "", '', /, \ are allowed.}")
)"

writeLines(file_raw_string, "print_means.R")

source("print_means.R")

#> Mean for column mpg is 20
#> Mean for column cyl is 6.2
#> Mean for column disp is 231
#> Mean for column hp is 147
#>  - This is a raw string where \n, "", '', /, \ are allowed.

^{由reprex package (v2.0.0) 于 2021-08-01 创建}

【讨论】：