扫描 Rmarkdown 文档中的 bibtexkeys答案

【问题标题】：Scan bibtexkeys in Rmarkdown documents扫描 Rmarkdown 文档中的 bibtexkeys
【发布时间】：2020-10-20 10:49:51
【问题描述】：

我喜欢 Rmarkdown 生成文档的简单性，并且我在 Bibtex (*.bib) 文档中维护自己的库。我使用these instructions 在文档中引用（bibtexkey 以“@”符号开头）。

我的问题是：有没有办法扫描 Rmarkdown 文档 (*.Rmd) 并提取文档中引用的 bibtexkeys 列表？这可以很好地生成我的库的一个子集以附加到项目而不是所有 ca.在我的图书馆中积累了 6000 篇参考文献。

【问题讨论】：

标签： r r-markdown bibtex bibliography

【解决方案1】：

您可以通过查找给定的字符串模式（即@）来解析您的.Rmd 文档。

例子：

创建示例文件

Rmd_txt  <- "Lorem ipsum dolor sit amet [@bibkey_a], consectetur adipisici elit [@bibkey_b], sed eiusmod tempor incidunt ut labore et dolore magna aliqua [@bibkey_c;@bibkey_d]."
writeLines(Rmd_txt, "rmdfile.Rmd")

读取文件：

Rmd <- readChar("rmdfile.Rmd",nchars=1e6)

使用 RegExp 查找字符串以 [@ 开头并以 ] 结尾的所有情况

pattern <- "\\[@(.*?)\\]"
m <- regmatches(Rmd,gregexpr(pattern,Rmd))[[1]]
m
[1] "[@bibkey_a]"           "[@bibkey_b]"           "[@bibkey_c;@bibkey_d]"

最后只需根据您的需要拆分和清理字符串

res <- unlist(strsplit(m,";"))

res<- gsub("\\[","",res)
res<- gsub("\\]","",res)

res
[1] "@bibkey_a" "@bibkey_b" "@bibkey_c" "@bibkey_d"

【讨论】：

好点，只有可以直接引用的问题，例如“根据@bibkey_ablabla...”，因此在方括号之外...
我还认为也可以考虑另一种方式：将包含所有 bibtexkeys 的向量与 Rmd 文件匹配并查看，即“正在使用”。

【解决方案2】：

在探索了几个替代方案之后，我从包stringr 中找到了函数str_extract()。在这里，我假设您有一个 bibtex 库，其中包括所有引用的参考文献（通常更多）。由于 bibtexkey 样式不同，我还将Oto Kaláb 的示例与自己的示例结合在一起。

首先是 Rmd 文档。

rmd_text <- c("# Introduction",
        "",
        "Lorem ipsum dolor sit amet [@bibkey_a], consectetur adipisici elit [@bibkey_b],",
        "sed eiusmod tempor incidunt ut labore et dolore magna aliqua [@bibkey_c;@bibkey_d].",
        "",
        "According to @Noname2000, the world is round [@Ladybug1999;Ladybug2009].",
        "This knowledge got lost [@Ladybug2009a].")
writeLines(rmd_text, "document.Rmd")

下一个代码块被注释。最后我们得到一个包含所有引用的向量，可以被unique()压缩。

# Bibtexkeys from bib file
keys <- c("bibkey_a", "bibkey_b", "bibkey_c", "bibkey_d",
        "Noname2000", "Ladybug1999", "Ladybug2009", "Ladybug2009a")
keys <- paste0("@", keys)

# Read document
document <- readLines("document.Rmd")

# Scan document line by line
cited_refs <- list()
for(i in 1:length(document)) {
    cited_refs[[i]] <- str_extract(document[i], keys)
}

# Final output
cited_refs <- unlist(cited_refs)
cited_refs <- cited_refs[!is.na(cited_refs)]

summary(as.factor(cited_refs))

然后可以聚合得到的向量以了解文本中出现的频率（我认为这对于检测稀有引用也很有用）。我也在考虑在输出中提取“行号”。

【讨论】：

【解决方案3】：

更简单的解决方案是使用函数bbt_detect_citations() 包rbbt。

另见this discussion

【讨论】：