使用 str_match_all 匹配 R 中字符的开头和结尾答案

【问题标题】：Using str_match_all to match beginning and end of characters in R使用 str_match_all 匹配 R 中字符的开头和结尾
【发布时间】：2019-05-08 19:00:31
【问题描述】：

尊敬的 Stackoverflow 社区，

我正在尝试使用 stringR 从网站中提取唯一的数字标识符。网站中有几个独特的 doi，在 doi 结束后，后面跟着一个“Cite”字符。

[1] 我从网站获取信息 pg https://search.datacite.org/works?query=Movebank&resource-type-id=dataset") [2] 我尝试从网站中获取 26 个以“doi”开头的唯一字符串

[3] 我打算使用 string_match_all 开头必须匹配“https://doi.org/”，“*”之间的一些字符和结尾必须匹配单词“Cite”。

str_match_all( html_text(html_nodes(pg, "body")) , 模式 = "^https://doi.org/*Cite$") [4] 其中一个 doi 的外观示例如下：

https://doi.org/10.5441/001/1.41076dq1/6引用

非常感谢任何帮助！

问候，

迭戈

【问题讨论】：

标签： regex string-matching rvest stringr

【解决方案1】：

使用与下面答案中的 hrbrmstr 类似的代码，您可以轻松获得所有想要的网址。 https://stackoverflow.com/a/46674097/10710995

fils <- html_nodes(pg, xpath=".//a[contains(@href, 'doi.org')]")

df <- data.frame(link= html_attr(fils, "href"))

 df
                                          link
1  https://doi.org/10.25504/fairsharing.httzv2
2     https://doi.org/10.5441/001/1.41076dq1/6
3     https://doi.org/10.5441/001/1.q986rc29/3
4     https://doi.org/10.5441/001/1.q986rc29/4
5       https://doi.org/10.5441/001/1.25551gr6
6     https://doi.org/10.5441/001/1.25551gr6/1
7     https://doi.org/10.5441/001/1.25551gr6/2
8     https://doi.org/10.5441/001/1.q8b02dc5/4

【讨论】：