在 R 中清理 HTML 代码：如何清理这个列表？答案

【问题标题】：Cleaning HTML code in R: how to clean this list?在 R 中清理 HTML 代码：如何清理这个列表？
【发布时间】：2017-08-08 09:32:51
【问题描述】：

我知道这个问题已经在这里被问过很多次了，但是在阅读了一堆主题之后，我仍然坚持这个:(。我有一个这样的 html 节点列表

<a href="http://bit.d o/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://bit.d o/bnRinN9</a>

我只想清理所有代码部分。不幸的是，我是一个新手，我唯一想到的就是 Cthulhu 方式（正则表达式，啊！）。我可以通过哪种方式做到这一点？

*我在域名中的“d”和“o”之间放了一个空格，因为SO不允许发布该链接

【问题讨论】：

你尝试了什么？
我试过 cleanlinks <- gsub('<.*?>', ' ', short_links) 但这一切都干净了

标签： r regex gsub

【解决方案1】：

这使用了下载的Why R can't scrape these links?中链接的数据。

library(rvest)
library(stringr)

# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")

# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")

# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics 
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]

# the real urls are in the html text, prefixed with http
span_text  <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]

# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"

【讨论】：

【解决方案2】：

库 rvest 包含许多用于抓取和处理 html 的简单函数。它取决于包 xml2。一般来说，您可以一步完成抓取和过滤。

不清楚是否要提取 href 值或 html 文本，这在您的示例中是相同的。此代码通过查找a 节点和html 属性href 来提取href 值。或者您可以使用html_text 来获取链接显示文本。

library(rvest)
links <- list('
<a href="http://anydomain.com/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://anydomain.com/bnRinN9</a>
<a href="domain.com/page">
')

# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs  


## [1] "http://anydomain.com/bnRinN9" "domain.com/page"

【讨论】：

嗨，@epi99 谢谢你的回答。由于在这种情况下 href 属性和 text 是相同的，它们等同于我，但假设我只需要抓取文本。不幸的是，我不能使用您发布的方法，因为我所有的节点都在一个列表对象中，所以我需要一个不同的。我也不明白为什么我发布的正则表达式示例不起作用
@massimo, 1) 如果你所有的链接都在一个列表对象中，使用类似paste(list_of_links, collapse = "\n") 的东西应该会给你一个包含所有链接的字符串。 2）像gsub("<[^>]+>", "", short_links) 这样的正则表达式应该可以工作。我认为您的正则表达式中的 .* 会消耗所有内容，但我不确定。这个[^>] 匹配所有内容（但不包括>），然后匹配>，然后重复。 3) 提供可重现的测试数据（在本例中为列表）将为您提供更快的答案
很遗憾，您提供的解决方案不起作用。使用粘贴功能，我得到了一系列奇怪的[1] "<pointer: 0x00000000119e3f00>\n<pointer: 0x00000000119e5110>\n<pointer: 0x00000000119e7520>\n<pointer: 0x00000000119ea830>\n<pointer: 0x00000000119eaab0>，而使用正则表达式，我得到了一个" " 的空白列表。如果您想尝试使用我的代码和数据，您可以查看here，谢谢！