【问题标题】:Get text from href tag after specific class在特定类之后从href标签获取文本
【发布时间】:2018-09-25 05:22:47
【问题描述】:

我正在尝试抓取网页

library(RCurl)
webpage <- getURL("https://somewebpage.com")

webpage

<div class='CredibilityFacts'><span id='qZyoLu'><a class='answer_permalink'
action_mousedown='AnswerPermalinkClickthrough' href='/someurl/answer/my_id' 
id ='__w2_yeSWotR_link'>
<a class='another_class' action_mousedown='AnswerPermalinkClickthrough' 
href='/ignore_url/answer/some_id' id='__w2_ksTVShJ_link'>
<a class='answer_permalink' action_mousedown='AnswerPermalinkClickthrough' 
href='/another_url/answer/new_id' id='__w2_ksTVShJ_link'>

class(webpage)
[1] "character"

我正在尝试提取所有 href 值,但前提是它前面带有 answer_permalink 类。

这个的输出应该是

[1] "/someurl/answer/my_id"  "/another_url/answer/new_id"

/ignore_url/answer/some_id 应该被忽略,因为它前面是 another_class 而不是 answer_permalink 类。

现在,我正在考虑使用正则表达式的方法。我认为这样的东西可以用于stri_extract_all中的正则表达式

class='answer_permalink'.*href='

但这并不是我想要的。

我可以通过什么方式实现这一目标?此外,除了正则表达式之外,R 中还有一个函数可以像 Javascript 中那样按类提取元素吗?

【问题讨论】:

  • 您应该能够使用rvest 包使用类似read_html(webpage) %&gt;% html_nodes("answer_permalink") %&gt;% html_attr("href") 的东西来做到这一点
  • @AndrewGustar 返回我character(0)

标签: r regex stringr stringi


【解决方案1】:

您可以使用rvestxml2 之类的包来代替字符串解析:

library(xml2)
xml <- read_html(webpage)
l <- as_list(xml)[[1]][[1]][[1]][[1]]  #not sure why you need to go this deep.

l2 <- l[sapply(l, attr, ".class") == "answer_permalink"]
sapply(l2, attr, "href")
                       a                            a 
 "/someurl/answer/my_id" "/another_url/answer/new_id"

【讨论】:

  • sapply(l, attr, ".class") 给我的输出为 [[1]] NULL 。难道我做错了什么 ? length(l) 是 1。
  • 我不知道,我只是将您的webpage 读为字符串,并且正在运行上面的确切代码..
  • 必须减少一层。 l &lt;- as_list(xml)[[1]][[1]][[1]]。不知道那里发生了什么变化。
  • 是的,我不确定这有多可靠,通常 Andrew 的评论应该是正确的,但我也无法让它发挥作用。
【解决方案2】:

有了dplyrrvest,我们可以这样做:

library(rvest)
library(dplyr)

"https://www.quora.com/profile/Ronak-Shah-96" %>% 
  read_html() %>% 
  html_nodes("[class='answer_permalink']") %>% 
  html_attr("href")
[1] "/How-can-we-adjust-in-engineering-if-we-are-not-in-IITs-or-NITs-How-can-we-enjoy-engineering-if-we-are-pursuing-it-from-a-local-private-college/answer/Ronak-Shah-96"                                                                        
[2] "/Do-you-think-it-is-worth-it-to-change-my-career-path-For-the-past-2-years-I-was-pursuing-a-career-in-tax-advisory-in-a-BIG4-company-I-just-got-a-job-offer-that-will-allow-me-to-learn-coding-It-is-not-that-well-paid/answer/Ronak-Shah-96"
[3] "/Why-cant-India-opt-for-40-hours-work-a-week-for-all-professions-when-it-is-proved-and-working-well-in-terms-of-efficiency/answer/Ronak-Shah-96"

[4] "/Why-am-I-still-confused-and-thinking-about-my-career-after-working-more-than-one-year-in-software-engineering/answer/Ronak-Shah-96"

[5] "/Would-you-rather-be-a-jack-of-all-trades-or-the-master-of-one-trade/answer/Ronak-Shah-96"

【讨论】:

    【解决方案3】:
    require(XML)
    require(RCurl)
    
    doc <- getURL("https://www.quora.com/profile/Ronak-Shah-96" )
    html <- htmlTreeParse(doc, useInternalNodes = TRUE)
    nodes <- getNodeSet(html, "//a[@class='answer_permalink']")
    sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])
    
    [1] "/Do-you-think-it-is-worth-it-to-change-my-career-path-For-the-past-2-years-I-was-pursuing-a-career-in-tax-advisory-in-a-BIG4-company-I-just-got-a-job-offer-that-will-allow-me-to-learn-coding-It-is-not-that-well-paid/answer/Ronak-Shah-96"
    [2] "/Why-cant-India-opt-for-40-hours-work-a-week-for-all-professions-when-it-is-proved-and-working-well-in-terms-of-efficiency/answer/Ronak-Shah-96"                                                                                             
    [3] "/Why-am-I-still-confused-and-thinking-about-my-career-after-working-more-than-one-year-in-software-engineering/answer/Ronak-Shah-96"                                                                                                         
    [4] "/Would-you-rather-be-a-jack-of-all-trades-or-the-master-of-one-trade/answer/Ronak-Shah-96"                                                                                                                                                   
    [5] "/Is-software-engineering-a-good-career-choice-I-know-it-pays-well-initially-but-if-you-look-at-the-managing-directors-of-most-companies-they-are-people-with-MBAs/answer/Ronak-Shah-96"                                                      
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-05-23
      • 2022-12-17
      • 2023-03-10
      • 2019-07-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多