【问题标题】:how to get the comments in a html page while scraping?如何在抓取时获取 html 页面中的评论?
【发布时间】:2018-07-28 16:27:21
【问题描述】:

这就是问题所在。我试图在这个 facebook 上抓取生日日期的页面,当我在浏览器中看到页面源时,它会在类名 div 的类名 class="hidden_elem" 中以 html 中的注释形式显示生日日期。

这可能是因为,当我使用 (selenium , scrapy , requests) 在我的获取请求中看到此页面的源代码时,我只得到一个 divclass="hidden_elem" 并且该评论无处可去看到了更不用说解析它的信息了。

那么如何获取此文本,如果可能,请说明如何获取生日日期。

在 facebook 页面上,可能有一些 javascript 的东西会通过设计巧妙地导致这种情况。如何解决这个问题?

这是我试图从中获取生日日期的 URL。 https://www.facebook.com/profile.php?id=100004456147835&sk=about

从浏览器的源页面看起来是这样的:-

<div class="hidden_elem"><code id="u_0_2g"><!-- <ul class="uiList _54nz _4kg _4kt" data-pnref="about"><li><div class="_5aj7"><div class="_4bl9"><div class="_54n- _2pi3"><div id="u_0_2e"></div></div></div><div class="_4bl7"><div class="_4ms4" id="u_0_2a"><div class="clearfix _ikh _5c0g" data-pnref="overview" id="u_0_2f"><div class="_4bl7"><ul class="uiList _1pi3 _4kg _6-h _703 _4ks"><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_344683"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No workplaces to show</span></div></div></div></div></li><li id="u_0_2b"><div class="clearfix _5y02" data-overviewsection="education" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://scontent.fblr6-1.fna.fbcdn.net/v/t1.0-1/c9.0.32.32/p32x32/580846_10149999285985791_1565762244_n.png?oh=d4ccc6a667e53f20db9cf60c0742f989&amp;oe=5B1420C5" alt="" aria-label="Cambridge Institute of technolagy" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Studies at <a class="profileLink" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1">Cambridge Institute of technolagy</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg">Past: <a class="profileLink" href="https://www.facebook.com/deekshaintegrated/" data-hovercard="/ajax/hovercard/page.php?id=176180289071224" data-hovercard-prefer-more-content-show="1">Deeksha Integrated</a> and <a class="profileLink" href="https://www.facebook.com/pages/chethana-vidya-mandiratumkur/378826618888908" data-hovercard="/ajax/hovercard/page.php?id=378826618888908" data-hovercard-prefer-more-content-show="1">chethana vidya mandira,tumkur</a></div></div></div></div></div></div></div></li><li id="u_0_2c"><div class="clearfix _5y02" data-overviewsection="places" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://external.fblr6-1.fna.fbcdn.net/safe_image.php?d=AQCKH3kcP1-A2NPe&amp;w=32&amp;h=32&amp;url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F80%2FBangaloreMontage.png&amp;cfs=1&amp;fallback=hub_city&amp;f&amp;_nc_hash=AQDbJ1ytdhSz3E8E" alt="" aria-label="Bangalore, India" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Lives in <a class="profileLink" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1">Bangalore, India</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg"><span id="u_0_2d">From <span class="fwb"><a class="profileLink" href="https://www.facebook.com/pages/Tumkur/106525352717093" data-hovercard="/ajax/hovercard/page.php?id=106525352717093" data-hovercard-prefer-more-content-show="1">Tumkur</a></span></span></div></div></div></div></div></div></div></li><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_585866"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No relationship info to show</span></div></div></div></div></li></ul></div><div class="_4bl9 _zu9"><ul class="uiList _5yql _4kg" data-overviewsection="contact_basic" role="button" tabindex="0"><li class="_4tnv _2pif"><div class="clearfix _ikh"><div class="_4bl7"><div class="_pvf _5pmc"><i class="img sp_yw06AF9sktb sx_e0cf75"></i></div></div><div class="_4bl9 _2pis _2dbl"><span class="_c24 _2ieq"><div><span class="accessible_elem">Birthday</span></div><div>April 28, 1998</div></span></div></div></li></ul></div></div></div></div></div></li></ul> --></code></div>

当我从脚本中获取页面源时,只有&lt;div class="hidden_elem"&gt; &lt;/div&gt; 这即将到来。

【问题讨论】:

  • 如果没有您的示例代码和/或网址,很难给出明确的答案。请看How to create a Minimal, Complete, and Verifiable example
  • 我也添加了网址
  • 报废Facebook 是违反ToS 的,你很可能会受到质疑,甚至可能会被Facebook Jail 处理。请改用 Facebook API
  • 我知道,我刚刚学习抓取并想尝试一下。我只是一个学生,并不想做出任何产品。
  • 好吧,当然,欢迎你忽略别人的意愿。

标签: python html selenium web-scraping scrapy


【解决方案1】:

使用 BeautifulSoup,您可以做到这一点

试试这个:-

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text,Comment)):
    print (comment)

【讨论】:

    【解决方案2】:

    您需要向下滚动页面:

    String s = "window.scrollBy(0,document.body.scrollHeight || document.documentElement.scrollHeight)";
                ScriptResult sr = page.executeJavaScript(s);
                LOG.info("Result= " + sr.getJavaScriptResult());
    

    之后,您将能够获得“hidden_​​elem”对象列表:

    String xpathHiddenElem = "//div[contains(@class, 'hidden_elem')]";
    List<Object> responseHiddenElem = page.getByXPath(xpathHiddenElem);
    LOG.info("responseHiddenElem: {}", responseHiddenElem);
    if (responseHiddenElem != null && responseHiddenElem.size() > 0) {
        for (Object element : responseHiddenElem) {
            HtmlDivision elementCasted = (HtmlDivision) element;
            LOG.info("elementContent: {}", elementCasted.getTextContent());
            LOG.info("elementContent: {}", elementCasted.asText());
            LOG.info("elementContent: {}", elementCasted.getTagName());
            LOG.info("elementContent: {}", elementCasted.getIndex());
        }
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-08-22
      • 1970-01-01
      • 2022-12-09
      • 1970-01-01
      • 2018-04-12
      • 2016-07-25
      • 2014-10-28
      • 1970-01-01
      相关资源
      最近更新 更多