【问题标题】:Web Scraping in R from Google Images来自 Google 图片的 R 中的网页抓取
【发布时间】:2016-12-21 07:51:49
【问题描述】:

我使用“rvest”包进行网络抓取,用于不同的目的。现在我需要使用它从谷歌图像中获取图像对象(png)的来源。我已在此链接上尝试过解决方案:Web scraping of image。它正是我想做的。所以我想出了下面的代码,但我的 html_nodes 函数得到了空对象。

library("rvest")
page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
node <- html_nodes(page,xpath='//*[@id="rg_s"]/div[1]/a/img')
src <-  html_attr(node,"src")

我还尝试了 css 选择器和图像名称,因为它是在我上面给出的链接上完成的。我的节点对象在任何方面都是空的。我还应该指出,我想抓取链接上第一个图像的来源,它有我上面写的 xpath。提前谢谢你。

【问题讨论】:

    标签: html r image xpath web-scraping


    【解决方案1】:

    我认为它工作正常,只是你还不够了解该文件的构成,即可能没有与你编写的 xpath 选择器对应的节点。

    例如,这里我选择所有&lt;img&gt; 节点并将它们打印出来:

    library("rvest")
    page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
    node <- html_nodes(page,xpath = '//img')
    node
    

    屈服:

    {xml_nodeset (21)}
     [1] <img style="padding-top:2px" src="/textinputassistant/tia.png" onclick="(function(){var text_input_assistant_js='/textinputassistant/11/tr_tia.js';var s = document.createElement('s ...
     [2] <img height="113" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRg92_01ZbpYpV_agaHP4M3GoRoaCsZW5Sym8eqcXG8M1iJ8Nag1SXufq8" width="150" alt="manitou ile ilgili görsel s ...
     [3] <img height="98" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSbJOecoEPbrJjZ-TjJMgMwlulXRMPLBWZX45vwUJNVXZk5MeY1chaZ07Y" width="143" alt="manitou ile ilgili görsel so ...
     [4] <img height="79" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcStpgymO--9B7R3O3OZJFrDsuOUuP94HwwNw-av9tUyjziG3sCl6M9s7G4" width="141" alt="manitou ile ilgili görsel so ...
     [5] <img height="95" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTkibMqBWEifcyw_d-vrNob6UqYP-hDFPoQG2pkzVsP5bgmbReFWqyHjWA" width="143" alt="manitou ile ilgili görsel so ...
     [6] <img height="91" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRhqrV1f--7QrQwovNBUHIpDFHe8Zwwad3UIvnwppv74GRIrsI1XYNPkFOg" width="150" alt="manitou ile ilgili görsel s ...
     [7] <img height="112" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1gpUEBucliP4WK2_22K4wElI2lIrDs2PZT7sRCLXK1Yxjg7DoQ2BtyLat" width="142" alt="manitou ile ilgili görsel  ...
     [8] <img height="69" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSssiUhuZe_1YmQ9dwmYHdKoFXyQBj9IQPGX_LU8msjekOvRRHDG9FmoaD_" width="140" alt="manitou ile ilgili görsel s ...
     [9] <img height="113" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTCM9Mu6K63QpzNk20HFrHkybi--dw3JPu5JDd4LSEqz3UT5TBU5I0owLU" width="150" alt="manitou ile ilgili görsel s ...
    [10] <img height="95" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcS8sI3fBSJmjftqC9Rx2bhXh_xgP3-nS2WuD2as9U_87SLxggQvmo2awDk" width="143" alt="manitou ile ilgili görsel so ...
    [11] <img height="83" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT-gf45JbC4Q4lD3hioj_CP6imrO5RUWBeW6IuygNaN8LM1qydX56l5gFx4" width="148" alt="manitou ile ilgili görsel s ...
    [12] <img height="84" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6tnxPJYeS48IoNAlN0D52U5TNjmq7Ta-GcPNifM4_k40Y2D8LDj5-e-Wz" width="150" alt="manitou ile ilgili görsel s ...
    [13] <img height="140" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTwmI9PxfLBT2dCPnR04I9pXmK8V9whAI2yEv4dX5qQq8G_JxHUAOwQB1mSTg" width="140" alt="manitou ile ilgili görse ...
    [14] <img height="71" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQNx2Pe1AZtT-0XQ44HSurWO6O2syXrXG6YPfggtZsTHaf6YXuQlcmMOu0" width="150" alt="manitou ile ilgili görsel so ...
    [15] <img height="130" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRGACLfeRm6U0xwSeYncSUDQtcd4noTewVF4aGnQcgz6TWYwwr917mjEtB6" width="113" alt="manitou ile ilgili görsel  ...
    [16] <img height="107" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ1RwAscQpzVXfquuAoPaLE9hFMuZSOpo6ckOzdpkTmg3KiswOIZIDTqrU" width="143" alt="manitou ile ilgili görsel s ...
    [17] <img height="98" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcTE5sLf71TxAYla6nlfLRgXwL1IC-gXzXQRq1ZcnB21c5NXmQklJyNeqEs" width="148" alt="manitou ile ilgili görsel so ...
    [18] <img height="91" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRHQjJ-Hc0Muy6Vjw5OlQZocflSCqR3oz0GBRu3Bs7_JCoNyjr5vjNP7KZ4" width="137" alt="manitou ile ilgili görsel s ...
    [19] <img height="68" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR8R_39V3bxWJUDdNhrsAS6YOYEg6U-QpaLEV0MQ5GBnVkeZa9lSB5MaGU" width="149" alt="manitou ile ilgili görsel so ...
    [20] <img height="99" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIrnwcUbo9WYT-gyvrLb5g4JFEc27odkzzU6SwzxrxvrsajRMD1OroUaY" width="116" alt="manitou ile ilgili görsel so ...
    ...
    > 
    

    这是第一个节点:

    >node[[1]]
    {xml_node} <img style="padding-top:2px"
    src="/textinputassistant/tia.png" onclick="(function(){var
       text_input_assistant_js='/textinputassistant/11/tr_tia.js';var s =
      document.createElement('script');s.src =
      text_input_assistant_js;(document.getElementById('xjsc')||
      document.body).appendChild(s);})();" 
       alt="" height="23" width="27">
    

    【讨论】:

    • 谢谢你,迈克,拍摄所有图像并从该元素中逐个处理元素解决了它。我知道如何处理对象,但我的 xpath 有空对象:) 再次感谢。
    • 也许你能帮我解决这个问题 --> stackoverflow.com/questions/54884611/…
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-05-02
    • 1970-01-01
    • 1970-01-01
    • 2013-07-23
    • 1970-01-01
    • 2018-02-03
    相关资源
    最近更新 更多