如何使用 unirest 和 cheerio 抓取谷歌图像？答案

【问题标题】：How to scrape google images with unirest and cheerio?如何使用 unirest 和 cheerio 抓取谷歌图像？
【发布时间】：2023-01-29 18:35:00
【问题描述】：

我正在尝试使用 unirest 和 cheerio 来抓取谷歌图像，但是当我发现解析没有正确发生时我被卡住了。这是我目前的代码：

const unirest = require("unirest");
const cheerio = require("cheerio");


const getData = async() => {
    let count= [] , page_url = [];
    let url =
    "https://www.google.com/search?q=india&oq=india&tbm=isch&asearch=ichunk&async=_id:rg_s,_pms:s,_fmt:pc&sourceid=chrome&ie=UTF-8";
const response = await unirest
.get(
    url
)
.headers({
  "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
})
.proxy(
  "proxy"
);

const $ = cheerio.load(response.body)
console.log(response.body)//html file returned successsfully
let title = [] , link = [];
$(".vbC6V").each((i,el) => {
title[i] = $(el).find(".iKjWAf .mVDMnf").text()//not parsing
link[i] = $(el).find(".rg_l .rg_ic").attr("src")//not parsing
})
console.log(title)//returned empty
console.log(link)//returned empty
}

getData();

【问题讨论】：

标签： javascript web-scraping cheerio unirest

【解决方案1】：

是的，我发现用于解析的父类是rg_bx，而不是vbC6V。所以更新后的代码将是：

$(".rg_bx").each((i,el) => {
title[i] = $(el).find(".iKjWAf .mVDMnf").text()
link[i] = $(el).find(".rg_l .rg_ic").attr("src")
})

【讨论】：

【解决方案2】：

像“.rg_bx”和“.rg_l .rg_ic”这样的选择器不稳定并且经常改变。我有对你的代码做了一些小改动（我觉得这对下次使用来说更方便）并建议你使用更稳定的选择器：

const $ = cheerio.load(response.body);
const results = Array.from($(".PNCib.MSM1fd")).map((el, i) => ({
  title: $(el).find(".VFACy").attr("title"),
  link: $(el).find(".VFACy").attr("href"),
}));

console.log(results);

输出：

[
   {
      "title":"India - Wikipedia",
      "link":"https://en.wikipedia.org/wiki/India"
   },
   {
      "title":"India | History, Map, Population, Economy, & Facts | Britannica",
      "link":"https://www.britannica.com/place/India"
   },
   {
      "title":"India - Know all about India including its History, Geography, Culture, etc",
      "link":"https://www.mapsofindia.com/india/"
   },
   {
      "title":"India | History, Map, Population, Economy, & Facts | Britannica",
      "link":"https://www.britannica.com/place/India"
   },
   ...and other results
]

但即使是“更稳定”的选择器也会不时发生变化，您需要始终维护您的代码。为了使其更加可靠，regular expressions to extract inline JSON data is a way to go。尽管 HTML 中的内联 JSON 位置可以更改，但更改频率较低或根本不会更改。

您可以在我的博文web Scraping Google Images with Nodejs 中阅读更多关于使用正则表达式抓取 Google 图像的信息。

【讨论】：