确保您使用的是user-agent(标头),否则它将返回空输出,因为 Google 最终会阻止请求。 What is my user-agent.
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
代码和example in the online IDE:
require 'nokogiri'
require 'httparty'
require 'json'
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "stackoverflow",
num: "100"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
link = result.at_css(".yuRUbf a")&.attr("href")
displayed_link = result.at_css(".tjvcx")&.text
snippet = result.at_css(".VwiC3b")&.text
# puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
{
title: title,
link: link,
displayed_link: displayed_link,
snippet: snippet,
}.compact
end
puts JSON.pretty_generate(data)
--------
=begin
[
{
"title": "Stack for Stack Overflow - Apps on Google Play",
"link": "https://play.google.com/store/apps/details?id=me.tylerbwong.stack&hl=en_US&gl=US",
"displayed_link": "https://play.google.com › store › apps › details",
"snippet": "Stack is powered by Stack Overflow and other Stack Exchange sites. Search and filter through questions to find the exact answer you're looking for!"
}
...
]
=end
或者,您可以从 SerpApi 中 Google Organic Results API。这是一个带有免费计划的付费 API。
主要区别在于无需弄清楚如何抓取页面的某些部分。所需要做的只是迭代结构化的 JSON 字符串。
require 'google_search_results'
require 'json'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "stackoverflow",
hl: "en",
num: "100"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
data = hash_results[:organic_results].map do |result|
title = result[:title]
link = result[:link]
displayed_link = result[:displayed_link]
snippet = result[:snippet]
{
title: title,
link: link,
displayed_link: displayed_link,
snippet: snippet
}.compact
end
puts JSON.pretty_generate(data)
-------------
=begin
[
{
"title": "Stack Overflow - Home | Facebook",
"link": "https://www.facebook.com/officialstackoverflow/",
"displayed_link": "https://www.facebook.com › Pages › Interest",
"snippet": "Stack Overflow. 519455 likes · 587 talking about this. We are the world's programmer community."
}
...
]
=end
免责声明,我为 SerpApi 工作。