【发布时间】:2018-04-03 07:13:17
【问题描述】:
我是网络抓取的新手,我对 Google 搜索结果的抓取有疑问。假设我想根据搜索查询抓取谷歌前 100 页的搜索结果并提取这些网址的文本。我已经尝试了几个代码,到目前为止我还没有得到想要的结果,任何人都可以帮我解决这个问题.. 这是附加的代码,用于获取当前页面的 url 并消除访问过的 url,以防它们被多次访问。
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urljoin
import requests
base_query = 'inurl:www.bbc.com/urdu/pakistan'
base ="http://www.bbc.com.pk/"
google_search_url = 'https://www.google.com.pk/search?q=inurl:www.bbc.com/urdu/pakistan&filter=0&biw=1366&bih=638'
resp = requests.get(google_search_url)
soup = BeautifulSoup(resp.text, "html.parser")
url = []
to_crawl_urls = set()
visited = [] # to check if page was already visited
visited = ["http://www.bbc.com.pk/"]
for cite in soup.find_all('cite'):
url .append( cite.text)
# skip urls already visited
if url in visited or url == google_search_url:
print('... skiping:', url)
# remember new page as visited
visited.append(url)
print("loading:", url)
subpage = urlopen(url)
subsoup = BeautifulSoup(subpage, "html.parser")
# find div with text
for story_body in subsoup.find_all('div', class_='story-body'):
# find title
h1 = story_body.find('h1', class_='story-body__h1')
if h1:
print('title:', story_body.find('h1', class_='story-body__h1').get_text(strip=True))
# find div with paragraphs
div = story_body.find('div', class_='story-body__inner')
# find all paragraphs in dive
for p in div.find_all('p'):
print(p.get_text(strip=True))
【问题讨论】:
-
怎么不工作了?您收到错误消息吗?你有什么打印出来的吗?此外,Google 会阻止抓取工具
-
它正在显示网址。但之后不显示任何文字。
-
变量“wiki”的值未在您的代码中定义。你能解决这个问题吗?
-
我修好了。请现在检查一下
-
仅供参考,它是报废,是报废而不是报废。
标签: python python-3.x web-scraping beautifulsoup google-search