网页抓取链接时出错答案

【问题标题】：Getting error while web scraping the link网页抓取链接时出错
【发布时间】：2021-03-14 03:29:29
【问题描述】：

抓取给定链接时出错。任何人都可以帮我解决错误，以及用于抓取网页以获取所有文本数据的链接的代码。

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link) 
webpage = urlopen(req).read()

【问题讨论】：

如果您发布错误会很有帮助...

标签： python web-scraping data-wrangling

【解决方案1】：

你可以试试requests:

>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'

为了获取页面的内容（在本例中为实际故事），您可能需要一个网络爬虫，例如BeautifulSoup4 或lxml。

美丽的汤4

import bs4
import requests

res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()

BeautifulSoup4是第三方模块，请务必安装：pip install BeautifulSoup4。

lxml

from urllib.request import urlopen
from lxml import etree

res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text

lxml是第三方模块，请务必安装：pip install lxml

【讨论】：

感谢您的回答，但两者都没有工作。 Beautifulsoup 和 lxml 都显示错误
@desktopp 你安装了吗？它们都是第三方模块，因此您必须分别运行 pip install BeautifulSoup4 和 pip install lxml。

【解决方案2】：

在标头中设置用户代理，好像从浏览器调用似乎可以避免HTTP 403: Forbidden 错误，例如：

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()

类似情况也可以看this question

【讨论】：