【问题标题】:urllib.error.HTTPError: HTTP Error 404: Not Found-- difficulty webscrapingurllib.error.HTTPError: HTTP 错误 404: Not Found-- 网络抓取困难
【发布时间】:2020-11-10 04:23:27
【问题描述】:

我正在尝试对表格进行网络抓取,这是我正在使用的代码。我尝试了很多方法,但我是 Python 新手,但它们不起作用。有人有想法吗?请在您的回答中包括该部分代码的插入位置。

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup('https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')

我收到错误:

Traceback (most recent call last):
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
  File "C:\Users\GBEM\PycharmProjects\tablepractice\tablescrape.py", line 11, in <module>
soup = make_soup('https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')
  File "C:\Users\GBEM\PycharmProjects\tablepractice\tablescrape.py", line 7, in make_soup
thepage = urllib.request.urlopen(url)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
  File "C:\Users\GBEM\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

【问题讨论】:

    标签: web-scraping html-table


    【解决方案1】:

    要从服务器获得正确的响应,请指定User-Agent HTTP 标头:

    import urllib.request
    from bs4 import BeautifulSoup
    
    
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
    
    def make_soup(url):
        req = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(req)
        return BeautifulSoup(response.read(), 'html.parser')
    
    soup = make_soup('https://www.transfermarkt.com/transfers/saisontransfers/statistik?land_id=0&ausrichtung=&spielerposition_id=&altersklasse=&leihe=&transferfenster=&saison-id=2020&plus=1')
    print(soup)
    

    打印:

    <!DOCTYPE html>
    
    <!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
    <!--[if IE 7]>
    <html class="ie7 oldie" lang="en"> <![endif]-->
    <!--[if IE 8]>
    <html class="no-js lt-ie9" lang="en"> <![endif]-->
    <!--[if gt IE 8]><!-->
    <html class="no-js" lang="en"> <!--<![endif]-->
    <head>
    
    ..and so on.
    

    【讨论】:

    • 非常感谢!解决了问题
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-05-04
    • 1970-01-01
    • 1970-01-01
    • 2019-01-13
    • 2017-01-20
    相关资源
    最近更新 更多