【发布时间】:2021-07-29 09:44:04
【问题描述】:
我正在尝试使用 Github 上提供的 python 脚本从 ESPN Cricinfo 抓取数据。代码如下。
import urllib.request as ur
import csv
import sys
import time
import os
import unicodedata
from urllib.parse import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.espncricinfo.com'
for i in range(0, 6019):
url = 'http://search.espncricinfo.com/ci/content/match/search.html?search=first%20class;all=1;page='
soupy = BeautifulSoup(ur.urlopen(url + str(i)).read())
time.sleep(1)
for new_host in soupy.findAll('a', {'class' : 'srchPlyrNmTxt'}):
try:
new_host = new_host['href']
except:
continue
odiurl = BASE_URL + urlparse(new_host).geturl()
new_host = unicodedata.normalize('NFKD', new_host).encode('ascii','ignore')
print (new_host)
print (str.split(new_host, "/"))[4]
html = urllib2.urlopen(odiurl).read()
if html:
with open('espncricinfo-fc/{0!s}'.format(str.split(new_host, "/")[4]), "wb") as f:
f.write(html)
错误就在这一行。
print (str.split(new_host, "/"))[4]
TypeError: 描述符“split”需要一个“str”对象但收到一个“bytes” 您的任何帮助将不胜感激。谢谢
【问题讨论】:
-
urllib2在 py3 标准库中不存在,你确定这是 py3 吗?
标签: python csv beautifulsoup scrape