【发布时间】:2021-03-15 21:19:42
【问题描述】:
我在抓取多个网址时遇到问题。基本上我只能为一种类型运行它,但第二种我包含它停止工作的其他链接。
目标是获取数据并将其放入包含电影标题、网址和流派的 csv 文件中。任何帮助将不胜感激!
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = ['https://www.netflix.com/browse/genre/1365', 'https://www.netflix.com/browse/genre/7424']
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
containers = page_soup.findAll("li",{"class":"nm-content-horizontal-row-item"})
# name the output file to write to local disk
out_filename = "netflixaction2.csv"
# header of csv file to be written
headers = "Movie_Name, Movie_ID \n"
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
for container in containers:
title_container = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
title_container = title_container[0].text
movieid = container.findAll("a",{"class":"nm-collections-title nm-collections-link"})
movieid = movieid[0].attrs['href']
print("Movie Name: " + title_container, "\n")
print("Movie ID: " , movieid, "\n")
f.write(title_container + ", " + movieid + "\n")
f.close() # Close the file
【问题讨论】:
-
因为您使用的是 open(out_filename, "w") ,它会覆盖您在文件中写入的内容。您可以尝试更改为“a”而不是“w”,看看是否可以解决?如果是这样,那么我将作为解决方案发布:)
-
不幸的是它没有。我遇到了这个错误: Traceback(最近一次调用最后一次):文件“C:\Users\tghan\autoscrape3.py”,第 6 行,在
uClient = uReq(my_url) 文件“C:\Users\ tghan\Anaconda3\lib\urllib\request.py”,第 222 行,在 urlopen 返回 opener.open(url, data, timeout) 文件“C:\Users\tghan\Anaconda3\lib\urllib\request.py”,行516, in open req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout' [Finished in 0.3s]
标签: python beautifulsoup screen-scraping