【发布时间】:2021-10-26 19:47:03
【问题描述】:
完全的新手,但我已经成功地使用 Python 从上游代码创建的链接列表中抓取 EAN 数字。但是,我的输出文件将所有抓取的数字包含为连续的单行,而不是每行一个 EAN。
这是我的代码 - 它有什么问题? (已删除的 URL 已编辑)
import requests
from bs4 import BeautifulSoup
import urllib.request
import os
subpage = 1
while subpage <= 2:
URL = "https://..." + str(subpage)
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
"""writes all links under the h2 tag into a list"""
links = []
h2s = soup.find_all("h2")
for h2 in h2s:
links.append("http://www.xxxxxxxxxxx.com" + h2.a['href'])
"""opens links from list and extracts EAN number from underlying page"""
with open("temp.txt", "a") as output:
for link in links:
urllib.request.urlopen(link)
page_2 = requests.get(link)
soup_2 = BeautifulSoup(page_2.content, "html.parser")
if "EAN:" in soup_2.text:
span = soup_2.find(class_="articleData_ean")
EAN = span.a.text
output.write(EAN)
subpage += 1
os.replace('temp.txt', 'EANs.txt')
【问题讨论】:
标签: python python-3.x web-scraping