【发布时间】:2017-02-04 08:32:38
【问题描述】:
我想在一个网站上解析一些 URL,并且我创建了一个文本文件,其中包含我想要解析的所有链接。如何在 python 程序中从文本文件中一一调用此 URL。
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://www.example.com").content, "html.parser")
for d in soup.select("div[data-selenium=itemDetail]"):
url = d.select_one("h3[data-selenium] a")["href"]
upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum")
if upc:
data = json.loads(d["data-itemdata"])
text = (upc.text.strip())
print(upc.text)
outFile = open('/Users/Burak/Documents/new_urllist.txt', 'a')
outFile.write(str(data))
outFile.write(",")
outFile.write(str(text))
outFile.write("\n")
outFile.close()
urllist.txt
https://www.example.com/category/1
category/2
category/3
category/4
提前致谢
【问题讨论】:
标签: python parsing web-scraping beautifulsoup