【发布时间】:2019-07-05 16:52:02
【问题描述】:
我有一个从this page 收集的 URL 列表,它们基本上只是来自人们的引用,我想将每个不同 URL 的引用保存在单独的文件中。
为了获取 URL 列表,我使用了:
import bs4
from urllib.request import Request,urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'
# set up known browser user agent for the request to bypass HTMLError
req=Request(my_url,headers={'User-Agent': 'Mozilla/5.0'})
#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
#html is jumbled at the moment, so call html using soup function
soup = soup(page_html, "html.parser")
# Test: print title of page
soup.title
tags = soup.findAll("a" , href=re.compile("javascript:pop"))
print(tags)
# get list of all URLS
for links in tags:
link = links.get('href')
if "java" in link:
print("http://archive.ontheissues.org" + link[18:len(link)-3])
如何从每个链接中提取内容,包括文本、项目符号、段落,然后将它们保存到单独的文件中? 另外,我不想要那些不是引号的东西,比如这些页面中的其他 URL。
【问题讨论】:
标签: html python-3.x web-scraping beautifulsoup