网页抓取，网址按日期更改答案

【问题标题】：Web scraping with urls changing by date网页抓取，网址按日期更改
【发布时间】：2016-03-22 19:14:04
【问题描述】：

我正在编写一个使用 python 和 BeautifulSoup4 的脚本。脚本本身已完成，唯一引起问题的部分是正在使用的 url。

我正在使用此代码传递网址：

urllist = ["samplewebsitename.com/2015/05/xxx-chapter-{}.html".format(str(pgnum).zfill(2)) for pgnum in range(1, chapter_number+1)]
for url in urllist:
    url_queue.put(url)

我遇到的一个问题是，在抓取网站时，我注意到部分 url 会根据上传时间而发生变化。例如：

samplewebsitename.com/2015/05/xxx-chapter-01.html
samplewebsitename.com/2015/06/xxx-chapter-32.html
samplewebsitename.com/2015/10/xxx-chapter-47.html

我可以处理这些章节，因为它们是连续的，但是添加材料时的月份和年份没有固定的模式。我想知道是否有办法解决这个问题。

年份和月份也需要成为变量，以替换为示例中的硬编码变量，但从网站获取它们似乎比我想象的要难。

编辑显然，您可以从下拉列表中获取链接，这将整个问题简化为仅解析所有链接的下拉列表。

我现在唯一遇到的小问题是如何正确解析它。目前正在尝试找到该网站的选择元素，但我对此还是很陌生。

#Gets all the url's for each chapter
urllist = []
starturl = "http://www.bimanga.com/2015/05/read-manga-tokyo-ghoul-re-chapter-01.html"
response = requests.get(starturl)
html = response.content
soup = BeautifulSoup(html, "html.parser")
for option in soup.findAll('option'):
    #urllist.append(option["value"])
    print(option["value"]) #Debugging

【问题讨论】：

除非您知道您有兴趣抓取的书籍的上传日期，否则您无法确定这一点。或者，如果您真的需要，您可以尝试蛮力方法并检查所有可能的日期，但我真的不鼓励这种方式。
章节是否相互关联？如果是这样，一旦找到一章，您就可以查找其他章节的链接。
没有任何明确的链接，但网站上有一个下拉列表，是否可以从那里获取它们？ i.imgur.com/pvKgnDw.png

标签： python html web-scraping beautifulsoup

【解决方案1】：

可以从您在此处看到的下拉列表中获取年份和月份：http://i.imgur.com/pvKgnDw.png

解析下拉列表（select 元素）并获取链接。那么你可能甚至不需要从年和月构造 url。下拉列表可能包含该章节的整个 url。

【讨论】：

在解析下拉列表时是否有任何您知道或参考的示例？有点麻烦。这就是我目前所拥有的。 #Gets all the url's for each chapter urllist = [] starturl = "http://www.bimanga.com/2015/05/read-manga-tokyo-ghoul-re-chapter-01.html" response = requests.get(starturl) html = response.content soup = BeautifulSoup(html, "html.parser") for option in soup.find_all('select', attrs={'name':'menu'}): #urllist.append(option["value"]) sys.stdout.write(option["value"]) for url in urllist: url_queue.put(url)
您需要选择“选项”标签。
我尝试使用 option 作为标签。 for option in soup.findAll('option'): 但它只是继续提供一个空集。可能是 Beautifulsoup 不支持使用下拉列表吗？还是我的错误？