【发布时间】:2013-11-21 04:40:59
【问题描述】:
我想用 python 制作我自己的 RSS
是否可以从 hdarea.org 仅提取标题和下载链接(“上传”)
这就是我到目前为止所做的事情
import urllib2
from BeautifulSoup import BeautifulSoup
import re
page = urllib2.urlopen("http://hd-area.org").read()
soup = BeautifulSoup(page)
for title in soup.findAll("div", {"class" : "title"}):
print (title.getText())
for a in soup.findAll('a'):
if 'Uploaded.net' in a:
print a['href']
它已经提取了标题。
但我卡在应该提取链接的位置。
它提取但顺序不正确...
我如何确保脚本首先检查“标题”和“链接”是否在这个 div 类中的任何建议"<div class="topbox">"
编辑
现在完成了
这是最终代码
谢谢大家 - 把我推向正确的方向
import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import PyRSS2Gen
print "top_rls"
page = urllib2.urlopen("http://hd-area.org/index.php?s=Cinedubs").read()
soup = BeautifulSoup(page)
movieTit = []
movieLink = []
for title in soup.findAll("div", {"class" : "title"}):
movieTit.append(title.getText())
for span in soup.findAll('span', attrs={"style":"display:inline;"},recursive=True):
for a in span.findAll('a'):
if 'ploaded' in a.getText():
movieLink.append(a['href'])
elif 'cloudzer' in a.getText():
movieLink.append(a['href'])
for i in range(len(movieTit)):
print movieTit[i]
print movieLink[i]
rss = PyRSS2Gen.RSS2(
title = "HD-Area Cinedubs",
link = "http://hd-area.org/index.php?s=Cinedubs",
description = " "
" ",
lastBuildDate = datetime.datetime.now(),
items = [
PyRSS2Gen.RSSItem(
title = movieTit[0],
link = movieLink[0]),
PyRSS2Gen.RSSItem(
title = movieTit[1],
link = movieLink[1]),
PyRSS2Gen.RSSItem(
title = movieTit[2],
link = movieLink[2]),
PyRSS2Gen.RSSItem(
title = movieTit[3],
link = movieLink[3]),
PyRSS2Gen.RSSItem(
title = movieTit[4],
link = movieLink[4]),
PyRSS2Gen.RSSItem(
title = movieTit[5],
link = movieLink[5]),
PyRSS2Gen.RSSItem(
title = movieTit[6],
link = movieLink[6]),
PyRSS2Gen.RSSItem(
title = movieTit[7],
link = movieLink[7]),
PyRSS2Gen.RSSItem(
title = movieTit[8],
link = movieLink[8]),
PyRSS2Gen.RSSItem(
title = movieTit[9],
link = movieLink[9]),
])
rss.write_xml(open("cinedubs.xml", "w"))
【问题讨论】:
-
什么意思:顺序不对?
-
是的。我想这就是我想说的我糟糕的英语:)
-
哦,我的意思是:你说的顺序不对是什么意思?
-
当您访问 hd-area.org 时,每部电影都有 2 个下载链接。我抓取的每个条目都应该产生 1title+1downloadlink 等等......交替方式。现在它不这样做了。首先它会抓取所有标题而不是所有下载链接