抓取 html 数据并解析成列表答案

【问题标题】：scraping html data and parsing into list抓取 html 数据并解析成列表
【发布时间】：2014-04-14 04:51:12
【问题描述】：

我正在使用 python for android (sl4a) 编写一个 android 应用程序，我想要它做的是搜索一个笑话网站并提取一个笑话。然后告诉我那个笑话来唤醒我。到目前为止，它将原始 html 源保存到一个列表中，但我需要它通过保存 html 标记之间的数据然后将这些数据读取给我来创建一个新列表。它是我无法工作的解析器。代码如下：

import android
droid = android.Android() 
import urllib 
current = 0
newlist = []

sock = urllib.urlopen("http://m.funtweets.com/random") 
htmlSource = sock.read() 
sock.close() 
rawhtml = []
rawhtml.append (htmlSource)

while current < len(rawhtml):
    while current != "<div class=":
        if [current] == "</b></a>":
            newlist.append (current)
            current += 1


print newlist

【问题讨论】：

抓取见Beautiful Soup。
我不知道如何安装漂亮的汤模块，因为我使用的是 android 脚本层，而不是典型的 python 安装。

标签： android python html parsing scrape

【解决方案1】：

使用这个 LIB 在 android http://jsoup.org/ 中解析 HTML，它的影响力和开发人员广泛接受的 lib 它也可用于 python :)

【讨论】：

阅读 jsoup 文档一千遍后，我仍然无法获得任何代码来做我想做的事。关于如何为此目的使用 jsoup 的任何具体建议？
跟随本教程survivingwithandroid.com/2014/04/…

【解决方案2】：

这是如何做到这一点的： [代码] 重新进口导入 urllib2

page = urllib2.urlopen("http://www.m.funtweets.com/random").read() 
user = re.compile(r'<span>@</span>(\w+)') 
text = re.compile(r"</b></a> (\w.*)") 
user_lst =[match.group(1) for match in re.finditer(user, page)] 
text_lst =[match.group(1) for match in re.finditer(text, page)] 
for _user, _text in zip(user_lst, text_lst):
    print '@{0}\n{1}\n'.format(_user,_text)

[/代码]

【讨论】：