用 BeautifulSoup 解析表并写入文本文件答案

【问题标题】：parsing table with BeautifulSoup and write in text file用 BeautifulSoup 解析表并写入文本文件
【发布时间】：2010-02-09 11:55:56
【问题描述】：

我需要文本文件 (output.txt) 中表格中的数据，格式如下：数据1；数据2；数据3；数据4；.....

Celkova podlahova plocha bytu;33m;Vytah;Ano;Nadzemne podlazie;Prizemne podlazie;.....;Forma vlastnictva;Osobne

全部在“一行”，分隔符为“;”（稍后导出为csv文件）。

我是初学者..帮助，谢谢。

from BeautifulSoup import BeautifulSoup
import urllib2
import codecs

response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)

tabulka = soup.find("table", {"class" : "detail-char"})

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
    prvy = col[0].string.strip()
    druhy = col[1].string.strip()
    record = ([prvy], [druhy])

fl = codecs.open('output.txt', 'wb', 'utf8')
for rec in record:
    line = ''
    for val in rec:
        line += val + u';'
    fl.write(line + u'\r\n')
fl.close()

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

您没有在读入时保留每条记录。试试这个，它将记录存储在records：

from BeautifulSoup import BeautifulSoup
import urllib2
import codecs

response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)

tabulka = soup.find("table", {"class" : "detail-char"})

records = [] # store all of the records in this list
for row in tabulka.findAll('tr'):
    col = row.findAll('td')
    prvy = col[0].string.strip()
    druhy = col[1].string.strip()
    record = '%s;%s' % (prvy, druhy) # store the record with a ';' between prvy and druhy
    records.append(record)

fl = codecs.open('output.txt', 'wb', 'utf8')
line = ';'.join(records)
fl.write(line + u'\r\n')
fl.close()

这可以清理更多，但我认为这是你想要的。

【讨论】：

【解决方案2】：

这是另一种非 BS 方式，仅适用于您的任务

store=[] #to store your results
url="""http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever"""
page=urllib2.urlopen(url)
data=page.read()
for table in data.split("</table>"):
    if "<table" in table and 'class="detail-char' in table:
         for item in table.split("</td>"):
              if "<td" in item:
                  store.append(item.split(">")[-1].strip())
print ','.join(store)

输出

$ ./python.py
Celková podlahová plocha bytu,33 m2,Výťah,Áno,Nadzemné podlažie,Prízemné podlažie,Stav,Čiastočná rekonštrukcia,Konštrukcia bytu,tehlová,Forma vlastníctva,osobné

【讨论】：

应该是 ';'.join(store) 因为项目之间需要分号。
哇，这太棒了——但你有一个限制，只能抓住第一个项目。怎样才能继续抓取表中的所有数据，包括嵌套表？