【问题标题】:parsing table with BeautifulSoup and write in text file用 BeautifulSoup 解析表并写入文本文件
【发布时间】:2010-02-09 11:55:56
【问题描述】:

我需要文本文件 (output.txt) 中表格中的数据,格式如下: 数据1;数据2;数据3;数据4;.....

Celkova podlahova plocha bytu;33m;Vytah;Ano;Nadzemne podlazie;Prizemne podlazie;.....;Forma vlastnictva;Osobne

全部在“一行”,分隔符为“;”(稍后导出为csv文件)。

我是初学者..帮助,谢谢。

from BeautifulSoup import BeautifulSoup
import urllib2
import codecs

response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
html = response.read()
soup = BeautifulSoup(html)

tabulka = soup.find("table", {"class" : "detail-char"})

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
    prvy = col[0].string.strip()
    druhy = col[1].string.strip()
    record = ([prvy], [druhy])

fl = codecs.open('output.txt', 'wb', 'utf8')
for rec in record:
    line = ''
    for val in rec:
        line += val + u';'
    fl.write(line + u'\r\n')
fl.close()

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    您没有在读入时保留每条记录。试试这个,它将记录存储在records

    from BeautifulSoup import BeautifulSoup
    import urllib2
    import codecs
    
    response = urllib2.urlopen('http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever')
    html = response.read()
    soup = BeautifulSoup(html)
    
    tabulka = soup.find("table", {"class" : "detail-char"})
    
    records = [] # store all of the records in this list
    for row in tabulka.findAll('tr'):
        col = row.findAll('td')
        prvy = col[0].string.strip()
        druhy = col[1].string.strip()
        record = '%s;%s' % (prvy, druhy) # store the record with a ';' between prvy and druhy
        records.append(record)
    
    fl = codecs.open('output.txt', 'wb', 'utf8')
    line = ';'.join(records)
    fl.write(line + u'\r\n')
    fl.close()
    

    这可以清理更多,但我认为这是你想要的。

    【讨论】:

      【解决方案2】:

      这是另一种非 BS 方式,仅适用于您的任务

      store=[] #to store your results
      url="""http://www.reality.sk/zakazka/0747-003578/predaj/1-izb-byt/kosice-mestska-cast-sever-sladkovicova-kosice-sever/art-real-1-izb-byt-sladkovicova-ul-kosice-sever"""
      page=urllib2.urlopen(url)
      data=page.read()
      for table in data.split("</table>"):
          if "<table" in table and 'class="detail-char' in table:
               for item in table.split("</td>"):
                    if "<td" in item:
                        store.append(item.split(">")[-1].strip())
      print ','.join(store)
      

      输出

      $ ./python.py
      Celková podlahová plocha bytu,33 m2,Výťah,Áno,Nadzemné podlažie,Prízemné podlažie,Stav,Čiastočná rekonštrukcia,Konštrukcia bytu,tehlová,Forma vlastníctva,osobné
      

      【讨论】:

      • 应该是 ';'.join(store) 因为项目之间需要分号。
      • 哇,这太棒了——但你有一个限制,只能抓住第一个项目。怎样才能继续抓取表中的所有数据,包括嵌套表?
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-10-04
      • 1970-01-01
      • 1970-01-01
      • 2015-05-24
      • 2021-12-07
      • 1970-01-01
      相关资源
      最近更新 更多