【问题标题】:Retrieve all data from html table and put in csv从 html 表中检索所有数据并放入 csv
【发布时间】:2021-03-10 15:58:54
【问题描述】:

我试图制作一个 python 脚本来从几个页面上的 html 表中检索我的所有数据(我有一个链接数组) 我希望将表中的这些数据放入 csv 中。 我该如何进行? 我做了类似的事情,但是数据被放入而不是在列和行中,并且在之后立即删除,然后放入下一个。 我怎样才能以最干净的方式进行? 这是桌子

<div class="table-responsive">
                    <table class="table table-striped product-page-specifications">
                        <tbody><tr>
                                <td class="col-xs-4 text-muted">Product type</td>
                                <td class="col-xs-8">1</td>
                            </tr><tr>
                                <td class="col-xs-4 text-muted">Tip2</td>
                                <td class="col-xs-8">MMA
TIG/WIG
</td>
                            </tr><tr>
                                <td class="col-xs-4 text-muted">Material</td>
                                <td class="col-xs-8">Metal </td>
                            </tr><tr>
                                <td class="col-xs-4 text-muted">Size</td>
                                <td class="col-xs-8">Universal </td>
                            </tr><tr>
                                <td class="col-xs-4 text-muted">Color</td>
                                <td class="col-xs-8">Black</td>
                            </tr><tr>
                                <td class="col-xs-4 text-muted">Content</td>
                                <td class="col-xs-8">Material made of a material as resistant as possible</td>
                            </tr></tbody>
                    </table>
                </div>

这是代码:

        for a_link in all_links:
            res = requests.get(a_link).text
            soup = BeautifulSoup(res, 'html.parser')
            table = soup.select_one("table")

            output_rows = []
            for table_row in table.findAll('tr'):
              columns = table_row.findAll('td')
              output_row = []
              for column in columns:
                output_row.append(column.text)
                output_rows.append(output_row)

                df = pd.DataFrame(output_rows)
                print(df)

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    看起来pd.read_html 在该表上工作正常,但您可能需要稍后根据页面其余部分的外观以及您希望最终输出的显示方式进行一些按摩/合并:

    In [13]: pd.read_html(StringIO(s))
    Out[13]:
    [              0                                                  1
     0  Product type                                                  1
     1          Tip2                                        MMA TIG/WIG
     2      Material                                              Metal
     3          Size                                          Universal
     4         Color                                              Black
     5       Content  Material made of a material as resistant as po...]
    

    特别是,您可能希望将第一列设置为索引并转置,以便从中获得命名良好的列:

    In [15]: pd.read_html(StringIO(s))[0].set_index(0).T
    Out[15]:
    0 Product type         Tip2 Material       Size  Color                                            Content
    1            1  MMA TIG/WIG    Metal  Universal  Black  Material made of a material as resistant as po...
    

    【讨论】:

      【解决方案2】:
       for a_link in all_links:
                  res = requests.get(a_link).text
                  soup = BeautifulSoup(res, 'html.parser')
                  table = soup.select_one("table")
                  rows = table.findAll('tr')
                  headers = rows[0]
                  header_text = []
                  for th in headers.findAll('th'):
                      header_text.append(th.text)
                      row_text_array = []
                      for row in rows[1:]:
                          row_text = []
                      # loop through the elements
                      for row_element in row.findAll(['th', 'td']):
                          # append the array with the elements inner text
                          row_text.append(row_element.text.replace('\n', '').strip())
                      # append the text array to the row text array
                      row_text_array.append(row_text)
                      with open("out.csv", "w") as f:
                          wr = csv.writer(f)
                          wr.writerow(header_text)
                          # loop through each row array
                          for row_text_single in row_text_array:
                              wr.writerow(row_text_single)
      
                          df = pd.DataFrame(output_rows)
                          print(df)
      

      这是完整的代码,但在 csv 中不能正确显示 @兰迪

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-01-24
        • 2014-05-23
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-01-24
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多