将 HTML 表从 shell 转换为 CSV 文件答案

【问题标题】：Converting HTML table to CSV file from shell将 HTML 表从 shell 转换为 CSV 文件
【发布时间】：2014-03-27 21:37:03
【问题描述】：

我正在尝试将带有 HTML 表格的文件转换为 CSV 格式。该文件的摘录如下：

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
    </title></head>
    <body>
        <form name="form1" method="post" action="mypricelist.aspx" id="form1">
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />

<div>
    <table id="price_list" border="0">
<tr>
    <td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
    <td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
    <td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

我尝试使用

    xls2csv -x -c\; evprice.xls > evprice.csv

但这给了我一个错误提示

    evprice.xls is not OLE file or Error

我用谷歌搜索。它说这是因为文件不是正确的 xls 而只是 html。

当我尝试时

    file evprice.xls

它说它的 html 所以找到了一个“解决方案”，使用 libreoffice。

    libreoffice --headless -convert-to csv ./evprice.xls

这不会出错，但 csv 输出文件很奇怪，就像在记事本中打开 exe 文件一样。

它包含很多像这样的奇怪字符

    —¬žþ9ü~ÆóXþK¢

有人知道为什么会发生这种情况，并找到了可行的解决方案吗？

【问题讨论】：

您使用的样本数据是否公开可用？我不知道任何人如何能够提供可以处理我们从未见过的不确定格式的文件的东西。
对不起，它不公开。我可以放置文件的一部分
我根本不会将其描述为“XLS”文件——它是一个 HTML 表格，与 Excel 或 XLS 无关。
...所以，鉴于此，这看起来像是 stackoverflow.com/questions/259091/… 的副本（虽然接受的答案根本不是自动的，但还有其他答案）。
好吧，我不知道它是什么类型的文件。他们说它是一个生成的 xls 文件，但无论如何。当我手动打开它时，libreoffice 可以打开它。为什么在使用命令行版本时会出现这些奇怪的字符？

标签： linux bash csv libreoffice

【解决方案1】：

我已经构建了一个 Python 实用程序，可以将 HTML 文件中的所有表格转换为单独的 CSV 文件。

你可以找到它here。

脚本的关键是：

from BeautifulSoup import BeautifulSoup
import csv

filename = "MY_HTML_FILE"
fin      = open(filename,'r')

print "Opening file"
fin  = fin.read()

print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
    fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in table.findAll('tr'):
      cols = row.findAll(['td','th'])
      if cols:
        cols = [x.text for x in cols]
        fout.writerow(cols)

【讨论】：

确实非常有用。我遇到了一些错误，如 "UnicodeEncodeError: 'ascii' codec can't encode character at special name...";但是，我可以通过在文件顶部添加以下行来修复它：import sys; reload sys; sys.setdefaultencoding('utf-8')。对此question 的公认答案是我在此评论中实际所做并建议的。