在python中解析html到csv数据答案

【问题标题】：html to csv data parsing in python在python中解析html到csv数据
【发布时间】：2017-11-04 15:37:45
【问题描述】：

我正在使用 python 将 html 表格转换为 CSV 格式。

代码：

from BeautifulSoup import BeautifulSoup
import sys
import csv
import argparse

reload(sys)
sys.setdefaultencoding('utf-8')
parser = argparse.ArgumentParser(description='Reads in an HTML and attempts to convert all tables into CSV files.')
parser.add_argument('--delimiter', '-d', action='store', default=',',help="Character with which to separate CSV columns")
parser.add_argument('--quotechar', '-q', action='store', default='"',help="Character within which to nest CSV text")
parser.add_argument('filename',nargs="?",help="HTML file from which to extract tables")
args = parser.parse_args()

if sys.stdin.isatty() and not args.filename:
  parser.print_help()
  sys.exit(-1)
elif not sys.stdin.isatty():
  args.filename = sys.stdin
else:
  args.filename = open(sys.argv[1],'r')

print "Opening file"
fin  = args.filename.read()

print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
        fout = csv.writer(csvfile, delimiter=args.delimiter, quotechar=args.quotechar, quoting=csv.QUOTE_MINIMAL)
        for row in table.findAll('tr'):
          cols = row.findAll(['td','th'])
          if cols:
            cols = [x.text for x in cols]
            fout.writerow(cols)

在这里，我不想显式提供 sys 参数，而是希望在脚本中使用文件名对其进行硬编码。目前的用法是 - python html2csv.py test.html。

有什么办法可以做到吗？

错误：

File "html2csv.py", line 17, in <module>
if sys.stdin.isatty() and not args.filename:
AttributeError: 'Namespace' object has no attribute 'filename'

【问题讨论】：

您想在脚本调用中省略“test.html”，并且应该将该文件名作为默认文件名？
哦，我没有告诉你它应该在哪里？
对不起 :) 我想问你到底想达到什么目的。什么应该被编码，只有文件名 test.html？您正在开发什么操作系统？
操作系统是 windows，是的，我不想在运行 python html2csv.py 文件作为 sys 参数时明确提供文件（test.html），而是应该在 html2csv.py 文件中进行硬编码跨度>

标签： python python-2.7 python-3.x csv beautifulsoup

【解决方案1】：

看来，打开的文件对象被写入变量args.filename

我会尝试（快速且非常肮脏）在

之前添加

print "Opening file"
fin  = args.filename.read()

一行

args.filename = open('test.html', 'r')

解析器可能会抱怨您没有将文件名传递给命令，但可能不是，请尝试:)

【讨论】：

它仍然要求将 test.html 文件作为参数 :-(
hm...parser 是一个相当复杂的对象，我不太了解。尝试删除带有parser.add_argument('filename'... 的行，也许还有else: args.filename = open(sys.argv[1],'r')
它如何“询问”参数，顺便说一句，错误消息也有助于理解
添加了上面返回的错误。尝试了您推荐的步骤。
哦，是的，当然！您删除了 add_argument('filename' 行，所以 if 无法评估，我明白了。只需删除整个if-block，不仅是最后一个else，还有之前的五行。现在应该可以工作了^^