python提取网页表格并保存为csv

1.参考

表格标签

表格	描述
<table>	定义表格
<caption>	定义表格标题。
<th>	定义表格的表头。
<tr>	定义表格的行。
<td>	定义表格单元。
<thead>	定义表格的页眉。
<tbody>	定义表格的主体。
<tfoot>	定义表格的页脚。
<col>	定义用于表格列的属性。
<colgroup>	定义表格列的组。

表格元素定位

参看网页源代码并没有 thead 和 tbody。。。

<table class="wikitable sortable" style="text-align: center; font-size: 85%; width: auto; table-layout: fixed;">
　　<caption>List of text editors</caption>
　　<tr>
　　　　<th style="width: 12em">Name</th>
　　　　<th>Creator</th>
　　　　<th>First public release</th>
　　　　<th data-sort-type="number">Latest stable version</th>
　　　　<th>Latest Release Date</th>
　　　　<th><a href="/wiki/Programming_language" title="Programming language">Programming language</a></th>
　　　　<th data-sort-type="currency">Cost (<a href="/wiki/United_States_dollar" title="United States dollar">US$</a>)</th>
　　　　<th><a href="/wiki/Software_license" title="Software license">Software license</a></th>
　　　　<th><a href="/wiki/Free_and_open-source_software" title="Free and open-source software">Open source</a></th>
　　　　<th><a href="/wiki/Command-line_interface" title="Command-line interface">Cli available</a></th>
　　　　<th>Minimum installed size</th>
　　</tr>
　　<tr>
　　　　<th

2.提取表格数据

表格标题可能出现超链接，导致标题被拆分，

也可能不带表格标题。。

<caption>Text editor support for remote file editing over 
　　<a href="/wiki/Lists_of_network_protocols" title="Lists of network protocols">network protocols</a>
</caption>

表格内容换行

<td>
　　<a href="/wiki/Plan_9_from_Bell_Labs" title="Plan 9 from Bell Labs">Plan 9</a>
　　 and 
　　<a href="/wiki/Inferno_(operating_system)" title="Inferno (operating system)">Inferno</a>
</td>

tag 规律

table
thead tr1	th	th	th	th
tbody tr2	td/th	td
tbody tr3	td/th
tbody tr3	td/th

2.1提取所有表格标题列表

filenames = []

for index, table in enumerate(response.xpath(\'//table\')):
    caption = table.xpath(\'string(./caption)\').extract_first()    #提取caption tag里面的所有text，包括子节点内的和文本子节点，这样也行 caption = \'\'.join(table.xpath(\'./caption//text()\').extract())
    filename = str(index+1)+\'_\'+caption if caption else str(index+1)  #xpath 要用到 table 计数，从[1]开始
    filenames.append(re.sub(r\'[^\w\s()]\',\'\',filename))    #移除特殊符号


In [233]: filenames
Out[233]:
[u\'1_List of text editors\',
 u\'2_Text editor support for various operating systems\',
 u\'3_Available languages for the UI\',
 u\'4_Text editor support for common document interfaces\',
 u\'5_Text editor support for basic editing features\',
 u\'6_Text editor support for programming features (see source code editor)\',
 u\'7_Text editor support for other programming features\',
 \'8\',
 u\'9_Text editor support for key bindings\',
 u\'10_Text editor support for remote file editing over network protocols\',
 u\'11_Text editor support for some of the most common character encodings\',
 u\'12_Right to left (RTL)  bidirectional (bidi) support\',
 u\'13_Support for newline characters in line endings\']

2.2每个表格分别写入csv文件

for index, filename in enumerate(filenames):
    print filename
    with open(\'%s.csv\'%filename,\'wb\') as fp:
        writer = csv.writer(fp)
        for tr in response.xpath(\'//table[%s]/tr\'%(index+1)):
            writer.writerow([i.xpath(\'string(.)\').extract_first().replace(u\'\xa0\', u\' \').strip().encode(\'utf-8\',\'replace\') for i in tr.xpath(\'./*\')])  #xpath组合，限定 tag 范围，tr.xpath(\'./th | ./td\')

代码处理 .replace(u\'\xa0\', u\' \')

HTML转义字符&npsp；表示non-breaking space，unicode编码为u\'\xa0\',超出gbk编码范围？

使用 \'w\' 写csv文件，会出现如下问题，使用\'wb\' 即可解决问题

【已解决】Python中通过csv的writerow输出的内容有多余的空行 – 在路上

所有表格写入同一excel文件的不同工作表 sheet，需要使用xlwt

python ︰创建 excel 工作簿和倾倒 csv 文件作为工作表