使用正则表达式从网页中提取表格答案

【问题标题】：Extracting a table from webpage with regex使用正则表达式从网页中提取表格
【发布时间】：2015-01-05 10:44:40
【问题描述】：

查看HTML源代码我可以清楚地看到我想要的区域是这样的结构：

[CONTENT BEFORE TABLE]
<table border="1" cellpadding="6" bordercolor="#000000">
[IP ADDRESSES AND OTHER INFO]
</table>
[CONTENT AFTER TABLE]

所以我写了这个小sn-p：

import urllib2,re
from lxml import html
response = urllib2.urlopen('http://www.nirsoft.net/countryip/za.html')

content = response.read()

print re.match(r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)",content)

页面的内容被提取（并且正确）没有问题。然而，正则表达式匹配总是返回None（此处的打印仅用于调试）。

考虑到页面的结构，我不明白为什么没有匹配。我希望会有三组，第二组是表格内容。

【问题讨论】：

请使用 HTML 解析器
stackoverflow.com/questions/1732348/…
@sshashank124 在这种情况下不是一个选项。
您的任务是否特别要求您使用正则表达式？
@sshashank124 是的，我需要演示正则表达式来提取表格，但当我可以让正则表达式处理其他字符串时，我无法弄清楚为什么它不适用于这个大字符串。

标签： python html regex web-scraping html-table

【解决方案1】：

默认情况下，. 不匹配换行符。你需要指定dot-all flag 让它这样做：

re.match(..., content, re.DOTALL)

下面是一个演示：

>>> import re
>>> content = '''
... [CONTENT BEFORE TABLE]
... <table border="1" cellpadding="6" bordercolor="#000000">
... [IP ADDRESSES AND OTHER INFO]
... </table>
... [CONTENT AFTER TABLE]
... '''
>>> pat = r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)"
>>> re.match(pat, content, re.DOTALL)
<_sre.SRE_Match object at 0x02520520>
>>> re.match(pat, content, re.DOTALL).group(2)
'\n[IP ADDRESSES AND OTHER INFO]\n'
>>>

全点标志也可以通过使用re.S 或将(?s) 放在图案的开头来激活。

【讨论】：

谢谢！不知道 DOTALL

【解决方案2】：

为了解析HTML，我更喜欢BeautifulSoup：

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://www.nirsoft.net/countryip/za.html').read())
for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    print x

为了更好的结果：

for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    for y in x:
        try:
            if y.name == 'tr':
                print "\t".join(y.get_text().split())
       except:pass

【讨论】：

谢谢，我需要正则表达式，但我会研究一下 Beautiful Soup，它看起来很整洁
@Juicy BeautifulSoup 是解析 html 页面的好工具