无法使用正则表达式抓取网站的某些值答案

【问题标题】：Unable to scrape certain values of a website using regex无法使用正则表达式抓取网站的某些值
【发布时间】：2014-07-03 12:09:24
【问题描述】：

我一直在尝试从网站上的一组特定 p 标签中抓取信息，但遇到了很多麻烦。

我的代码如下：

import urllib   
import re

def scrape():
        url = "https://www.theWebsite.com"

        statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
        htmlfile = urllib.urlopen(url)
        htmltext = htmlfile.read()

        status = re.findall(statusText,htmltext)

        print("Status: " + str(status))
scrape()

不幸的是只返回："Status: []"

但是，话虽如此，我不知道自己做错了什么，因为当我在同一个网站上进行测试时，我可以使用代码

statusText = re.compile('<a href="/about">(.+?)</a>')

相反，我会得到我想要的，"Status: ['About', 'About']"

有谁知道我可以做些什么来获取 div 标签中的信息？或者更具体地说， div 标签包含的一组 p 标签？我尝试插入几乎所有我能想到的值，但一无所获。在 Google、YouTube 和 SO 搜索之后，我的想法已经不多了。

【问题讨论】：

您是否首先检查 htmltext 不为空？
@zx81 我看不出它与存在 a 标签而不是 div 标签时有何不同。 htmltext 不会在两种情况下都保存数据吗？
绝对有必要使用 Regex 吗？尝试在 python 中查看 BeautifulSoup 或 Scrappy 库

标签： python regex python-2.7 web-scraping

【解决方案1】：

我使用BeautifulSoup 来提取html 标签之间的信息。假设您要提取这样的除法：<div class='article_body' itemprop='articleBody'>...</div> 那么您可以使用 beautifulsoup 并通过以下方式提取此除法：

soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})

另见bs4官方documentation

作为示例，我编辑了您的代码，用于从 article 的 bloomberg 中提取除法您可以进行自己的更改

import urllib   
import re
from bs4 import BeautifulSoup

def scrape():
    url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    soup = BeautifulSoup(htmltext)
    ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
    print ans
scrape()

你可以从hereBeautifulSoup

附： : 我使用scrapy 和 BeautifulSoup 进行网页抓取，我很满意

【讨论】：