正则表达式：找到一个字符串，然后看后面答案

【问题标题】：regex: find a string, then look behind正则表达式：找到一个字符串，然后看后面
【发布时间】：2015-03-15 15:56:16
【问题描述】：

我是正则表达式的新手，所以我希望这不是一个太明显的问题

我正在寻找 craigslist 公寓列表的 html 中的社区。社区是这样列出的

(castro / upper market)
</h2>

这是一个html的例子......

<a class="backup" disabled="disabled">&#9650;</a>
<a class="next" disabled="disabled"> next &#9654;</a>
</span>

</section>

<h2 class="postingtitle">
<span class="star"></span>
&#x0024;5224 / 2br - Stunning Furnished 2BR with Hardwwod Floors &amp; Newly  renovated Kitchen (pacific heights)
</h2>
<section class="userbody">
<figure class="iw">


<div class="slidernav">
    <button class="sliderback">&lt;</button>
    <span class="sliderinfo"></span>
    <button class="sliderforward">&gt;</button>

这应该找到所有不同的社区

但是在整个 html 页面上花费的时间太长了

\w+\s?(\/)?\s?\w+\s?(\/)?\s?\w+\s?(\/)?\s?\w+\)\n<\/h2>

# \w+ to find the word 
# \s?(\/)?\s? for a space or space, forward slash, space
# \n<\/h2> because </h2> is uniquely next to the neighborhood in the html

有没有办法找到

</h2>

那么在后面寻找附近的文本字符串？

非常感谢任何帮助或引导我朝着正确的方向前进

【问题讨论】：

对 html 使用正则表达式并不是一个好主意 (more here)。使用适当的工具，例如scrapy.org。

标签： python html regex web-scraping html-parsing

【解决方案1】：

使用 HTML Parser 提取标题（h2 标签内容），然后使用正则表达式提取邻域（括号内的文本）。

示例（使用BeautifulSoup HTML parser）：

import re
from bs4 import BeautifulSoup
import requests

response = requests.get('http://sfbay.craigslist.org/sfc/apa/4849806764.html')
soup = BeautifulSoup(response.content)

pattern = re.compile(r'\((.*?)\)$')
text = soup.find('h2', class_='postingtitle').text.strip()
print pattern.search(text).group(1)

打印pacific heights。

注意$(.*?)$$ 正则表达式 - 它将capture 括号内的所有内容直接放在字符串末尾之前。

使用Scrapy web-scraping framework，您可以在一行中解决它，因为Selectors 有built-in support for regular expressions。来自“Scrapy shell”的示例：

$ scrapy shell http://sfbay.craigslist.org/sfc/apa/4849806764.html
In [1]: response.xpath('//h2[@class="postingtitle"]/text()').re(r'\((.*?)\)$')[0]
Out[1]: u'pacific heights'

另请参阅不应该将正则表达式用于 HTML 解析的一百个理由：

RegEx match open tags except XHTML self-contained tags

【讨论】：

您可能是对的，目前浏览大约 5000 个列表需要 6 秒，到目前为止提取了大约 20 个功能。当我有时间重做所有这些时，我会仔细研究一下
@DavidFeldman 当然，开始研究scrapy，并通过一个包含蜘蛛、项目和管道的scrapy项目来组织和模块化你的代码。
实际上，解析 HTML 和提取从（网页）页面中的内容并不完全相同。虽然您不应该使用正则表达式解析 HTML，但对于这种特殊情况，使用精心设计的 RE 进行提取可能比 数量级 快，我敢打赌.
@fnl 这只是我们在谈论从单个页面中提取文本的速度。可读性、复杂性、可靠性等如何？有专门用于解析这些格式的特定格式和专用工具，经过大量用户测试和使用，证明有效。
@alecxe 呵呵，当然。我很欣赏你突然胡塞尔对此事的看法:)

【解决方案2】：

假设您的 HTML 存储在一个名为 page 的变量中，那么这个模式怎么样？

re.findall("\(([^\(\)]+)\)\n<\/h2>", page)

为了更好的衡量，也允许额外的空间：

re.findall("\(([^\(\)]+)\)\s*\n\s*<\/h2>", page)

最后，预编译自动机：

neighborhoods = re.compile( "\(([^\(\)]+)\)\s*\n\s*<\/h2>")

# somewhere else, for each page 
for nh in neighborhoods.findall(page):
    print(nh)

对于您的示例 HTML 页面，这将打印以下列表中唯一的社区：

pacific heights

如果每页只有一个位置，re.search() 会更快。请记住，search() 会生成一个中间匹配对象，而不是字符串本身。

【讨论】：

谢谢，但这最终会在其中一些上抓取太多文字
这意味着您要避免使用较早的左括号。我正在修复模式以进行调整。

【解决方案3】：

如何使用 string.find 查找正则表达式索引，然后在该索引处返回负值。

 In [1]: import re

 In [2]: c = "123456</h2>7890"

 In [3]: x = c.find("</h2>")

 In [4]: print c[x-6:x]
 123456

【讨论】：