Python 正则表达式匹配但不包含字符漂亮的汤答案

【问题标题】：Python regex match but not include characters beautiful soupPython 正则表达式匹配但不包含字符漂亮的汤
【发布时间】：2017-04-27 10:53:32
【问题描述】：

我正在使用漂亮的汤并请求从网页中记录信息，我正在尝试获取只是标题且不包括标题字体中的文本 title= 的书名列表。

示例文本='一堆垃圾标题=book1 更多垃圾文本标题=book2'

我得到的是 titleList = ['title=book1', 'title=book2']

我想要titleList = ['book1', 'book2']

我尝试过匹配组，这确实将 title= 和 book1 分开，但我不确定如何仅将 group(2) 附加到列表中。

titleList = []

def getTitle(productUrl):

  res = requests.get(productUrl, headers=headers)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'lxml')
  title = re.compile(r'title=[A-Za-z0-9]+')
  findTitle = title.findall(res.text.strip())
  titleList.append(findTitle)

【问题讨论】：

你能发布一个你正在使用的 html 的例子吗？
这真的是一个 BeautifulSoup 问题吗？你实际上并没有使用soup 对象。
问题是你为什么用beautifulsoup？

标签： python regex python-2.7 beautifulsoup

【解决方案1】：

您的正则表达式没有捕获组。您还应该注意 findall 返回一个列表，因此您应该使用 extend 而不是 append（除非您希望 titleList 成为列表列表）。

title = re.compile(r'title=([A-Za-z0-9]+)')   # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle)   # using extend and not append

一个独立的例子：

import re

titleList = []
text = 'a bunch of junk title=book1 more junk text title=book2'

title = re.compile(r'title=([A-Za-z0-9]+)') 
findTitle = title.findall(text.strip())
titleList.extend(findTitle) 
print(titleList)
>> ['book1', 'book2']

【讨论】：

非常感谢，我所有的搜索都没有找到扩展选项，也只是添加了捕获组，我只需要第二双眼睛。

【解决方案2】：

将re.findall 与捕获组一起使用即可：

>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>

【讨论】：