(Python 3) - 正则表达式问题不返回匹配项答案

【问题标题】：(Python 3) - Regex issue not returning matches(Python 3) - 正则表达式问题不返回匹配项
【发布时间】：2021-08-15 15:35:17
【问题描述】：

有点背景故事。我正在尝试抓取 pastebin 的存档页面并仅获取粘贴的 ID。 ID 长度为 8 个字符，粘贴的示例链接如下：“https://pastebin.com/A8XGWYBu”

我目前编写的代码能够从标记中获取所有数据，但它也会检索不必要的信息。

import requests
import re
from bs4 import BeautifulSoup

def get_recent_id():
    
    URL = requests.get('https://pastebin.com/archive', verify=False)

    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"

    soup = BeautifulSoup(URL.content, 'html.parser')
    pastes = soup.find_all('a')

    # Works good here
    # prints the necessary things using the regex above
    pastes_findall = re.findall(href_regex, str(pastes))

    try:
        for id, t in pastes_findall:
            output = f"{t} -> {id}"
            get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'

            final = re.findall(get_valid, output)
            print(final)
    except IndexError:
        pass

get_recent_id()

它打破的地方是try 语句中的正则表达式。它不会返回我期望的信息，而是返回空白 [] 括号。

在try 语句中使用正则表达式的示例输出。

[]
[]
[]
[]
...

我已经在 regex101 中测试了正则表达式，它对 output 变量的输出效果很好。

regex101 中的示例：

我试图实现的输出应该只返回标题和粘贴 ID，并且应该如下所示：

blood sword v1.0 -> cvWdRuaV
lab2 -> eRJY9YAb
example 210526a -> A2sv2shx
2021-05-26_stats.json -> wjsmucFF
2021-05-25_stats.json -> TsXrW7ex
Flake#5595 (466999758096039936) RD -> q8tHsgMz
Untitled -> akrSbCyT
...

当 regex101 清楚地显示 2 组中有匹配项时，我不确定为什么我没有从输出中得到任何结果。如果有人能够提供帮助，我将不胜感激！

谢谢！

【问题讨论】：

通过正则表达式解析 html 被认为是错误的形式 - 使用 html 解析器：TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
你的意思是只依赖 BeautifulSoup ？

标签： python python-3.x regex web-scraping pastebin

【解决方案1】：

您可以使用更少的代码行来实现所需的输出。确保您的 bs4 版本是最新的或至少 >= 4.7.0 以支持我在脚本中使用的伪 css 选择器。

import requests
from bs4 import BeautifulSoup

link = 'https://pastebin.com/archive'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.maintable tr:has(> td > a[href]) > td:nth-of-type(1) > a"):
        title = item.text
        _id = item.get("href").lstrip("/")
        print(title," -> ",_id)

此时的输出（截断）：

new_meta_format  ->  JjMxWDzh
Paste Ping  ->  bH54QCb9
Untitled  ->  EEMQigvX
free checked credit cards  ->  b6LE4e78
Untitled  ->  wJA8Axbb
Untitled  ->  fFFrEJnv
Untitled  ->  A8XGWYBu
Ejercicio01  ->  CqP4grhP
Ejercicio01  ->  nhxM8Tca
Untitled  ->  8Y485jwG
f_get_product_balance_stock_exclude_reserved  ->  hc64MsgH
in_product_balance_stock_reserved  ->  ZGXgRWKQ
My Log File  ->  24TnZK2F
Untitled  ->  tvbwuWkL

【讨论】：

【解决方案2】：

我认为您不需要正则表达式。您可以获取每个pastes、strip 和/ 字符的href 值，然后通过附加-> 和a 元素的文本值来生成输出值：

[i["href"].strip('/') + " -> " + i.get_text() for i in pastes]

整个方法看起来像

def get_recent_id():
    URL = requests.get('https://pastebin.com/archive', verify=False)
    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"
    soup = BeautifulSoup(URL.content, 'html.parser')
    pastes = soup.find_all('a')
    return [i["href"].strip('/') + " -> " + i.get_text() for i in pastes]

【讨论】：

【解决方案3】：

所以玩了一点，我能够找到我的问题的答案。

@Wiktor，您的回答很好，但仍然返回了一些我不需要的结果。

最终代码如下所示：

def get_recent_id():
    
    URL = requests.get('https://pastebin.com/archive', verify=False)

    href_regex = r"<a href=\"\/(.*?)\">(.*?)<\/a>"

    soup = BeautifulSoup(URL.content, 'html.parser')
    pastes = soup.find_all('a')
    
    # Works until here
    # prints the necessary things using the regex above
    pastes_findall = re.findall(href_regex, str(pastes))

    try:
        for id, t in pastes_findall:
            output = f"{t} -> {id}"
            get_valid = r'(.*?) \-\> ([A-Za-z\d+]{8})'
            final = re.search(get_valid, output)
            
            if final is None:
                pass
            else:
                final = final.group(0)
                print(final)
            
    except IndexError:
        pass

get_recent_id()

所以本质上，我在本地的 output 变量中还有其他一些东西，我没有在我的帖子中展示。删除这些后，我最初发布的内容就解决了（应该早点尝试...）。

然后我收到“NoneType”错误，但一个简单的 if 语句也解决了这个问题。

最后我得到了所需的输出，如下所示：

$ ./tool.py

Paste Ping -> bH54QCb9
Untitled -> EEMQigvX
free checked credit cards -> b6LE4e78
Untitled -> wJA8Axbb
Untitled -> fFFrEJnv
Untitled -> A8XGWYBu
Ejercicio01 -> CqP4grhP
Ejercicio01 -> nhxM8Tca
Untitled -> 8Y485jwG
f_get_product_balance_stock_exclude_reserved -> hc64MsgH
in_product_balance_stock_reserved -> ZGXgRWKQ
My Log File -> 24TnZK2F
Untitled -> tvbwuWkL
Woocommerce Minimum Order Amount -> j35Hg0Ci
...

感谢您的回答！

【讨论】：