试图从一页抓取多个 URL答案

【问题标题】：Trying to Scrape multiple URLS from one page试图从一页抓取多个 URL
【发布时间】：2021-03-23 12:55:12
【问题描述】：

我正在尝试从 18 个 NI 选区的选举结果中提取信息：

http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results

每个唯一的 URL 都是这样开始的：

http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/

18个URL的选择器如下：

#container > div.two-column-content.clearfix > div > div.right-column.cms > div > ul > li

我想要开始的是一个包含 18 个 URL 的列表。这个列表应该是干净的（即只有实际地址，没有标签等）

到目前为止我的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver

url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'

response = requests.get(url)
response.status_code

text = requests.get(url).text

soup = BeautifulSoup(text, parser="html5lib")

link_list = []
for a in soup('a'):
    if a.has_attr('href'):
        link_list.append(a)

re_pattern = r"^/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/"

这是我迷路的地方，因为我需要搜索以该模式开头的所有 18 个 URL（我很确定该模式是错误的。请帮忙！）

其余代码：

import re
good_urls = [url for url in link_list if re.match(re_pattern, url)]

这里我得到这个错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-f3fbbd3199b1> in <module>
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]

<ipython-input-36-f3fbbd3199b1> in <listcomp>(.0)
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]

~/opt/anaconda3/lib/python3.7/re.py in match(pattern, string, flags)
    173     """Try to apply the pattern at the start of the string, returning
    174     a Match object, or None if no match was found."""
--> 175     return _compile(pattern, flags).match(string)
    176 
    177 def fullmatch(pattern, string, flags=0):

TypeError: expected string or bytes-like object

为了获得这 18 个网址，我应该输入什么不同的内容？谢谢！

【问题讨论】：

这应该会有所帮助：stackoverflow.com/questions/57417684/…
抱歉，这无济于事……不同的应用程序

标签： python regex web-scraping

【解决方案1】：

这似乎可以完成这项工作。

我已经删除了一些不必要的导入和这里不需要的东西，如果你在其他地方需要它们，请阅读它们。

错误消息是由于尝试对汤对象进行正则表达式比较，需要将其转换为字符串（与@Huzefa 发布的链接中讨论的问题相同，因此这绝对是相关的）。

修复仍然存在尝试隔离正确字符串的问题。我已经简化了匹配的正则表达式，然后在 " 上使用简单的字符串拆分并选择拆分产生的第二个对象（这是我们的 url）

import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
text = requests.get(url).text
soup = BeautifulSoup(text, "html.parser")
re_pattern = "<a href=\".*/Elections-2019/.*"
link_list = []
for a in soup('a'):
    if a.has_attr('href') and re.match(re_pattern, str(a)):
        link_list.append((str(a).split('"')[1]))

希望它符合您的目的，如果有不清楚的地方，请询问。

【讨论】：

对不起，我没有早点回复，但感谢您的解决方案！非常整洁！
@chefbilby 太好了，很高兴听到！不要忘记将我的答案标记为已接受的答案，以帮助其他人找到解决方案。
刚刚做到了！再次感谢！