【发布时间】:2021-03-23 12:55:12
【问题描述】:
我正在尝试从 18 个 NI 选区的选举结果中提取信息:
每个唯一的 URL 都是这样开始的:
http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/
18个URL的选择器如下:
#container > div.two-column-content.clearfix > div > div.right-column.cms > div > ul > li
我想要开始的是一个包含 18 个 URL 的 列表。这个列表应该是干净的(即只有实际地址,没有标签等)
到目前为止我的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
response.status_code
text = requests.get(url).text
soup = BeautifulSoup(text, parser="html5lib")
link_list = []
for a in soup('a'):
if a.has_attr('href'):
link_list.append(a)
re_pattern = r"^/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/"
这是我迷路的地方,因为我需要搜索以该模式开头的所有 18 个 URL(我很确定该模式是错误的。请帮忙!)
其余代码:
import re
good_urls = [url for url in link_list if re.match(re_pattern, url)]
这里我得到这个错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-f3fbbd3199b1> in <module>
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]
<ipython-input-36-f3fbbd3199b1> in <listcomp>(.0)
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]
~/opt/anaconda3/lib/python3.7/re.py in match(pattern, string, flags)
173 """Try to apply the pattern at the start of the string, returning
174 a Match object, or None if no match was found."""
--> 175 return _compile(pattern, flags).match(string)
176
177 def fullmatch(pattern, string, flags=0):
TypeError: expected string or bytes-like object
为了获得这 18 个网址,我应该输入什么不同的内容?谢谢!
【问题讨论】:
-
抱歉,这无济于事……不同的应用程序
标签: python regex web-scraping