【发布时间】:2021-03-16 19:36:31
【问题描述】:
我正在尝试用漂亮的汤替换特定的文本: 我的代码:
import requests
from bs4 import BeautifulSoup as bs
dorks = input("Keyword : ")
binglist = "http://www.bing.com/search?q="
with open(dorks , mode="r",encoding="utf-8") as my_file:
for line in my_file:
clean = binglist + line
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get(clean, headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
print(links)
输出:
[<cite>https://www.wsltv.com/tv-<strong>allinurl:-streaming</strong>/s17455</cite>, <cite>https://www.<strong>google</strong>.es/webhp</cite>]
所以我正在尝试删除所有
我尝试过这个正则表达式,但我没有成功提取网站网址
像这样:
links = soup.find_all('http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$')
但我没有成功仅提取 URL。
感谢您的帮助
【问题讨论】:
-
你可以在这里找到答案:stackoverflow.com/questions/56421148/…
标签: python beautifulsoup text-extraction