使用python解析href答案

【问题标题】：Parsing href using python使用python解析href
【发布时间】：2018-06-04 09:04:08
【问题描述】：

我使用 python 进行网页抓取，从网站获取此代码：

<a href="javascript:document.frmMain.action.value='display_physician_info';document.frmMain.PhysicianID.value=1234567;document.frmMain.submit();" title="For more information, click here.">JOHN, DOE</a>

我想解析 href 的具体值，例如 "document.frmMain.PhysicianID.value" 中的 PhysicianID 的值 1234567 >

目前我得到的整个 href 文本是这样的：

for i in soup.select('.data'):
     name = i.find('a', attrs = {'title': 'For more information, click here.'})

有什么想法吗？提前致谢。

【问题讨论】：

标签： python web-scraping html-parsing

【解决方案1】：

获得链接后，使用BeautifulSoup 即可轻松进入href：

href = name['href']

然后您可以将正则表达式与re 模块一起使用：

import re
match = re.search(r'document.frmMain.PhysicianID.value=\d*;', href).group()
value = re.search(r'\d+', match).group()
print(value) #prints 1234567

将它们与您的代码放在一起：

import re
for i in soup.select('.data'):
    name = i.find('a', attrs = {'title': 'For more information, click here.'})
    match = re.search(r'document.frmMain.PhysicianID.value=\d*;', href).group()
    value = re.search(r'\d+', match).group()
    print(value) #prints 1234567

【讨论】：

【解决方案2】：

或者没有正则表达式：

from bs4 import BeautifulSoup

content = """
<a href="javascript:document.frmMain.action.value='display_physician_info';document.frmMain.PhysicianID.value=1234567;document.frmMain.submit();" title="For more information, click here.">JOHN, DOE</a>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.select_one("a")['href'].split("PhysicianID.value=")[1].split(";")[0]
print(item)

输出：

【讨论】：