使用 bs4 从 href 中提取部分文本答案

【问题标题】：extract the part of text from the href using bs4使用 bs4 从 href 中提取部分文本
【发布时间】：2018-09-11 04:46:11
【问题描述】：

想从href中提取文本，看来我只能从HTML中提取整个href

from bs4 import BeautifulSoup

soup=BeautifulSoup("""<div class="cdAllIn"><a href="/footba/all.aspx?lang=EN&amp;tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0" title="All Odds"><img src="/football/info/images/btn_odds.gif?CV=L302R1g" alt="All Odds" title="All Odds"></a></div>
<div class="cdAllIn"><a href="/footba/all.aspx?lang=EN&amp;tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0" title="All Odds"><img src="/football/info/images/btn_odds.gif?CV=L302R1g" alt="All Odds" title="All Odds"></a></div>
<div class="cdAllIn"><a href="/footba/all.aspx?lang=EN&amp;tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0" title="All Odds"><img src="/football/info/images/btn_odds.gif?CV=L302R1g" alt="All Odds" title="All Odds"></a></div>
<div class="cdAllIn"><a href="/footba/all.aspx?lang=EN&amp;tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0" title="All Odds"><img src="/football/info/images/btn_odds.gif?CV=L302R1g" alt="All Odds" title="All Odds"></a></div>
""",'html.parser')

lines=soup.find_all('a')
for line in lines:
    print(line['href'])

结果：

/footba/all.aspx?lang=EN&tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0
/footba/all.aspx?lang=EN&tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0
/footba/all.aspx?lang=EN&tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0
/footba/all.aspx?lang=EN&tmatchid=6be0690b-93e3-4300-87e9-7d0aa5797ae0

预期结果：

6be0690b-93e3-4300-87e9-7d0aa5797ae0
6be0690b-93e3-4300-87e9-7d0aa5797ae0
6be0690b-93e3-4300-87e9-7d0aa5797ae0
6be0690b-93e3-4300-87e9-7d0aa5797ae0

【问题讨论】：

为什么会是预期的结果？ ...['href'] 按预期获取 href 属性的值。也许想要？
亲爱的，是的，只是尝试编写一个与结果链接的代码！对不起我的错

标签： python beautifulsoup

【解决方案1】：

使用= 拆分字符串并获取最后一个索引。

for line in lines:
    print(line['href'].split('=')[-1])

希望这会有所帮助！干杯!

【讨论】：

【解决方案2】：

由于您只需要检索 tmatchid 值，因此在 url 中找到子字符串 tmatchid= 并从该索引中提取剩余的 url

lines=soup.find_all('a')
for line in lines:
    index=line['href'].find('tmatchid=')+9
    print(line['href'][index:])

输出

6be0690b-93e3-4300-87e9-7d0aa5797ae0
6be0690b-93e3-4300-87e9-7d0aa5797ae0
6be0690b-93e3-4300-87e9-7d0aa5797ae0
6be0690b-93e3-4300-87e9-7d0aa5797ae0

【讨论】：