【问题标题】:I want to extract the text inside h4 and the text related to h4 and the link related to them(with xpath)我想提取 h4 中的文本以及与 h4 相关的文本以及与它们相关的链接(使用 xpath)
【发布时间】:2020-09-28 11:27:54
【问题描述】:

我想从给定的字符串中提取一些滴度、文本和链接。 python脚本是这样的:

from lxml.html import fromstring
import requests
import html.parser

url='''
<div class="topLinks">
<div class="hd left">
</div><div class="hd-middle middle">

        <h4>TTTTTTTTTTTTT</h4></div><div class="hd right"></div><div class="boxMiddle"><ul><li><a href="FullStory.aspx?gid=4&id=6516" title="1399/03/18" target="_blank">PPPPPPPPPPPPPPP<img class="new" src="images/new.png"></a></li><li><a href="http://register1.sanjesh.org/fanni99up" title="1399/03/11" target="_blank">CCCCCCCCCCCCC</a></li><li><a href="http://www6.sanjesh.org/download/fani99/FaniNote99.pdf" title="1399/03/11" target="_blank"> ZZZZZZZZ </a></li><li><a href="FullStory.aspx?gid=4&id=6509" title="1399/03/11" target="_blank">FFFFFF</a></li><li><a href="FullStory.aspx?gid=4&id=6498" title="1399/02/21" target="_blank">XXXXXXXXXXXXXX </a></li></ul></div><div class="boxBottom"></div></div>


<div class="topLinks"><div class="hd left_alter"></div><div class="hd-middle middle_alter">

<h4>CCCCCCCCCCCC</h4></div><div class="hd right_alter"></div><div class="boxMiddle_alter"><ul><li><a href="http://register1.sanjesh.org/rgempiactax99/" title="1399/03/18" target="_blank">GGGGGGGGGGGGGGGG <img class="new" src="images/new.png"></a></li><li><a href="FullStory.aspx?gid=11&id=6515" title="1399/03/18" target="_blank">FFFFFFFFF<img class="new" src="images/new.png"></a></li><li><a href="http://register2.sanjesh.org/RGKhanevadehConsult/" title="1399/03/12" target="_blank">HHHHHHHHH</a></li><li><a href="FullStory.aspx?gid=11&id=6512" title="1399/03/12" target="_blank">FFFFFFFF</a></li><li><a href="FullStory.aspx?gid=11&id=6505" title="1399/02/24" target="_blank">NNNNNNNNNNNNNNNNNNNNNNNNNN</a></li><li><a href="http://dl.sanjesh.org/NOETDownload/DownloadHandler.ashx?id=1271" title="1398/12/12" target="_blank">OOOOOOOOOOOO</a></li><li><a href="FullStory.aspx?gid=11&id=6480" title="1399/01/26" target="_blank">JJJJJJJ</a></li></ul></div><div class="boxBottom_alter"></div></div>

'''  

tree = fromstring(url)
titrs = tree.xpath("//div[@class='topLinks']")
for titr in titrs:
    print(titr);

texts = tree.xpath("//div[@class='topLinks']//a/text()")
for text in texts:
    print(text);
    links = tree.xpath("//div[@class='topLinks']//a/@href")
for link in links:
    print(link)

示例输出为:

【问题讨论】:

  • 样本输出是什么?

标签: python xpath href


【解决方案1】:

严格来说,您需要以下 XPath。 h4="TTTTTTTTTTTTT" 的示例:

检索文本:

//h4[.="TTTTTTTTTTTTT"]/following::div[@class="boxMiddle"]//text()

要检索链接:

//h4[.="TTTTTTTTTTTTT"]/following::div[@class="boxMiddle"]//@href

一个班轮:

(//text()[normalize-space()]|//@href)[preceding::h4[1][.="TTTTTTTTTTTTT"]]

【讨论】:

    猜你喜欢
    • 2019-08-11
    • 1970-01-01
    • 1970-01-01
    • 2021-12-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多