【发布时间】:2020-09-28 11:27:54
【问题描述】:
我想从给定的字符串中提取一些滴度、文本和链接。 python脚本是这样的:
from lxml.html import fromstring
import requests
import html.parser
url='''
<div class="topLinks">
<div class="hd left">
</div><div class="hd-middle middle">
<h4>TTTTTTTTTTTTT</h4></div><div class="hd right"></div><div class="boxMiddle"><ul><li><a href="FullStory.aspx?gid=4&id=6516" title="1399/03/18" target="_blank">PPPPPPPPPPPPPPP<img class="new" src="images/new.png"></a></li><li><a href="http://register1.sanjesh.org/fanni99up" title="1399/03/11" target="_blank">CCCCCCCCCCCCC</a></li><li><a href="http://www6.sanjesh.org/download/fani99/FaniNote99.pdf" title="1399/03/11" target="_blank"> ZZZZZZZZ </a></li><li><a href="FullStory.aspx?gid=4&id=6509" title="1399/03/11" target="_blank">FFFFFF</a></li><li><a href="FullStory.aspx?gid=4&id=6498" title="1399/02/21" target="_blank">XXXXXXXXXXXXXX </a></li></ul></div><div class="boxBottom"></div></div>
<div class="topLinks"><div class="hd left_alter"></div><div class="hd-middle middle_alter">
<h4>CCCCCCCCCCCC</h4></div><div class="hd right_alter"></div><div class="boxMiddle_alter"><ul><li><a href="http://register1.sanjesh.org/rgempiactax99/" title="1399/03/18" target="_blank">GGGGGGGGGGGGGGGG <img class="new" src="images/new.png"></a></li><li><a href="FullStory.aspx?gid=11&id=6515" title="1399/03/18" target="_blank">FFFFFFFFF<img class="new" src="images/new.png"></a></li><li><a href="http://register2.sanjesh.org/RGKhanevadehConsult/" title="1399/03/12" target="_blank">HHHHHHHHH</a></li><li><a href="FullStory.aspx?gid=11&id=6512" title="1399/03/12" target="_blank">FFFFFFFF</a></li><li><a href="FullStory.aspx?gid=11&id=6505" title="1399/02/24" target="_blank">NNNNNNNNNNNNNNNNNNNNNNNNNN</a></li><li><a href="http://dl.sanjesh.org/NOETDownload/DownloadHandler.ashx?id=1271" title="1398/12/12" target="_blank">OOOOOOOOOOOO</a></li><li><a href="FullStory.aspx?gid=11&id=6480" title="1399/01/26" target="_blank">JJJJJJJ</a></li></ul></div><div class="boxBottom_alter"></div></div>
'''
tree = fromstring(url)
titrs = tree.xpath("//div[@class='topLinks']")
for titr in titrs:
print(titr);
texts = tree.xpath("//div[@class='topLinks']//a/text()")
for text in texts:
print(text);
links = tree.xpath("//div[@class='topLinks']//a/@href")
for link in links:
print(link)
示例输出为:
【问题讨论】:
-
样本输出是什么?