【发布时间】:2015-11-06 03:44:15
【问题描述】:
我正在尝试抓取看起来像这样的页面,每组有 3 个或更多跨度标签。目标是获取字典列表:
{'ctl02_lblAppearanceInfo1': 'Text',
'ctl02_lblAppearanceInfo2': 'Text'}
html:
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE.............. </span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span>
我用过
tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]')
成功,因为它返回一个带有 id 和 text 属性的元素对象,但如果我遇到这样的事情:
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 1
<br>TEXT LINE 2
<br>TEXT LINE 3
<br>TEXT LINE 4</span>
它只会返回“TEXT LINE 1”
【问题讨论】: