【问题标题】:lxml xpath - Get all text within span tagslxml xpath - 获取跨度标签内的所有文本
【发布时间】:2015-11-06 03:44:15
【问题描述】:

我正在尝试抓取看起来像这样的页面,每组有 3 个或更多跨度标签。目标是获取字典列表:

{'ctl02_lblAppearanceInfo1': 'Text',
'ctl02_lblAppearanceInfo2': 'Text'}

html:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............   </span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span>


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span>


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span>

我用过

tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]')

成功,因为它返回一个带有 id 和 text 属性的元素对象,但如果我遇到这样的事情:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 1
<br>TEXT LINE 2
<br>TEXT LINE 3
<br>TEXT LINE 4</span>

它只会返回“TEXT LINE 1”

【问题讨论】:

    标签: python xpath lxml


    【解决方案1】:

    使用 contains()text()

    代码如下:

    from lxml import html
    
    HTML = """<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE 1..............   </span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE 2..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE 3..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE 4..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE 5..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE 6..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE 7..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE 8..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE 9..............</span>
    <span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
    TEXT LINE 10.............
    <br>TEXT LINE 11.............
    <br>TEXT LINE 12.............
    <br>TEXT LINE 13.............</span>
    """
    
    tree = html.fromstring(HTML)
    text_lines = tree.xpath('//span[contains(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl")]')
    
    results = dict()
    
    for i, text_line in enumerate(text_lines):
        span_id = text_line.xpath('.//@id')[0]
        span_text = [x.strip() for x in text_line.xpath('.//text()')]
        results[i] = dict(id=span_id, texts=span_text)
    
    print results
    

    输出:

    {
        0: {
            'texts': ['TEXT HERE 1..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
        },
        1: {
            'texts': ['TEXT HERE 2..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2'
        },
        2: {
            'texts': ['TEXT HERE 3..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace'
        },
        3: {
            'texts': ['TEXT HERE 4..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1'
        },
        4: {
            'texts': ['TEXT HERE 5..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2'
        },
        5: {
            'texts': ['TEXT HERE 6..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace'
        },
        6: {
            'texts': ['TEXT HERE 7..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1'
        },
        7: {
            'texts': ['TEXT HERE 8..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2'
        },
        8: {
            'texts': ['TEXT HERE 9..............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace'
        },
        9: {
            'texts': ['TEXT LINE 10.............', 'TEXT LINE 11.............', 'TEXT LINE 12.............', 'TEXT LINE 13.............'],
            'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
        }
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-09-27
      • 1970-01-01
      • 2019-11-01
      • 1970-01-01
      • 1970-01-01
      • 2019-07-20
      • 2023-03-14
      相关资源
      最近更新 更多