【问题标题】:Manage quotation marks in XPath (lxml)管理 XPath (lxml) 中的引号
【发布时间】:2017-03-04 06:34:20
【问题描述】:

我想从给定网站的“制造概览”表中提取 Web 元素。但是行的名称有 ' (单引号)。这干扰了我的语法。我该如何克服这个问题?此代码适用于其他行。

import requests
from lxml import html, etree

ism_pmi_url = 'https://www.instituteforsupplymanagement.org/ismreport/mfgrob.cfm?SSO=1'
page = requests.get(ism_pmi_url)
tree = html.fromstring(page.content)

PMI_CustomerInventories = tree.xpath('//strong[text()="Customers' Inventories"]/../../following-sibling::td/p/text()')
PMI_CustomerInventories_Curr_Val = PMI_CustomerInventories[0]

【问题讨论】:

    标签: python parsing xpath lxml elementtree


    【解决方案1】:

    这是我避免您的问题的方法。 也许不是您真正需要的,但可以帮助您了解这个想法。

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import lxml.html
    import re
    import requests
    import lxml.html
    from pprint import pprint
    
    def load_lxml(response):
        return lxml.html.fromstring(response.text)
    
    url = 'https://www.instituteforsupplymanagement.org/ismreport/mfgrob.cfm?SSO=1'
    response = requests.get(url)
    root = load_lxml(response)
    
    headers = []
    data = []
    for index,row in enumerate(root.xpath('//*[@id="home_feature_container"]/div/div/div/span/table[2]/tbody/tr')):
        rows = []
        for cindex,column in enumerate(row.xpath('./th//text() | ./td//text()')):
            if cindex == 1:
                continue
            column = column.strip()
            if index == 0 or not column:
                continue
            elif index == 1:
                headers.append(column)
            else:
                rows.append(column)
    
        if rows and len(rows) == 6:
            data.append(rows)
    
    
    data.insert(0,headers)
    
    pprint(data)    
    

    结果:

    [['Series Index',
      'Feb',
      'Series Index',
      'Jan',
      'Percentage',
      'Point',
      'Change',
      'Direction',
      'Rate of Change',
      'Trend* (Months)'],
     ['65.1', '60.4', '+4.7', 'Growing', 'Faster', '6'],
     ['62.9', '61.4', '+1.5', 'Growing', 'Faster', '6'],
     ['54.2', '56.1', '-1.9', 'Growing', 'Slower', '5'],
     ['54.8', '53.6', '+1.2', 'Slowing', 'Faster', '10'],
     ['51.5', '48.5', '+3.0', 'Growing', 'From Contracting', '1'],
     ['47.5', '48.5', '-1.0', 'Too Low', 'Faster', '5'],
     ['68.0', '69.0', '-1.0', 'Increasing', 'Slower', '12'],
     ['57.0', '49.5', '+7.5', 'Growing', 'From Contracting', '1'],
     ['55.0', '54.5', '+0.5', 'Growing', 'Faster', '12'],
     ['54.0', '50.0', '+4.0', 'Growing', 'From Unchanged', '1']]
    [Finished in 2.9s]
    

    【讨论】:

    • 谢谢 wu4m4n。我注意到有时网站会发生一些变化。 X 路径从 '/div/div/div/' 变为 '/div/div[2]/div/' 等。这是完全不可预测的。所以过去使用这种技术的代码失败了。
    猜你喜欢
    • 2018-02-24
    • 2016-11-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多