【问题标题】:Python: Get text from all HTML child elements texts with lxml xpathPython:使用 lxml xpath 从所有 HTML 子元素文本中获取文本
【发布时间】:2020-09-01 10:03:30
【问题描述】:

我正在使用 python 的 lxml xpath。如果我提供 HTML 标记的完整路径,我可以提取文本。但是,我无法将标签中的所有文本及其子元素提取到列表中。因此,例如给定这个 html,我想获取“示例”类的所有文本:

<div class="example">
    "Some text"
    <div>
        "Some text 2"
        <p>"Some text 3"</p>
        <p>"Some text 4"</p>
        <span>"Some text 5"</span>
    </div>
    <p>"Some text 6"</p> 
</div>

我想得到:

["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

【问题讨论】:

标签: python xpath lxml


【解决方案1】:

mzjn-s 答案是正确的。经过一些试验和错误,我设法让它工作。这就是最终代码的样子。您需要将 //text() 放在 xpath 的末尾。暂时没有重构,所以肯定会有一些错误和不好的做法,但它是有效的。

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    page = session.get("The url you are webscraping")
    content = page.content

    htmlsite = urllib.request.urlopen("The url you are webscraping")
    soup = BeautifulSoup(htmlsite, 'lxml')
    htmlsite.close()

    tree = html.fromstring(content)
    scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

我已经在 keeleyteton.com 的团队介绍页面上试用过了。它返回了以下正确的列表(尽管需要大量修改!),因为它们位于不同的标签中,有些是子标签。感谢您的帮助!

['\r\n        ', '\r\n        ', 'Nicholas F. Galluccio', '\r\n        ', '\r\n        ', 'Managing Director and Portfolio Manager', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Scott R. Butler', '\r\n        ', '\r\n        ', 'Senior Vice President and Portfolio Manager ', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Thomas E. Browne, Jr., CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian P. Leonard, CFA', '\r\n        ', '\r\n
  ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Robert M. Goldsborough', '\r\n        ', '\r\n        ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian R. Keeley, CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Edward S. Borland', '\r\n        ', '\r\n
  ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Kevin M. Keeley', '\r\n        ', '\r\n        ', 'President', '\r\n
 ', '\r\n        ', '\r\n        ', 'Deanna B. Marotz', '\r\n        ', '\r\n        ', 'Chief Compliance Officer', '\r\n      ']

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-02-14
    • 2020-07-12
    • 1970-01-01
    • 2015-01-05
    • 1970-01-01
    • 2012-03-23
    • 1970-01-01
    • 2015-09-10
    相关资源
    最近更新 更多