Xpath 从所有表中刮取数据，而不是我想要的那张答案

【问题标题】：Xpath scrapes data from all tables rather than the one I intend toXpath 从所有表中刮取数据，而不是我想要的那张
【发布时间】：2021-12-01 13:02:57
【问题描述】：

借助问题的答案：Python: Get html table data by xpath，我正在尝试从网页上抓取“股权模式”信息。代码如下：

import lxml.html as LH
import pprint
import requests

def screenerdata (symbol):
    with requests.Session() as sess:
        resp = sess.get('https://www.screener.in/company/'+symbol+'/consolidated/')
        root= LH.fromstring(resp.content)

        for tbody in root.xpath('/html/body/main/section[9]/div[2]/table/tbody'):
            data = [ [tdata.text_content().replace(u'\xa0', u'').strip()
                     for tdata in trow.xpath('td')]
                     for trow in tbody.xpath('//tr') ]
        pprint.pprint(data)

screenerdata("LTTS")

由于网页上的 html 表没有任何 id 或类，我使用 Mozilla Firefox Web 开发工具复制了 xpath。一切都很好，除了代码也从其他表中抓取数据。有关如何解决此问题的任何想法。提前致谢

两个答案后更新：虽然没多大关系，但是我发现我要从中抓取数据的表没有任何 id 或唯一的类，但是保存该表的 section 标签有一个唯一的 id。所以我相应地修改了代码

【问题讨论】：

标签： python-3.x web-scraping xpath

【解决方案1】：

您必须通过 xpath 访问吗？既然是<table> 标签，为什么不让pandas 解析表格呢？它将返回一个数据框列表（基本上是html中的每个<table>标签。最后一个表是“股权模式”，因此可以只使用df列表的索引。

import pandas as pd

def screenerdata (symbol):
    url = 'https://www.screener.in/company/'+symbol+'/consolidated/'
    df = pd.read_html(url)[-1]
    print(df.to_string())

screenerdata("LTTS")

输出：

Unnamed: 0  Dec 2018  Mar 2019  Jun 2019  Sep 2019  Dec 2019  Mar 2020  Jun 2020  Sep 2020  Dec 2020  Mar 2021  Jun 2021  Sep 2021
0  Promoters +     80.41     78.88     74.97     74.97     74.74     74.62     74.60     74.36     74.27     74.24     74.23     74.15
1       FIIs +      4.22      5.09      8.50      8.93      8.26      8.37      8.95      7.97      8.87      9.06      8.92      9.50
2       DIIs +      4.25      4.43      4.75      4.76      4.52      4.88      4.45      5.83      6.40      6.36      6.68      6.14
3     Public +     11.12     11.60     11.78     11.34     12.48     12.13     12.00     11.83     10.46     10.34     10.17     10.21

【讨论】：

谢谢，这是一个很好的解决方案。这将使代码更小更整洁。但是，据我所知，知道为什么我的代码不起作用吗？
使用tbody.xpath('//tr') 它会抓取所有<tr> 标签，而您想要的只是您想要的特定<tbody> 元素中的<tr> 标签。如其他解决方案中所述，您将使用 for trow in tbody.xpath('.//tr') 而不是 for trow in tbody.xpath('//tr')
非常感谢，我现在明白了

【解决方案2】：

这行是问题所在：

for trow in tbody.xpath('//tr') ]

您正在“跳跃”到 XML 树的顶部，然后向下查看整个文档以查找任何和所有 tr 元素。

您应该将其设为相对表达式 .//tr 而不是 //tr。这将从当前位置（选定的tbody）开始查找任何和所有tr。

【讨论】：

非常感谢您指出我的代码存在的问题。