只使用标准库刮掉没有 id 或类的表？

【问题标题】：Scrape table with no ids or classes using only standard libraries?只使用标准库刮掉没有 id 或类的表？
【发布时间】：2020-02-06 17:07:59
【问题描述】：

我想从一个网站上抓取两条数据：

https://www.moneymetals.com/precious-metals-charts/gold-price

具体来说，我想要“每盎司黄金价格”和“现货变化”百分比在它右侧的两列。

仅使用 Python 标准库，这可能吗？许多教程使用 HTML 元素 id 来有效地抓取，但检查此页面的源代码，它只是一个表格。具体来说，我想要出现在页面上的第二个和第四个<td>。

【问题讨论】：

不允许使用 BeautifulSoup？
@Cryptoharf84 我想留在 Python 标准库中。

标签： html python-3.x web-scraping

【解决方案1】：

可以使用标准的 python 库来做到这一点；丑陋，但可能：

import urllib
from html.parser import HTMLParser

URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'

page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()

class MyHTMLParser(HTMLParser):
    gold = []

    def handle_data(self, data):
        self.gold.append(data)

parser = MyHTMLParser()
parser.feed(str(resulttext))

for i in parser.gold:
    if 'Gold Price per Ounce' in i:
        target= parser.gold.index(i) #get the index location of the heading
        print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
        print(parser.gold[target+5].replace('\\n',''))
        print(parser.gold[target+9].replace('\\n',''))

输出（截至加载 url 的时间）：

$1,566.70
8.65
0.55%

【讨论】：