使用 R 在 Yahoo Finance 上查找 Xpath答案

【问题标题】：Finding The Xpath On Yahoo Finance Using R使用 R 在 Yahoo Finance 上查找 Xpath
【发布时间】：2019-08-01 03:42:57
【问题描述】：

我正在尝试从 yahoo Finance 中抓取一些数据。通常我这样做没有问题。然而，今天，我在尝试拉某个容器时遇到了问题。这可能是什么原因让我如此困难？

我尝试了许多 xpath 的组合。选择器小工具由于某种原因无法拾取 xpath。我已经发布了一些尝试和下面的网址。

绿色aea 是我想要带入控制台的东西。

library(tidyverse)
library(rvest)
library(httr)

read_html("https://ca.finance.yahoo.com/quote/SPY/holdings?p=SPY") %>% html_nodes(xpath = '//*[@id="Col1-0-Holdings-Proxy"]/section/div[1]/div[1]') 

{xml_nodeset (0)}

#When I search for all tables using the following function.
read_html("https://finance.yahoo.com/quote/xlk/holdings?p=xlk") %>% html_nodes("table") %>% .[1] %>% html_table(fill = T)

我得到了页面底部的表格。在 [] 中尝试不同的数字会导致错误。

我做错了什么？这似乎很容易刮。非常感谢您的帮助。

【问题讨论】：

你也需要这两个标题吗？或者只是行（不包括图表）？
只是行（部门名称和相关值。即基本材料 - 2.48%）

标签： r xpath web-scraping rvest

【解决方案1】：

您的数据不在实际的 html 表中。

您目前可以使用以下 css 选择器 - 尽管很多页面看起来是动态的，我怀疑属性和类将来会发生变化。我试图保持更通用一点来弥补，但如果可能的话，你绝对应该设法使它更通用。

我始终使用css selectors 以获得灵活性和特异性。 [] 表示属性选择器，. 表示类选择器，* 是 contains 运算符，指定左侧属性的值包含右侧字符串，例如对于[class*=screenerBorderGray]，这意味着class 属性包含字符串screenerBorderGray。

选择器之间的" "、">"、"+"称为combinators，用于指定选择器序列的连续部分匹配的节点之间的关系。

我生成一个左列节点列表和一个右列节点列表（忽略中间的图表 col）。然后我将它们加入到最终的数据帧中。

library(rvest)
library(magrittr)

pg <- read_html('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')

lhs <- pg %>% 
  html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] > span:nth-child(1)') %>% 
  html_text()

rhs <- pg %>% 
  html_nodes('[id*=Holdings] section > .Fl\\(start\\) [class*=screenerBorderGray] span + span:last-child') %>% 
  html_text()

df <- data.frame(lhs,rhs) %>% set_names(., c('Title','value'))
df <- df[-c(3),] 
rownames(df) <- NULL
print(df)

派

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/xlk/holdings?p=xlk&guccounter=1')
soup = bs(r.content, 'lxml')
lhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) > span:nth-child(1)')]
rhs = [i.text.strip() for i in soup.select('[id*=Holdings] section > .Fl\(start\) .Bdbc\(\$screenerBorderGray\) span + span:last-child')]
df = pd.DataFrame(zip(lhs, rhs), columns = ['Title','Value'])
df = df.drop([2]).reset_index(drop = True)
print(df)

参考资料：

行重新编号@thelatemail

【讨论】：

感谢 QHarr 的深入回答！救生员！