【问题标题】:rvest html_table() error max(p) returning -Infrvest html_table() 错误 max(p) 返回 -Inf
【发布时间】:2021-02-04 22:14:10
【问题描述】:

我正在尝试从网上抓取一张表格(此处为 https://www.cryptoslam.io/nba-top-shot/marketplace)。

我一直在研究如何做到这一点,并且似乎使用库 rvesthtml_table() 函数最接近。事实上,我可以使用代码从这里https://en.wikipedia.org/wiki/Brazil_national_football_team 下载“FIFA 世界杯记录”表

webpage_url <- "https://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- xml2::read_html(webpage_url)
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[[6]] %>%
  html_table(fill = TRUE)

请注意,我已加载库 library(xml2)library(rvest)。然后我在这里使用基本相同的代码:

webpage_url <- "https://www.cryptoslam.io/nba-top-shot/marketplace"
webpage <- xml2::read_html(webpage_url)
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table(fill = TRUE)

但出现错误

Error in matrix(NA_character_, nrow = n, ncol = maxp) : 
  invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
  NAs introduced by coercion to integer range

我无法在其他任何地方找到有关此错误的任何讨论。两个表之间的不同之处在于第二个表中存在 thead 标记,该标记不起作用。我对 html 的了解非常有限,因此我可能会遗漏表格实现之间的一些其他重要差异。

【问题讨论】:

    标签: javascript html r rvest


    【解决方案1】:

    一种方法是使用 RSelenium:

    library(RSelenium)
    library(rvest) #requires xml2, no need to load separately 
    driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
    client <- driver[["client"]]
    client$navigate(webpage_url)
    source <- client$getPageSource()[[1]]
      
    read_html(source) %>% 
      html_nodes("table") %>%
      html_table() %>%
      `[[`(1) -> result
    
    head(result)
                Listed  Rank                   Crypto        Set                  Team Play Category   SN# Current Price          Owner
    1 NA 5 minutes ago 10324     2020-21 Bradley Beal   Base Set    Washington Wizards       Handles 10691   (10.00 USD)        P1BenEe
    2 NA 5 minutes ago  1096     2019-20 Kelly Olynyk The Finals            Miami Heat         Layup   360  (180.00 USD) Top_Shot3point
    3 NA 5 minutes ago  3138      2019-20 Alex Caruso   Base Set    Los Angeles Lakers         Block   679     67.00 USD CaptainThunder
    4 NA 5 minutes ago  3586  2020-21 Kelly Oubre Jr.   Base Set Golden State Warriors          Dunk  3583      5.00 USD       dddd9999
    5 NA 5 minutes ago  3318  2020-21 Bismack Biyombo   Base Set     Charlotte Hornets         Layup  3315      7.00 USD       ectoasty
    6 NA 5 minutes ago  4940 2020-21 DeMarcus Cousins   Base Set       Houston Rockets     3 Pointer  4937    (3.00 USD) StoneColdBroke
    

    【讨论】:

    • 非常感谢您的帮助!以防万一其他人偶然发现这一点,我对rsDriver() 函数的运气并不好。许多人转而使用 docker 来避免头痛。信息可以在这里找到 rpubs.com/johndharrison/RSelenium-Docker>
    【解决方案2】:

    该数据来自返回 json 的 API 发布请求。您可以发出该请求,然后将作为内容列表返回的 json 解析为您想要的任何格式

    library(httr)
    
    headers = c(
      'user-agent'= 'Mozilla/5.0',
      'content-type'= 'application/json',
      'referer'= 'https://www.cryptoslam.io/',
      'accept-language'= 'en-GB,en-US;q=0.9,en;q=0.8'
    )
    
    data = '{"draw":1,"columns":[{"data":null,"name":"","searchable":true,"orderable":false,"search":{"value":"","regex":false}},{"data":null,"name":"TimeStamp","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":null,"name":"","searchable":true,"orderable":false,"search":{"value":"","regex":false}},{"data":null,"name":"Tokens.0.Rank","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":null,"name":"Tokens.Attributes.Name","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":null,"name":"Tokens.Attributes.Set","searchable":true,"orderable":false,"search":{"value":"","regex":false}},{"data":null,"name":"Tokens.Attributes.Team","searchable":true,"orderable":false,"search":{"value":"","regex":false}},{"data":null,"name":"Tokens.Attributes.PlayCategory","searchable":true,"orderable":false,"search":{"value":"","regex":false}},{"data":null,"name":"Tokens.Attributes.SerialNumber","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":null,"name":"CurrentPrice","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":null,"name":"EndingPriceGwei","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":null,"name":"","searchable":true,"orderable":false,"search":{"value":"","regex":false}},{"data":null,"name":"","searchable":true,"orderable":false,"search":{"value":"","regex":false}}],"order":[{"column":1,"dir":"desc"}],"start":0,"length":50,"search":{"value":"","regex":false},"startdate":"","enddate":"","marketplace":"","attributesQuery":{}}'
    
    r <- httr::POST(url = 'https://api2.cryptoslam.io/api/marketplace/NBA Top Shot/search', httr::add_headers(.headers=headers), body = data) %>% content()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2013-07-23
      • 1970-01-01
      • 1970-01-01
      • 2016-09-14
      • 2018-08-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多