【问题标题】:Web-scraping popup table without submit button using R使用 R 的没有提交按钮的网页抓取弹出表
【发布时间】:2019-05-25 17:03:06
【问题描述】:

我正在尝试从“https://www.zipcodestogo.com/county-zip-code-list.htm”中提取邮政编码,其中州和县将在数据集中提供。以阿拉巴马州戴尔为例(如下图所示)。但是,当我使用 Selector Gadget 提取表格时,它并没有出现,当我查看源代码时,我也没有找到这个表格。我不知道如何解决这个问题。我对网络抓取非常陌生,所以如果这是一个愚蠢的问题,我提前道歉。谢谢。

zipurl = 'https://www.zipcodestogo.com/county-zip-code-list.htm'
query = list('State:'="Alabama",
              'Counties:'="Dale"
)
website = POST(zipurl, body = query,encode = "form")
tables <- html_nodes(content(website), css = 'table')

【问题讨论】:

    标签: css r web-scraping


    【解决方案1】:

    同样的想法,但抓住桌子并删除标题

    library(rvest)
    state = "ALABAMA"
    county = "DALE"
    url = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county)
    
    r <- read_html(url) %>%
      html_node("table table") %>% 
      html_table()%>%
      slice(-1)
    
    print(r)
    

    仅邮政编码列是:

    r$X1
    

    您还可以限制第一列并删除第一行:

    r <- read_html(url) %>%
      html_nodes("table table td:nth-of-type(1)") %>% 
      html_text() %>% 
      as.character
    
    print(r[-1])
    

    【讨论】:

    • html_node("table table") 不错的一个
    • @Alexandregeorges + 为您服务。我是 R 新手,有很多东西要学。
    【解决方案2】:

    您可以使用浏览器在Inspect > 选项卡Network

    中找到的链接

    这里有一个解决方案:

    state = "ALABAMA"
    county = "DALE"
    url_scrape = paste0("https://www.zipcodestogo.com/lookups/countyZipCodes.php?state=",state,"&county=",county) # Inspect > Network > XHR links
    
    # function => First letter Capital (needed for regexp)
    capwords <- function(s, strict = T) { # You can find this function on the forum
      cap <- function(s) paste(toupper(substring(s, 1, 1)),
                               {s <- substring(s, 2); if(strict) tolower(s) else s},
                               sep = "", collapse = " " )
      sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
    }
    
    zip_codes = read_html(url_scrape) %>% html_nodes("td") %>% html_text()
    zip_codes = zip_codes[-c(1:6)] # Delete header
    string_regexp = paste0(capwords(state),"|View") # pattern as var
    zip_codes = zip_codes[-grep(pattern = string_regexp,zip_codes)]
    df = data.frame(zip = zip_codes[grep("\\d",zip_codes)], label = zip_codes[-grep("\\d",zip_codes)])
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-12-28
      • 1970-01-01
      • 1970-01-01
      • 2022-09-24
      • 2021-07-19
      • 1970-01-01
      • 2011-04-20
      相关资源
      最近更新 更多