【问题标题】:Rselenium scraping loop and listRselenium 抓取循环和列表
【发布时间】:2017-11-16 10:36:34
【问题描述】:

我正在尝试使用此代码:

require(RSelenium)
checkForServer()
startServer()
remDr<-remoteDriver()
remDr$open()

appURL <- 'http://www.mtmis.excise-punjab.gov.pk'
remDr$navigate(appURL)
remDr$findElement("name", "vhlno")$sendKeysToElement(list("ria-07-777"))

无法识别 css 选择器

remDr$findElements("class", "ent-button-div")[[1]]$clickElement()

搜索查询后

elem <- remDr$findElement(using="class", value="result-div") 
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] 
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) 
final <- readHTMLTable(elemxml)

remDr$close()
rD[["server"]]$stop()

我想要的是使用列表中的不同车辆创建一个自动化的“for循环”,并将所有车辆合并到一个具有唯一标识符的最终表格中,例如“ria-07-777”。

list <- c("ria-07-776", "ria-07-777", "ria-07-778")

【问题讨论】:

    标签: r for-loop web-scraping rselenium


    【解决方案1】:

    为什么需要 Selenium?

    library(httr)
    library(rvest)
    
    clean_cols <- function(x) {
      x <- tolower(x)
      x <- gsub("[[:punct:][:space:]]+", "_", x)
      x <- gsub("_+", "_", x)
      x <- gsub("(^_|_$)", "", x)
      make.unique(x, sep = "_")
    }
    
    get_vehicle_info <- function(vhlno) {
    
      POST(
        url = 'http://www.mtmis.excise-punjab.gov.pk/',
        set_cookies(has_js=1),
        body = list(vhlno=vhlno)
      ) -> res
    
      stop_for_status(res)
    
      pg <- content(res)
      rows <- html_nodes(pg, xpath=".//div[contains(@class, 'result-div')]/table/tr[td[not(@colspan)]]") 
    
      cbind.data.frame(
        as.list(
          setNames(
            html_text(html_nodes(rows, xpath=".//td[2]")),
            clean_cols(html_text(html_nodes(rows, xpath=".//td[1]")))
          )
        ),
        stringsAsFactors=FALSE
      )
    
    }
    

    现在使用 ^^:

    vehicles <- c("ria-07-776", "ria-07-777", "ria-07-778")
    
    Reduce(
      rbind.data.frame,
      lapply(vehicles, function(v) {
        Sys.sleep(5) # your desire to steal a bunch of vehicle info to make a sketch database does not give you the right to hammer the server, and you'll very likely remove this line anyway, but I had to try
        get_vehicle_info(v)
      })
    ) -> vehicle_df
    
    str(vehicle_df)
    ## 'data.frame': 3 obs. of  12 variables:
    ##  $ registration_number: chr  "ria-07-776" "ria-07-777" "ria-07-778"
    ##  $ chassis_number     : chr  "KZJ95-0019869" "NFBFD15746R101101" "NZE1206066278"
    ##  $ engine_number      : chr  "1KZ-0375851" "R18A11981105" "X583994"
    ##  $ make_name          : chr  "LAND - CRUISER" "HONDA - CIVIC" "TOYOTA - COROLLA"
    ##  $ registration_date  : chr  "17-Dec-2007 12:00 AM" "01-Aug-2007 12:00 AM" "01-Jan-1970 12:00 AM"
    ##  $ model              : chr  "1997" "2006" "2007"
    ##  $ vehicle_price      : chr  "1,396,400" "1,465,500" "0"
    ##  $ color              : chr  "MULTI" "GRENDA B.P" "SILVER"
    ##  $ token_tax_paid_upto: chr  "June 2015" "June 2011" "June 2016"
    ##  $ owner_name         : chr  "FATEH DIN AWAN" "M BILAL YASIN" "MUHAMMAD ALTAF"
    ##  $ father_name        : chr  "HAFIZ ABDUL HAKEEM AWAN" "CH M. YASIN" "NAZAR MUHAMMAD"
    ##  $ owner_city         : chr  "RAWALPINDI" "ISLAMABAD" "SARGODHA"
    

    您需要自行处理网络和抓取错误。我无法再为这种可能不道德的努力辩护(答案更多的是帮助有类似 q 的其他人)。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-04-24
      • 1970-01-01
      • 2020-04-19
      • 1970-01-01
      • 1970-01-01
      • 2015-02-03
      • 1970-01-01
      • 2023-03-04
      相关资源
      最近更新 更多