【问题标题】:R/Python webscraping of websites [closed]网站的 R/Python 网页抓取 [关闭]
【发布时间】:2014-05-12 08:02:57
【问题描述】:

我想尝试总结一下 Alsop 网站 (http://www.auction.co.uk/residential/onlineCatalogue.asp) 上提供的数据

理想情况下,我希望最终得到一个 data.frame,其中包含来自网站的以下字段。

批号、类型、位置/完整地址、指导价格、卧室数量、任何照片的网址。

我尝试使用谷歌浏览器检查元素和htmlParse(通常是链接),但我得到每个批号的相同 URL,即http://www.auction.co.uk/residential/LotDetails.asp?A=877&MP=24&ID=877000001&S=L&O=A

所以对我来说,我有点难过,因为我常用的抓取网站寻找链接的方法不再有效。

我偏爱 R,但了解 Python 是否更有用,并且愿意就如何实现这一点提出建议。

【问题讨论】:

    标签: python r web-scraping


    【解决方案1】:

    您可以使用 selenium 获取数据。

    require(RSelenium)
    RSelenium::startServer()
    Sys.sleep(5)
    appUrl <- "http://www.auction.co.uk/residential/onlineCatalogue.asp"
    remDr <- remoteDriver()
    remDr$open()
    remDr$navigate("http://www.auction.co.uk/residential/onlineCatalogue.asp")
    webElem <- remDr$findElement("css selector", '[href="onlineCatalogue.asp"]')
    # check Element
    webElem$highlightElement()
    # click link
    webElem$clickElement()
    # get the pages to click thru
    webElems <- remDr$findElements("css selector", "#Table7 a[href]")
    appUrl <- c(appUrl, sapply(webElems, function(x){x$getElementAttribute("href")[[1]]}))
    out <- lapply(appUrl, function(x){
      remDr$navigate(x)
      # get table data
      webElem <- remDr$findElement("id", "Table6")
      # get table html
      appData <- webElem$getElementAttribute("outerHTML")[[1]]
    }
    )
    remDr$close()
    remDr$closeServer()
    

    现在我们可以处理html了

    # Process html Table
    asDF <- lapply(out, function(x){
      appData <- x
      xData <- htmlParse(appData)
      require(selectr)
      lotAndLoc <- querySelectorAll(xData, "a.tooltip")
      alsopLot <- lapply(lotAndLoc[c(T,F)], function(x){
        lot <- getNodeSet(x, ".//span[@class = 'lotnum']")
        lot <- xmlValue(lot[[1]])
        img <- getNodeSet(x, ".//img")
        img <- xmlGetAttr(img[[1]], "src")
        data.frame(lot = lot, img = img)
      })
      alsopLot <- do.call(rbind.data.frame, alsopLot)
      alsopType <- xpathSApply(xData, "//tr/td[2]", xmlValue)[-1]
      alsopPrice <- xpathSApply(xData, "//tr/td[4]", xmlValue)[-1]
      alsopPrice <- gsub("ÂÂ", "", alsopPrice)
      alsopAddr <- xpathSApply(xData, "//tr/td[3]/*//span[@class='text']", function(x){
        Addr <- getChildrenStrings(x)[names(getChildrenStrings(x)) %in% c("text", "span")]
        Addr <- gsub("\\n\\s*", "", Addr)
        Addr <- Addr[Addr != ""]
        paste(Addr, collapse = "~")
      })
    
      alsopDf <- data.frame(type = alsopType, price = alsopPrice, address = alsopAddr)
      alsopDf <- cbind.data.frame(alsopLot, alsopDf)
      alsopDf
    }
    )
    asDF <- do.call(rbind.data.frame, asDF)
    

    您需要整理地址,但其余数据如您所愿

    > head(asDF)
      lot                                                                   img
    1   1 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg
    2   2 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp2.jpg
    3   3 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp3.jpg
    4   4 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp4.jpg
    5   5 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp5.jpg
    6   6 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp6.jpg
                                type               price
    1        VACANT - Leasehold Flat           £225,000+
    2        VACANT - Leasehold Flat           £160,000+
    3     VACANT - Freehold Building           £250,000+
    4        VACANT - Leasehold Flat           £180,000+
    5                 Freehold House           £180,000+
    6 INVESTMENT - Freehold Building £110,000 - £120,000
                                                                      address
    1 1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR
    2                                   2~London W3~17 York Road~Acton~W3 6TS
    3                 3~London SE27~23 Thurlestone Road~West Norwood~SE27 0PE
    4             4~London N16~Flat G~74 Darenth Road~Stoke Newington~N16 6ED
    5                              5~Ilford~11 Cavenham Gardens~Essex~IG1 1XX
    6                                  6~Ilford~52 Balfour Road~Essex~IG1 4JG
    

    数据框asDF 具有所需的手数:

    > str(asDF)
    'data.frame':   347 obs. of  5 variables:
     $ lot    : Factor w/ 347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ img    : Factor w/ 347 levels "http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ type   : Factor w/ 102 levels "Freehold Building",..: 30 30 23 30 2 5 23 1 1 19 ...
     $ price  : Factor w/ 151 levels "£1.25M - £1.5M",..: 31 19 33 21 21 9 54 68 68 68 ...
     $ address: Factor w/ 347 levels "1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR",..: 1 14 27 38 49 60 71 82 94 2 ...
    

    【讨论】:

    • 哇!那速度非常快而且非常好!我不知道 RSelenium 包...谢谢。我注意到地址不在输出中...它丢失了吗?
    • 地址是数据框中的第 5 列。我省略了打印它,因为它需要进一步处理。
    • 啊!明白谢谢!
    • @h.l.m 我添加了一些额外的代码来整理地址并添加了数据帧的输出。
    • 出于好奇,这将如何调整以考虑到不止一页,即批次上升到 334,总共有 347 个批次...
    猜你喜欢
    • 1970-01-01
    • 2012-07-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-02-21
    相关资源
    最近更新 更多