【发布时间】:2014-07-15 14:10:40
【问题描述】:
一个很难用一句话来表达的问题,但我正在尝试从以下页面中抓取一些 html
http://www.ncbi.nlm.nih.gov/snp/?term=(human[Organism])+AND+GLRA3[Gene Name]
我可以使用 R 抓取我需要的内容,但由于浏览器只显示前 20 个条目,因此我只能使用相应的 html。这会导致问题,因为我想抓取 所有 条目,而不仅仅是浏览器页面提供的条目。无论如何,这是我的 R 代码
library(XML)
library(httr)
#Go to Nectar Mutation and get SNP refs
dbsnp.searchterm="(human[Organism])+AND+GLRA1[Gene Name]"
dbsnp.url=paste0("http://www.ncbi.nlm.nih.gov/snp/?term=",dbsnp.searchterm)
dbsnp.get=GET(dbsnp.url)
dbsnp.content=content(dbsnp.get, as="text")
links<-xpathSApply(htmlParse(dbsnp.content), "//a[contains(@href, 'snp_ref')]",xmlGetAttr,"href")
结果
> links
[1] "/projects/SNP/snp_ref.cgi?rs=116474260"
[2] "/projects/SNP/snp_ref.cgi?rs=121918408"
[3] "/projects/SNP/snp_ref.cgi?rs=121918409"
[4] "/projects/SNP/snp_ref.cgi?rs=121918410"
[5] "/projects/SNP/snp_ref.cgi?rs=121918411"
[6] "/projects/SNP/snp_ref.cgi?rs=121918412"
[7] "/projects/SNP/snp_ref.cgi?rs=121918413"
[8] "/projects/SNP/snp_ref.cgi?rs=121918414"
[9] "/projects/SNP/snp_ref.cgi?rs=121918415"
[10] "/projects/SNP/snp_ref.cgi?rs=121918416"
[11] "/projects/SNP/snp_ref.cgi?rs=121918417"
[12] "/projects/SNP/snp_ref.cgi?rs=121918418"
[13] "/projects/SNP/snp_ref.cgi?rs=267600494"
[14] "/projects/SNP/snp_ref.cgi?rs=267606848"
[15] "/projects/SNP/snp_ref.cgi?rs=281864912"
[16] "/projects/SNP/snp_ref.cgi?rs=281864913"
[17] "/projects/SNP/snp_ref.cgi?rs=281864914"
[18] "/projects/SNP/snp_ref.cgi?rs=281864915"
[19] "/projects/SNP/snp_ref.cgi?rs=281864916"
[20] "/projects/SNP/snp_ref.cgi?rs=281864917"
您会注意到我需要 4058 个条目。
【问题讨论】:
标签: html r xpath web-scraping