【发布时间】:2016-06-02 16:09:38
【问题描述】:
我正在尝试从 UCAS 网站上抓取数据,以显示从基本搜索返回的所有页面中的所有 Uni 名称。
到目前为止,在没有循环工作的情况下,它会显示第一页中所有大学的名称以及一些随机信息,如下所示:
"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"
这是我的代码:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'mechanize'
mechanize = Mechanize.new
doc = mechanize.get('http://search.ucas.com/')
form = doc.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
doc = form.submit
doc.search('li.results clearfix').each do |h3|
puts h3.text.strip
while a = doc.at('div.pagerclearfix a')
doc = Nokogiri::HTML(open(a[:href]))
doc.search('results clearfix').each do |h3|
puts h3.text.strip
end
end
end
【问题讨论】:
-
您的问题到底是什么?您只从第一页而不是所有页面获取结果?
-
是的,第一个 puts 似乎是打印的,并且循环似乎不起作用,所以那里的 puts 不会。我认为在检查元素时页面上的 div.pagerclearfix 存在问题,称为 pager.clearfix。
-
欢迎来到 SO。请阅读“minimal reproducible example”。我们需要问题本身中的最小 HTML 样本。一些试图提供帮助的人无法访问互联网,或者他们不想通过一个大文件来筛选有问题的标签。
标签: ruby-on-rails ruby web-scraping nokogiri mechanize