使用 ruby 和正则表达式扫描网页以获取 url答案

【问题标题】：scanning a webpage for urls with ruby and regex使用 ruby 和正则表达式扫描网页以获取 url
【发布时间】：2017-02-18 00:30:08
【问题描述】：

我正在尝试创建一个包含以下网址中的所有链接的数组。使用 page.scan(URI.regexp) 或 URI.extract(page) 返回的不仅仅是 url。

如何只获取网址？

require 'net/http'
require 'uri'

uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)

【问题讨论】：

标签： ruby uri net-http

【解决方案1】：

如果您只是想从文本文件中提取链接（<a href="..."> 元素），那么最好使用 Nokogiri 将其解析为真正的 HTML，然后以这种方式提取链接：

require 'nokogiri'
require 'open-uri'

# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))

# Extract all a-elements (HTML links)
all_links = doc.css('a')

# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
        sort.delete_if { |h| h.empty? }

# Print out some of them
puts links.grep(/store/)

http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...

【讨论】：