如何从存档中获取原始 URL。是使用 python 的短链接？答案

【问题标题】：How can I get the original URL from an archive.is short link using python?如何从存档中获取原始 URL。是使用 python 的短链接？
【发布时间】：2018-06-21 21:45:38
【问题描述】：

我想编写一个函数，它将archive.is（或archive.fo、archive.li 或archive.today）链接作为输入，并将原始站点的URL 作为输出。

例如，如果输入是'http://archive.is/9mIro'，那么我希望输出是'http://www.dailytelegraph.com.au/news/nsw/australian-army-bans-male-recruits-to-get-female-numbers-up/news-story/69ee9dc1d4f8836e9cca7ca2e3e5680a'。

如何在 python 中做到这一点？

【问题讨论】：

在做了一些研究之后，我打算采用的方法（除非有人有更好的主意）是使用 BeautifulSoup 来获取存档头部的<link rel="bookmark" href="..."> 的href 字段页面，然后使用正则表达式从中获取原始网址。

标签： python web-services url short-url

【解决方案1】：

是的，您的方法可能适用于另一个站点，但archive.is 似乎可以保护他们的数据免受自动查询，当我尝试 curl, python (urllib2) 时，我收到错误Empty reply from server。你需要像 phantomjs 这样模仿真实浏览器的东西。而且我相信它只适用于少数查询，然后会显示验证码或给出错误。他们似乎也记录了 ip 地址，甚至 phantomjs 从尝试 curl 或 python 的同一台机器上得到错误。

下面是 phantomjs 代码：

var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';

function getOriginalUrl(shortUrl, cb) {
  page.open(shortUrl, function(status) {
    //console.log(status);
    var url = page.evaluate(function(){
      return document.querySelector('form input').value;
    });
    cb(url);
  });
}

if (args.length > 1) {
  getOriginalUrl(args[1],function(url){
    console.log(url);
    phantom.exit();
  });
} else {
  console.log('Pass url');
  phantom.exit();
}

【讨论】：