【问题标题】:how to bypass Oracle ADF loopback script for scripting website using php cURL library?如何绕过使用 php cURL 库编写网站的 Oracle ADF 环回脚本?
【发布时间】:2019-05-28 11:33:58
【问题描述】:

我正在抓取一个具有 Oracle ADF loopback 脚本的网站,该脚本不断将我重定向到我的同一页面,那么如何绕过它?

以下是我的 php 代码。

<?php
    $url = 'https://www.mywebsite.com/faces/index.jspx';
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookie.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . '/cookie.txt');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $header[] = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36';
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $data = curl_exec($ch);
    curl_close($ch);
    if (curl_errno($ch)) { // check for execution errors
      echo 'Scraper error: ' . curl_error($ch);
      exit;
    }
    echo $data;
?>

当我运行上面的代码时,我被重定向到同一页面,

它还添加了一些查询字符串参数,例如 ?_afrLoop=39478247795404&amp;_afrWindowMode=0&amp;_afrWindowId=null

在实际站点中_afrWindowId 有一些随机的字母数字字符串,但我得到的是null

手动停止页面重定向后,我得到的页面包含以下 Oracle 环回脚本

导致重定向,我该怎么做。

回送脚本:

    <html lang="el-GR"><head><script>
/*
** Copyright (c) 2008, Oracle and/or its affiliates. All rights reserved.
*/

/**
 * This is the loopback script to process the url before the real page loads. It introduces
 * a separate round trip. During this first roundtrip, we currently do two things: 
 * - check the url hash portion, this is for the PPR Navigation. 
 * - do the new window detection
 * the above two are both controled by parameters in web.xml
 * 
 * Since it's very lightweight, so the network latency is the only impact. 
 * 
 * here are the list of will-pass-in parameters (these will replace the param in this whole
 * pattern: 
 *        viewIdLength                           view Id length (characters), 
 *        loopbackIdParam                        loopback Id param name, 
 *        loopbackId                             loopback Id,
 *        loopbackIdParamMatchExpr               loopback Id match expression, 
 *        windowModeIdParam                      window mode param name, 
 *        windowModeParamMatchExpr               window mode match expression, 
 *        clientWindowIdParam                    client window Id param name, 
 *        clientWindowIdParamMatchExpr           client window Id match expression, 
 *        windowId                               window Id, 
 *        initPageLaunch                         initPageLaunch, 
 *        enableNewWindowDetect                  whether we want to enable new window detection
 *        jsessionId                             session Id that needs to be appended to the redirect URL
 *        enablePPRNav                           whether we want to enable PPR Navigation
 *
 */

var id = null; 
var query = null; 
var href = document.location.href; 
var hashIndex = href.indexOf("#"); 
var hash = null;

/* process the hash part of the url, split the url */
if (hashIndex > 0) 
{ 
  hash = href.substring(hashIndex + 1); 
  /* only analyze hash when pprNav is on (bug 8832771) */
  if (false && hash && hash.length > 0) 
  { 
    hash = decodeURIComponent(hash); 
    if (hash.charAt(0) == "@") 
    { 
      query = hash.substring(1); 
    } 
    else 
    { 
      var state = hash.split("@"); 
      id = state[0]; 
      query = state[1]; 
    } 
  } 
  href = href.substring(0, hashIndex); 
} 

/* process the query part */
var queryIndex = href.indexOf("?"); 
if (queryIndex > 0) 
{
  /* only when pprNav is on, we take in the query from the hash portion */
  query = (query || (id && id.length>0))? query: href.substring(queryIndex); 
  href = href.substring(0, queryIndex); 
} 

var jsessionIndex = href.indexOf(';');
if (jsessionIndex > 0)
{
  href = href.substring(0, jsessionIndex);
}

/* we will replace the viewId only when pprNav is turned on (bug 8832771) */
if (false) 
{
  if (id != null && id.length > 0) 
  { 
    href = href.substring(0, href.length - 11) + id;
  } 
}

var isSet = false; 
if (query == null || query.length == 0) 
{ 
  query = "?"; 
} 
else if (query.indexOf("_afrLoop=") >= 0) 
{ 
  isSet = true; 
  query = query.replace(/_afrLoop=[^&]*/, "_afrLoop=39279593944826"); 
} 
else 
{ 
  query += "&"; 
} 
if (!isSet) 
{ 
  query = query += "_afrLoop=39279593944826"; 
} 

/* below is the new window detection logic */
var initWindowName = "_afr_init_"; // temporary window name set to a new window
var windowName = window.name;

// if the window name is "_afr_init_", treat it as redirect case of a new window
if ((true) && (!windowName || windowName==initWindowName || 
    windowName!="null"))  
{ 
  /* append the _afrWindowMode param */
  var windowMode;
  if (true) 
  {
    /* this is the initial page launch case, 
       also this could be that we couldn't detect the real windowId from the server side */
    windowMode=0;
  }
  else if ((href.indexOf("/__ADFvDlg__") > 0) || (query.indexOf("__ADFvDlg__") >= 0))
  {
    /* this is the dialog case */
    windowMode=1;
  }
  else 
  {
    /* this is the ctrl-N case */
    windowMode=2;
  }

  if (query.indexOf("_afrWindowMode=") >= 0) 
  { 
    query = query.replace(/_afrWindowMode=[^&]*/, "_afrWindowMode="+windowMode); 
  } 
  else 
  { 
    query = query += "&_afrWindowMode="+windowMode; 
  } 

  /* append the _afrWindowId param */
  var clientWindowId;
  /* in case we couldn't detect the windowId from the server side */
  if (!windowName || windowName == initWindowName) 
  {
    clientWindowId = "null";

    // set window name to an initial name so we can figure out whether a page is loaded from
    // cache when doing Ctrl+N with IE
    window.name = initWindowName;
  }
  else 
  {
    clientWindowId = windowName;
  }  

  if (query.indexOf("_afrWindowId=") >= 0) 
  { 
    query = query.replace(/_afrWindowId=\w*/, "_afrWindowId="+clientWindowId); 
  } 
  else 
  { 
    query = query += "&_afrWindowId="+clientWindowId; 
  } 

}

var sess = "";

if (sess.length > 0)
  href += sess; 

/* if pprNav is on, then the hash portion should have already been processed */
if ((false) || (hash == null))
  document.location.replace(href + query);
else 
  document.location.replace(href + query + "#" + hash);
</script>
</head>
</html>

【问题讨论】:

  • 停用 ADF 项目的环回功能对您有用吗?
  • @MrAdibou 我无法停用,因为我正在抓取我不拥有的其他网站。

标签: php curl web-scraping oracle-adf loopback


【解决方案1】:

爬取ADF页面的正确方法是在URL中传入一个参数

*domain.com*?org.apache.myfaces.trinidad.outputMode=webcrawler

来自脚本的所有 GET 请求。请记住,当您切换到爬虫模式时,页面看起来会有所不同,因为它不是供人类使用的,但它应该包含您需要抓取的所有原始细节。

虽然,这是一个老问题,OP 可能早就转向更好的事情,想在这里回答这个问题以帮助其他遇到同样问题的人。

【讨论】:

  • Ashvin 我正在使用 php cURL 库,我不能像你所说的那样设置输出模式,我认为你可以在 ADF 中设置它,但不能在 php 中设置。
  • 我指的是您提出请求的 URL 参数。
  • 有趣...这是官方记录的地方吗?或者它是一个未发布的“功能”?
  • 我在 repo 中找不到这个了,很可能已经停产了。从几年前我在那里投稿的那一刻起,我就记住了这一点。但是如果有人仍然需要这种能力,电子邮件模式似乎是壁橱,因为我记得它内联所有 CSS 并关闭所有类型的动态脚本。电子邮件参数似乎是来自github.com/apache/myfaces-trinidad/blob/master/trinidad-impl/… 的“org.apache.myfaces.trinidad.agent.email”
猜你喜欢
  • 2015-09-04
  • 2015-10-21
  • 2021-06-29
  • 1970-01-01
  • 2019-10-18
  • 2014-05-05
  • 2019-02-17
  • 2011-06-25
  • 1970-01-01
相关资源
最近更新 更多