【问题标题】:how extract information from script tag如何从脚本标签中提取信息
【发布时间】:2016-07-24 22:34:15
【问题描述】:

在我的代码中经过几次查询后,我得到了如下可变内容:

<!DOCTYPE html>
<html dir=ltr>
  <head>
    <script>
      mapslite = {
        START_TIME: new Date()
      };
      mapslite.getBasePageResponse = function(cacheResponse) {
        delete mapslite.getBasePageResponse;
        cacheResponse([[[3988.776886432477,103.7950744,1.3090672],[0,0,0],[1024,768],13.10000038146973],"/maps-lite/js/2/maps_lite_20160404_RC01",107,null,null,["en",""],["/maps/lite/ApplicationService.GetEntityDetails","/maps/lite/ApplicationService.UpdateStarring","/maps/lite/ApplicationService.Search",null,"/maps/lite/suggest","/maps/lite/directions","/maps/lite/MapsLiteService.GetHotelAvailability",null,"https://www.google.com/maps/api/js/.....
,[null,null,1.3090672,103.7950744],null,"11401",null,"PjoDV_jjE8yPuATo_LmYDA","Asia/Singapore",[["\u003cb\u003eBuses\u003c/b\u003e from this station",[[3,"bus.png",null,"Bus",[["https://maps.gstatic.com/mapfiles/transit/iw2/b/bus.png",0,[15,15],null,0]]]],[[null,null,null,null,"0x31da18325b415901:0xeb661015c651c24a",[[5,["48",1,"#ffffff"]]]],[null,null,null,null,"0x31da19f34e04d59b:0x5758ef6990938b",[[5,["61",1,"#ffffff"]]]],[null,null,null,null,"0x31da1a5b8b75c379:0x6a13e189555f9fab",[[5,["95",1,"#ffffff"]]]],[null,null,null,null,"0x31da1a16ea23bf95:0xd7c90f15535c2b9f",[[5,["106",1,"#ffffff"]]]],[null,null,null,null,"0x31da10a7613d616f:0xf1f61ffeac2ea8a4",[[5,["970",1,"#ffffff"]]]],[null,null,null,null,"0x31da1a0bd6262d0b:0xfbd5d2bfd7a1252",[[5,["NR8",1,"#ffffff"]]]]],null,0,"5"]]],["http://www.google.com/search?q=
....
[0,0,"",0,1,null,null,null,0,0,1,1,0,"map,common",null,0,0,1,null,null,1,"1","2,1","","",0],null,null,"PjoDV_jjE8yPuATo_LmYDA",null,null,null,null,"//consent.google.com","2.maps_lite_20160404_RC01"]);
      };
      executeOgJs = function() {

        delete executeOgJs;
      };
    </script>

我要提取的重要信息是“this station”行中的所有数字:“48, 61,95,106,970,NR8”(在“,1,”#ffffff 旁边)。

我尝试过使用 python 代码:

 tree = html.fromstring(buspage, base_url=detail['result']['url'])
        bus_elm = tree.xpath("/html/body/div[1]/div/div[4]/div[4]/div/div/div[2]/div/div[2]/div[1]/div[2]/div/div/div[2]/div/table/tr/td")

但遇到了一些错误和困难。有什么办法可以方便地在 PHP 中做到这一点?

【问题讨论】:

  • 抓取一些脚本,为什么?
  • 因为没有 api 可以做到这一点,尽管它是公共信息。所以这是唯一的方法,它并不违法。但是我想回到编码方面,PHP 是如何实现这个任务的?

标签: php beautifulsoup html-parsing


【解决方案1】:

我相信你最好的选择是使用正则表达式,如果你确定你总是有那个特定的结构。

匹配'array' ["5N4", 323, "#asdasd"] 的表达式是 (\[\"[a-zA-Z0-9]*?\"\,\d*?\,\".*?\"\])

您可以在 PHP 中使用 explode() 或在 python 中使用 split() 来获取您想要的数字(在本例中为 5N4),如下所示:

function get_numbers_from($input) {
    $matches = preg_match_all('(\[\"[]a-zA-Z0-9]*?\"\,\d*?\,\".*?\"\])', $input);
    foreach($matches[1] as $key => $match) {
        array_push($numbers, explode(',', $match)[0]);
    }

    return $numbers;
}

【讨论】:

  • 实际上该格式也出现在其他一些地方。如果搜索“from this station”,那么最后出现的就是提取的地方,可以从这里的源代码中看到:view-source:google.com/maps/place/Blk+12/@1.3090672,103.7928857,17z/…
  • 你可以做一个 preg_match 来捕获从“从这个站”到 的所有内容,然后在上面尝试那个正则表达式。
  • 有4个地方有“从这个站”我只找最后一个。字符串正则表达式应该处理的是: [\"91\",1,\"#ffffff\"] 得到数字 91。你能把 preg_match 也放到答案中吗?
  • 我想我知道如何获取最后一部分,使用:"end(explode('from this station',$str))" 但是你能更新正则表达式以获得我上面提到的正确格式吗?
猜你喜欢
  • 2017-04-24
  • 2013-12-26
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-08-18
  • 2011-05-10
  • 1970-01-01
相关资源
最近更新 更多