如何从脚本标签中提取信息答案

【问题标题】：how extract information from script tag如何从脚本标签中提取信息
【发布时间】：2016-07-24 22:34:15
【问题描述】：

在我的代码中经过几次查询后，我得到了如下可变内容：

<!DOCTYPE html>
<html dir=ltr>
  <head>
    <script>
      mapslite = {
        START_TIME: new Date()
      };
      mapslite.getBasePageResponse = function(cacheResponse) {
        delete mapslite.getBasePageResponse;
        cacheResponse([[[3988.776886432477,103.7950744,1.3090672],[0,0,0],[1024,768],13.10000038146973],"/maps-lite/js/2/maps_lite_20160404_RC01",107,null,null,["en",""],["/maps/lite/ApplicationService.GetEntityDetails","/maps/lite/ApplicationService.UpdateStarring","/maps/lite/ApplicationService.Search",null,"/maps/lite/suggest","/maps/lite/directions","/maps/lite/MapsLiteService.GetHotelAvailability",null,"https://www.google.com/maps/api/js/.....
,[null,null,1.3090672,103.7950744],null,"11401",null,"PjoDV_jjE8yPuATo_LmYDA","Asia/Singapore",[["\u003cb\u003eBuses\u003c/b\u003e from this station",[[3,"bus.png",null,"Bus",[["https://maps.gstatic.com/mapfiles/transit/iw2/b/bus.png",0,[15,15],null,0]]]],[[null,null,null,null,"0x31da18325b415901:0xeb661015c651c24a",[[5,["48",1,"#ffffff"]]]],[null,null,null,null,"0x31da19f34e04d59b:0x5758ef6990938b",[[5,["61",1,"#ffffff"]]]],[null,null,null,null,"0x31da1a5b8b75c379:0x6a13e189555f9fab",[[5,["95",1,"#ffffff"]]]],[null,null,null,null,"0x31da1a16ea23bf95:0xd7c90f15535c2b9f",[[5,["106",1,"#ffffff"]]]],[null,null,null,null,"0x31da10a7613d616f:0xf1f61ffeac2ea8a4",[[5,["970",1,"#ffffff"]]]],[null,null,null,null,"0x31da1a0bd6262d0b:0xfbd5d2bfd7a1252",[[5,["NR8",1,"#ffffff"]]]]],null,0,"5"]]],["http://www.google.com/search?q=
....
[0,0,"",0,1,null,null,null,0,0,1,1,0,"map,common",null,0,0,1,null,null,1,"1","2,1","","",0],null,null,"PjoDV_jjE8yPuATo_LmYDA",null,null,null,null,"//consent.google.com","2.maps_lite_20160404_RC01"]);
      };
      executeOgJs = function() {

        delete executeOgJs;
      };
    </script>

我要提取的重要信息是“this station”行中的所有数字：“48, 61,95,106,970,NR8”（在“,1,”#ffffff 旁边）。

我尝试过使用 python 代码：

 tree = html.fromstring(buspage, base_url=detail['result']['url'])
        bus_elm = tree.xpath("/html/body/div[1]/div/div[4]/div[4]/div/div/div[2]/div/div[2]/div[1]/div[2]/div/div/div[2]/div/table/tr/td")

但遇到了一些错误和困难。有什么办法可以方便地在 PHP 中做到这一点？

【问题讨论】：

抓取一些脚本，为什么？
因为没有 api 可以做到这一点，尽管它是公共信息。所以这是唯一的方法，它并不违法。但是我想回到编码方面，PHP 是如何实现这个任务的？

标签： php beautifulsoup html-parsing

【解决方案1】：

我相信你最好的选择是使用正则表达式，如果你确定你总是有那个特定的结构。

匹配'array' ["5N4", 323, "#asdasd"] 的表达式是 (\[\"[a-zA-Z0-9]*?\"\,\d*?\,\".*?\"\])。

您可以在 PHP 中使用 explode() 或在 python 中使用 split() 来获取您想要的数字（在本例中为 5N4），如下所示：

function get_numbers_from($input) {
    $matches = preg_match_all('(\[\"[]a-zA-Z0-9]*?\"\,\d*?\,\".*?\"\])', $input);
    foreach($matches[1] as $key => $match) {
        array_push($numbers, explode(',', $match)[0]);
    }

    return $numbers;
}

【讨论】：

实际上该格式也出现在其他一些地方。如果搜索“from this station”，那么最后出现的就是提取的地方，可以从这里的源代码中看到：view-source:google.com/maps/place/Blk+12/@1.3090672,103.7928857,17z/…
你可以做一个 preg_match 来捕获从“从这个站”到的所有内容，然后在上面尝试那个正则表达式。
有4个地方有“从这个站”我只找最后一个。字符串正则表达式应该处理的是： [\"91\",1,\"#ffffff\"] 得到数字 91。你能把 preg_match 也放到答案中吗？
我想我知道如何获取最后一部分，使用："end(explode('from this station',$str))" 但是你能更新正则表达式以获得我上面提到的正确格式吗?