【问题标题】:Extract data from a page从页面中提取数据
【发布时间】:2013-09-05 12:51:24
【问题描述】:

您好,我想创建一个页面 html 和 php,它能够获取此链接中包含的表中的数据:http://www.comuni-italiani.it/province.html

我很想得到任何提示,我会使用 file_get_content 但我不知道如何获取所有各种数据

【问题讨论】:

    标签: php html file-get-contents html-content-extraction


    【解决方案1】:

    您能更清楚地向我们解释一下您想从这个页面中获得什么吗?

    无论如何,要做到这一点,您可以使用 file_get_contents 来获取页面,然后根据您想从页面中获取的内容(我想您想从表格中的页面中获取每个 <td> 元素),您可以使用PHP regular expressions (preg_match, preg_match_all) 来获取您需要的所有数据。

    您的案例示例:

    $page = file_get_contents("http://www.comuni-italiani.it/province.html");
    
    $output = array();
    preg_match_all('/<td.*.<\/td>/',$page,$output);
    
    print_r($output);
    

    这将输出:

    Array ( [0] => Array ( [0] =>    [1] => [2] => Agrigento [3] => Alessandria [4] => Ancona [5] => Aosta [6] => Arezzo [7] => Ascoli Piceno [8] => Asti [9] => Avellino [10] => Bari [11] => Barletta-Andria-Trani [12] => Belluno [13] => Benevento [14] => Bergamo [15] => Biella [16] => Bologna [17] => Bolzano [18] => Brescia [19] => Brindisi [20] => Cagliari [21] => Caltanissetta [22] => Campobasso [23] => Carbonia-Iglesias [24] => Caserta [25] => Catania [26] => Catanzaro [27] => Chieti [28] => Como [29] => Cosenza [30] => Cremona [31] => Crotone [32] => Cuneo [33] => Enna [34] => Fermo [35] => Ferrara [36] => Firenze [37] => Foggia [38] => Forlì-Cesena [39] => Frosinone [40] => Genova [41] => Gorizia [42] => Grosseto [43] => Imperia [44] => Isernia [45] => La Spezia [46] => L'Aquila [47] => Latina [48] => Lecce [49] => Lecco [50] => Livorno [51] => Lodi [52] => Lucca [53] => Macerata [54] => Mantova [55] => Massa-Carrara [56] => Matera [57] => Messina [58] => Milano [59] => Modena [60] => Monza e della Brianza [61] => Napoli [62] => Novara [63] => Nuoro [64] => Olbia-Tempio [65] => Oristano [66] => Padova [67] => Palermo [68] => Parma [69] => Pavia [70] => Perugia [71] => Pesaro e Urbino [72] => Pescara [73] => Piacenza [74] => Pisa [75] => Pistoia [76] => Pordenone [77] => Potenza [78] => Prato [79] => Ragusa [80] => Ravenna [81] => Reggio Calabria [82] => Reggio Emilia [83] => Rieti [84] => Rimini [85] => Roma [86] => Rovigo [87] => Salerno [88] => Medio Campidano [89] => Sassari [90] => Savona [91] => Siena [92] => Siracusa [93] => Sondrio [94] => Taranto [95] => Teramo [96] => Terni [97] => Torino [98] => Ogliastra [99] => Trapani [100] => Trento [101] => Treviso [102] => Trieste [103] => Udine [104] => Varese [105] => Venezia [106] => Verbano-Cusio-Ossola [107] => Vercelli [108] => Verona [109] => Vibo Valentia [110] => Vicenza [111] => Viterbo [112] => CercaNel Sito e sul WebPagine UtiliElenco Province per PopolazionePrincipali Città ItalianeLista Alfabetica RegioniAmministrazioni LocaliScuole in Italia [113] =>   ) )
    

    当然可以过滤。

    在您的情况下,例如,通过添加一点 foreach 循环...:

    $page = file_get_contents("http://www.comuni-italiani.it/province.html");
    
        $output = array();
        preg_match_all('/<td.*.<\/td>/',$page,$output);
    
        $provinces = array();
    
        foreach ($output as $id => $list) {
            for ($i = 2; $i <= 111; $i++) {
                array_push($provinces,$list[$i]);
            }
        }
    
        print_r($provinces);
    

    会给你这个:

    Array ( [0] => Agrigento [1] => Alessandria [2] => Ancona [3] => Aosta [4] => Arezzo [5] => Ascoli Piceno [6] => Asti [7] => Avellino [8] => Bari [9] => Barletta-Andria-Trani [10] => Belluno [11] => Benevento [12] => Bergamo [13] => Biella [14] => Bologna [15] => Bolzano [16] => Brescia [17] => Brindisi [18] => Cagliari [19] => Caltanissetta [20] => Campobasso [21] => Carbonia-Iglesias [22] => Caserta [23] => Catania [24] => Catanzaro [25] => Chieti [26] => Como [27] => Cosenza [28] => Cremona [29] => Crotone [30] => Cuneo [31] => Enna [32] => Fermo [33] => Ferrara [34] => Firenze [35] => Foggia [36] => Forlì-Cesena [37] => Frosinone [38] => Genova [39] => Gorizia [40] => Grosseto [41] => Imperia [42] => Isernia [43] => La Spezia [44] => L'Aquila [45] => Latina [46] => Lecce [47] => Lecco [48] => Livorno [49] => Lodi [50] => Lucca [51] => Macerata [52] => Mantova [53] => Massa-Carrara [54] => Matera [55] => Messina [56] => Milano [57] => Modena [58] => Monza e della Brianza [59] => Napoli [60] => Novara [61] => Nuoro [62] => Olbia-Tempio [63] => Oristano [64] => Padova [65] => Palermo [66] => Parma [67] => Pavia [68] => Perugia [69] => Pesaro e Urbino [70] => Pescara [71] => Piacenza [72] => Pisa [73] => Pistoia [74] => Pordenone [75] => Potenza [76] => Prato [77] => Ragusa [78] => Ravenna [79] => Reggio Calabria [80] => Reggio Emilia [81] => Rieti [82] => Rimini [83] => Roma [84] => Rovigo [85] => Salerno [86] => Medio Campidano [87] => Sassari [88] => Savona [89] => Siena [90] => Siracusa [91] => Sondrio [92] => Taranto [93] => Teramo [94] => Terni [95] => Torino [96] => Ogliastra [97] => Trapani [98] => Trento [99] => Treviso [100] => Trieste [101] => Udine [102] => Varese [103] => Venezia [104] => Verbano-Cusio-Ossola [105] => Vercelli [106] => Verona [107] => Vibo Valentia [108] => Vicenza [109] => Viterbo )
    

    (对于庞大的数组感到抱歉)。

    然而,它将链接保留在数组中,因此,如果您只想获取值而不是与其关联的锚点,请随意使用另一个正则表达式。 p>

    希望这会有所帮助。

    (以此为例,请记住,如果页面发生更改,这个 foreach 技巧可能不再起作用,我发布它只是为了让您了解如何解决该案例)。

    【讨论】:

    • 非常感谢,但如果我想添加列 'sigla'
    【解决方案2】:

    尝试更多地了解DOMDocument 参考:http://php.net/manual/en/class.domdocument.php

    这些问题也可能对您有所帮助:

    Getting DOM elements by classname

    PHP Parse HTML code

    【讨论】:

      猜你喜欢
      • 2015-02-24
      • 2016-03-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-02-02
      • 2020-05-07
      • 2019-09-16
      • 1970-01-01
      相关资源
      最近更新 更多