【问题标题】:Get all urls in a string with php使用 php 获取字符串中的所有 url
【发布时间】:2012-07-20 06:32:52
【问题描述】:

我正在尝试找出一种从文本字符串中获取 URL 数组的方法。 文本的格式有点像这样:

这里有一些随机文本

http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/

显然,这些链接可以是任何东西(并且可以有很多链接,这些只是我现在正在测试的那些。如果我使用像我的正则表达式这样的简单 URL 就可以了。

我正在使用:

preg_match_all('((https?|ftp|gopher|telnet|file|notes|ms-help):'.
    '((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)',
    $bodyMessage, $matches, PREG_PATTERN_ORDER);

当我执行print_r( $matches); 时,我得到的结果是:

Array ( [0] => Array (
    [0] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon=
    [1] => http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= 
    [2] => http://techcrunch.co=
    [3] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-ip= 
    [4] => http://techcrunch.com/2012/07/20/last-day-to-purc=
    [5] => http://tec=
)
...

该数组中的所有项目都不是来自上述链接的完整链接。

有人知道获得我需要的东西的好方法吗?我找到了一堆正则表达式的东西来获取 PHP 的链接,但没有一个有效。

谢谢!

编辑:

好的,所以我从电子邮件中提取这些链接。该脚本解析电子邮件,获取邮件正文,然后尝试从中获取链接。 调查电子邮件后,似乎出于某种原因在 url 中间添加了一个空格。这是我的 PHP 脚本看到的正文消息的输出。

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 

关于如何使它不破坏 URL 的任何建议?

编辑 2

按照劳内特的建议,我运行了这段代码:

 $bodyMessage = str_replace("= ", "",$bodyMessage);

但是,当我回应它时,它似乎不想替换“=”

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 

【问题讨论】:

  • 我觉得不错:ideone.com/ulJ4a.
  • 嗯很有趣...我刚刚编辑了我的问题...链接来自一封电子邮件,然后我对其进行解析以获取正文...似乎电子邮件正在正确放置一个空格在链接中间!建议?
  • = 的那些实例看起来很可疑,就像您的代码没有正确处理的某种分块编码。
  • 在处理字符串之前,我只需将所有“= ”替换为空。
  • 好点。看看我最新的编辑......字符串替换似乎不想在它上面工作

标签: php regex url


【解决方案1】:
    /**
     *
     * @get URLs from string (string maybe a url)
     *
     * @param string $string

     * @return array
     *
     */
    function getUrls($string) {
        $regex = '/https?\:\/\/[^\" ]+/i';
        preg_match_all($regex, $string, $matches);
        //return (array_reverse($matches[0]));
        return ($matches[0]);
}

【讨论】:

  • 您还应该将新行添加到否定$regex = '/https?\:\/\/[^\" \n]+/i';
【解决方案2】:

请改用以下正则表达式。

$regex = "(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))";

希望对你有帮助。

【讨论】:

    【解决方案3】:

    使用以下代码,您将找到一个数组 urls_in_string,并且在零索引 $urls_in_string[0] 处,您将找到所有 url。

        $urls_in_string = [];
        $string_with_urls = "Worlds most popular socila networking website in https://www.facebook.com. We have many such othe websites like https://twitter.com/home and https://www.linkedin.com/feed/ etc.";
        $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\/\S*)?/im";
        preg_match_all($reg_exUrl, $string_with_urls, $urls_in_string);
        print_r($urls_in_string);
    
    
    
    
    // OutPut 
    /*
    Array
    (
        [0] => Array
            (
                [0] => https://www.facebook.com
                [1] => https://twitter.com/home
                [2] => https://www.linkedin.com/feed/
            )
    
        [1] => Array
            (
                [0] => https
                [1] => https
                [2] => https
            )
    
        [2] => Array
            (
                [0] => 
                [1] => /home
                [2] => /feed/
            )
    
    )
    */
    

    【讨论】:

      【解决方案4】:
      you can do something like following
      
      $url = "http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~
      
      http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/";
      
      $dataArray = explode("http",$url);
      
      echo "<pre>";print_r($dataArray);
      
      this will return like following array
      
      Array
      (
       [0] => 
       [1] => ://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~
      
      
       [2] => ://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/
      )
      
      when you extract above output please prepend http, I think this will help you 
      
      Happy Coding
      

      【讨论】:

        猜你喜欢
        • 2012-08-01
        • 1970-01-01
        • 2018-05-07
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-01-29
        • 2010-12-03
        • 1970-01-01
        相关资源
        最近更新 更多