【问题标题】:Regex erase control characters except between words正则表达式擦除除了单词之间的控制字符
【发布时间】:2017-12-01 23:56:37
【问题描述】:

我从包含产品的数据库中获得了一个数组,我想将下面的大描述文本拆分/分隔为较小的产品属性名称和值块。最终,我正在努力实现数据库规范化,因为我目前正在尝试为 2 种不同的数据库设计创建一个导入工具。

我从旧产品表中得到的数组:

        Array

            (
                [0] => Array
                    (
                        [product_id] => 219
                        [product_description] =>
<table style="color:; text-align: left;">
<tr>
<td>
Processor:
</td>
<td>
        Intel Core 2 Duo - E8400
</td>
</tr>
<tr>
<td>
Clock speed:
</td>
<td>
        3.0 GHz
</td>
</tr>
<tr>
<td>
Memory:
</td>
<td>
        4 GB
</td>
</tr>
<tr>
<td>
Hard disk:
</td>
<td>
        250 GB
</td>
</tr>
<tr>
<td>
Video-adapter:
</td>
<td>
        VGA, Display
</td>
</tr>
<tr>
<td>
Netwerk card:
</td>
<td>
        1000 Mbps LAN
</td>
</tr>
<tr>
<td>
Optical drive:
</td>
<td>
        DVD-Rewriter
</td>
</tr>
<tr>
<td>
Operating system:
</td>
<td>
        Windows 7 or 10 Pro
</td>
</tr>
<tr>
<td>
Warranty:
</td>
<td>
        1 year
</td>
</tr>
</table>
                    )
            )

到目前为止我的代码:

$sth = $dbh->prepare("SELECT * from products WHERE product_status_id = '1' ORDER BY order_num ASC");
$sth->execute();
$result = $sth->fetchAll(PDO::FETCH_ASSOC);

$output = array();

$tdpattern = "!<td>(.*?)</td>!is";

foreach ($result as $key=>$val)  {
    preg_match_all($tdpattern, $val['product_description'], $result);
    foreach ($result as $key => $arr) {
        foreach ($arr as $key2 => $description) {
            $output[] = preg_replace('/\n^[\x0a\x20]+|[\x0a\x20]+$/','',$description);
        }
    }
}

// return $output to controller

如下所示,输出显示单词前面有多个空格,但它们之间没有空格,还有应该删除的换行符。除了每个数组元素的单词之间有 1 个空格之外,我怎样才能擦除所有这些控制字符,例如换行符和空格,所以理想情况下它就像底部的布局一样?

Array
(
    [0] => Processor
    [1] =>         IntelCore2-E5500
    [2] => Clockspeed
    [3] =>         2.93GHz
    [4] => Memory
    [5] =>         4GB
    [6] => Harddisk
    [7] =>         250GB
    [8] => Video-adapter
    [9] =>         VGA,Display
    [10] => Netwerkcard
    [11] =>         1000mbpsLAN
    [12] => Opticaldrive
    [13] =>         DVD-Rewriter
    [14] => Operatingsystem
    [15] =>         Windows7or10Pro
    [16] => Warranty
    [17] =>         2jaar
)

我希望将其转换为这种布局:

[219] => array (
    [product_description] => array (
        [processor] => Intel Core 2 - E5500
        [clock speed] => 2.93 GHz
        [memory] => 2.93 GHz
        [hard disk] => 2.93 GHz
        [video adapter] => 2.93 GHz
        [network card] => DVD Rewriter
        [optical drive] => DVD Rewriter
        [operating system] => Windows 7 or 10 Pro
        [warranty] = > 2 years
    )
)

一些方向会很棒,特别是如何改进正则表达式。

【问题讨论】:

标签: php arrays regex


【解决方案1】:

不要使用正则表达式解析 HTML,使用DomDocument

<?php
//...
$result = $sth->fetchAll(PDO::FETCH_ASSOC);

$dom_err = libxml_use_internal_errors(true);
$dom = new DOMDocument();

foreach ($result as $key => $val)  {

    // fix product_description
    $product_description = [];
    if (!empty($val['product_description'])) {
        $html = $val['product_description'];

        // proccess
        $dom->loadHTML($html);
        foreach ($dom->getElementsByTagName('td') as $i => $td) {
            if ($i % 2 == 0) {
                $label = strtolower(trim($td->nodeValue));
                $label = str_replace('-', ' ', trim($label, ':'));
            } else {
                $product_description[$label] = trim($td->nodeValue);
            }
        }
    }
    $val['product_description'] = $product_description;

    // ... rest
}

libxml_clear_errors();
libxml_use_internal_errors($dom_err);

示例:

https://3v4l.org/vECil

结果:

Array
(
    [processor] => Intel Core 2 Duo - E8400
    [clock speed] => 3.0 GHz
    [memory] => 4 GB
    [hard disk] => 250 GB
    [video adapter] => VGA, Display
    [netwerk card] => 1000 Mbps LAN
    [optical drive] => DVD-Rewriter
    [operating system] => Windows 7 or 10 Pro
    [warranty] => 1 year
)

【讨论】:

    【解决方案2】:

    来源:https://stackoverflow.com/a/2326239/5245032

    <?php
    $str = "This is  a string       with
    spaces, tabs and newlines present";
    
    $stripped = preg_replace(array('/\s{2,}/', '/[\t\n]/'), ' ', $str);
    
    echo $str;
    echo "\n---\n";
    echo "$stripped";
    ?>
    

    这个输出

    This is  a string   with
    spaces, tabs and newlines present
    ---
    This is a string with spaces, tabs and newlines present
    

    【讨论】:

      【解决方案3】:

      给定一个如下所示的数组:

      <?php
      
      $a = [ 0 => [ "product_id" => 219,
                    "product_description" => "<table style=\"color:; text-align: left;\">
      <tr>
      <td>
      Processor:
      </td>
      <td>
              Intel Core 2 Duo - E8400
      </td>
      </tr>
      <tr>
      <td>
      Clock speed:
      </td>
      <td>
              3.0 GHz
      </td>
      </tr>
      <tr>
      <td>
      Memory:
      </td>
      <td>
              4 GB
      </td>
      </tr>
      <tr>
      <td>
      Hard disk:
      </td>
      <td>
              250 GB
      </td>
      </tr>
      <tr>
      <td>
      Video-adapter:
      </td>
      <td>
              VGA, Display
      </td>
      </tr>
      <tr>
      <td>
      Netwerk card:
      </td>
      <td>
              1000 Mbps LAN
      </td>
      </tr>
      <tr>
      <td>
      Optical drive:
      </td>
      <td>
              DVD-Rewriter
      </td>
      </tr>
      <tr>
      <td>
      Operating system:
      </td>
      <td>
              Windows 7 or 10 Pro
      </td>
      </tr>
      <tr>
      <td>
      Warranty:
      </td>
      <td>
              1 year
      </td>
      </tr>
      </table>"]
           ];
      

      您可以使用DOM 或其他libraries 来粘贴它。此外,可以使用各种 PHP 内置函数处理字符串值,如下例所示:

      <?php
      
      $arr = array_pop($a);
      $str =  $arr["product_description"];
      
      $stripped = strip_tags( $str, "<td>" );
      $replaced = str_replace( "</td>", "", $stripped );
      $arr = explode( "<td>", $replaced );
      array_shift( $arr );
      
      $arrKeyVal=[];
      
      for( $i=0, $max = count( $arr ); $i < $max; $i+=2 ) {
             $key = trim( $arr[$i],"\r\t\n :" );
             $arrKeyVal[strtolower( $key )] = trim( $arr[$i+1] );
      }
      print_r( $arrKeyVal );
      

      live code

      代码使用array_pop() 来提取嵌套的关联数组。 “product_description”元素的值分配给 $str 以便于处理。除“

      ”外,字符串中的所有标签都被删除。结束 td 标记被空字符串替换。然后在开始的 td 标记上拆分字符串。结果数组的初始元素为空,因此它被移出数组。然后代码使用循环构造一个关联数组,以便根据数组元素的奇偶性,它是数组中的键或值。此外,$arr 的每个元素都被修剪,以便删除多余的空格以及尾随冒号。而且,strtolower() 确保每个键都以小写形式出现。

      通常不建议使用正则表达式;见here

      【讨论】:

      • 谢谢,dom 解析器确实是要走的路。我确实不得不对其进行一些更改,例如: $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));然后删除 foreach 循环中的 nbsp 字符 $td->nodeValue = preg_replace("~\x{00a0}~siu", "", $td->nodeValue );现在我得到了我想要的。
      猜你喜欢
      • 2020-09-03
      • 1970-01-01
      • 1970-01-01
      • 2011-01-05
      • 1970-01-01
      • 1970-01-01
      • 2013-12-02
      • 2015-08-13
      • 2016-05-09
      相关资源
      最近更新 更多