重构循环？答案

【问题标题】：Refactoring a loop?重构循环？
【发布时间】：2011-11-05 04:23:48
【问题描述】：

我想循环超过 200,000 个用户数据集来过滤 30,000 个产品，我该如何优化这个嵌套的大循环以获得最佳性能？

  //settings , 5 max per user, can up to 200,000
   $settings = array(...);

   //all prods, up to 30,000
   $prods = array(...);

   //all prods category relation map, up to 2 * 30,000
   $prods_cate_ref_all = array(...);

   //msgs filtered by settings saved yesterday , more then 100 * 200,000
   $msg_all = array(...);

   //filter counter
   $j = 0;

   //filter result
   $res = array();

   foreach($settings as $set){

       foreach($prods as $k=>$p){

           //filter prods by site_id 
           if ($set['site_id'] != $p['site_id']) continue;

               //filter prods by city_id , city_id == 0 is all over the country
           if ($set['city_id'] != $p['city_id'] && $p['city_id'] > 0) continue;

           //muti settings of a user may get same prods
               if (prod_in($p['id'], $set['uuid'], $res)) continue;

            //prods filtered by settings saved  to msg table yesterday
           if (msg_in($p['id'], $set['uuid'], $msg_all)) continue;

               //filter prods by category id 
           if (!prod_cate_in($p['id'], $set['cate_id'], $prods_cate_ref_all)) continue;

            //filter prods by tags of set not in prod title, website ...
                $arr = array($p['title'], $p['website'], $p['detail'], $p['shop'], $p['tags']);
           if (!tags_in($set['tags'], $arr)) continue; 

               $res[$j]['name'] = $v['name'];
           $res[$j]['prod_id'] = $p['id'];
               $res[$j]['uuid'] = $v['uuid'];
               $res[$j]['msg'] = '...';
               $j++;
       }

   }

   save_to_msg($res);

function prod_in($prod_id, $uuid, $prod_all){
    foreach($prod_all as $v){
    if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid)
        return true;
    }
    return false;
}

function prod_cate_in($prod_id, $cate_id, $prod_cate_all){
    foreach($prod_cate_all as $v){
    if ($v['prod_id'] == $prod_id && $v['cate_id'] == $cate_id)
        return true;
    }
    return false;
}

function tags_in($tags, $arr){
    $tag_arr = explode(',', str_replace('，', ',', $tags));
    foreach($tag_arr as $v){
    foreach($arr as $a){
        if(strpos($a, strtolower($v)) !== false){
        return true;
        }
    }
    }
    return false;
}

function msg_in($prod_id, $uuid, $msg_all){
    foreach($msg_all as $v){
    if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid)
        return true;
    }
    return false;
}

更新：非常感谢。是的，数据在mysql中，下面是主要结构：

-- user settings to filter prods, 5 max per user
CREATE TABLE setting(
   id INT NOT NULL AUTO_INCREMENT, 
   uuid VARCHAR(100) NOT NULL DEFAULT '',
   tags VARCHAR(100) NOT NULL DEFAULT '',
   site_id SMALLINT UNSIGNED NOT NULL DEFAULT 0,
   city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
   cate_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
   addtime INT UNSIGNED NOT NULL DEFAULT 0,
   PRIMARY KEY (`id`), 
   KEY `idx_setting_uuid` (`uuid`),
   KEY `idx_setting_tags` (`tags`),
   KEY `idx_setting_city_id` (`city_id`),
   KEY `idx_setting_cate_id` (`cate_id`)
) DEFAULT CHARSET=utf8;


CREATE TABLE users(
   id INT NOT NULL AUTO_INCREMENT, 
   uuid VARCHAR(100) NOT NULL DEFAULT '',
   PRIMARY KEY (`id`),   
   UNIQUE KEY `idx_unique_uuid` (`uuid`)
) DEFAULT CHARSET=utf8;


-- filtered prods
CREATE TABLE msg_list(
   id INT NOT NULL AUTO_INCREMENT, 
   uuid VARCHAR(100) NOT NULL DEFAULT '',
   prod_id INT UNSIGNED NOT NULL DEFAULT 0,
   msg TEXT NOT NULL DEFAULT '',
   PRIMARY KEY (`id`),
   KEY `idx_ml_uuid` (`uuid`)
) DEFAULT CHARSET=utf8;



-- prods and prod_cate_ref table in another database, so can not join it


CREATE TABLE prod(
   id INT NOT NULL AUTO_INCREMENT, 
   website VARCHAR(100) NOT NULL DEFAULT '' COMMENT ' site name ',
   site_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
   city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
   title VARCHAR(50) NOT NULL DEFAULT '',
   tags VARCHAR(50) NOT NULL DEFAULT '',
   detail VARCHAR(500) NOT NULL DEFAULT '',
   shop VARCHAR(300) NOT NULL DEFAULT '',
   PRIMARY KEY (`id`),
   KEY `idx_prod_tags` (`tags`),
   KEY `idx_prod_site_id` (`site_id`),
   KEY `idx_prod_city_id` (`city_id`),
   KEY `idx_prod_mix` (`site_id`,`city_id`,`tags`)
) DEFAULT CHARSET=utf8;

CREATE TABLE prod_cate_ref(
   id MEDIUMINT NOT NULL AUTO_INCREMENT, 
   prod_id INT NOT NULL NULL DEFAULT 0,
   cate_id MEDIUMINT NOT NULL NULL DEFAULT 0,
   PRIMARY KEY (`id`),
   KEY `idx_pcr_mix` (`prod_id`,`cate_id`)
) DEFAULT CHARSET=utf8;


-- ENGINE all is myisam

我不知道如何只使用一个 sql 来获取所有内容。

【问题讨论】：

我假设您正在从数据库中获取此信息。我还假设您使用的是 SQL 数据库。那为什么不使用JOINs 和WHERE 子句呢？
您从哪里获得设置？数据库？
这是要存储在 PHP 数组中的大量数据。这是来自数据库还是什么？
我同意 NullUserEsception，如果可能的话，你应该在 MySQL 中这样做。
不能使用查找表吗？只需遍历其中一组，查找另一组数据...？

标签： php algorithm refactoring large-data

【解决方案1】：

谢谢大家的鼓励，我终于明白了，方法确实很简单，但是迈出了一大步！

我重新组合 $prods_cate_ref_all 和 $msg_all 中的数据（使用最后的两个函数），还有结果数组$res，然后使用 strpos 和 in_array 代替三个迭代函数 (prod_in msg_in prod_cate_in) ，

我得到了惊人的 50 倍加速！！！数据越大，效果越有效。

  //settings , 5 max per user, can up to 200,000
   $settings = array(...);

   //all prods, up to 30,000
   $prods = array(...);

   //all prods category relation map, up to 2 * 30,000
   $prods_cate_ref_all = get_cate_ref_all();

   //msgs filtered by settings saved yesterday , more then 100 * 200,000
   $msg_all = get_msg_all();

   //filter counter
   $j = 0;

   //filter result
   $res = array();


  foreach($settings as $set){

       foreach($prods as $p){

       $res_uuid_setted = false;

       $uuid = $set['uuid'];

       if (isset($res[$uuid])){
           $res_uuid_setted = true;
       }

       //filter prods by site_id 
       if ($set['site_id'] != $p['site_id']) 
               continue;

       //filter prods by city_id , city_id == 0 is all over the country
       if ($set['city_id'] != $p['city_id'] && $p['city_id'] > 0) 
               continue;


       //muti settings of a user may get same prods
       if ($res_uuid_setted)
           //in_array faster than strpos if item < 1000
           if (in_array($p['id'], $res[$uuid]['prod_ids']))
           continue;

       //prods filtered by settings saved  to msg table yesterday
       if (isset($msg_all[$uuid]))
           //strpos faster than in_array in large data
           if (false !== strpos($msg_all[$uuid], ' ' . $p['id'] . ' '))
           continue;

       //filter prods by category id 
       if (false === strpos($prods_cate_ref_all[$p['id']], ' ' . $set['cate_id'] . ' '))
           continue;

       $arr = array($p['title'], $p['website'], $p['detail'], $p['shop'], $p['tags']);
       if (!tags_in($set['tags'], $arr))
           continue;


       $res[$uuid]['prod_ids'][] = $p['id'];

       $res[$uuid][] = array(
        'name' => $set['name'],
        'prod_id' => $p['id'],
        'msg' => '',
       );

       }

   }


function get_msg_all(){

    $temp = array();
    $msg_all = array(
        array('uuid' => 312, 'prod_id' => 211),
        array('uuid' => 1227, 'prod_id' => 31),
        array('uuid' => 1, 'prod_id' => 72),
        array('uuid' => 993, 'prod_id' => 332),
        ...
    );

    foreach($msg_all as $k=>$v){
    if (!isset($temp[$v['uuid']])) 
        $temp[$v['uuid']] = ' ';

    $temp[$v['uuid']] .= $v['prod_id'] . ' ';
    }

    $msg_all = $temp;
    unset($temp);

    return $msg_all;
}


function get_cate_ref_all(){

    $temp = array();
    $cate_ref = array(
        array('prod_id' => 3, 'cate_id' => 21),
        array('prod_id' => 27, 'cate_id' => 1),
        array('prod_id' => 1, 'cate_id' => 232),
        array('prod_id' => 3, 'cate_id' => 232),
        ...
    );

    foreach($cate_ref as $k=>$v){
    if (!isset($temp[$v['prod_id']]))
        $temp[$v['prod_id']] = ' ';

    $temp[$v['prod_id']] .= $v['cate_id'] . ' ';
    }
    $cate_ref = $temp;
    unset($temp);

    return $cate_ref;
}

【讨论】：

【解决方案2】：

由于您已经在外部循环中设置了较大的集合，因此很难判断在哪里可以优化它。您可以内联函数代码以节省函数调用或从您的许多 foreach 中部分展开。

比如这个函数里面

function tags_in($tags, $arr){
    $tag_arr = explode(',', str_replace('，', ',', $tags));
    foreach($tag_arr as $v){
    foreach($arr as $a){
        if(strpos($a, strtolower($v)) !== false){
        return true;
        }
    }
    }
    return false;
}

您基本上是在使用数组进行字符串访问。更直接地做字符串（注意：完整的标签匹配，你的做部分匹配）：

function tags_in($tags, $arr)
{
    $tags = ', '.strtolower($tags).', ';

    foreach($arr as $tag)
    {
        if (false !== strpos($tags, ', '.$tag.', ')
          return true;
    }
    return false;
}

但由于您拥有大量数据，因此需要很长时间。

仅优化您的发布版本。
只做小的、体面的改变。
每次更改后运行测试。
根据具有代表性的真实世界测试数据分析每个更改。

因此，除了较小的代码移动和更改之外，您可能正在寻找规模。如果你之前可以部分解决问题，你可以做一个 map 和 reduce 策略。也许您已经在使用基于文档的数据库，该数据库为此提供了接口。

【讨论】：