如何更有效地比较大海捞针答案

【问题标题】：How to make comparisons of a needle in a haystack more efficient如何更有效地比较大海捞针
【发布时间】：2019-05-19 00:29:09
【问题描述】：

我一直在努力使以下代码更有效率。

总之;

我有一个包含标题和描述的数据库。该数据库将平均 10000 条文本。我想通过使用“mb_split”拆分文本来搜索比较这些文本，然后遍历所有其他文本以比较该词是否存在。根据进行的比较次数，我想将文章编号写入该数据库中的另一个表。

下面的代码可以工作并且可以解决问题，但是需要很长时间才能完成并使用大量资源。我似乎找不到更有效地比较这些文本的方法。

function compareArticle() {
  include '../include/write.php';
  $readNewsQuery = "select title,text,articleid,name from texts";
  $readNews = $dbwrite->query($readNewsQuery);

  if ($readNews) {
    //Fetch mysql data as an array
    $news = $readNews->fetch_all(MYSQLI_NUM);
      // Start foreach to read every article once
      foreach ($news as $item) {
        echo $item[2].'<br />';
        // Start another foreach to loop through the articles to compare with
        foreach ($news as $compare) {
          $strippedWords = mb_split(' +', $item[0]);
          $count = 0;
          $compareString = "";
          $compareString .= $compare[0];
          $compareString .= $compare[1];
          $compareString = strtolower($compareString);
          // Start yet another foreach to loop through the words
          foreach ($strippedWords as $word) {
            // I only want to count the words that are longer than 4 characters
            if (strlen($word) > 4) {
              $woord = strtolower($word);
              if (strpos($compareString, $word) && $compare[2] != $item[2]) {
                $count++;
              }
            }
          }
          if ($count > 5) {
            echo $count.'<br />';
            //Insert action to write comparison to database (item[2] and compare[2])
          }
       }
    }
  }
}

我真正想知道的；我可以更有效率吗？我可以使用更少的循环，还是有更简单的方法来搜索数组？如果我可以更有效率，有人可以在正确的方向上推动我吗？

编辑：了解我检索到的数据以及我想写入另一个表的数据可能很有用：

texts-database 设置为包含

| article id | title | text | sourcename

我将标题中的单词与所有其他文章的标题和文本组合的单词进行比较。如果它们足够匹配，我想将两个文章 id 写入另一个表：

| id | original article id | compared article id |

【问题讨论】：

在sql中直接使用LIKE或REGEXP。
是的，有趣的注释@DavidLemon，我刚开始只是使用 MySQL 来收集大部分数据，而且这似乎要快得多。它与我的 cronjobs 结合导致了一些问题，我希望 PHP 数组比我在 MySQL 上拍摄的 mysql 查询更快。结果并非如此。我会考虑你的建议，因为我 100% 确信我可以使用一些更高级的 MySQL 查询来提高我的脚本的效率。

标签： php performance mysqli foreach strpos

【解决方案1】：

一旦您循环浏览一条新闻，您就不再需要与它比较任何其他新闻，例如，如果新闻 1 与其他 50 条新闻不匹配，那么当您开始检查新闻 2 时，您已经知道它与新闻项目 1 不匹配。

因此，您可以在第一个新闻文章循环的当前索引 +1（您不需要将当前新闻项目与其自身进行比较）上开始您的第二个循环，而不是对新闻项目进行两次循环。

编辑：这是一个示例循环：

优化循环：

$matches = array();
$a = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 ];
$count = 0;
for ($i = 0; $i < count($a); ++$i) {
    for ($j = $i+1; $j < count($a); ++$j) {
        if ($a[$i] == $a[$j]) {
            array_push($matches, "$i, $j");
        }
        $count++; 
    }
}
echo "Optimized n loops: $count\n";
echo 'Matches: ' . count($matches);

// Output
// Optimized n loops: 435
// Matches: 5

未优化的循环

$matches = array();
$count = 0;
for ($i = 0; $i < count($a); ++$i) {
    for ($j = 0; $j < count($a); ++$j) {
        if ($a[$i] == $a[$j]) {
            array_push($matches, "$i, $j");
        }
        $count++; 
    }
}
$matches = array_unique($matches); // Dedupe
echo "Un-optimized n loops: $count\n";
echo 'Matches: ' . count($matches);

// Output
// Un-optimized n loops: 900
// Matches: 40

未优化的循环包含大量重复匹配项（索引 1 匹配索引 5，索引 5 匹配索引 1）

【讨论】：

虽然我承认没有考虑过这个解决方案，但它并不是 100% 正确的。比较仅从第 1 条标题中的单词开始。将本标题中的单词与下一篇文章的标题和正文进行比较。这意味着如果与文章 2 的标题相比，文章 1 中可用而不是文章 1 标题中的单词将永远不会触发。我确实发现这个解决方案很有趣，可以放弃我的文本，看看有多大的不同这使得。感谢您的建议！ :)

【解决方案2】：

我已经执行了很多测试并对我的脚本进行了一些更改，现在知道最大的罪魁祸首是什么。

原案：

样本大小为 10.000；
执行时间：超过 600 秒（达到最大执行时间）。

测试用例：

原版的完全精简版
样本大小为 1000；
执行时间：24 秒。

最大的不同是什么？

最大的不同是更改了以下行的位置：

$strippedWords = mb_split(' +', $item[0]);

我将该行移到第一个循环而不是第二个循环。这样，第一个循环中的标题仅每 1000 个项目拆分一次，而不是每 1000 个项目拆分 1000 次。我测量了时间上的差异：

mb_split 在第二个循环中：

总执行时间（秒）：162.17704296112

第一个循环中的mb_split：

以秒为单位的总执行时间：24.564566135406

这是一个惊人的巨大差异。我猜 mb_split 对 PHP 来说并不是最简单的事情。将 mb_split 放在我代码的错误部分会使脚本慢近 7 倍：|

strtolower()

得到这个结果后，我很好奇我可以改变其他文本修饰符的位置。因此，我使用 strtolower() 并尽可能将其放在第一个循环中。

strtolower() 在第二个循环中：

以秒为单位的总执行时间：44.315208911896

strtolower() 在第一个循环中：

以秒为单位的总执行时间：37.129139900208

虽然这种差异要小很多，但仍然是显着的差异。

可能的其他原因

我不确定——因为我目前没有时间对此进行测试——这是否完全正确，但在测试一些案例时，我发现我的浏览器出现了问题。当我告诉 PHP 向我的浏览器输出大量信息时，脚本感觉它们会运行更长时间，并且浏览器也会在一段时间后停止显示信息。

如果时机成熟，我还有一些空闲时间，我将测试这个理论，并尝试看看我的浏览器是否真的可以解决我的 PHP 脚本的持续时间。我似乎无法找到一个合乎逻辑的原因来解释为什么它会影响我的 PHP 脚本的持续时间，因为我希望浏览器会崩溃，而我的 PHP 脚本会继续在服务器端工作......但是我脑海中闪过这个想法几次。

无论如何，这是新脚本

function compareArticle() {
  //For timing my script
  $time_start = microtime(true);

  include '../include/write.php';
  $readNewsQuery = "select title,text,articleid,name,datetoday from texts";
  $readNews = $dbwrite->query($readNewsQuery);
  $dateToday = date("Y-m-d");

  if ($readNews) {
    //Fetch mysql data as an array
    $news = $readNews->fetch_all(MYSQLI_NUM);
  }

  foreach ($news as $item) {
    // Decrease the sample pool
    if ($item[4] != $dateToday) {
      continue;
    }
    $strippedWords = strtolower($item[0]);
    $strippedWords = mb_split(' +', $strippedWords);

    // Start another foreach to loop through the articles to compare with
      foreach ($news as $compare) {

        $compareString = "";
        $compareString .= $compare[0];
        $compareString .= $compare[1];

        $count = 0;

        // Start yet another foreach to loop through the words
        foreach ($strippedWords as $word) {
          // I only want to count the words that are longer than 4 characters
          if (strlen($word) > 4) {

            if (strpos(strtolower($compareString), $word)) {
              $count++;
            }
          }
        }

【讨论】：