【问题标题】:Store duplicate array elements存储重复的数组元素
【发布时间】:2014-12-07 09:28:16
【问题描述】:

我正在拼命地试图克服以下问题:在一系列句子/新闻标题中,我试图找到那些非常相似的(有 3 或 4 个共同词)并将它们放入新数组。所以,对于这个原始数组/列表:

'Title1: Hackers expose trove of snagged Snapchat images',
'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
'Title3: Family says goodbye at funeral for 16-year-old',
'Title4: New Jersey officials talk about Ebola quarantine',
'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
'Title6: Hackers expose Snapchat images'

结果应该是:

Array
(
    [0] => Title1: Hackers expose trove of snagged Snapchat images
    [1] => Array
        (
            [duplicate] => Title6: Hackers expose Snapchat images
        )

    [2] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [3] => Array
        (
            [duplicate] => Title4: New Jersey officials talk about Ebola quarantine
        )
    [4] => Title3: Family says goodbye at funeral for 16-year-old
    [5] => Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands
)

这是我的代码:

    $titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
    );
$z = 1;
foreach ($titles as $feed)
{
    $feed_A = explode(' ', $feed);
    for ($i=$z; $i<count($titles); $i++)
    {
        $feed_B = explode(' ', $titles[$i]);
        $intersect_A_B = array_intersect($feed_A, $feed_B);
        if(count($intersect_A_B)>3)
        {
            $titluri[] = $feed;
            $titluri[]['duplicate'] = $titles[$i]; 
        }
        else 
        {
            $titluri[] = $feed;
        }
    }
    $z++;
}

它会输出这个[尴尬,但与期望的结果有些接近]:

Array
(
    [0] => Title1: Hackers expose trove of snagged Snapchat images
    [1] => Title1: Hackers expose trove of snagged Snapchat images
    [2] => Title1: Hackers expose trove of snagged Snapchat images
    [3] => Title1: Hackers expose trove of snagged Snapchat images
    [4] => Title1: Hackers expose trove of snagged Snapchat images
    [5] => Array
        (
            [duplicate] => Title6: Hackers expose Snapchat images
        )

    [6] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [7] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [8] => Array
        (
            [duplicate] => Title4: New Jersey officials talk about Ebola quarantine
        )

    [9] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [10] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [11] => Title3: Family says goodbye at funeral for 16-year-old
    [12] => Title3: Family says goodbye at funeral for 16-year-old
    [13] => Title3: Family says goodbye at funeral for 16-year-old
    [14] => Title4: New Jersey officials talk about Ebola quarantine
    [15] => Title4: New Jersey officials talk about Ebola quarantine
    [16] => Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands
)

任何建议将不胜感激!

【问题讨论】:

  • 我给你一些有用的链接,对你有帮助。 Highlight the difference between two strings in PHP。您也可以查看PHP manual 中的similar_text 函数。
  • 虽然很脏,但是你可以在循环之后在$titluri上使用array_unique来得到预期的数组?
  • @AlbanPommeret,array_unique 不起作用,已经尝试过了。

标签: php arrays sorting duplicates


【解决方案1】:

这是我的解决方案,灵感来自@DomWeldon,没有重复:

 <?php
$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
);
$titluri    =   array(); // unless it's declared elsewhere
$duplicateTitles = array();
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    if(!in_array($key, $duplicateTitles)){
        $titluri[] = $originalFeed; // all feeds are listed in the new array
        $feed_A = explode(' ', $originalFeed);
        foreach ($titles as $newKey => $comparisonFeed)
        {
            // iterate through the array again and see if they intersect
            if ($key != $newKey) { // but don't compare same line against eachother!
                $feed_B = explode(' ', $comparisonFeed);
                $intersect_A_B = array_intersect($feed_A, $feed_B);
                // do they share three words?
                if(count($intersect_A_B)>3)
                {
                    // yes, add a diplicate entry
                    $titluri[]['duplicate'] = $comparisonFeed;
                    $duplicateTitles[] = $newKey;
                }
            }
        }
    }
}

【讨论】:

  • 谢谢,它似乎做得很好。我会看看是否可以调整它以将代码集成到更大的方案中。祝你好运,奥尔本!
  • 这是另一种解决方法(in_array 无论如何都会执行一些内部循环),它当然比 Dom Weldon 的解决方案更好,但我认为我们可以使用 2 个 for 循环(而不是 2 个foreach) 那么性能就更好了。第一个循环:$i0&lt; count($titles)-1,第二个循环:$j$i+1&lt; count($titles)。然而,我们可能需要更多的调整才能让它工作(不仅仅是改变循环)。
  • @KingKing,就像你说的那样,我使用了 2 个 for 循环,但是在旧的非功能代码 '$i' 和 '$j=$i+1' 中。将尝试在 Alban 的解决方案中使用它,但明天早上。感谢您的提示!
  • @VladAndrei 只是再次扫描它,使用 for 循环仍然需要您检查是否已经采用了某些索引(作为重复),但是您应该使用专用数组来保存这些索引(而不是使用in_array 进行检查),这将表现更好(因为搜索是基于键,而不是值),但需要更多的内存 - 但这很少。
  • @VladAndrei 这是我使用 2 个 fors 编辑的代码:$dup = array(); for ($i=0;$i &lt; count($titles)-1; $i++) { if($dup[$i]) continue; $titluri[] = $titles[$i]; $feed_A = explode(' ', $titles[$i]); for ($j=$i+1; $j&lt;count($titles); $j++) { $feed_B = explode(' ', $titles[$j]); $intersect_A_B = array_intersect($feed_A, $feed_B); if(count($intersect_A_B)&gt;3) { $titluri[]['duplicate'] = $titles[$j]; $dup[$j] = true; } } }
【解决方案2】:

我认为这段代码可能是您正在寻找的(包含在 cmets 中)。如果没有,请告诉我 - 这是匆忙编写的,未经测试。此外,您可能想看看这个替代方案 - 嵌套的 foreach 循环可能会在大型网站上导致性能问题。

<?php

$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
    );
$titluri    =   array(); // unless it's declared elsewhere
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    $titluri[] = $originalFeed; // all feeds are listed in the new array
    $feed_A = explode(' ', $originalFeed);
    foreach ($titles as $newKey => $comparisonFeed)
    {
        // iterate through the array again and see if they intersect
        if ($key != $newKey) { // but don't compare same line against eachother!
            $feed_B = explode(' ', $comparisonFeed);
            $intersect_A_B = array_intersect($feed_A, $feed_B);
            // do they share three words?
            if(count($intersect_A_B)>3)
            {
                // yes, add a diplicate entry
                $titluri[]['duplicate'] = $comparisonFeed; 
            }
        }
    }
}

【讨论】:

  • 刚刚将$i 替换为$newKey,我认为您的代码很好!
  • 不知道这样行不行但是效率不是很高,比如第一轮比较Title1Title4,然后再比较Title4Title1,差不多相同的结果(其他对相同)。使用 for 循环(带有计数器)应该会更好。
  • 你说得对,@KingKing - 这写得很快,请编辑!
  • 使用for循环当然性能更好,但在这种情况下实现起来更复杂(您可以节省一些对array_intersect的调用)。我的评论是作为 OP 的注释,他可能想自己尝试一下(它可能真的需要一些测试)。
  • @AlbanPommeret,您的代码完成了这项工作,但它重复了一些条目,如数组中所示(title1 抓取 title6,而 title2 抓取 title4,因为它们相似,但 title6 也会有title1 在它下面,并且 title4 也将有 title2,这是重复的,我试图避免这种情况。请打印结果数组以了解我在说什么。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2015-03-13
  • 2016-08-02
  • 1970-01-01
  • 2021-09-10
  • 2019-09-13
  • 1970-01-01
  • 2014-05-26
相关资源
最近更新 更多