将段落与php中的关键字数组进行比较的算法答案

【问题标题】：Algorithm for Compare Paragraph with the array of keywords in php将段落与php中的关键字数组进行比较的算法
【发布时间】：2017-10-11 20:49:58
【问题描述】：

我想为主观论文开发测验系统。在这个系统中，用户可以在段落中回答问题，而不是从多项选择中选择。管理员可以添加带有关键字数量的问题作为答案。我想要一个有效的算法来比较用户答案（最多 100 个字）和预定义的关键字数组（最多 50 个字）。我该如何实施？请帮帮我。

提前致谢！我正在考虑将用户段落转换为单词数组，而不是与预定义关键字数组进行比较。但我认为这是实施该系统的耗时方法。例如，如果用户答案数组包含 100 个单词，而预定义数组包含 50 个单词，那么 100*50 比较，这太昂贵了。

请帮我在php中找到有效的解决方案。

【问题讨论】：

标签： php arrays algorithm data-structures logic

【解决方案1】：

如果要将文本拆分为单词，则必须分别为每种语言执行算法。您将面临一个问题，即用空格分隔文本不足以完成您的任务：标点符号仍然存在。所以你必须保存- 并修剪!、,、! 等字符。同时，如果您查看中文，您可能会发现它们使用了另一组标点符号，因此您必须将它们全部枚举。

但是，使用str_word_count 方法和预定义字母的小帮助很容易解决此任务。下面的示例适用于英文文本（没有额外的字母）和希腊文本（有字母）：

function words($string, $charlist = null)
{
    return str_word_count($string, 1, $charlist);
}


$string_ASCII = 'ASCII string example'; # string(20) "ASCII string example"

$result = words($string_ASCII); # Array
                                # (
                                #     [0] => ASCII
                                #     [1] => string
                                #     [2] => example
                                # )


$string_UTF8 = 'UTF-8 string πράδειγμα'; # string(31) "UTF-8 string πράδειγμα"

$alphabet = '1234567890-ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαάβγδεζηθικλμνξοπρστυφχψω';

$result = words($string_UTF8, $alphabet); # Array
                                          # (
                                          #     [0] => UTF-8
                                          #     [1] => string
                                          #     [2] => πράδειγμα
                                          # )

您无需一次又一次地比较 2 个数组。使用索引。最好的办法是在您的关键字数组上使用array_flip，然后只遍历用户单词一次并使用 isset 检查单词：

$keywords = array_flip($keywords); # $keywords - your 50 words

$words = words($string); # $string - a text with 100 words from user

foreach ($words as $word)
{
    # only 100 iterations with fast isset validation

    if (isset($keywords[$word]))
    {
        # it exists!
    }
}

考虑单词规范化也是一个好主意，因为当您的关键字列表中有leg 时，用户可能会使用legs，这也是正确的。对于英语，我可能会推荐下一个代码：

# Author - https:#gist.github.com/tbrianjones
# Source - https:#gist.github.com/tbrianjones/ba0460cc1d55f357e00b
#
# The MIT License (MIT)
#
# Copyright (c) 2015
#
#
# Changes:
#   Removed rule for virus -> viri
#   Added rule for potato -> potatoes
#   Added rule for *us -> *uses

class english
{
    private static $plural = array
    (
        '/(quiz)$/i'                     => '$1zes',
        '/^(ox)$/i'                      => '$1en',
        '/([m|l])ouse$/i'                => '$1ice',
        '/(matr|vert|ind)ix|ex$/i'       => '$1ices',
        '/(x|ch|ss|sh)$/i'               => '$1es',
        '/([^aeiouy]|qu)y$/i'            => '$1ies',
        '/(hive)$/i'                     => '$1s',
        '/(?:([^f])fe|([lr])f)$/i'       => '$1$2ves',
        '/(shea|lea|loa|thie)f$/i'       => '$1ves',
        '/sis$/i'                        => 'ses',
        '/([ti])um$/i'                   => '$1a',
        '/(tomat|potat|ech|her|vet)o$/i' => '$1oes',
        '/(bu)s$/i'                      => '$1ses',
        '/(alias)$/i'                    => '$1es',
        '/(octop)us$/i'                  => '$1i',
        '/(ax|test)is$/i'                => '$1es',
        '/(us)$/i'                       => '$1es',
        '/s$/i'                          => 's',
        '/$/'                            => 's'
    );

    private static $singular = array
    (
        '/(quiz)zes$/i'              => '$1',
        '/(matr)ices$/i'             => '$1ix',
        '/(vert|ind)ices$/i'         => '$1ex',
        '/^(ox)en$/i'                => '$1',
        '/(alias)es$/i'              => '$1',
        '/(octop|vir)i$/i'           => '$1us',
        '/(cris|ax|test)es$/i'       => '$1is',
        '/(shoe)s$/i'                => '$1',
        '/(o)es$/i'                  => '$1',
        '/(bus)es$/i'                => '$1',
        '/([m|l])ice$/i'             => '$1ouse',
        '/(x|ch|ss|sh)es$/i'         => '$1',
        '/(m)ovies$/i'               => '$1ovie',
        '/(s)eries$/i'               => '$1eries',
        '/([^aeiouy]|qu)ies$/i'      => '$1y',
        '/([lr])ves$/i'              => '$1f',
        '/(tive)s$/i'                => '$1',
        '/(hive)s$/i'                => '$1',
        '/(li|wi|kni)ves$/i'         => '$1fe',
        '/(shea|loa|lea|thie)ves$/i' => '$1f',
        '/(^analy)ses$/i'            => '$1sis',
        '/((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$/i' => '$1$2sis',
        '/([ti])a$/i'                => '$1um',
        '/(n)ews$/i'                 => '$1ews',
        '/(h|bl)ouses$/i'            => '$1ouse',
        '/(corpse)s$/i'              => '$1',
        '/(us)es$/i'                 => '$1',
        '/s$/i'                      => ''
    );

    private static $irregular = array
    (
        'move'   => 'moves',
        'foot'   => 'feet',
        'goose'  => 'geese',
        'sex'    => 'sexes',
        'child'  => 'children',
        'man'    => 'men',
        'tooth'  => 'teeth',
        'person' => 'people'
    );

    private static $uncountable = array
    (
        'sheep',
        'fish',
        'deer',
        'series',
        'species',
        'money',
        'rice',
        'information',
        'equipment'
    );

    public static function pluralize($string)
    {
        # save some time in the case that singular and plural are the same
        if (in_array(mb_strtolower($string), self::$uncountable))
        {
            return $string;
        }


        # check for irregular singular forms
        foreach (self::$irregular as $pattern => $result)
        {
            $pattern = '/' . $pattern . '$/i';

            if (preg_match($pattern, $string))
            {
                return preg_replace($pattern, $result, $string);
            }
        }

        # check for matches using regular expressions
        foreach (self::$plural as $pattern => $result)
        {
            if (preg_match($pattern, $string))
            {
                return preg_replace($pattern, $result, $string);
            }
        }

        return $string;
    }

    public static function singularize($string)
    {
        # save some time in the case that singular and plural are the same
        if (in_array(mb_strtolower($string), self::$uncountable))
        {
            return $string;
        }

        # check for irregular plural forms
        foreach (self::$irregular as $result => $pattern)
        {
            $pattern = '/' . $pattern . '$/i';

            if (preg_match($pattern, $string))
            {
                return preg_replace($pattern, $result, $string);
            }
        }

        # check for matches using regular expressions
        foreach (self::$singular as $pattern => $result)
        {
            if (preg_match($pattern, $string))
            {
                return preg_replace($pattern, $result, $string);
            }
        }

        return $string;
    }
}

如果您想从用户输入中删除重复项，您也可以在将文本转换为单词后立即使用$words = array_unique($words)（如果您的关键字列表中有legs并且用户使用它100次，它可能会解决问题）获得 100 分）。此外，它会使您的代码更快一点，因为更少的单词 = 更少的循环中的迭代 :)

【讨论】：

非常感谢你的朋友 :-) 你让我开心。我会按照你的建议执行，如果我觉得有什么困难或者需要帮助，我可以联系你吗？如果是，请在 cmets 中分享您的电子邮件。再次，非常感谢您的帮助。 :-)

【解决方案2】：

保持关键字数组排序
一旦用户提交答案：只保留唯一单词和计数
现在您可以遍历答案中的每个唯一单词，并对排序的关键字数组进行二分搜索。如果匹配 - 按存储计数增加

检查复杂度为O(number_of_unique_answer_keywords * log(keywords) * avg(string_length))。

【讨论】：

【解决方案3】：

使用 PHP 中的 levenshtein 或 similar_text 函数来比较两个数组的精确匹配和接近匹配

【讨论】：