【问题标题】:Algorithm for Compare Paragraph with the array of keywords in php将段落与php中的关键字数组进行比较的算法
【发布时间】:2017-10-11 20:49:58
【问题描述】:

我想为主观论文开发测验系统。在这个系统中,用户可以在段落中回答问题,而不是从多项选择中选择。管理员可以添加带有关键字数量的问题作为答案。我想要一个有效的算法来比较用户答案(最多 100 个字)和预定义的关键字数组(最多 50 个字)。我该如何实施?请帮帮我。

提前致谢! 我正在考虑将用户段落转换为单词数组,而不是与预定义关键字数组进行比较。但我认为这是实施该系统的耗时方法。 例如,如果用户答案数组包含 100 个单词,而预定义数组包含 50 个单词,那么 100*50 比较,这太昂贵了。

请帮我在php中找到有效的解决方案。

【问题讨论】:

    标签: php arrays algorithm data-structures logic


    【解决方案1】:
    1. 如果要将文本拆分为单词,则必须分别为每种语言执行算法。您将面临一个问题,即用空格分隔文本不足以完成您的任务:标点符号仍然存在。所以你必须保存- 并修剪!,! 等字符。同时,如果您查看中文,您可能会发现它们使用了另一组标点符号,因此您必须将它们全部枚举。

      但是,使用str_word_count 方法和预定义字母的小帮助很容易解决此任务。下面的示例适用于英文文本(没有额外的字母)和希腊文本(有字母):

    function words($string, $charlist = null)
    {
        return str_word_count($string, 1, $charlist);
    }
    
    
    $string_ASCII = 'ASCII string example'; # string(20) "ASCII string example"
    
    $result = words($string_ASCII); # Array
                                    # (
                                    #     [0] => ASCII
                                    #     [1] => string
                                    #     [2] => example
                                    # )
    
    
    $string_UTF8 = 'UTF-8 string πράδειγμα'; # string(31) "UTF-8 string πράδειγμα"
    
    $alphabet = '1234567890-ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαάβγδεζηθικλμνξοπρστυφχψω';
    
    $result = words($string_UTF8, $alphabet); # Array
                                              # (
                                              #     [0] => UTF-8
                                              #     [1] => string
                                              #     [2] => πράδειγμα
                                              # )
    
    1. 您无需一次又一次地比较 2 个数组。使用索引。最好的办法是在您的关键字数组上使用array_flip,然后只遍历用户单词一次并使用 isset 检查单词:
    $keywords = array_flip($keywords); # $keywords - your 50 words
    
    $words = words($string); # $string - a text with 100 words from user
    
    foreach ($words as $word)
    {
        # only 100 iterations with fast isset validation
    
        if (isset($keywords[$word]))
        {
            # it exists!
        }
    }
    
    1. 考虑单词规范化也是一个好主意,因为当您的关键字列表中有leg 时,用户可能会使用legs,这也是正确的。对于英语,我可能会推荐下一个代码:
    # Author - https:#gist.github.com/tbrianjones
    # Source - https:#gist.github.com/tbrianjones/ba0460cc1d55f357e00b
    #
    # The MIT License (MIT)
    #
    # Copyright (c) 2015
    #
    #
    # Changes:
    #   Removed rule for virus -> viri
    #   Added rule for potato -> potatoes
    #   Added rule for *us -> *uses
    
    class english
    {
        private static $plural = array
        (
            '/(quiz)$/i'                     => '$1zes',
            '/^(ox)$/i'                      => '$1en',
            '/([m|l])ouse$/i'                => '$1ice',
            '/(matr|vert|ind)ix|ex$/i'       => '$1ices',
            '/(x|ch|ss|sh)$/i'               => '$1es',
            '/([^aeiouy]|qu)y$/i'            => '$1ies',
            '/(hive)$/i'                     => '$1s',
            '/(?:([^f])fe|([lr])f)$/i'       => '$1$2ves',
            '/(shea|lea|loa|thie)f$/i'       => '$1ves',
            '/sis$/i'                        => 'ses',
            '/([ti])um$/i'                   => '$1a',
            '/(tomat|potat|ech|her|vet)o$/i' => '$1oes',
            '/(bu)s$/i'                      => '$1ses',
            '/(alias)$/i'                    => '$1es',
            '/(octop)us$/i'                  => '$1i',
            '/(ax|test)is$/i'                => '$1es',
            '/(us)$/i'                       => '$1es',
            '/s$/i'                          => 's',
            '/$/'                            => 's'
        );
    
        private static $singular = array
        (
            '/(quiz)zes$/i'              => '$1',
            '/(matr)ices$/i'             => '$1ix',
            '/(vert|ind)ices$/i'         => '$1ex',
            '/^(ox)en$/i'                => '$1',
            '/(alias)es$/i'              => '$1',
            '/(octop|vir)i$/i'           => '$1us',
            '/(cris|ax|test)es$/i'       => '$1is',
            '/(shoe)s$/i'                => '$1',
            '/(o)es$/i'                  => '$1',
            '/(bus)es$/i'                => '$1',
            '/([m|l])ice$/i'             => '$1ouse',
            '/(x|ch|ss|sh)es$/i'         => '$1',
            '/(m)ovies$/i'               => '$1ovie',
            '/(s)eries$/i'               => '$1eries',
            '/([^aeiouy]|qu)ies$/i'      => '$1y',
            '/([lr])ves$/i'              => '$1f',
            '/(tive)s$/i'                => '$1',
            '/(hive)s$/i'                => '$1',
            '/(li|wi|kni)ves$/i'         => '$1fe',
            '/(shea|loa|lea|thie)ves$/i' => '$1f',
            '/(^analy)ses$/i'            => '$1sis',
            '/((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$/i' => '$1$2sis',
            '/([ti])a$/i'                => '$1um',
            '/(n)ews$/i'                 => '$1ews',
            '/(h|bl)ouses$/i'            => '$1ouse',
            '/(corpse)s$/i'              => '$1',
            '/(us)es$/i'                 => '$1',
            '/s$/i'                      => ''
        );
    
        private static $irregular = array
        (
            'move'   => 'moves',
            'foot'   => 'feet',
            'goose'  => 'geese',
            'sex'    => 'sexes',
            'child'  => 'children',
            'man'    => 'men',
            'tooth'  => 'teeth',
            'person' => 'people'
        );
    
        private static $uncountable = array
        (
            'sheep',
            'fish',
            'deer',
            'series',
            'species',
            'money',
            'rice',
            'information',
            'equipment'
        );
    
        public static function pluralize($string)
        {
            # save some time in the case that singular and plural are the same
            if (in_array(mb_strtolower($string), self::$uncountable))
            {
                return $string;
            }
    
    
            # check for irregular singular forms
            foreach (self::$irregular as $pattern => $result)
            {
                $pattern = '/' . $pattern . '$/i';
    
                if (preg_match($pattern, $string))
                {
                    return preg_replace($pattern, $result, $string);
                }
            }
    
            # check for matches using regular expressions
            foreach (self::$plural as $pattern => $result)
            {
                if (preg_match($pattern, $string))
                {
                    return preg_replace($pattern, $result, $string);
                }
            }
    
            return $string;
        }
    
        public static function singularize($string)
        {
            # save some time in the case that singular and plural are the same
            if (in_array(mb_strtolower($string), self::$uncountable))
            {
                return $string;
            }
    
            # check for irregular plural forms
            foreach (self::$irregular as $result => $pattern)
            {
                $pattern = '/' . $pattern . '$/i';
    
                if (preg_match($pattern, $string))
                {
                    return preg_replace($pattern, $result, $string);
                }
            }
    
            # check for matches using regular expressions
            foreach (self::$singular as $pattern => $result)
            {
                if (preg_match($pattern, $string))
                {
                    return preg_replace($pattern, $result, $string);
                }
            }
    
            return $string;
        }
    }
    
    1. 如果您想从用户输入中删除重复项,您也可以在将文本转换为单词后立即使用$words = array_unique($words)(如果您的关键字列表中有legs并且用户使用它100次,它可能会解决问题)获得 100 分)。此外,它会使您的代码更快一点,因为更少的单词 = 更少的循环中的迭代 :)

    【讨论】:

    • 非常感谢你的朋友 :-) 你让我开心。我会按照你的建议执行,如果我觉得有什么困难或者需要帮助,我可以联系你吗?如果是,请在 cmets 中分享您的电子邮件。再次,非常感谢您的帮助。 :-)
    【解决方案2】:
    • 保持关键字数组排序
    • 一旦用户提交答案:只保留唯一单词和计数
    • 现在您可以遍历答案中的每个唯一单词,并对排序的关键字数组进行二分搜索。如果匹配 - 按存储计数增加

    检查复杂度为O(number_of_unique_answer_keywords * log(keywords) * avg(string_length))

    【讨论】:

      【解决方案3】:

      使用 PHP 中的 levenshteinsimilar_text 函数来比较两个数组的精确匹配和接近匹配

      【讨论】:

        猜你喜欢
        • 2015-08-03
        • 1970-01-01
        • 2022-06-27
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-12-15
        • 1970-01-01
        • 2018-10-17
        相关资源
        最近更新 更多