-
如果要将文本拆分为单词,则必须分别为每种语言执行算法。您将面临一个问题,即用空格分隔文本不足以完成您的任务:标点符号仍然存在。所以你必须保存- 并修剪!、,、! 等字符。同时,如果您查看中文,您可能会发现它们使用了另一组标点符号,因此您必须将它们全部枚举。
但是,使用str_word_count 方法和预定义字母的小帮助很容易解决此任务。下面的示例适用于英文文本(没有额外的字母)和希腊文本(有字母):
function words($string, $charlist = null)
{
return str_word_count($string, 1, $charlist);
}
$string_ASCII = 'ASCII string example'; # string(20) "ASCII string example"
$result = words($string_ASCII); # Array
# (
# [0] => ASCII
# [1] => string
# [2] => example
# )
$string_UTF8 = 'UTF-8 string πράδειγμα'; # string(31) "UTF-8 string πράδειγμα"
$alphabet = '1234567890-ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαάβγδεζηθικλμνξοπρστυφχψω';
$result = words($string_UTF8, $alphabet); # Array
# (
# [0] => UTF-8
# [1] => string
# [2] => πράδειγμα
# )
- 您无需一次又一次地比较 2 个数组。使用索引。最好的办法是在您的关键字数组上使用array_flip,然后只遍历用户单词一次并使用 isset 检查单词:
$keywords = array_flip($keywords); # $keywords - your 50 words
$words = words($string); # $string - a text with 100 words from user
foreach ($words as $word)
{
# only 100 iterations with fast isset validation
if (isset($keywords[$word]))
{
# it exists!
}
}
- 考虑单词规范化也是一个好主意,因为当您的关键字列表中有
leg 时,用户可能会使用legs,这也是正确的。对于英语,我可能会推荐下一个代码:
# Author - https:#gist.github.com/tbrianjones
# Source - https:#gist.github.com/tbrianjones/ba0460cc1d55f357e00b
#
# The MIT License (MIT)
#
# Copyright (c) 2015
#
#
# Changes:
# Removed rule for virus -> viri
# Added rule for potato -> potatoes
# Added rule for *us -> *uses
class english
{
private static $plural = array
(
'/(quiz)$/i' => '$1zes',
'/^(ox)$/i' => '$1en',
'/([m|l])ouse$/i' => '$1ice',
'/(matr|vert|ind)ix|ex$/i' => '$1ices',
'/(x|ch|ss|sh)$/i' => '$1es',
'/([^aeiouy]|qu)y$/i' => '$1ies',
'/(hive)$/i' => '$1s',
'/(?:([^f])fe|([lr])f)$/i' => '$1$2ves',
'/(shea|lea|loa|thie)f$/i' => '$1ves',
'/sis$/i' => 'ses',
'/([ti])um$/i' => '$1a',
'/(tomat|potat|ech|her|vet)o$/i' => '$1oes',
'/(bu)s$/i' => '$1ses',
'/(alias)$/i' => '$1es',
'/(octop)us$/i' => '$1i',
'/(ax|test)is$/i' => '$1es',
'/(us)$/i' => '$1es',
'/s$/i' => 's',
'/$/' => 's'
);
private static $singular = array
(
'/(quiz)zes$/i' => '$1',
'/(matr)ices$/i' => '$1ix',
'/(vert|ind)ices$/i' => '$1ex',
'/^(ox)en$/i' => '$1',
'/(alias)es$/i' => '$1',
'/(octop|vir)i$/i' => '$1us',
'/(cris|ax|test)es$/i' => '$1is',
'/(shoe)s$/i' => '$1',
'/(o)es$/i' => '$1',
'/(bus)es$/i' => '$1',
'/([m|l])ice$/i' => '$1ouse',
'/(x|ch|ss|sh)es$/i' => '$1',
'/(m)ovies$/i' => '$1ovie',
'/(s)eries$/i' => '$1eries',
'/([^aeiouy]|qu)ies$/i' => '$1y',
'/([lr])ves$/i' => '$1f',
'/(tive)s$/i' => '$1',
'/(hive)s$/i' => '$1',
'/(li|wi|kni)ves$/i' => '$1fe',
'/(shea|loa|lea|thie)ves$/i' => '$1f',
'/(^analy)ses$/i' => '$1sis',
'/((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$/i' => '$1$2sis',
'/([ti])a$/i' => '$1um',
'/(n)ews$/i' => '$1ews',
'/(h|bl)ouses$/i' => '$1ouse',
'/(corpse)s$/i' => '$1',
'/(us)es$/i' => '$1',
'/s$/i' => ''
);
private static $irregular = array
(
'move' => 'moves',
'foot' => 'feet',
'goose' => 'geese',
'sex' => 'sexes',
'child' => 'children',
'man' => 'men',
'tooth' => 'teeth',
'person' => 'people'
);
private static $uncountable = array
(
'sheep',
'fish',
'deer',
'series',
'species',
'money',
'rice',
'information',
'equipment'
);
public static function pluralize($string)
{
# save some time in the case that singular and plural are the same
if (in_array(mb_strtolower($string), self::$uncountable))
{
return $string;
}
# check for irregular singular forms
foreach (self::$irregular as $pattern => $result)
{
$pattern = '/' . $pattern . '$/i';
if (preg_match($pattern, $string))
{
return preg_replace($pattern, $result, $string);
}
}
# check for matches using regular expressions
foreach (self::$plural as $pattern => $result)
{
if (preg_match($pattern, $string))
{
return preg_replace($pattern, $result, $string);
}
}
return $string;
}
public static function singularize($string)
{
# save some time in the case that singular and plural are the same
if (in_array(mb_strtolower($string), self::$uncountable))
{
return $string;
}
# check for irregular plural forms
foreach (self::$irregular as $result => $pattern)
{
$pattern = '/' . $pattern . '$/i';
if (preg_match($pattern, $string))
{
return preg_replace($pattern, $result, $string);
}
}
# check for matches using regular expressions
foreach (self::$singular as $pattern => $result)
{
if (preg_match($pattern, $string))
{
return preg_replace($pattern, $result, $string);
}
}
return $string;
}
}
- 如果您想从用户输入中删除重复项,您也可以在将文本转换为单词后立即使用
$words = array_unique($words)(如果您的关键字列表中有legs并且用户使用它100次,它可能会解决问题)获得 100 分)。此外,它会使您的代码更快一点,因为更少的单词 = 更少的循环中的迭代 :)