【问题标题】:How to count words frequency by removing non-letters of a string?如何通过删除字符串的非字母来计算单词频率?
【发布时间】:2020-03-03 04:17:53
【问题描述】:

我有一个字符串:

var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist."

删除所有非字母字符,然后将每个单词拆分到新行以便我可以存储和计算每个单词有多少的最佳方法是什么?

var words = text.Split(' ');

foreach(var word in words)
{
    word.Trim(',','.','-');
}

我尝试了各种方法,例如 text.Replace(characters)whitespace 然后拆分。我已经尝试过 Regex(我不想使用它)。

我还尝试使用 StringBuilder 类从文本(字符串)中获取字符,并且仅在它是字母 a-z / A-Z 时附加字符。

还尝试调用 sb.Replace 或 sb.Remove 我不想要的字符,然后再将它们存储在字典中。但我似乎最终还是得到了我不想要的角色?

我尝试的一切,我似乎至少有一个我不想要的角色,并且无法弄清楚为什么它不起作用。

谢谢!

【问题讨论】:

  • Trim() 从方法调用返回修剪后的字符串,它不会改变您调用 Trim() 的字符串。您需要调整代码以使用对 Trim() 的调用返回的值,并使用结果更新您的单词数组。
  • 您必须将它们移动到新行吗?您可以尝试使用正则表达式来查看有多少字符串与您的正则表达式匹配?
  • 一个简单的正则表达式将捕获所有单词而不生成所有字符串拆分将,例如Regex.Matches(@"\w+") 将捕获所有连续单词字符。匹配计数将是单词的数量。 Word characters 包含数字。 "[a-zA-Z]+" 将只捕获英文字母,而 "\p{L}+" 将捕获字母,无论是哪种语言
  • non-letter 算一两个字吗?你想跳过数字太像this is a 10 number 将是 4 个字?
  • 非字母将是两个单词,理想情况下,我会将连字符替换为空格,这样我就可以将每个单词拆分到一个新行中。我的文本中实际上没有任何数字,所以在这个阶段它并不重要,但我可能会选择跳过它们

标签: c# string word-count distinct-values


【解决方案1】:

使用没有 RegEx 和 Linq 的扩展方法

static public class StringHelper
{
  static public Dictionary<string, int> CountDistinctWords(this string text)
  {
    string str = text.Replace(Environment.NewLine, " ");
    var words = new Dictionary<string, int>();
    var builder = new StringBuilder();
    char charCurrent;
    Action processBuilder = () =>
    {
      var word = builder.ToString();
      if ( !string.IsNullOrEmpty(word) )
        if ( !words.ContainsKey(word) )
          words.Add(word, 1);
        else
          words[word]++;
    };
    for ( int index = 0; index < str.Length; index++ )
    {
      charCurrent = str[index];
      if ( char.IsLetter(charCurrent) )
        builder.Append(charCurrent);
      else
      if ( !char.IsNumber(charCurrent) )
        charCurrent = ' ';
      if ( char.IsWhiteSpace(charCurrent) )
      {
        processBuilder();
        builder.Clear();
      }
    }
    processBuilder();
    return words;
  }
}

它解析所有字符,拒绝所有非字母,同时创建每个单词的字典,并计算出现次数。

测试

var result = text.CountDistinctWords();
Console.WriteLine($"Found {result.Count()} distinct words:");
Console.WriteLine();
foreach ( var item in result )
  Console.WriteLine($"{item.Key}: {item.Value}");

样品结果

Found 36 distinct words:

I: 3
have: 2
a: 2
long: 1
string: 1
with: 1
load: 1
of: 3
words: 1
and: 3
it: 1
includes: 1
new: 1
lines: 1
non: 1
letter: 1
characters: 1
want: 1
to: 2
remove: 1
all: 1
them: 1
split: 1
this: 1
text: 1
one: 1
word: 2
per: 1
line: 1
then: 1
can: 1
count: 1
how: 1
many: 1
each: 1
exist: 1

【讨论】:

    【解决方案2】:

    我确实相信使用字典计算频率的解决方案在性能和清晰度方面是最佳实践。这是我的版本,与接受的答案略有不同(我使用 String.Split() 而不是遍历字符串的字符):

    var text = @"
        I have a long string with a load of words,
        and it includes new lines and non-letter characters.
        I want to remove all of them and split this text to have one word per line, then I       can count how many of each word exist.";
    
    var words = text.Split(new [] {',', '.', '-', ' ', '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
    
    var freqByWord = new Dictionary<string, int>();
    
    foreach (var word in words)
    {
        if (freqByWord.ContainsKey(word))
        {
            freqByWord[word]++; // we found the same word
        }
        else
        {
            freqByWord.Add(word, 1); // we don't have this one yet
        }
    }
    
    foreach (var word in freqByWord.Keys)
    {
        Console.WriteLine($"{word}: {freqByWord[word]}");
    }
    

    结果几乎是一样的:

    I: 3
    have: 2
    a: 2
    long: 1
    string: 1
    with: 1
    load: 1
    of: 3
    words: 1
    and: 3
    it: 1
    includes: 1
    new: 1
    lines: 1
    non: 1
    letter: 1
    characters: 1
    want: 1
    to: 2
    remove: 1
    all: 1
    them: 1
    split: 1
    this: 1
    text: 1
    one: 1
    word: 2
    per: 1
    line: 1
    then: 1
    can: 1
    count: 1
    how: 1
    many: 1
    each: 1
    exist: 1
    

    【讨论】:

      【解决方案3】:

      使用正则表达式排除非字母字符。这也将为您提供所有单词的集合。

      var text = @"
      I have a long string with a load of words,
      and it includes new lines and non-letter characters.
      I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist.";
      
      var words = Regex.Matches(text, @"[A-Za-z ]+").Cast<Match>().SelectMany(n => n.Value.Trim().Split(' '));
      int wordCount = words.Count();
      

      【讨论】:

        猜你喜欢
        • 2022-10-23
        • 2023-03-25
        • 1970-01-01
        • 2018-09-08
        • 2021-12-31
        • 1970-01-01
        • 1970-01-01
        • 2012-06-04
        相关资源
        最近更新 更多