【问题标题】:Counting/sorting characters in a text file对文本文件中的字符进行计数/排序
【发布时间】:2015-05-17 19:26:51
【问题描述】:

我正在尝试编写一个程序来读取文本文件,按字符对其进行排序,并跟踪每个字符在文档中出现的次数。这就是我到目前为止所拥有的。

class Program
{
    static void Main(string[] args)
    {
        CharFrequency[] Charfreq = new CharFrequency[128];

        try
        {            
        string line;
        System.IO.StreamReader file = new System.IO.StreamReader(@"C:\Users\User\Documents\Visual Studio 2013\Projects\Array_Project\wap.txt");
        while ((line = file.ReadLine()) != null)
        {
            int ch = file.Read();

            if (Charfreq.Contains(ch))
            {

            }     
        }

        file.Close();

        Console.ReadLine();
        }
        catch (Exception e)
        {
            Console.WriteLine("The process failed: {0}", e.ToString());
        }
    }
}

我的问题是,这里的 if 语句应该写什么?

我还有一个 Charfrequency 类,我将在此处包含它,以防我包含它是有帮助/必要的(是的,我有必要使用数组而不是列表或数组列表)。

public class CharFrequency
{
    private char m_character;
    private long m_count;

    public CharFrequency(char ch)
    {
        Character = ch;
        Count = 0;
    }

    public CharFrequency(char ch, long charCount)
    {
        Character = ch;
        Count = charCount;
    }

    public char Character
    {
        set
        {
            m_character = value;
        }

        get
        {
            return m_character;
        }
    }

    public long Count
    {
        get
        {
            return m_count;
        }
        set
        {
            if (value < 0)
                value = 0;

            m_count = value;
        }
    }

    public void Increment()
    {
        m_count++;

    }

    public override bool Equals(object obj)
    {
        bool equal = false;
        CharFrequency cf = new CharFrequency('\0', 0);

        cf = (CharFrequency)obj;

        if (this.Character == cf.Character)
            equal = true;

        return equal;
    }

    public override int GetHashCode()
    {
        return m_character.GetHashCode();
    }

    public override string ToString()
    {
        String s = String.Format("'{0}' ({1})     = {2}", m_character, (byte)m_character, m_count);

        return s;
    }

}

【问题讨论】:

  • 你在逐字符阅读吗?我是这样,如果你有 ReadLine() 电话,为什么?
  • readline不应该在那里,它是之前的剩余代码形式。
  • 你为什么不做一个“strbob = .ReadToEnd()”然后用 strbob.length 循环字符集 - strbob.replace(strloopchar).length() 并放入数组?

标签: c# arrays counter


【解决方案1】:

【讨论】:

    【解决方案2】:

    你不应该使用Contains

    首先你需要初始化你的Charfreq数组:

    CharFrequency[] Charfreq = new CharFrequency[128];
    
    for (int i = 0; i < Charferq.Length; i++)
    {
        Charfreq[i] = new CharFrequency((char)i);
    }
    
    try
    

    那么你可以

    int ch;
    
    // -1 means that there are no more characters to read,
    // otherwise ch is the char read
    while ((ch = file.Read()) != -1)
    {
         CharFrequency cf = new CharFrequency((char)ch);
    
         // This works because CharFrequency overloads the
         // Equals method, and the Equals method checks only 
         // for the Character property of CharFrequency
         int ix = Array.IndexOf(Charfreq, cf);
    
         // if there is the "right" charfrequency
         if (ix != -1)
         {
             Charfreq[ix].Increment();
         }     
    }
    

    请注意,这不是我编写程序的方式。这是使您的程序正常运行所需的最少更改。

    作为旁注,此程序将计算 ASCII 字符(代码

    CharFrequency cf = new CharFrequency('\0', 0);
    
    cf = (CharFrequency)obj;
    

    这是一个无用的初始化:

    CharFrequency cf = (CharFrequency)obj;
    

    就足够了,否则你创建一个CharFrequency只是为了丢弃它下面的行。

    【讨论】:

      【解决方案3】:

      字典非常适​​合这样的任务。你没有说文件在哪个字符集和编码。所以,因为 Unicode 很常见,让我们假设 Unicode 字符集和 UTF-8 编码。 (毕竟,它是 .NET、Java、JavaScript、HTML、XML 等的默认设置。)如果不是这样,请使用适用的编码读取文件并修复您的代码,因为您当前使用的是 UTF-8流阅读器。

      接下来是遍历“字符”。然后增加字典中“字符”的计数,就像它在文本中看到的那样。

      Unicode 确实有一些复杂的特性。一种是组合字符,其中一个基本字符可以与变音符号等重叠。用户将这些组合视为一个“字符”,或者 Unicode 将它们称为字素。值得庆幸的是,.NET 提供了 StringInfo 类,该类将它们作为“文本元素”进行迭代。

      因此,如果您考虑一下,使用数组将非常困难。您必须在数组之上构建自己的字典。

      下面的示例使用字典并且可以使用 LINQPad script 运行。创建字典后,它会以漂亮的显示排序并转储它。

      var path = Path.GetTempFileName();
      // Get some text we know is encoded in UTF-8 to simplify the code below
      // and contains combining codepoints as a matter of example.
      using (var web = new WebClient())
      {
          web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path); 
      }
      // since the question asks to analyze a file
      var content = File.ReadAllText(path, Encoding.UTF8); 
      var frequency = new Dictionary<String, int>();
      var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content);
      while (itor.MoveNext()) 
      {
          var element = (String)itor.Current;
          if (!frequency.ContainsKey(element)) 
          {
              frequency.Add(element, 0);
          }
          frequency[element]++;
      }
      var histogram = frequency
          .OrderByDescending(f => f.Value)
          // jazz it up with the list of codepoints in each text element
          .Select(pair =>  
              {
                  var bytes = Encoding.UTF32.GetBytes(pair.Key);
                  var codepoints = new UInt32[bytes.Length/4];
                  Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length);
                  return new { 
                      Count = pair.Value, 
                      textElement = pair.Key, 
                      codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp) ) };
              });
      histogram.Dump(); // For use in LINQPad
      

      【讨论】:

      • 哇!我没有注意到代理对和可组合字符的过度处理!我一直喜欢 Unicode 正确处理! :-)
      猜你喜欢
      • 2018-04-02
      • 1970-01-01
      • 2022-01-01
      • 1970-01-01
      • 2019-05-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-10-15
      相关资源
      最近更新 更多