使用matlab计算大文本文件中每个字符的频率答案

【问题标题】：Count the frequency of each character in a large text file using matlab使用matlab计算大文本文件中每个字符的频率
【发布时间】：2023-04-03 11:12:01
【问题描述】：

我正在尝试读取一个巨大的文本文件并计算每个字母的频率，然后我想找到每个字母的概率分布。这是我目前正在尝试的：

f = fopen('c:\words.txt');
ns = textscan(f, '%s');
fclose(f);

counts = hist(num, 1:26); 
prob = counts / numel(ns{:})

任何提示、帮助、工作代码？

我也在尝试这段代码，但答案不准确

fid = fopen('c:\words.txt');
c = fread(fid);
fclose(fid);


y = unique(c);
counts = histc(c, y);

我想得到如下结果：

a = 2338 times
b = 4533 times 
c = 1233 times

等等……

问候，

【问题讨论】：

有多大？你的意思是太大了，需要分小块阅读？
你试过了吗？ mathworks.com/matlabcentral/fileexchange/7738-countmember

标签： matlab probability

【解决方案1】：

对于大型文本文件，您可能希望避免使用hist 或histc。

代码

%// Convert everything to chars
letters_char = reshape(char(ns{:}),[],1);

%// Get the case-insensitive count of each letter 
count_lettters = sum(bsxfun(@eq,letters_char,97:122),1) + ...
    sum(bsxfun(@eq,letters_char,65:90),1)

最后，要获得概率分布，请使用plot(count_lettters./sum(count_lettters)) 或bar(count_lettters./sum(count_lettters))，以您认为更好的为准。

然后，如果您想为每个字母的概率添加标签，请使用set(gca, 'XTickLabel',cellstr(char(97:122)'),'XTick',1:26)。 Source

样图 -

现在，这是一个随机文本文件，它至少显示了一个有趣的事实，即'e' 可能是典型文本中出现频率最高的字母。

【讨论】：

如何计算结果的概率分布并将其绘制在图表中？
@user2085339 你的意思是这样的 - plot(count_lettters./sum(count_lettters)) 或 bar(count_lettters./sum(count_lettters))?
最后请帮忙....我想用 x 轴 (a-z) 中的字母替换 char (1-26) 的数量
完美支持*****

【解决方案2】：

这可以将 waonce 的所有字符读入数组 A

fileID = fopen('words.txt','r');
A = fscanf(fileID, '%c');   % this also works for unicode characters.
fclose(fileID);

使用Map，可以统计所有字符的出现次数：

for i = 1:numel(A)

    if isKey(keyMap, A(i))
        keyMap(A(i)) = keyMap(A(i)) + 1;
    else
        keyMap(A(i)) = 1;
    end        
end

【讨论】：