来自 100 台机器的使用 ssh 的文件的 Python 字数统计答案

【问题标题】：Python Word Count of a file using ssh from 100 machines来自 100 台机器的使用 ssh 的文件的 Python 字数统计
【发布时间】：2014-11-11 00:33:24
【问题描述】：

我尝试为给定文件编写字数统计代码。当我运行这个时，我的字典里是空的，我只是想只得到单词和它的频率。我不确定这是哪里错了。

import collections, re

class Wordcount(object):
    def __init__(self):
        self.freq_dict = collections.defaultdict(int)

    def count(self,input_file):
        with open(input_file) as f:
            for line in f:
                words = line.rstrip().strip().split()
                for word in words:
                    word = word.lower()
                    word = re.sub("[^A-Za-z0-9]+",'',word)
                    self.freq_dict[word]+=1
        print self.freq_dict

def Main():
    c1 = Wordcount()
    c1.count('out.txt')

我的out.txt是这样的

The quick brown fox jumps over the lazy dog

--
 asd
 asdasd


The quick brown fox jumps over the lazy dog's

The quick brown fox jumps over the lazy dog

asd to 之前的空格被解析到字典中。

defaultdict(<type 'int'>, {'': 1, 'brown': 3, 'lazy': 3, 'over': 3, 'fox': 3, 'dog': 2, 'asdasd': 1, 'dogs': 1, 'asd': 1, 'quick': 3, 'the': 6, 'jumps': 3})

我还想将这部分用于 ssh 扩展到近 1000 台机器并读取文件并增加单词的频率。什么是最好的方法？我是否应该创建一个线程 T1 用于登录机器并将登录名传递给另一个线程以读取文件，然后传递给另一个线程以单独增加哈希值。

关于如何扩展它的任何建议真的很有帮助吗？

【问题讨论】：

Map/Reduce 技术？
是的，它的 MR 工作，但我只想使用 Python！
避免空行的提示：检查行是否为空。而且只要 strip 就够了，去掉 rstrip。
最好使用Counter 进行计数。用于拆分创建迭代器。

标签： python multithreading ssh paramiko

【解决方案1】：

这里是使用fabric 的简单示例。 Fabric 是允许通过 ssh 在多台机器上执行命令的框架。

from fabric.api import task, run, get
from collections import Counter
from StringIO import StringIO


def worlds(data):
    return data.split()


@task
def count_worlds():
    s_fp = StringIO()
    # for big files better read to temp file
    get('/some/remote/file', s_fp)
    world_count = Counter(s_fp.getvalue())
    # do something with world_count

要在多台机器上执行此脚本，只需将其保存到 fabfile.py 并执行：

$ fab count_worlds -H host1,host2,host3

您也可以在 fabfile 中定义主机，更多信息请参见this。当然，您应该先安装织物。

【讨论】：

谢谢！这绝对是我想要的。