从 python 输出创建一个 ARFF 文件答案

【问题标题】：Creating an ARFF file from python output从 python 输出创建一个 ARFF 文件
【发布时间】：2011-07-10 23:36:13
【问题描述】：

gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1, 'previous': 1, 'detectives': 1, 'republican': 1, 'group': 1, 'monitor': 1, 'clashes': 1, 'civil': 1, 'charge': 1, 'breaches': 1, 'travelling': 1, 'main': 1, 'disrupt': 1, 'real': 1, 'policing': 3, 'march': 6, 'finance': 1, 'drawn': 1, 'assistant': 1, 'protesters': 1, 'emphasised': 1, 'department': 1, 'traffic': 2, 'outbreak': 1, 'culprits': 1, 'proportionate': 1, 'instructions': 1, 'warned': 2, 'commanders': 1, 'michael': 2, 'exploit': 1, 'culminating': 1, 'large': 2, 'continue': 1, 'team': 1, 'hijack': 1, 'disorder': 1, 'square': 1, 'leaders': 1, 'deal': 2, 'people': 3, 'streets': 1, 'demonstrations': 2, 'observed': 1, 'street': 2, 'college': 1, 'organised': 1, 'operation': 1, 'special': 1, 'shown': 1, 'attendance': 1, 'normal': 1, 'unions': 2, 'individuals': 1, 'safety': 2, 'prosecuted': 1, 'ira': 1, 'ground': 1, 'public': 2, 'told': 1, 'body': 1, 'stewards': 2, 'obey': 1, 'business': 1, 'gathered': 1, 'assemble': 1, 'garda': 5, 'sinn': 1, 'broken': 1, 'fachtna': 1, 'management': 2, 'possibility': 1, 'groups': 3, 'put': 1, 'affiliated': 1, 'strong': 2, 'security': 1, 'stage': 1, 'behaviour': 1, 'involved': 1, 'route': 2, 'violence': 1, 'dublin': 3, 'fein': 1, 'ensure': 2, 'stand': 1, 'act': 2, 'contingency': 1, 'troublemakers': 2, 'facilitate': 2, 'road': 1, 'members': 1, 'prepared': 1, 'presence': 1, 'sullivan': 2, 'reassure': 1, 'number': 3, 'community': 1, 'strategic': 1, 'visible': 2, 'addressed': 1, 'notify': 1, 'trained': 1, 'eirigi': 1, 'city': 4, 'gpo': 1, 'from': 3, 'crowd': 1, 'visit': 1, 'wood': 1, 'editor': 1, 'peaceful': 4, 'expected': 2, 'today': 1, 'commissioner': 4, 'quay': 1, 'ictu': 1, 'advance': 1, 'murphy': 2, 'gardai': 6, 'aware': 1, 'closures': 1, 'courts': 1, 'branch': 1, 'deployed': 1, 'made': 1, 'thousands': 1, 'socialist': 1, 'work': 1, 'supt': 2, 'feehan': 1, 'mr': 1, 'briefing': 1, 'visited': 1, 'manner': 1, 'irish': 2, 'metropolitan': 1, 'spotters': 1, 'organisers': 1, 'in': 13, 'dissident': 1, 'evidence': 1, 'tom': 1, 'arrangements': 3, 'experience': 1, 'allowed': 1, 'sought': 1, 'rally': 1, 'connell': 1, 'officers': 3, 'potential': 1, 'holding': 1, 'units': 1, 'place': 2, 'events': 1, 'dignified': 1, 'planned': 1, 'independent': 1, 'added': 2, 'plans': 1, 'congress': 1, 'centre': 3, 'comprehensive': 1, 'measures': 1, 'yesterday': 2, 'alert': 1, 'important': 1, 'moving': 1, 'plan': 2, 'highly': 1, 'law': 2, 'senior': 2, 'fair': 1, 'recent': 1, 'refuse': 1, 'attempt': 1, 'brady': 1, 'liaising': 1, 'conscious': 1, 'light': 1, 'clear': 1, 'headquarters': 1, 'wing': 1, 'chief': 2, 'maintain': 1, 'harcourt': 1, 'order': 2, 'left': 1}}

我有一个 python 脚本，可以从文本文件中提取单词并计算它们在文件中出现的次数。

我想将它们添加到“.ARFF”文件中以用于 weka 分类。以上是我的 python 脚本的示例输出。如何将它们插入 ARFF 文件，使每个文本文件分开。每个文件都由 {"with their words in here!!"}

来区分

【问题讨论】：

标签： python file classification weka arff

【解决方案1】：

ARFF file format here上有详细信息，生成起来非常简单。例如，使用 Python 字典的精简版，以下脚本：

import re

d = { 'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': 
      {'dail': 1,
       'focus': 1,
       'actions': 1,
       'trade': 2,
       'protest': 1,
       'identify': 1 }}

for original_filename in d.keys():
    m = re.search('^(.*)\.html$',original_filename,)
    if not m:
        print "Ignoring the file:", original_filename
        continue
    output_filename = m.group(1)+'.arff'
    with open(output_filename,"w") as fp:
        fp.write('''@RELATION wordcounts

@ATTRIBUTE word string
@ATTRIBUTE count numeric

@DATA
''')
        for word_and_count in d[original_filename].items():
            fp.write("%s,%d\n" % word_and_count)

生成表单的输出：

@RELATION wordcounts

@ATTRIBUTE word string
@ATTRIBUTE count numeric

@DATA
dail,1
focus,1
actions,1
trade,2
protest,1
identify,1

... 在一个名为 gardai-plan-crackdown-on-troublemakers-at-protest-2438316.arff 的文件中。如果这不是您想要的，我相信您可以轻松更改它。（例如，如果“单词”中可能包含空格或其他标点符号，您可能需要引用它们。）

【讨论】：

owky，你有没有提供任何 java 源代码示例来创建和插入数据到 .arff 文件中。

【解决方案2】：

我知道自己生成 arff 文件很容易，但我还是想让它更简单，所以我写了一个 python 包

https://github.com/ubershmekel/arff

它也在 pypi 所以easy_install arff

【讨论】：

感谢您的工作！我一直在尝试使用它，但它似乎无法处理包含逗号的字符串。转储工作，但加载回来没有。
如果您可以通过 gmail 将我的用户名通过电子邮件发送给我，我们将不胜感激。

【解决方案3】：

This project 似乎更新了一点。你可以通过安装它

点：

$ pip install liac-arff

或easy_install：

$ easy_install liac-arff

【讨论】：