【问题标题】:Python remove tabs in strings and tokenize listPython删除字符串中的选项卡并标记列表
【发布时间】:2014-06-22 11:36:15
【问题描述】:

我已经尝试了很多,但这根本不会发生。

输入:-

condor  t   airline airline
eight   n   0   flightnumber
nine    n   0   flightnumber
five    n   0   flightnumber
hallo   t   0   sentence
turn    t   com turn_heading
left    t   0   direction
heading t   com turn_heading
three   n   0   degree_absolute
two     n   0   degree_absolute
zero    n   0   degree_absolute

预期输出:

<s> <callsign> <airline> condor </airline> <flightnumber> eight nine five </flightnumber> </callsign> hallo <command="turn_heading"> turn <direction> left </direction> heading <degree_absolute> three two zero </degree_absolute> </command> </s>

每次我尝试输入内容时,选项卡都会妨碍对字符串进行标记,即使我将它们作为列表或字符串输入也是如此。这就是我尝试剥离标签时发生的情况

['condor\tt\tairline\tairline\n', 'eight\tn\t \tflightnumber\n', 'nine\tn\t \tflightnumber\n', 'five\tn\t \tflightnumber\n', 'hallo\tt\t \tsentence\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'three\tn\t \tdegree_absolute\n', 'two\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n', '\n', 'aeh\tt\t \tsentence\n', 'two\tn\t \tflightnumber\n', 'eight\tn\t \tflightnumber\n', 'november\tt\tflightnumber\tflightnumber\n', 'hallo\tt\t \tsentence\n', 'reduce\tt\tcom\treduce\n', 'two\tn\t \tspeed\n', 'two\tn\t \tspeed\n', 'zero\tn\t \tspeed\n', 'knots\tt\t \treduce\n', '\n', 'condor\tt\tairline\tairline\n', 'eight\tn\t \tflightnumber\n', 'nine\tn\t \tflightnumber\n', 'five\tn\t \tflightnumber\n', 'descend\tt\tcom\tdescend\n', 'three\tn\t \taltitude\n', 'thousand\tn\t \taltitude\n', 'feet\tt\t \tdescend\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'two\tn\t \tdegree_absolute\n', 'six\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n', 'cleared\tt\tcom\tcleared_ils\n', 'ils\tt\t \tcleared_ils\n', 'runway\tt\t \tcleared_ils\n', 'two\tn\t \trunway\n', 'three\tn\t \trunway\n', 'left\tt\t \trunway\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'two\tn\t \tdegree_absolute\n', 'five\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n']

任何帮助,以便我可以剥离标签并将它们标记化并将它们转换为标记格式??

我用来删除控制字符的代码:

import string
with open('input.txt', 'r') as file1:
    lines = str(list(file1))
    print lines.translate(string.maketrans("\n\t\r", "   "))

【问题讨论】:

  • 查看post删除特定控制字符
  • @KobiK Nope 仍然无法正常工作。仍然给出相同的输出。
  • 分享你的代码,这是知道你的问题出在哪里的唯一方法。
  • @KobiK import string with open('input.txt', 'r') as file1:lines = str(list(file1)) print lines.translate(string.maketrans("\n\ t\r", ""))
  • 你为什么不用the csv module

标签: python string tokenize markup


【解决方案1】:

如果你使用csv module,这很容易:

>>> import csv
>>> f = ["condor\tt\tairline\tairline", 
         "eight\tn\t0\tflightnumber",
         "nine\tn\t0\tflightnumber",
         "turn\tt\tcom\tturn_heading",
         "left\tt\t0\tdirection"] # fake 'file' for testing
>>> list(csv.DictReader(f, delimiter="\t"))
[{'condor': 'eight', 't': 'n', 'airline': 'flightnumber'}, 
 {'condor': 'nine', 't': 'n', 'airline': 'flightnumber'},
 {'condor': 'turn', 't': 't', 'airline': 'turn_heading'}, 
 {'condor': 'left', 't': 't', 'airline': 'direction'}]

请注意,我指定delimiter='\t' 来指定制表符分隔(而不是默认的逗号分隔)输入文件,并使用DictReader 自动将每一行设为字典{fieldname: value, ...}

然后您可以将这些字典处理成您想要的任何格式。

【讨论】:

  • 这似乎在 shell 中工作得很好。但是当我将输入作为文件导入时无法工作。我是否必须进行任何类型的数据类型转换才能执行此操作?
  • 你是什么意思“无法工作”?相反会发生什么 - 错误,意外结果?您是否先打开文件(例如with open(filename) as f: reader = csv.DictReader(f, ...))?您可以尝试Sniffer 来确定文件的适当方言。
  • 哇。它只是撕掉了所有字符并用逗号分隔。糟糕,抱歉.. 我想说我没有得到想要的输出,而是像这样映射所有字符 => [{'[': "'"}, {'[': 'c'}, {'[' : 'o'}, {'[': 'n'}, {'[': 'd'}, {'[': 'o'}, {'[': 'r'},...
  • 是的。这给了我一些观点。完成后将发布最终程序。 @jonrsharpe
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2010-10-16
  • 1970-01-01
  • 2019-04-24
  • 2019-05-17
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多