你在.I 上拆分的想法似乎是一个好的开始。
以下似乎有效:
with open('crantest.txt') as f:
articles = f.read().split('\n.I')
def process(i, article):
article = article.replace('\n.T\n','.T=')
article = '.T=' + article.split('.T=')[1] #strips off the article number, restored below
article = article.replace('\n.A\n',',.A=')
article = article.replace('\n.B\n',',.B=')
article = article.replace('\n.W\n',',.W=')
return 'article ' + str(i) + ':' + article
data = [process(i+1, article) for i,article in enumerate(articles)]
我创建了一个仅包含前 10 篇文章的测试文件(丢弃了一个小标题和所有以 .I 11 开头的文件)。当我运行上面的代码时,我得到一个长度为 10 的列表。重要的是第一行以 .I 开头(没有先前的换行符),因为我不努力测试拆分的第一个条目是否为空。列表中的第一个条目是一个开头的字符串:
article 1:.T=experimental investigation of the aerodynamics of a\nwing in a slipstream .,.A=brenckman,m.,.B=j. ae. scs. 25, 1958, 324.,.W=experimental investigation of the aerodynamics of a\nwing in a slipstream
编辑时 这是一个字典版本,它使用partition 连续提取相关块。它返回字典而不是字符串列表:
with open('crantest.txt') as f:
articles = f.read().split('\n.I')
def process(article):
article = article.split('\n.T\n')[1]
T, _, article = article.partition('\n.A\n')
A, _, article = article.partition('\n.B\n')
B, _, W = article.partition('\n.W\n')
return {'T':T, 'A':A, 'B':B, 'W':W}
data = {(i+1):process(article) for i,article in enumerate(articles)}
例如:
>>> data[1]
{'A': 'brenckman,m.', 'T': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .', 'B': 'j. ae. scs. 25, 1958, 324.', 'W': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .\n an experimental study of a wing in a propeller slipstream was\nmade in order to determine the spanwise distribution of the lift\nincrease due to slipstream at different angles of attack of the wing\nand at different free stream to slipstream velocity ratios . the\nresults were intended in part as an evaluation basis for different\ntheoretical treatments of this problem .\n the comparative span loading curves, together with\nsupporting evidence, showed that a substantial part of the lift increment\nproduced by the slipstream was due to a /destalling/ or\nboundary-layer-control effect . the integrated remaining lift\nincrement, after subtracting this destalling lift, was found to agree\nwell with a potential flow theory .\n an empirical evaluation of the destalling effects was made for\nthe specific configuration of the experiment .'}
s.partition() 返回一个三元组,该三元组由第一次出现分隔符之前的字符串s、分隔符本身和该分隔符之后的字符串部分组成。代码中的下划线 (_) 是一个 Python 习惯用法,它强调意图是丢弃返回值的那部分。