python - 如何解析半结构化文本（cran.all.1400）答案

【问题标题】：python - how to parse semi structured text (cran.all.1400)python - 如何解析半结构化文本（cran.all.1400）
【发布时间】：2016-12-21 11:12:54
【问题描述】：

我需要使用 cran.all.1400 文本文件。

它是文章摘要的集合，其中包含关于每篇文章的一些附加数据。其形式为：

.我 1
.T
空气动力学的实验研究滑流中的机翼。
.A
布伦克曼，男
.B
j. ae。上一页25, 1958, 324.
.W
//很多文字
.I 2
.T
在小体积不可压缩流体中通过平板的简单剪切流粘度 .
.A
听一礼
.B
伦斯勒理工学院航空工程系研究所纽约州特洛伊
.W
//很多文字

等等。

我需要的是这样组织的数据：

文章 1：.T="无论文章 1 的标题是什么"，.A="w/e 作者是"，.B="w/e"，.T="所有文字"
文章 2：.T="whatever the title is", .A="w/e the author is", .B="w/e", .T="all the text"

我将如何在 Python 中执行此操作？感谢您的宝贵时间。

【问题讨论】：

你尝试了什么？ Read Question rules here
看起来你在一行和常规行上有单独的由点、大写字母和可选属性组成的关键字。只需逐行处理文件，如果您遇到问题，请来这里询问更准确的问题。
我尝试将整个文件作为单个字符串读取（带有读取），然后使用 .I 作为分隔符将字符串分解。这给了我一个文章列表（开头有一个空元素，但我可以管理它）。现在我需要按其他标签/关键字来分解文章，但仍然知道哪个元素属于哪个文章。我想我需要一本字典或一个表/二维数组。
如果我逐行处理文本，我不知道如何将行放在正确的位置。

标签： python parsing text-parsing

【解决方案1】：

你在.I 上拆分的想法似乎是一个好的开始。

以下似乎有效：

with open('crantest.txt') as f:
    articles = f.read().split('\n.I')

def process(i, article):
    article = article.replace('\n.T\n','.T=')
    article = '.T=' + article.split('.T=')[1] #strips off the article number, restored below
    article = article.replace('\n.A\n',',.A=')
    article = article.replace('\n.B\n',',.B=')
    article = article.replace('\n.W\n',',.W=')
    return 'article ' + str(i) + ':' + article

data = [process(i+1, article) for i,article in enumerate(articles)]

我创建了一个仅包含前 10 篇文章的测试文件（丢弃了一个小标题和所有以 .I 11 开头的文件）。当我运行上面的代码时，我得到一个长度为 10 的列表。重要的是第一行以 .I 开头（没有先前的换行符），因为我不努力测试拆分的第一个条目是否为空。列表中的第一个条目是一个开头的字符串：

article 1:.T=experimental investigation of the aerodynamics of a\nwing in a slipstream .,.A=brenckman,m.,.B=j. ae. scs. 25, 1958, 324.,.W=experimental investigation of the aerodynamics of a\nwing in a slipstream

编辑时 这是一个字典版本，它使用partition 连续提取相关块。它返回字典而不是字符串列表：

with open('crantest.txt') as f:
    articles = f.read().split('\n.I')

def process(article):
    article = article.split('\n.T\n')[1]
    T, _, article = article.partition('\n.A\n')
    A, _, article = article.partition('\n.B\n')
    B, _, W = article.partition('\n.W\n')
    return {'T':T, 'A':A, 'B':B, 'W':W}

data = {(i+1):process(article) for i,article in enumerate(articles)}

例如：

>>> data[1]
{'A': 'brenckman,m.', 'T': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .', 'B': 'j. ae. scs. 25, 1958, 324.', 'W': 'experimental investigation of the aerodynamics of a\nwing in a slipstream .\n  an experimental study of a wing in a propeller slipstream was\nmade in order to determine the spanwise distribution of the lift\nincrease due to slipstream at different angles of attack of the wing\nand at different free stream to slipstream velocity ratios .  the\nresults were intended in part as an evaluation basis for different\ntheoretical treatments of this problem .\n  the comparative span loading curves, together with\nsupporting evidence, showed that a substantial part of the lift increment\nproduced by the slipstream was due to a /destalling/ or\nboundary-layer-control effect .  the integrated remaining lift\nincrement, after subtracting this destalling lift, was found to agree\nwell with a potential flow theory .\n  an empirical evaluation of the destalling effects was made for\nthe specific configuration of the experiment .'}

s.partition() 返回一个三元组，该三元组由第一次出现分隔符之前的字符串s、分隔符本身和该分隔符之后的字符串部分组成。代码中的下划线 (_) 是一个 Python 习惯用法，它强调意图是丢弃返回值的那部分。

【讨论】：

这非常接近我的需要。但是，恐怕我的问题还不够清楚。我也需要把琴弦拆开。我需要某种可以从中访问的数据结构，例如，第 7 篇文章的 .W 部分。列表的列表或类似的东西。我仍然不确定我是否说清楚。但是“for i,article in enumerate(articles)”是一个很大的帮助，我想我可以使用它到达我需要去的地方。谢谢！
听起来你想要一个字典列表：每篇文章一个字典，每个字典都有键 'T'、'A'、'B' 和 'W'。
是的，这听起来完全正确！ :) 今天晚些时候我会尝试修改你的代码，我需要先完成其他一些事情。感谢您的帮助！
@Car 我添加了基于字典的第二种方法。
太棒了！这正是我需要的，完美！非常感谢！