这里的关键是彻底检查表格并了解您要提取的内容。
首先,像这样逐行解析字符串通常更容易,因此您需要根据表行进行拆分,然后根据该行解析列。我们这样做主要是因为好恶跨越了界限。
1。获取每一行
我们不知道表格可能有多宽,所以我们使用正则表达式来分解表格,如下所示:
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
这为我们提供了一个与我们的多行行相对应的数组。最后的数组切片删除了标题和我们不想处理的任何尾随空格。但是,我们仍然存在将跨越单元格中多行的字符串拉到一起的问题。
2。找到喜欢和不喜欢
如果我们遍历这个行数组,我们知道每一行都有一个喜欢和不喜欢跨越未知行数组的行。我们将这个喜欢和不喜欢的每个都初始化为一个数组,以使最后的连接更快。
for p in pairs:
like,dislike = [],[]
3。处理每一行
对于我们的行,我们需要根据换行符拆分它,然后根据管道拆分 (|)。
for l in p.split('\n'):
pair = l.split('|')
4。拉出每个喜欢和不喜欢
如果我们得到的一对有多个值,那么必须有一对喜欢或不喜欢我们来捕捉。所以将它附加到我们的 like 和 dislike 数组 - 不是喜欢或不喜欢,因为它们保存了我们最终格式化的字符串。我们还应该执行strip on these to remove any trailing or leading whitespace。
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
5。创建最终文本
处理完行后,我们可以在join the strings 中添加一个空格,最后可以将它们添加到我们的likes 和dislikes 数组中。
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
6。使用我们的新数据结构
现在我们可以使用这两个新列表以我们选择的任何方式进行处理,或者分别打印每个列表...
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
... 或zip() them together 创建成对的好恶列表!
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
完整代码:
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """
import re
likes,dislikes = [],[]
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
for p in pairs:
like,dislike = [],[]
for l in p.split('\n'):
pair = l.split('|')
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
这会导致:
Likes:
[ 'Meritocracy',
'Healthy debates and collaboration ',
'Autonomy given by confident leaders capable of attracting top-tier talent']
Dislikes:
[ 'Favoritism, ass-kissing, politics',
"Ego-driven rhetoric, drama and FUD to get one's way",
'Micro-management by insecure managers compensating for a weak, immature team']
A set of paired likes and dislikes
[ ('Meritocracy', 'Favoritism, ass-kissing, politics'),
( 'Healthy debates and collaboration ',
"Ego-driven rhetoric, drama and FUD to get one's way"),
( 'Autonomy given by confident leaders capable of attracting top-tier talent',
'Micro-management by insecure managers compensating for a weak, immature team')]
你可以看到complete code in action on codepad。