python - 如何计算python文本文件中相同关键字出现两次的行数？答案

【问题标题】：how to calculate the number of lines separating 2 occurrences of the same keyword in a text file in python?python - 如何计算python文本文件中相同关键字出现两次的行数？
【发布时间】：2019-12-16 17:37:20
【问题描述】：

我有一个 python scraping 脚本来获取一些即将到来的音乐会的信息，并且无论有多少音乐会出现，每次都是相同的文本模式，唯一的区别是有时它会显示一个额外的行仍然可以预订时的门票价格，例如以下示例：

LIVE 01/01/99 9PM
Iron Maiden
Madison Square Garden 
New York City
LIVE 01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
LIVE 01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
Book a ticket now for $99,99
LIVE 01/02/99 9PM
Diana Ross
City Hall
New York City 
Book a ticket now for $79,99       ect...

我需要计算每个文本块的行数并检查它是 4 行还是 5 行，所以我想的是计算每个块的第一个单词的出现（“LIVE”）和然后添加一个 if 语句来对 2 个类别（4 行块和 5 行块）之间的块进行排序

if 语句部分并不难，但我只是不知道如何做第一部分，也许是 readlines 然后当一行有关键字“LIVE”时，添加行位置（提供数据样本分别是第1行、第5行、第9行、第14行，这里我们可以清楚地看到前2个块是4行，而第3个是5行）然后if语句部分将它们整理出来

任何帮助将不胜感激，谢谢！

用我的代码想法编辑，我希望它会更清楚，我需要获取变量 line_number 和 gap_each_line 的代码：

with open('concerts_list.txt', 'r') as file:          
    reading_file = file.read()
    lines = reading_file.split('\n')
    for "LIVE" in lines:
        line_number = #the part where I'm stuck to tell each line number
 where the word "LIVE" appears. output desired: [0, 4, 8, 13]
        gap_each_line = #calculate the gap between each number of previous 
variable line_number. output desired: [4, 4, 5]
    if gap == 4 for gap in gap_each_line:
        dates = [i for i in lines [0::4]]
    elif gap == 5 for gap in gap_each_line:
        dates = [i for i in lines [0::5]]

【问题讨论】：

您的预期输出如何？
我实际上已经为每个数据（日期、波段、位置等）分配了一个变量，所以当我在行之间迭代时，我正在这样做： dates = [i for i in lines [0 ::4]]。因此，在获得行号后，我将能够为我的 if 语句分配 2 个类别，日期 = [i for i in lines [0::4]] & dates = [i for i in lines [0:: 5]]
所以你的最终输出将是一个列标题为日期、波段、位置、价格等的表格。我说的对吗？
我刚刚编辑了我的初始帖子，以便更清楚地了解一些代码想法；）

标签： python string text line

【解决方案1】：

您可以使用read_csv of pandas 模块。

我希望您的整个问题（查找日期等）都可以使用pandas 解决。

下面是查找以'LIVE'开头的行之间的行差异的代码

import pandas as pd
df = pd.read_csv('/Users/prince/Downloads/test3.csv', sep='~~~', header=None, engine='python')
df.columns = ['Details']
df['si_no'] = df['Details'].str.startswith('LIVE').cumsum()
gaps = df.groupby('si_no').apply(lambda x : len(x)).values
print(gaps)

它会打印出来

[4 4 5 5]

【讨论】：

【解决方案2】：

我知道您提供了您想要的输出（弗朗西斯王子的回答），但我感觉您正在尝试以困难的方式解决问题。

请看这个：

from collections import defaultdict #Defaultdict let's you create a dictionary, which is already set up to contain a list for every key

concerts = defaultdict(list)
current_dictKey = None # Starts "unset"
with open('/tmp/concerts_list.txt', 'r') as file:
    reading_file = file.read()
    lines = reading_file.split('\n')
    for line in lines:
        print('I just read the following:', line)
        if line.startswith('LIVE'):
            print('The current line starts with keyword "live", so this will be the dictionarys new Key')
            current_dictKey = line
            continue # Continue to next line without doing anything else

        if line.startswith('Book a ticket'):
            print("This line starts with 'book a ticket'. Let's skip those too.")
            continue # Skip those lines. I guess you don't want them either.

        concerts[current_dictKey].append(line) # Just add the line to the Key in defaultdict


print()
print('This is the object "concerts"you get as the result:')
print(concerts)
print()


print('You can access a specific value like this:', concerts['LIVE 01/01/99 9PM'])

一旦它在字典中，您就可以非常轻松地访问所有数据。

【讨论】：

我只需要这两个变量的代码，因为在那之后我的代码部分可以访问我的日期、波段、位置等中的值......变量工作得很好，所以我真的只需要检查每个块中的行数，然后将它们分类到各自的 4 或 5 行类别

【解决方案3】：

（我创建了一个新答案，因为我仍然认为我的第一个答案更适合大多数情况。）

这将创建您想要的输出：

live_lines = []
line_counter = 0
distances = []
with open('concerts_list.txt', 'r') as file:
    reading_file = file.read()
    lines = reading_file.split('\n')
    for line in lines:
        if line.startswith('LIVE'):
            live_lines.append(line_counter)

        line_counter += 1

for position in range(len(live_lines)-1):
    new_distance = live_lines[position+1] - live_lines[position]
    distances.append(new_distance)

print('live_lines:', live_lines)
print('distances', distances)

输出：

live_lines: [0, 4, 8, 13]
Distances [4, 4, 5]

【讨论】：

效果很好，谢谢！但这只是我问题的第一部分，现在的想法是计算每个块之间的间隙并将结果放入变量 gap_each_line。然后最后将 4 和 5 块分类到各自的类别中，以便正确访问数据（日期、波段等...）
我刚刚编辑了我的帖子并添加了距离计算。这将遍历 live_lines 数组（最后一个除外）并进行简单的减法。
再次非常感谢亚历克斯，它就像一个魅力！很抱歉再进一步（最后一次肯定哈哈），但在最后一部分，我添加了我的 if 语句来整理块并根据每个类别访问我的数据不起作用

【解决方案4】：

与你写的类似，但写的更pythonic。

with open('concerts_list.txt', 'rt') as file:
    indices = [index for index, line in enumerate(file) if line.startswith("LIVE")]
    block_lengths = [adjacent - current for current, adjacent in zip(indices , indices [1:])]

如果您的文件非常大，您可以使用generator comprehension、itertools.tee、itertools.islice 来延迟加载您需要在内存中进行计算的数据。因此，与这里的第一个示例相比，您使用 Iterator 对象而不是内存列表来处理数据流。

import itertools

with open('concerts_list.txt', 'rt') as file:
    # generator comprehension
    indices = (index for index, line in enumerate(file) if line.startswith("LIVE"))
    # itertools.tee make copies of iterators
    indices_1, indices_2 = itertools.tee(indices)
    # here itertools.islice make new iterator without first element
    block_lengths = [adjacent - current for current, adjacent in
                     zip(indices_1, itertools.islice(indices_2, 1, None))]

【讨论】：