【问题标题】:Python converting srt file into csv filePython将srt文件转换为csv文件
【发布时间】:2019-11-28 17:12:18
【问题描述】:

我有一个包含类似行的 srt 文件

355
00:52:44,533 --> 00:52:51,467
Og så er der selvfølgelig masser af valg både her på <initial> P </initial> et og på nettet og på <initial> DR </initial> et i løbet af dagen og i aften. Godt valg.

356
S1 00:52:54,733 --> 00:53:01,933
Du kan finde alle <initial> P </initial> et programmer på dr punktum dk skråstreg <initial> P </initial> et. Det giver mening.

355 和 356 是分段编号,有时它没有诸如“S1”之类的扬声器 ID,所以我想将其留空。对于00:52:54,733 --&gt; 00:53:01,933,第一个是开始时间,接下来是结束时间。当我转换这些数字时,不要太担心格式。

我正在尝试将其转换为具有以下格式的 csv 文件

filename;starttime;endtime;speaker;transcripts

成绩单例如是Og så er der selvfølgelig masser af valg både her på &lt;initial&gt; P &lt;/initial&gt; et og på nettet og på &lt;initial&gt; DR &lt;/initial&gt; et i løbet af dagen og i aften. Godt valg.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
import re
import csv

SRTFILE = sys.argv[1]
CSVFILE = re.sub(r'\.srt$', '.csv', SRTFILE)
BASEFILE = re.sub(r'\.srt$', '', SRTFILE)

if CSVFILE == SRTFILE:
    sys.exit('check the srt suffix')

with open(SRTFILE, 'r') as fid:
    lines = fid.readlines()

newLine = False
transcript = []
captionStart = False
speaker = ''
t1 = 0
t2 = 0
for line in lines:
    line = line.strip()
    if re.match(r'^[0-9]+$', line):
        newLine = True
        continue
    if re.match(r'^$', line):
        if captionStart and len(transcript) > 0:
            continue
            print '%s;%1.3f;%1.3f;%s;;%s'%(BASEFILE, t1, t2, speaker, ' '.join(transcript))
        newLine = False
        transcript = []
        continue
    matchobj = re.match(r'^([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
    if matchobj:
        t1 = int(matchobj.group(1))*3600.0 + int(matchobj.group(2))*60.0 + float(re.sub(r',', '.', matchobj.group(3)))
        t2 = int(matchobj.group(4))*3600.0 + int(matchobj.group(5))*60.0 + float(re.sub(r',', '.', matchobj.group(6)))
        captionStart = True
            if speaker == '':
            continue
        continue
    else:
        matchobj = re.match(r'^([a-zA-Z0-9]+) +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
        if matchobj:
            t1 = int(matchobj.group(2))*3600.0 + int(matchobj.group(3))*60.0 + float(re.sub(r',', '.', matchobj.group(4)))
            t2 = int(matchobj.group(5))*3600.0 + int(matchobj.group(6))*60.0 + float(re.sub(r',', '.', matchobj.group(7)))
            speaker = matchobj.group(1)
            captionStart = True
            continue
    if newLine:
        transcript.append(line)
    if speaker:
        print(CSVFILE, t1, t2, speaker, line)
        with open(CSVFILE, 'w') as fid:
                writer = csv.writer(fid, delimiter=';')
                writer.writerow(CSVFILE, t1, t2, speaker, line)
    else:
        print(CSVFILE, t1, t2, line)
        with open(CSVFILE, 'w') as fid:
                writer = csv.writer(fid, delimiter=';')
                writer.writerow(CSVFILE, t1, t2, line)

with open(CSVFILE, 'w') as fid:
    writer = csv.writer(fid, delimiter=';')
    writer.writerow(transcript)

你可以看到我到底想做什么

with open(CSVFILE, 'w') as fid:
                writer = csv.writer(fid, delimiter=';')
                writer.writerow(CSVFILE, t1, t2, speaker, line)

但是 writerow 只接受一个参数。有没有其他有效的方法来实现这一点并将srt转换为filename;starttime;endtime;speaker;transcripts格式的csv。

【问题讨论】:

    标签: python python-3.x csv srt


    【解决方案1】:

    将您的五个变量放入一个列表中,然后将该列表用作 writerow 的参数:

    if speaker:
        new_list = [CSVFILE, t1, t2, speaker, line]
        print(CSVFILE, t1, t2, speaker, line)
        with open(CSVFILE, 'w') as fid:
            writer = csv.writer(fid, delimiter=';')
            writer.writerow(new_list)
    

    【讨论】:

    • 它只将最后一部分添加到 csv 文件中
    • 在每个 csv.writer() 中添加更改
    猜你喜欢
    • 2018-07-22
    • 1970-01-01
    • 2023-03-03
    • 1970-01-01
    • 1970-01-01
    • 2020-11-03
    • 2019-03-25
    • 2020-10-07
    • 2016-01-07
    相关资源
    最近更新 更多