【发布时间】:2019-11-28 17:12:18
【问题描述】:
我有一个包含类似行的 srt 文件
355
00:52:44,533 --> 00:52:51,467
Og så er der selvfølgelig masser af valg både her på <initial> P </initial> et og på nettet og på <initial> DR </initial> et i løbet af dagen og i aften. Godt valg.
356
S1 00:52:54,733 --> 00:53:01,933
Du kan finde alle <initial> P </initial> et programmer på dr punktum dk skråstreg <initial> P </initial> et. Det giver mening.
355 和 356 是分段编号,有时它没有诸如“S1”之类的扬声器 ID,所以我想将其留空。对于00:52:54,733 --> 00:53:01,933,第一个是开始时间,接下来是结束时间。当我转换这些数字时,不要太担心格式。
我正在尝试将其转换为具有以下格式的 csv 文件
filename;starttime;endtime;speaker;transcripts
成绩单例如是Og så er der selvfølgelig masser af valg både her på <initial> P </initial> et og på nettet og på <initial> DR </initial> et i løbet af dagen og i aften. Godt valg.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import re
import csv
SRTFILE = sys.argv[1]
CSVFILE = re.sub(r'\.srt$', '.csv', SRTFILE)
BASEFILE = re.sub(r'\.srt$', '', SRTFILE)
if CSVFILE == SRTFILE:
sys.exit('check the srt suffix')
with open(SRTFILE, 'r') as fid:
lines = fid.readlines()
newLine = False
transcript = []
captionStart = False
speaker = ''
t1 = 0
t2 = 0
for line in lines:
line = line.strip()
if re.match(r'^[0-9]+$', line):
newLine = True
continue
if re.match(r'^$', line):
if captionStart and len(transcript) > 0:
continue
print '%s;%1.3f;%1.3f;%s;;%s'%(BASEFILE, t1, t2, speaker, ' '.join(transcript))
newLine = False
transcript = []
continue
matchobj = re.match(r'^([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
if matchobj:
t1 = int(matchobj.group(1))*3600.0 + int(matchobj.group(2))*60.0 + float(re.sub(r',', '.', matchobj.group(3)))
t2 = int(matchobj.group(4))*3600.0 + int(matchobj.group(5))*60.0 + float(re.sub(r',', '.', matchobj.group(6)))
captionStart = True
if speaker == '':
continue
continue
else:
matchobj = re.match(r'^([a-zA-Z0-9]+) +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3}) +--> +([0-9][0-9]):([0-9][0-9]):([0-9][0-9][,\.][0-9]{2,3})$', line)
if matchobj:
t1 = int(matchobj.group(2))*3600.0 + int(matchobj.group(3))*60.0 + float(re.sub(r',', '.', matchobj.group(4)))
t2 = int(matchobj.group(5))*3600.0 + int(matchobj.group(6))*60.0 + float(re.sub(r',', '.', matchobj.group(7)))
speaker = matchobj.group(1)
captionStart = True
continue
if newLine:
transcript.append(line)
if speaker:
print(CSVFILE, t1, t2, speaker, line)
with open(CSVFILE, 'w') as fid:
writer = csv.writer(fid, delimiter=';')
writer.writerow(CSVFILE, t1, t2, speaker, line)
else:
print(CSVFILE, t1, t2, line)
with open(CSVFILE, 'w') as fid:
writer = csv.writer(fid, delimiter=';')
writer.writerow(CSVFILE, t1, t2, line)
with open(CSVFILE, 'w') as fid:
writer = csv.writer(fid, delimiter=';')
writer.writerow(transcript)
你可以看到我到底想做什么
with open(CSVFILE, 'w') as fid:
writer = csv.writer(fid, delimiter=';')
writer.writerow(CSVFILE, t1, t2, speaker, line)
但是 writerow 只接受一个参数。有没有其他有效的方法来实现这一点并将srt转换为filename;starttime;endtime;speaker;transcripts格式的csv。
【问题讨论】:
标签: python python-3.x csv srt