【发布时间】:2020-03-31 13:22:52
【问题描述】:
基本上,我已经使用 keras 训练了一些模型来进行孤立词识别。目前,我可以使用声音设备录制功能录制音频一段预先固定的时间,并将音频文件保存为 wav 文件。我已经实现了静音检测来修剪不需要的样本。但这一切都在整个录制完成后才起作用。我想在同时录制的同时立即获得修剪后的音频片段,以便我可以实时进行语音识别。我正在使用 python2 和 tensorflow 1.14.0。下面是我目前拥有的sn-p,
import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing
models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]
for model in models:
loaded_models.append(tf.keras.models.load_model(model))
def prediction(model_ip):
model,t=model_ip
ret_val=model.predict(t).tolist()[0]
return ret_val
print("recording in 5sec")
time.sleep(5)
fs = 44100 # Sample rate
seconds = 10 # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized.
end_points(wav_file,thresh,50)
#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
...
... # The trimmed audio is processed further and the input which the model can predict
#is t
...
modelon=[]
for md in loaded_models:
modelon.append((md,t))
start_time=time.time()
with closing(multiprocessing.Pool()) as p:
predops=p.map(prediction,modelon)
print('Total time taken: {}'.format(time.time() - start_time))
actops=[]
for predop in predops:
actops.append(predop.index(max(predop)))
print(actops)
max_freqq = max(set(actops), key = actops.count)
final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))
请注意,以上代码仅包含与问题相关的内容,不会运行。我想概述一下我到目前为止所拥有的内容,并且非常感谢您对我如何继续能够根据阈值同时录制和修剪音频的意见,以便如果在 10 秒的录制持续时间内说出多个单词(代码中的秒变量),正如我所说,当窗口大小为 50ms 的样本能量低于某个阈值时,我在这两个点处剪切音频,修剪并将其用于预测。修剪后的音频片段的录制和预测必须同时进行,以便在 10 秒的录制期间,每个输出单词可以在其发声后立即显示。非常感谢有关我如何解决此问题的任何建议。
【问题讨论】:
标签: python tensorflow multiprocessing speech-recognition real-time