如何在 python 中录制时同时读取音频样本以进行实时语音到文本的转换？答案

【问题标题】：How to simultaneously read audio samples while recording in python for real-time speech to text conversion?如何在 python 中录制时同时读取音频样本以进行实时语音到文本的转换？
【发布时间】：2020-03-31 13:22:52
【问题描述】：

基本上，我已经使用 keras 训练了一些模型来进行孤立词识别。目前，我可以使用声音设备录制功能录制音频一段预先固定的时间，并将音频文件保存为 wav 文件。我已经实现了静音检测来修剪不需要的样本。但这一切都在整个录制完成后才起作用。我想在同时录制的同时立即获得修剪后的音频片段，以便我可以实时进行语音识别。我正在使用 python2 和 tensorflow 1.14.0。下面是我目前拥有的sn-p，

import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np 
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing 

models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]

for model in models:
    loaded_models.append(tf.keras.models.load_model(model))

def prediction(model_ip):
    model,t=model_ip
    ret_val=model.predict(t).tolist()[0]
    return ret_val 

print("recording in 5sec")
time.sleep(5)
fs = 44100  # Sample rate
seconds = 10  # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized. 
end_points(wav_file,thresh,50)

#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
    ...
    ... # The trimmed audio is processed further and the input which the model can predict 
          #is t 
    ...
    modelon=[]
    for md in loaded_models:
        modelon.append((md,t))
    start_time=time.time()
    with closing(multiprocessing.Pool()) as p:
        predops=p.map(prediction,modelon)
    print('Total time taken: {}'.format(time.time() - start_time))          
    actops=[]
    for predop in predops:
        actops.append(predop.index(max(predop)))
    print(actops)
    max_freqq = max(set(actops), key = actops.count) 
    final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))

请注意，以上代码仅包含与问题相关的内容，不会运行。我想概述一下我到目前为止所拥有的内容，并且非常感谢您对我如何继续能够根据阈值同时录制和修剪音频的意见，以便如果在 10 秒的录制持续时间内说出多个单词（代码中的秒变量），正如我所说，当窗口大小为 50ms 的样本能量低于某个阈值时，我在这两个点处剪切音频，修剪并将其用于预测。修剪后的音频片段的录制和预测必须同时进行，以便在 10 秒的录制期间，每个输出单词可以在其发声后立即显示。非常感谢有关我如何解决此问题的任何建议。

【问题讨论】：

标签： python tensorflow multiprocessing speech-recognition real-time

【解决方案1】：

很难说您的模型架构是什么，但有专门为流式识别设计的模型。喜欢Facebook's streaming convnets。不过，您将无法在 Keras 中轻松实现它们。

【讨论】：

谢谢，我使用的是 CNN，但不是流媒体。我基本上想知道如何从记录中读取数据，这可以是与预测并行运行的单独过程，并检查能量阈值以修剪音频样本片段并在记录进行时对其进行处理。比如我应该使用什么模块？是否已经有一个模块可以做到这一点？我想使用我自己的模型，但是我想要在同时录制的同时以 numpy 数组的形式同时卸载音频样本。即使它有点慢我不介意......
你想要线程和数据队列吗？你可以拥有它们，但它们在 Python 中可能无法很好地工作，这对线程处理不利。
你的意思是处理速度不会好？无论如何，如果您能指导我，我将不胜感激。谢谢先生。
感谢@Nikolay Shmyrev 先生，如果您能指导我处理数据队列上的线程，将不胜感激。