如何从 pydub AudioSegment 创建一个 numpy 数组？答案

【问题标题】：How to create a numpy array from a pydub AudioSegment?如何从 pydub AudioSegment 创建一个 numpy 数组？
【发布时间】：2016-10-27 04:28:55
【问题描述】：

我知道以下问题： How to create a pydub AudioSegment using an numpy array?

我的问题正好相反。如果我有一个 pydub AudioSegment，如何将其转换为 numpy 数组？

我想使用 scipy 过滤器等。我不太清楚 AudioSegment 原始数据的内部结构是什么。

【问题讨论】：

标签： python arrays numpy wave pydub

【解决方案1】：

Pydub 有一个获取 audio data as an array of samples 的工具，它是一个 array.array 实例（不是 numpy 数组），但您应该能够相对轻松地将其转换为 numpy 数组：

from pydub import AudioSegment
sound = AudioSegment.from_file("sound1.wav")

# this is an array
samples = sound.get_array_of_samples()

不过，您也许可以创建一个实现的 numpy 变体。该方法的实现非常简单：

def get_array_of_samples(self):
    """
    returns the raw_data as an array of samples
    """
    return array.array(self.array_type, self._data)

也可以从（修改的？）样本数组创建新的音频片段：

new_sound = sound._spawn(samples)

上面的代码有点老套，它是为 AudioSegment 类中的内部使用而编写的，但它主要只是确定您正在使用什么类型的音频数据（样本数组、样本列表、字节、字节串等)。尽管有下划线前缀，但使用起来很安全。

【讨论】：

也有相反的方法吗？ IE。即时从原始/数组数据创建 AS 对象，无需访问文件系统。
@olix20 我在我的回答中添加了相关信息

【解决方案2】：

您可以从AudioSegment 获取array.array，然后将其转换为numpy.ndarray：

from pydub import AudioSegment
import numpy as np
song = AudioSegment.from_mp3('song.mp3')
samples = song.get_array_of_samples()
samples = np.array(samples)

【讨论】：

数组不会按照 scipy 过滤器的需要进行整形/排序。在上述代码块之后，您可能需要：samples = samples.reshape(song.channels, -1, order='F'); samples.shape # (<probably 2>, <len(song) in samples>)。然后samples 波形准备好进行滤波、FFT 分析、绘图等（尽管您可能希望将其转换为浮动）。
这条评论真的很有帮助，结合答案...解决了我的问题
后面的代码；是否需要评论？
@ChrisP 不，这不是必需的 - 只是为了解释

【解决方案3】：

现有的答案都不是完美的，他们错过了重塑和样本宽度。我已经编写了这个函数来帮助将音频转换为 np 中的标准音频表示：

def pydub_to_np(audio: pydub.AudioSegment) -> (np.ndarray, int):
    """
    Converts pydub audio segment into np.float32 of shape [duration_in_seconds*sample_rate, channels],
    where each value is in range [-1.0, 1.0]. 
    Returns tuple (audio_np_array, sample_rate).
    """
    return np.array(audio.get_array_of_samples(), dtype=np.float32).reshape((-1, audio.channels)) / (
            1 << (8 * audio.sample_width - 1)), audio.frame_rate

【讨论】：

【解决方案4】：

get_array_of_samples（未在 [ReadTheDocs.AudioSegment]: audiosegment module 上找到）返回一个 1 维数组，但效果不佳，因为它丢失了有关音频流（帧、频道，...）

几天前，我遇到了这个问题，因为我使用[PyPI]: sounddevice（需要一个 numpy.ndarray）来播放声音（我需要在不同的输出上播放它音频设备）。这是我想出的。

code00.py：

#!/usr/bin/env python

import sys
from pprint import pprint as pp
import numpy as np
import pydub
import sounddevice as sd


def audio_file_to_np_array(file_name):
    asg = pydub.AudioSegment.from_file(file_name)
    dtype = getattr(np, "int{:d}".format(asg.sample_width * 8))  # Or could create a mapping: {1: np.int8, 2: np.int16, 4: np.int32, 8: np.int64}
    arr = np.ndarray((int(asg.frame_count()), asg.channels), buffer=asg.raw_data, dtype=dtype)
    print("\n", asg.frame_rate, arr.shape, arr.dtype, arr.size, len(asg.raw_data), len(asg.get_array_of_samples()))  # @TODO: Comment this line!!!
    return arr, asg.frame_rate


def main(*argv):
    pp(sd.query_devices())  # @TODO: Comment this line!!!
    a, fr = audio_file_to_np_array("./test00.mp3")
    dvc = 5  # Index of an OUTPUT device (from sd.query_devices() on YOUR machine)
    #sd.default.device = dvc  # Change default OUTPUT device
    sd.play(a, samplerate=fr)
    sd.wait()


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

输出：

 [cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q038015319]> set PATH=%PATH%;f:\Install\pc064\FFMPEG\FFMPEG\4.3.1\bin

 [cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q038015319]> dir /b
 code00.py
 test00.mp3

 [cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q038015319]> "e:\Work\Dev\VEnvs\py_pc064_03.09.01_test0\Scripts\python.exe" code00.py
 Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)] 064bit on win32

    0 Microsoft Sound Mapper - Input, MME (2 in, 0 out)
 >  1 Microphone (Logitech USB Headse, MME (2 in, 0 out)
    2 Microphone (Realtek Audio), MME (2 in, 0 out)
    3 Microsoft Sound Mapper - Output, MME (0 in, 2 out)
 <  4 Speakers (Logitech USB Headset), MME (0 in, 2 out)
    5 Speakers / Headphones (Realtek , MME (0 in, 2 out)
    6 Primary Sound Capture Driver, Windows DirectSound (2 in, 0 out)
    7 Microphone (Logitech USB Headset), Windows DirectSound (2 in, 0 out)
    8 Microphone (Realtek Audio), Windows DirectSound (2 in, 0 out)
    9 Primary Sound Driver, Windows DirectSound (0 in, 2 out)
   10 Speakers (Logitech USB Headset), Windows DirectSound (0 in, 2 out)
   11 Speakers / Headphones (Realtek Audio), Windows DirectSound (0 in, 2 out)
   12 Realtek ASIO, ASIO (2 in, 2 out)
   13 Speakers (Logitech USB Headset), Windows WASAPI (0 in, 2 out)
   14 Speakers / Headphones (Realtek Audio), Windows WASAPI (0 in, 2 out)
   15 Microphone (Logitech USB Headset), Windows WASAPI (1 in, 0 out)
   16 Microphone (Realtek Audio), Windows WASAPI (2 in, 0 out)
   17 Microphone (Realtek HD Audio Mic input), Windows WDM-KS (2 in, 0 out)
   18 Speakers (Realtek HD Audio output), Windows WDM-KS (0 in, 2 out)
   19 Stereo Mix (Realtek HD Audio Stereo input), Windows WDM-KS (2 in, 0 out)
   20 Microphone (Logitech USB Headset), Windows WDM-KS (1 in, 0 out)
   21 Speakers (Logitech USB Headset), Windows WDM-KS (0 in, 2 out)

  44100 (82191, 2) int16 164382 328764 164382

 --- (Manually inserted line) Sound is playing :) ---

 Done.

注意事项：

正如所见，没有硬编码的值（就维度而言，dtype，...）
我还需要返回采样率（因为它不能嵌入数组中），并且它是设备需要的（在本例中为 44.1k默认值 - 但我已经测试过具有该值一半的文件）
所有现有答案都使用 float 来表示样本。这对我不起作用，因为大多数测试文件的采样率都是 16bit 长，并且不支持 np.float16（我的 FPU ），所以我不得不使用 int
附带说明，在测试各种文件时，SoundDevice 无法在我的 Win 笔记本电脑上播放 .m4a（大多数可能是因为 32k 的采样率），但 PyDub 能够

【讨论】：