[Python]Faster Whisperを用いた音声認識

2024年9月25日

Pythonでは、音声認識で音声情報をテキスト情報として認識する技術を利用できます。
Pythonの音声認識は、いくつかの方法がありますが、OpenAI社がWhisperを提供しています。
Faster Whisperは、OpenAI社が提供するWhisperを高速化したものです。
本記事では、Faster Whisperを利用して音声認識を行います。

やすひら

Faster Whisperを用いた音声認識を紹介します

この記事でわかること

音声認識とは
WhisperとFaster Whisperとは
Faster Whisperライブラリによる音声認識
Faster Whisperライブラリの使い方

音声認識とは

音声認識は、人間の声を解析し、コンピューターが音声情報を理解する技術です。
音声をテキストに変換したり、音声コマンドを通じてデバイスを制御することが可能です。
スマートフォンやスマートスピーカーで音声認識の技術を利用して、音声でデバイスを制御します。

WhisperとFaster Whisperとは

WhisperとFaster Whisperは、音声認識ソフトウェアです。

Whisperとは

Whisperは、OpenAI社が提供する音声認識のソフトウェアです。
GitHub公開版はオープンソースで利用できるため、誰でも手軽に利用することができます。
有料にはなりますが、APIでも利用することができます。

Faster Whisperとは

Faster Whisperは、OpenAI社が提供するWhisperを、CTranslate2で再構築したモデルです。
GitHub公開版のWhisperと同様にオープンソースとなっており、誰でも利用することができます。
Whisperよりも最大で4倍高速で動作することができます。

ライブラリをインストール

pipで関連ライブラリをインストールします。

コマンドライン

pip install torch
pip install faster-whisper
sudo apt install ffmpeg

Faster WhisperはPyTorchを利用しているため、インストールします。
Faster Whisperは音声ファイルの変換にffmpegを利用しているため、インストールします。

Faster Whisperを用いて音声認識する

Faster Whisperを用いて音声認識します。

Faster Whisperは、CPUとGPUを利用することができます。
高速に動作させてたい場合や、GPUを利用した環境が整っていれば、GPUを利用することを推奨します。

Faster Whisperを用いて音声認識する(CPU)

CPUを利用して、Faster Whisperで音声認識します。

ソースコード

from faster_whisper import WhisperModel

# モデルの読み込み
model = WhisperModel("base", device="cpu")

# 音声ファイルのパス
audio_file = "output.wav"

# 音声認識の実行
segments, info = model.transcribe(audio_file, language="ja")

# 結果を表示
print(f"Detected language: {info.language}")
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

コマンド実行例

$ python3 -B python-faster-whisper-speech-recognition.py 
[2024-09-24 14:04:02.429] [ctranslate2] [thread 58988] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
Detected language: ja
[0.00s - 2.00s] テスト

Faster Whisperを用いて音声認識を実行します。

初回実行時は、学習モデルをダウンロードするのでダウンロード時間がかかります。

Faster Whisperによる音声認識の手法

Faster Whisperによる音声認識の手法を紹介します。

学習モデル

Faster Whisperには、複数の学習モデルがあります。

モデル	パラメータ数	必要メモリ	速さ
tiny	39M	1GB	32x
base	74M	1GB	16x
small	244M	2GB	6x
medium	769M	5GB	2x
large	1550M	10GB	1x

パラメータ数が少ない軽量のモデルの方が高速の傾向があります。
パラメータ数が多いモデルの方が動作は遅いですが、精度が高い傾向があります。

Faster Whisperを動作させるハードウェアのスペックや、必要とする精度や速度に合わせて、モデルを決定すると良いです。

音声ファイルを用意する

音声ファイルを用意して、Faster Whisperで音声認識します。
音声ファイルは、録音するか音声合成で作成する方法が良いと思います。

Pythonで音声を録音する方法を紹介します。

Faster Whisperの処理速度

Faster Whisperの処理速度は、最大で4倍高速になっています。
筆者の環境で計測した結果を紹介します。

Whisper 実行結果

$ time python3 -B python-whisper-speech-recognition.py 
/home/user/.local/lib/python3.8/site-packages/whisper/__init__.py:146: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(fp, map_location=device)
/home/user/.local/lib/python3.8/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
テスト

real    0m8.221s
user    0m6.078s
sys    0m1.697s

Faster Whisper 実行結果

$ time python3 -B python-faster-whisper-speech-recognition.py 
[2024-09-24 14:04:02.429] [ctranslate2] [thread 58988] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
Detected language: ja
[0.00s - 2.00s] テスト

real    0m7.069s
user    0m4.951s
sys    0m1.380s

モデル読み込みの処理も含んでいますが、Faster Whisperの方が高速であることがわかります。

まとめ

Faster Whisperを用いた音声認識の方法を紹介しました。

Faster Whisperを用いた音声認識の方法は

PyTorch/ffmpegをインストールする必要がある
学習モデルが複数存在する
OpenAI社提供のWhisperより最大4倍高速で動作する

音声アシスタントや、テキスト変換(Speech-To-Text)の機能を作成したい場合、音声認識の実装が不可欠です。
Faster Whisperライブラリでは、音声ファイルをテキスト化することができるため、音声情報をインプットに、AIが回答するアプリケーションを作成することができます。

URLをコピーしました！