The integration of LLMs with voice capabilities has created new opportunities in personalized customer interactions.
This guide will walk you through setting up a local LLM server that supports two-way voice interactions using Python, Transformers,
Qwen2-Audio-7B-Instruct, and Bark.
Prerequisites
Before we begin, you'll have the following installed:
- Python: Version 3.9 or higher.
- PyTorch: For running the models.
- Transformers: Provides access to the Qwen model.
- Accelerate: Required in some environments.
- FFmpeg & pydub: For audio processing.
- FastAPI: To create the web server.
- Uvicorn: ASGI server to run FastAPI.
- Bark: For text-to-speech synthesis.
- Multipart & Scipy: To manipulate audio.
FFmpeg can be installed via apt install ffmpeg
on Linux or brew install ffmpeg
on MacOS.
You can install the Python dependencies using pip: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy
Step 1: Setting Up the Environment
First, let’s set up our Python environment and choose our PyTorch
device:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
This code checks if a CUDA-compatible (Nvidia) GPU is available and sets the device accordingly.
If no such GPU is available, PyTorch will instead run on CPU which is much slower.
For newer Apple Silicon devices, the device can also be set to mps to run PyTorch on Metal, but the PyTorch Metal implementation is not comprehensive.
Step 2: Loading the Model
Most open-source LLMs only support text input and text output. However, since we want to create a voice-in-voice-out system, this would require us to use two more models to (1) convert the speech into text before it's fed into our LLM and (2) convert the LLM output back into speech.
By using a multimodal LLM like Qwen Audio, we can get away with one model to process speech input into a text response, and then only have to use a second model convert the LLM output back into speech.
This multimodal approach is not only more efficient in terms of processing time and (V)RAM consumption, but also usually yields better results since the input audio is sent straight to the LLM without any friction.
If you're running on a cloud GPU host like Runpod or Vast, you'll want to set the HuggingFace home & Bark directories to your volume storage by running
export HF_HOME=/workspace/hf
&export XDG_CACHE_HOME=/workspace/bark
before downloading the models.
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
model_name = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)
We chose to use the small 7B variant of the Qwen Audio model series here in order to reduce our computational requirements. However, Qwen may have released stronger and bigger audio models by the time you are reading this article. You can view all the Qwen models on HuggingFace to double check you're using their latest model.
For a production environment, you may want to use a fast inference engine like vLLM for much higher throughput.
Step 3: Loading the Bark model
Bark is a state-of-the-art open-source text-to-speech AI model that supports multiple languages as well as sound effects.
from bark import SAMPLE_RATE, generate_audio, preload_models
preload_models()
Besides Bark, you can also use other open-source or proprietary text-to-speech models. Keep in mind that while the proprietary ones might be more performant, they come at a much higher cost. The TTS arena keeps an up-to-date comparison.
With both Qwen Audio 7B & Bark loaded into memory, the approximate (V)RAM usage is 24GB, so make sure your hardware supports this. Otherwise, you may use a quantized version of the Qwen model to save on memory.
Step 4: Setting Up the FastAPI Server
We’ll create a FastAPI server with two routes to handle incoming audio or text inputs and return audio responses.
from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse
import uvicorn
app = FastAPI()
@app.post("/voice")
async def voice_interaction(file: UploadFile):
# TODO
return
@app.post("/text")
async def text_interaction(text: str = Form(...)):
# TODO
return
if __name__ == "__main__":
  uvicorn.run(app, host="0.0.0.0", port=8000)
This server accepts audio files via POST requests at the /voice
& /text
endpoint.
Step 5: Processing Audio Input
We’ll use ffmpeg to process the incoming audio and prepare it for the Qwen model.
from pydub import AudioSegment
from io import BytesIO
import numpy as np
def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray:
audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1)
samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
samples = samples.astype(np.float32) / 32768.0
return samples
def load_audio_as_array(audio_bytes: bytes) -> np.ndarray:
audio_segment = AudioSegment.from_file(BytesIO(audio_bytes))
float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000)
return float_array
Step 6: Generating Textual Response with Qwen
With the processed audio, we can generate a textual response using the Qwen model. This will need to handle both text & audio inputs.
The preprocessor will convert our input to the model's chat template (ChatML in Qwen's case).
def generate_response(conversation):
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audio_array = load_audio_as_array(ele["audio_url"])
audios.append(audio_array)
if audios:
inputs = processor(
text=text,
audios=audios,
return_tensors="pt",
padding=True
).to(device)
else:
inputs = processor(
text=text,
return_tensors="pt",
padding=True
).to(device)
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
return response
Feel free to play around with the generation parameters like the temperature on the model.generate function.
Step 7: Converting Text to Speech with Bark
Finally, we’ll convert the generated text response back to speech.
from scipy.io.wavfile import write as write_wav
def text_to_speech(text):
audio_array = generate_audio(text)
output_buffer = BytesIO()
write_wav(output_buffer, SAMPLE_RATE, audio_array)
output_buffer.seek(0)
return output_buffer
Step 8: Integrating Everything in the APIs
Update the endpoints to process the audio or text input, generate a response, and return the synthesized speech as a WAV file.@app.post("/voice")
async def voice_interaction(file: UploadFile):
audio_bytes = await file.read()
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"audio_url": audio_bytes
}
]
}
]
response_text = generate_response(conversation)
audio_output = text_to_speech(response_text)
return StreamingResponse(audio_output, media_type="audio/wav")
@app.post("/text")
async def text_interaction(text: str = Form(...)):
conversation = [
{"role": "user", "content": [{"type": "text", "text": text}]}
]
response_text = generate_response(conversation)
audio_output = text_to_speech(response_text)
return StreamingResponse(audio_output, media_type="audio/wav")
You may choose to also add a system message to the conversations to gain more control over the assistant responses.
Step 9: Testing things out
We can use curl to ping our server as follows:
# Text input
curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"
# Audio input
curl -X POST http://localhost:8000/voice --output output.wav -F "file=@input.wav"
Conclusion
By following these steps, you’ve set up a simple local server capable of two-way voice interactions using state-of-the-art models. This setup can serve as a foundation for building more complex voice-enabled applications.
Applications
If you’re exploring ways to monetize AI-powered language models, consider these potential applications:
Chatbots (e.g. Character AI, NSFW AI Chat);
Phone Agents (e.g. Synthflow, Bland)
Customer Support Automation (e.g. Zendesk, Forethought)
Legal Assistants (Harvey AI, Leya AI)
Full code
import torch
from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse
import uvicorn
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from pydub import AudioSegment
from io import BytesIO
import numpy as np
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)
preload_models()
app = FastAPI()
def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray:
audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1)
samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
samples = samples.astype(np.float32) / 32768.0
return samples
def load_audio_as_array(audio_bytes: bytes) -> np.ndarray:
audio_segment = AudioSegment.from_file(BytesIO(audio_bytes))
float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000)
return float_array
def generate_response(conversation):
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audio_array = load_audio_as_array(ele["audio_url"])
audios.append(audio_array)
if audios:
inputs = processor(
text=text,
audios=audios,
return_tensors="pt",
padding=True
).to(device)
else:
inputs = processor(
text=text,
return_tensors="pt",
padding=True
).to(device)
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
return response
def text_to_speech(text):
audio_array = generate_audio(text)
output_buffer = BytesIO()
write_wav(output_buffer, SAMPLE_RATE, audio_array)
output_buffer.seek(0)
return output_buffer
@app.post("/voice")
async def voice_interaction(file: UploadFile):
audio_bytes = await file.read()
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"audio_url": audio_bytes
}
]
}
]
response_text = generate_response(conversation)
audio_output = text_to_speech(response_text)
return StreamingResponse(audio_output, media_type="audio/wav")
@app.post("/text")
async def text_interaction(text: str = Form(...)):
conversation = [
{"role": "user", "content": [{"type": "text", "text": text}]}
]
response_text = generate_response(conversation)
audio_output = text_to_speech(response_text)
return StreamingResponse(audio_output, media_type="audio/wav")
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Top comments (1)
This is a great piece of information!