Make a real-time AI application with FastRTC

08 of march of 2025

FastRTC: The Real-Time Communication Library for Python

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In recent months, we have seen significant advancements in real-time voice models, with entire companies being founded around both open-source and closed models. Some key milestones include:

OpenAI and Google launched their live multimodal APIs for ChatGPT and Gemini. OpenAI even launched a phone number 1-800-ChatGPT!
Kyutai launched Moshi, a fully open-source audio-to-audio LLM.
Alibaba launched Qwen2-Audio, an open-source LLM that natively understands audio.
Fixie.ai launched Ultravox, another open-source LLM that also natively understands audio.
ElevenLabs raised 180 million dollars in its Series C.

Despite this explosion in models and funding, it remains difficult to build real-time AI applications that stream audio and video, especially in Python.

Machine learning engineers may not have experience with the necessary technologies to build real-time applications, such as WebRTC.
Even code assistance tools like Cursor and Copilot struggle to write Python code that supports real-time audio/video applications.

That's why the announcement of FastRTC, the real-time communication library for Python, is exciting. The library is designed to make it easy to build real-time audio and video AI applications entirely in Python!

Main Features of FastRTC

🗣️ Automatic voice detection and built-in turn taking, so you only have to worry about the user response logic.
💻 Automatic UI - Built-in Gradio UI enabled for WebRTC for testing (or deployment to production!).
📞 Phone call - Use fastphone() to get a free phone number to call your audio stream (HF token required).
⚡️ Support for WebRTC and Websocket.
💪 Customizable - You can mount the stream in any FastAPI application to serve a custom UI and deploy beyond Gradio.
🧰 Many utilities for text-to-speech, speech-to-text, stop detection to help you get started.

Installation

To be able to use FastRTC, you first need to install the library:

pip install fastrtc

But if we want to install the pause detection, speech-to-text, and text-to-speech functionalities, we need to install some additional dependencies:

pip install "fastrtc[vad, stt, tts]"

Getting Started

We will start by building the hello world of real-time audio: echoing what the user says. In FastRTC, this is as simple as:

	
		from fastrtc import Stream, ReplyOnPause
import numpy as np
def echo(audio: tuple[int, np.ndarray]) -> tuple[int, np.ndarray]:
    yield audio
stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()

	
		* Running on local URL:  http://127.0.0.1:7872
To create a public link, set `share=True` in `launch()`.

When we go to the link that Gradio suggests, we first have to give permissions to the browser to access the microphone. Next, this will appear: fastrct - hello world - init

If we click on the tab to the right of the word Record, we can select the microphone we want to use.

When we press the Record button, everything we say will be repeated by the application. That is, it captures the audio, detects when we have stopped speaking, and repeats it.

Let's break it down:

ReplyOnPause will handle voice detection and turn-taking for you. You only need to worry about the logic for responding to the user. You have to pass it the function that will manage the input audio. In our case, it's the echo function, which captures the input audio and returns it as a stream using yield, which many people don't know, but is a generator, meaning it's a Python method for creating iterators. If you want to learn more about yield, you can read my post on Python. Any generator that returns an audio tuple (represented as (sample_rate, audio_data)) will work.
The Stream class will build a Gradio UI for you to quickly test your stream. Once you have finished prototyping, you can deploy your Stream as a production-ready FastAPI application in a single line of code.

Here we can see an example from the creators of FastRTC

Leveling Up: Voice Chat with LLM

The next level is to use an LLM to respond to the user. FastRTC comes with built-in speech-to-text and text-to-speech capabilities, so working with LLMs is really easy. Let's modify our echo function accordingly:

	
		from fastrtc import ReplyOnPause, Stream, get_stt_model, get_tts_model
from gradio_client import Client
client = Client("Maximofn/SmolLM2_localModel")
stt_model = get_stt_model()
tts_model = get_tts_model()
def echo(audio):
    prompt = stt_model.stt(audio)
    response = client.predict(
            message=prompt,
            system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
            max_tokens=512,
            temperature=0.7,
            top_p=0.95,
            api_name="/chat"
    )
    prompt = response
    for audio_chunk in tts_model.stream_tts_sync(prompt):
        yield audio_chunk
stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()

	
		Loaded as API: https://maximofn-smollm2-localmodel.hf.space ✔
* Running on local URL:  http://127.0.0.1:7871
To create a public link, set `share=True` in `launch()`.

As a speech-to-text model, use Moonshine, which supposedly only supports English, but I have tested it in Spanish and it understands well.

As a language model, we will use the model I deployed in a backend on Hugging Face and wrote about in the post Deploying a Backend with LLM on HuggingFace. It uses the LLM HuggingFaceTB/SmolLM2-1.7B-Instruct, which is a small model since it's running on a backend with CPU, but it works quite well.

As a text-to-speech model, use Kokoro, which does have options to speak in other languages, but is not yet implemented in the FastRTC library.

If we are very interested in using speech-to-speech and text-to-speech models in other languages, we could implement them ourselves, because the greatest potential of FastRTC lies in the real-time communication layer, but I won't go into that now.

Now if we test the code we just wrote, we can have a voice chatbot in real time.

Phone Call

We generated a script because it doesn't always work in a Jupyter Notebook.

	
		%%writefile fastrtc_phone_demo.py
from fastrtc import ReplyOnPause, Stream, get_stt_model, get_tts_model
import gradio
from gradio_client import Client
import os
from gradio.networking import setup_tunnel as original_setup_tunnel
import socket
# Monkey patch setup_tunnel para que acepte el parámetro adicional
def patched_setup_tunnel(host, port, share_token, share_server_address, share_server_tls_certificate=None):
    return original_setup_tunnel(host, port, share_token, share_server_address, share_server_tls_certificate)
# Replace the original function with our patched version
gradio.networking.setup_tunnel = patched_setup_tunnel
# Get the token from the environment variable
HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN = os.getenv("HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN")
# Initialize the LLM client
llm_client = Client("Maximofn/SmolLM2_localModel")
# Initialize the STT and TTS models
stt_model = get_stt_model()
tts_model = get_tts_model()
# Define the echo function
def echo(audio):
    # Convert the audio to text
    prompt = stt_model.stt(audio)
    # Generate the response
    response = llm_client.predict(
            message=prompt,
            system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
            max_tokens=512,
            temperature=0.7,
            top_p=0.95,
            api_name="/chat"
    )
    
    # Convert the response to audio
    prompt = response
    # Stream the audio
    for audio_chunk in tts_model.stream_tts_sync(prompt):
        yield audio_chunk
def find_free_port(start_port=8000, max_port=9000):
    """Find the first free port starting from start_port."""
    print(f"Searching for a free port starting from {start_port}...")
    for port in range(start_port, max_port):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
            result = sock.connect_ex(('127.0.0.1', port))
            if result != 0: # If result != 0, the port is free
                print(f"Free port found: {port}")
                return port
    raise RuntimeError(f"No free port found between {start_port} and {max_port}")
    
free_port = find_free_port() # Search for a free port
stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.fastphone(token=HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN, port=free_port)

We explain the code

The part

# Monkey patch setup_tunnel para que acepte el parámetro adicional
def patched_setup_tunnel(host, port, share_token, share_server_address, share_server_tls_certificate=None):
    return original_setup_tunnel(host, port, share_token, share_server_address, share_server_tls_certificate)
    
# Replace the original function with our patched version
gradio.networking.setup_tunnel = patched_setup_tunnel

It is necessary because FastRTC is written for an older version of gradio that does not support the share_server_address parameter in the setup_tunnel method. So we patch it to accept the additional parameter.

As a Hugging Face token is required, we obtain it from the environment variable HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN.

# Get the token from the environment variable
HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN = os.getenv("HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN")

The language models, the speech-to-text model, and the text-to-speech model are created below, along with the echo function that will handle the input and output audio.

# Initialize the LLM client
llm_client = Client("Maximofn/SmolLM2_localModel")

# Initialize the STT and TTS models
stt_model = get_stt_model()
tts_model = get_tts_model()

# Define the echo function
def echo(audio):
    # Convert the audio to text
    prompt = stt_model.stt(audio)

    # Generate the response
    response = llm_client.predict(
            message=prompt,
            system_message="You are a friendly Chatbot. Always reply in the language in which the user is writing to you.",
            max_tokens=512,
            temperature=0.7,
            top_p=0.95,
            api_name="/chat"
    )

    # Convert the response to audio
    prompt = response

    # Stream the audio
    for audio_chunk in tts_model.stream_tts_sync(prompt):
        yield audio_chunk

As before we have used the port 8000, if it says it is occupied, we create a function to find a free port and we find one.

def find_free_port(start_port=8000, max_port=9000):
    """Find the first free port starting from start_port."""
    print(f"Searching for a free port starting from {start_port}...")
    for port in range(start_port, max_port):
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
            result = sock.connect_ex(('127.0.0.1', port))
            if result != 0:  # If result != 0, the port is free
                print(f"Free port found: {port}")
                return port
      raise RuntimeError(f"No free port found between {start_port} and {max_port}")

free_port = find_free_port()    # Search for a free port

We create the stream and now we use stream.fastphone() to get a free phone number to call your stream, instead of stream.ui.launch() that we used before to create the graphical interface.

stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.fastphone(token=HUGGINGFACE_FASTRTC_PHONE_CALL_TOKEN, port=free_port)

If we run it, we will see something like this:

	
		!python fastrtc_phone_demo.py

	
		Loaded as API: https://maximofn-smollm2-localmodel.hf.space ✔
INFO:	  Warming up STT model.
INFO:	  STT model warmed up.
INFO:	  Warming up VAD model.
INFO:	  VAD model warmed up.
Searching for a free port starting from 8000...
Free port found: 8004
INFO:     Started server process [24029]
INFO:     Waiting for application startup.
INFO:	  Visit https://fastrtc.org/userguide/api/ for WebRTC or Websocket API docs.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8004 (Press CTRL+C to quit)
INFO:	  Your FastPhone is now live! Call +1 877-713-4471 and use code 994514 to connect to your stream.
INFO:	  You have 30:00 minutes remaining in your quota (Resetting on 2025-04-07)
INFO:	  Visit https://fastrtc.org/userguide/audio/#telephone-integration for information on making your handler compatible with phone usage.

We see that it appears

INFO:	  Your FastPhone is now live! Call +1 877-713-4471 and use code 994514 to connect to your stream.
INFO:	  You have 30:00 minutes remaining in your quota (Resetting on 2025-04-07)

If we go to Telephone Integration in the FastRTC documentation, we will see that it uses twilio to make the call. It has an option to configure a local number from the United States, Dublin, Frankfurt, Tokyo, Singapore, Sydney, and São Paulo.

I tried making the call from Spain (which is going to be quite expensive for me) and it works, but it's slow. I called, entered the code, and waited for the agent to connect, but since it was taking too long, I hung up.

Continue reading

Create virtual environments with uv

Learn how to create virtual environments with uv, a package manager and environment for Python written in Rust, which makes it very fast. If you have had problems with the waiting times using conda, or want a faster and easier alternative to venv, enter and see how to use uv.

Generate videos with Wan2.1-T2V-14B and Inference Providers

Do you want to have your own Sora, but also generate good videos? In this post I explain how to do it with HuggingFace Inference Providers and Replicate.

Deploy backend with LLM in HuggingFace

Do you want to deploy a backend with your own LLM? In this post I explain how to do it with HuggingFace Spaces, FastAPI, Langchain and Docker.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.