Tutorial: Train Your Own Medical Voice AI

General-purpose voice recognition (ASR) models struggle with medical terminology. Words like “Dermatofibrosarkoma” or “Efalizumab” often become gibberish when spoken. Fixing this usually requires a massive dataset of recorded medical dictations, which most clinicians do not have. In this tutorial, you will learn how to generate your own synthetic medical dataset and train a small, highly accurate voice model on it. Every step runs locally on your Mac — no cloud dependency, no API costs, and complete patient data privacy.

What You Need Before You Start

This tutorial is designed for Apple Mac computers. You will need:

A Mac with an Apple Silicon chip (M1, M2, M3, or M4).
macOS 13.5 or newer.
About 40 GB of free disk space to store the models and generated audio.

Step 1: Install the Necessary Tools

We need to install a few tools to run the tutorial. If you have never used the “Terminal” before, don’t worry — it’s just a place to type commands.

Open the Terminal app on your Mac (press Cmd + Space, type “Terminal”, and hit Enter).
First, we need Ollama, an app that lets your Mac run AI models locally. Download and install it from ollama.com.
Once Ollama is installed, go back to your Terminal and type this command, then press Enter:
ollama pull qwen3.5:35b
This downloads the AI model that will write our medical sentences. It is a large file (~22 GB), so this might take a while depending on your internet connection.
Next, we need uv, a tool that manages Python code. Paste this command into the Terminal and press Enter:
curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2: Download the Tutorial Files

We have prepared all the code for you in a “repository” (a folder of code) on GitHub. You just need to download it to your Mac.

In your Terminal, type this command and press Enter to download the folder:
git clone https://github.com/Isaree-ai/tutorials.git
Now, move into the folder you just downloaded by typing:
cd tutorials/asr-tutorial
Finally, install all the required Python packages by typing:
uv sync

Step 3: Open the Tutorial Notebook

We use something called a “Jupyter Notebook” to run the code. It lets you run small blocks of code one at a time and see the results immediately.

Make sure Ollama is running in the background. Open a new Terminal window and type:
ollama serve
Leave this window open.
Go back to your first Terminal window (which should still be in the tutorials/asr-tutorial folder) and type:
uv run jupyter notebook tutorial.ipynb
A web page will automatically open in your browser showing the tutorial code.

Step 4: Run the Pipeline

The notebook is divided into 6 stages. To run a block of code (called a “cell”), click on it and press Shift + Enter. Here is exactly what happens at each stage:

Stage 1: Generate Medical Text

The first cell uses Ollama to write realistic German dermatology sentences (e.g., “The patient presents with an erythematous plaque”). It automatically rejects sentences that contain abbreviations, ensuring the text is perfect for voice training. Click the cell and press Shift + Enter to generate 50 test sentences.

Stage 2: Synthesize Audio

The next cell takes those written sentences and turns them into spoken audio using a Text-to-Speech (TTS) model. It also creates “noisy” and “sped up” versions of the audio to help the model learn to understand different speaking conditions. Click the cell and press Shift + Enter.

Stage 3: Package the Dataset

This quick step sorts your generated audio into three piles: Training data (to teach the model), Validation data (to check its progress), and Test data (to grade its final performance). Click the cell and press Shift + Enter.

Stage 4: Finetune the Model

This is the core step. Your Mac will now teach a base voice model (Qwen3-ASR) to understand the medical words you generated. It does this by creating a small “adapter” that sits on top of the base model. Click the cell and press Shift + Enter. This will take a few minutes.

Stage 5: Evaluate

Once training is done, this cell tests the new model. It compares the “Word Error Rate” (WER) of the original model against your newly trained model. Lower numbers are better! Click the cell and press Shift + Enter.

Stage 6: Try It Yourself

Now for the fun part. You can record your own voice saying a medical sentence, save it as my_recording.wav in the same folder, and the notebook will transcribe it using your custom model.

Adapting to Your Own Specialty

The tutorial defaults to German Dermatology, but you can change it to any specialty (like Cardiology or Neurology). To do this, you just need to edit the files before running Stage 1:

Open the file asr/taxonomy.json and replace the skin conditions with conditions from your specialty.
In the Stage 1 cell of the notebook, change the vocabulary list to include terms from your field (e.g., “Auskultation”, “Myokardinfarkt”).
Change the specialty parameter to match your field.

What’s Next?

You have just trained a medical AI model on your own computer! The default tutorial only runs 50 samples to show you how it works. To build a production-ready model, you simply increase the n_samples number in Stage 1 to generate thousands of sentences, and let your Mac run overnight. Want to run your new finetuned model on your phone for your clinical workflow? Visit Isaree.ai to learn how to deploy it securely.

​What You Need Before You Start

​Step 1: Install the Necessary Tools

​Step 2: Download the Tutorial Files

​Step 3: Open the Tutorial Notebook

​Step 4: Run the Pipeline

​Stage 1: Generate Medical Text

​Stage 2: Synthesize Audio

​Stage 3: Package the Dataset

​Stage 4: Finetune the Model

​Stage 5: Evaluate

​Stage 6: Try It Yourself

​Adapting to Your Own Specialty

​What’s Next?