Voice-to-Form: How I Built an AI Agent That Listens to Doctors
Imagine a doctor finishing a patient examination and simply speaking their observations out loud. Within seconds, a complex medical form with dozens of fields is automatically filled out with structured, validated data. No typing, no clicking through dropdowns, no wrestling with forms.
This is exactly what I built with my Voice Agentic Workflow—a system that transforms natural speech into structured medical data using AI. In this post, I'll walk you through how this system works from the moment a doctor clicks the microphone button to when the form magically populates itself.
The Big Picture: Architecture Overview
Before diving into the details, let's look at the overall architecture:
The system has six main stages:
- Audio Capture: Recording the doctor's voice in the browser
- Transcription: Converting audio to text using AI
- Intelligent Processing: Understanding the transcript using Query Parser
- Orchestration: Managing the feedback loop and UI state
- Form Population: Filling out the form fields automatically
- Persistence: Saving the data as a draft
Stage 1: Audio Capture - Listening to the Doctor
The Voice Input Interface
When a doctor wants to fill out a form using voice, they can simply say "Hey Axone" (or similar wake words like "Hi Axone", "Hey Action") to trigger the recording hands-free. Alternatively, they can manually click the microphone button to bring up the floating control panel.
export function VoiceInputBar<T>({
onVoiceFormData,
formData,
schemaId
}: VoiceInputBarProps<T>) {
const [isRecording, setIsRecording] = useState(false);
const [transcript, setTranscript] = useState("");
const [isGenerating, setIsGenerating] = useState(false);
// Hook that handles all the audio recording logic
const {
startRecording,
stopRecording,
audioURL,
base64Audio,
isRecording: recorderIsRecording,
frequencyData,
} = useAudioRecorder();
// ... recording logic
}How Audio Recording Works
I use the browser's built-in MediaRecorder API and a lightweight voice activity detection (VAD) model to listen for the wake word.
// Step 1: Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
noiseSuppression: true, // Remove background noise
echoCancellation: true, // Remove echo
sampleRate: 44100, // High-quality audio
}
});
// Step 2: Create a recorder
const recorder = new MediaRecorder(stream, {
mimeType: 'audio/webm', // WebM format for compatibility
});Stage 2: Transcription - Audio to Text
Once the doctor stops recording (or the system auto-stops), the audio is sent to my backend for transcription using OpenAI's Whisper model.
export const transcription = async (
readStream: ReadStream,
sttType: "dictation" | "conversation"
) => {
// Get medical context prompt for better accuracy
const sttPrompt = getSttPrompt(sttType);
// Call OpenAI Whisper API
const transcriptionData = await openAiClient.audio.transcriptions.create({
file: readStream,
model: "whisper-1",
prompt: sttPrompt, // Helps Whisper understand medical terms
});
return transcriptionData.text;
};Stage 3: Intelligent Processing
This is where the magic happens. The system takes the plain text transcript and transforms it into structured form data using my Query Parser.
The Query Parser
The Query Parser is an intelligent orchestration layer that has deep knowledge of the UI state. It doesn't just extract data; it creates a dynamic set of steps to achieve the user's request.
- Context Awareness: It understands which form is open, what fields are visible, and the current validation state.
- Dynamic Prompting: Based on the user's intent and the current UI state, it dynamically chooses the next best prompt to send to the LLM.
- Step Generation: It generates a sequence of actions (e.g., "Open Accordion B", "Select Dropdown Option C", "Fill Text Field D").
[
{
"role": "system",
"content": "You are a medical data extraction assistant. Extract structured data from the doctor's notes and return it as JSON..."
},
{
"role": "user",
"content": "Extract data from these notes:\n\n__NOTES__"
}
]Stage 4: Orchestration & Feedback Loop
One of the most powerful features of Axone is its self-correcting feedback loop. In a complex UI, things don't always go according to plan—elements might be hidden, network requests might fail, or the state might change unexpectedly.
Self-Healing Actions
If the orchestration layer fails to execute a step (e.g., "Create Button not found"), the failure is not fatal. Instead:
- Error Capture: The system captures the specific failure (e.g.,
ElementNotFound: #create-btn). - Context Injection: This error is added back into the AI's context.
- Reasoning & Retry: The AI reasons about why it failed (e.g., "Ah, the accordion is closed. I need to open the accordion first.").
- New Plan: The Query Parser generates a new set of steps.
This loop ensures that the agent is resilient and can navigate through complex, multi-step UI interactions just like a human would.
Conclusion
Building a voice-to-form AI agent involves orchestrating multiple technologies: Browser APIs, Whisper, GPT-4, and dynamic prompts. The result is a system that saves doctors and nurses 85-95% of their time on form filling, allowing them to focus on what matters: patient care.