shanx

Imagine a doctor finishing a patient examination and simply speaking their observations out loud. Within seconds, a complex medical form with dozens of fields is automatically filled out with structured, validated data. No typing, no clicking through dropdowns, no wrestling with forms.

This is exactly what I built with my Voice Agentic Workflow—a system that transforms natural speech into structured medical data using AI. In this post, I'll walk you through how this system works from the moment a doctor clicks the microphone button to when the form magically populates itself.

The Big Picture: Architecture Overview

Before diving into the details, let's look at the overall architecture:

System Architecture Diagram

Frontend (Browser)

Mic → MediaRecorder → Audio Blob

API Layer (TRPC)

Transcribe → Generate Casesheet

AI Services

Whisper-1 (Audio to Text)

GPT-4 (Text to JSON)

The system has six main stages:

Audio Capture: Recording the doctor's voice in the browser
Transcription: Converting audio to text using AI
Intelligent Processing: Understanding the transcript using Query Parser
Orchestration: Managing the feedback loop and UI state
Form Population: Filling out the form fields automatically
Persistence: Saving the data as a draft

Stage 1: Audio Capture - Listening to the Doctor

The Voice Input Interface

When a doctor wants to fill out a form using voice, they can simply say "Hey Axone" (or similar wake words like "Hi Axone", "Hey Action") to trigger the recording hands-free. Alternatively, they can manually click the microphone button to bring up the floating control panel.

VoiceInputBar.tsx

export function VoiceInputBar<T>({
  onVoiceFormData,
  formData,
  schemaId
}: VoiceInputBarProps<T>) {
  const [isRecording, setIsRecording] = useState(false);
  const [transcript, setTranscript] = useState("");
  const [isGenerating, setIsGenerating] = useState(false);

  // Hook that handles all the audio recording logic
  const {
    startRecording,
    stopRecording,
    audioURL,
    base64Audio,
    isRecording: recorderIsRecording,
    frequencyData,
  } = useAudioRecorder();

  // ... recording logic
}

How Audio Recording Works

I use the browser's built-in MediaRecorder API and a lightweight voice activity detection (VAD) model to listen for the wake word.

SpeechToText.tsx

// Step 1: Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    noiseSuppression: true,    // Remove background noise
    echoCancellation: true,    // Remove echo
    sampleRate: 44100,         // High-quality audio
  }
});

// Step 2: Create a recorder
const recorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm',      // WebM format for compatibility
});

Stage 2: Transcription - Audio to Text

Once the doctor stops recording (or the system auto-stops), the audio is sent to my backend for transcription using OpenAI's Whisper model.

openai.ts

export const transcription = async (
  readStream: ReadStream,
  sttType: "dictation" | "conversation"
) => {
  // Get medical context prompt for better accuracy
  const sttPrompt = getSttPrompt(sttType);

  // Call OpenAI Whisper API
  const transcriptionData = await openAiClient.audio.transcriptions.create({
    file: readStream,
    model: "whisper-1",
    prompt: sttPrompt, // Helps Whisper understand medical terms
  });

  return transcriptionData.text;
};

Stage 3: Intelligent Processing

This is where the magic happens. The system takes the plain text transcript and transforms it into structured form data using my Query Parser.

The Query Parser

The Query Parser is an intelligent orchestration layer that has deep knowledge of the UI state. It doesn't just extract data; it creates a dynamic set of steps to achieve the user's request.

Context Awareness: It understands which form is open, what fields are visible, and the current validation state.
Dynamic Prompting: Based on the user's intent and the current UI state, it dynamically chooses the next best prompt to send to the LLM.
Step Generation: It generates a sequence of actions (e.g., "Open Accordion B", "Select Dropdown Option C", "Fill Text Field D").

Example Prompt Structure

[
  {
    "role": "system",
    "content": "You are a medical data extraction assistant. Extract structured data from the doctor's notes and return it as JSON..."
  },
  {
    "role": "user",
    "content": "Extract data from these notes:\n\n__NOTES__"
  }
]

Stage 4: Orchestration & Feedback Loop

One of the most powerful features of Axone is its self-correcting feedback loop. In a complex UI, things don't always go according to plan—elements might be hidden, network requests might fail, or the state might change unexpectedly.

Self-Healing Actions

If the orchestration layer fails to execute a step (e.g., "Create Button not found"), the failure is not fatal. Instead:

Error Capture: The system captures the specific failure (e.g., ElementNotFound: #create-btn).
Context Injection: This error is added back into the AI's context.
Reasoning & Retry: The AI reasons about why it failed (e.g., "Ah, the accordion is closed. I need to open the accordion first.").
New Plan: The Query Parser generates a new set of steps.

This loop ensures that the agent is resilient and can navigate through complex, multi-step UI interactions just like a human would.

Conclusion

Building a voice-to-form AI agent involves orchestrating multiple technologies: Browser APIs, Whisper, GPT-4, and dynamic prompts. The result is a system that saves doctors and nurses 85-95% of their time on form filling, allowing them to focus on what matters: patient care.

Voice-to-Form: How I Built an AI Agent That Listens to Doctors