Standardizing Clinical Performance Evaluation with Multimodal GenAI

About Client:

The client is a globally recognized university system specializing in postgraduate medical and nursing education. It operates one of the largest clinical simulation centers in North America, conducting thousands of high-stakes Objective Structured Clinical Examinations (OSCEs) and Simulated Patient Encounters (SPEs) every year.

Background:

In these simulations, students engage with trained actors portraying patients or, in some cases, medical mannequins used for hands-on procedures, while faculty evaluators assess their performance across both technical and interpersonal skills—from clinical accuracy and diagnosis to empathy and communication.

However, as enrollment grew and simulation scenarios became more complex, evaluation became increasingly burdensome. Faculty had to manually review hours of video, cross-check multiple criteria, and provide detailed feedback. With thousands of sessions conducted annually, maintaining consistency and standardization across evaluators became a significant challenge.

Challenge:

The university’s leadership sought a way to scale the evaluation process without compromising on fairness or educational quality. They faced three key pain points:

  • Manual, Time-Intensive Evaluation: Reviewing and scoring each session took 45–60 minutes, making it difficult to keep pace with growing simulation volume.
  • Lack of Standardization: Evaluation methods varied between one-on-one, multi-student, and multi-room simulations, leading to discrepancies in scoring.
  • Subjectivity in Assessing Soft Skills: Attributes like empathy, confidence, and patient engagement were prone to evaluator bias despite standardized rubrics.

The goal was clear: reduce instructor burden, enhance consistency, and deliver faster, more actionable feedback to students.

Solution:

To meet these challenges, we designed and implemented a secure, cloud-native Multimodal Generative AI Evaluation System built on AWS. The system combined video, audio, and text analysis through advanced AI models to produce objective and rich performance evaluations, while incorporating Retrieval-Augmented Generation (RAG) to ensure continuous accuracy improvement over time. The entire application was deployed on Amazon EKS to support scalable, containerized workloads, and Amazon RDS served as the foundation for both the RAG knowledge store and the backend database.

The solution was implemented in three key phases:

Phase 1: Data Integration and Processing

  • All video and audio feeds from simulation rooms were time-synced and securely ingested into the cloud.
  • Speech recognition and sentiment analysis were applied to identify tone, clarity, and emotional state — capturing both student confidence and patient sentiment in real time.
  • Video data was processed using Gemini Pro, a multimodal model that detected non-verbal behaviors such as body posture, hand gestures, eye contact, and adherence to procedural protocols.
  • All extracted text, transcripts, and structured observations were stored in Amazon RDS, enabling the RAG workflow to retrieve historical context, institutional rubrics, and prior feedback patterns to enrich each evaluation.

Phase 2: Intelligent Evaluation and Scoring

  • The outputs from audio, video, and sentiment models were synthesized by Claude 3 Sonnet, an advanced reasoning model that integrated all data points into a single, coherent analysis.
  • The model applied the client’s detailed evaluation rubrics as its internal system prompts, ensuring full alignment with institutional grading standards.
  • Through the RAG pipeline, Claude 3 Sonnet accessed the latest rubric updates, domain guidelines, and evaluator feedback stored in RDS — creating a continuous feedback loop that enhanced scoring accuracy over time.
  • For each simulation, the system produced a structured evaluation report including:
    1. Numeric scores across rubric categories (e.g., Clinical Reasoning – 4/5; Empathy – 5/5).
    2. Narrative feedback highlighting specific moments and improvement areas (e.g., “The student showed strong diagnostic reasoning but missed summarizing patient concerns before moving to examination.”).

Phase 3: Adaptability Across Simulation Setups

  • The system was built to handle all levels of complexity — from simple one-on-one encounters to multi-student, multi-patient, or distributed multi-room simulations.
  • In group sessions, each student’s performance metrics were isolated and tracked individually, ensuring scoring fairness even in collaborative settings.
    In distributed environments, synchronized metadata correlated activities from multiple cameras and microphones into a unified timeline, allowing the AI to evaluate the entire session as a single, cohesive event.
  • Deployed on Amazon EKS, the platform automatically scaled to support large simulation volumes, parallel evaluation workloads, and peak assessment periods.
    Using RDS-backed RAG, the system continuously learned from evaluator corrections and institutional updates, steadily improving rubric alignment and evaluation precision.

Outcome:

The university now operates with a fully standardized, AI-supported evaluation process — faster, fairer, and more scalable. Faculty focus on personalized guidance, while students receive immediate, actionable feedback that strengthens learning outcomes. Key results include:

MetricBefore (Manual)After (GenAI)Improvement
Instructor Evaluation Time45–60 mins10–15 mins67% faster
Total Instructor Hours (Annually)~9,500 hrs~2,375 hrs7,000+ hrs saved
Evaluation Consistency (ICC)0.650.8937% higher reliability
Feedback Delivery Time3–5 days<1 hour95% faster feedback
Cost per Session~$45~$5.5087% cost reduction
Pass/Fail Variance~12%<1%Near-perfect alignment

Lasting Impact

The GenAI system has become the standard evaluation framework for all simulation-based training within the university. It ensures every student is assessed fairly, every instructor’s time is optimized, and every session produces meaningful learning insights.

By embracing AI as a collaborative tool rather than a replacement for human judgment, the university has redefined how clinical competency can be measured at scale — with speed, consistency, and compassion.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

BizAcuity
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.