Eardrum TriageML prototype ยท Ben Strumeyer
Live ML prototype ยท built for Blueberry Pediatrics

Eardrum Triage

A working ML prototype of Blueberry's eardrum-triage feature โ€” upload an otoscope image, the model triages it for a physician.

98%accuracy on held-out data
454expert-labeled images
~2 minto train on a laptop GPU
Triage support โ€” not a diagnosis Built in Python + Django + PyTorch โ€” the same stack Blueberry uses.

Try the model on real ears

Click any otoscope image. These are held-out test images the model never trained on โ€” including two it gets wrong.

Held-out test set ยท never seen in training
Select an image to run the model The model returns a prediction, confidence, and a triage route.

Inference runs the same PyTorch model exported from training.

The homework

What I learned about Blueberry

I started from the business, not the model. Here's the company I built this for.

Mission

Care at home

Turn every family's living room into their own pediatric urgent care: home exam kits + 24/7 pediatricians + AI.

Built for Medicaid families โ€” and it cuts ED costs by up to 50%.

The stack

Server-rendered & pragmatic

Django + Hotwire (HTMX-like, server-rendered). The React Native app is being rewritten into Hotwire to simplify the stack. WebRTC lives inside their in-house EMR, on Google Cloud, with PyTorch / scikit-learn for ML.

DjangoHotwireWebRTCGoogle CloudPyTorch

Momentum

Scaling up

New leadership from MDLive (CEO) and Teladoc โ€” the team is scaling, and engineering ownership is expanding with it.

What they value

Product engineers
  • Start from the business problem, not the tech
  • Impact over complexity
  • Own work end-to-end โ€” a "Manager of One"
  • Human-in-the-loop, always
Roadmap fit

The four projects they're shipping โ€” and how I'd build each

Matching the right tool to each problem matters more than reaching for ML every time.

Eardrum detection model

What this demo is

A transfer-learning CNN that triages otoscope images. Triage โ€” not diagnosis.

How I'd build it

Pretrained ResNet-18 backbone, class-weighted loss, doctor in the loop on every flag.

Developmental screeners

Not ML

A scoring engine over validated questionnaires โ€” ASQ-3, M-CHAT-R, CDC milestones โ€” that flags delays for early intervention.

How I'd build it

Deterministic scoring rules. Auditable, explainable, and exactly what clinicians already trust.

At-risk child flagging

Rules first, model second

A transparent rules engine on home-kit vitals first; then a calibrated XGBoost + SHAP model that prioritizes the physician queue.

How I'd build it

Optimize for recall on the dangerous class โ€” missing a sick child is the only error that truly costs.

WebRTC call reliability

Engineering, not ML

TURN relay servers, ICE-restart on failure, adaptive bitrate, and getStats() metrics.

Why it matters

Families are often on weak connections โ€” reliability is the product, not a nice-to-have.

The build

How I built the eardrum model

A few hundred well-labeled ears, a pretrained backbone, and disciplined data hygiene.

1

Transfer learning on ResNet-18

The backbone already knows edges and textures, so a few hundred labeled ears are enough to specialize it. ResNet-18 is ~12M params โ€” scaling accuracy is a data problem, not a compute one.

2

Trustworthy data only

The open Ohio / OtoMatch otoscope dataset: 454 expert-labeled images (Normal, Effusion, Tube), decoded from its labels spreadsheet into clean class folders. No web-scraped images โ€” scraped labels are untrustworthy and unsafe for a medical model.

3

Handling imbalance & training

Class-weighted loss + augmentation to counter class imbalance. I also evaluated Zenodo, Figshare, and a Kaggle 956-image AOM set before settling on the cleanest source. The whole thing trains in ~2 minutes on an RTX 4060.

ResNet-18Transfer learningClass-weighted lossAugmentation454 imagesRTX 4060
Evidence

Results

98% overall on held-out data โ€” and, crucially, the two misses were the safe kind.

Confusion matrix of the eardrum triage model on held-out data
Confusion matrix ยท held-out test set
โ† PredictedTrue โ†“
Normalized confusion matrix ยท rows = true class
ClassPrecisionRecall
Effusion0.980.98
Normal1.000.97
Tube0.951.00
Overall98% accuracy
The two misses were an abnormal-vs-abnormal mixup and a false positive a physician would catch โ€” never a missed sick child.
Intellectual honesty

Honest limitations & what's next

A prototype that hides its weaknesses isn't trustworthy. Here are mine.

Where it stops today

  • Triage, not diagnosis โ€” deliberately out of FDA device territory
  • Random split, not patient-level โ€” production must split by patient
  • Only 3 classes, all from public data
  • The real accuracy moat is Blueberry's own home-kit photos, labeled by its doctors