Echo-HFrEF Classifier

3D CNN vs. Vision Transformer for heart failure detection from echocardiogram video, under a 6 GB VRAM constraint

Can a Vision Transformer outperform a 3D CNN at detecting heart failure from echocardiogram video, trained on a 6 GB laptop GPU?

I compared R(2+1)D-18 (~31.5M parameters) against MViT v2 Small (~34.5M parameters) on the EchoNet-Dynamic dataset (10,030 videos), holding pretraining data (Kinetics-400) constant to isolate architecture as the variable. Models were evaluated at a clinical operating target — sensitivity at 90% specificity — with thresholds locked on the validation set before any test-set evaluation.

Before fine-tuning, the models tied. Paired bootstrap confirmed the difference was not statistically significant, which pointed to a shared ceiling: Kinetics-400 pretraining on natural video wasn’t transferring well to noisy ultrasound data for either architecture. After fine-tuning, MViT v2 pulled ahead — ROC-AUC 0.936 vs. 0.902, sensitivity 0.825 vs. 0.725 at ~90% specificity. The gap was statistically significant (95% CI: [0.013, 0.057]).

The cost: MViT’s self-attention required halving the batch size to stay within 6 GB, doubling steps per epoch and pushing training time to ~10 minutes per epoch versus ~2 min 45 sec for the CNN — roughly 3.6x slower, despite exposing fewer trainable parameters. Better accuracy, but the CNN stays the more practical choice anywhere frequent retraining matters.

Stack: Python, PyTorch, OpenCV, AMP, R(2+1)D-18, MViT v2 Small

GitHub