
When I migrated training for my plant identification model from an AMD Radeon 5700 XT to an NVIDIA RTX 3060 Ti, I expected modest improvements. What I got was a 7.9% accuracy jump—from 72.3% to 80.2% top-1 accuracy—with 15 training epochs. Everything else was the same – the architecture, the training data, just different hardware.
For context: my model identifies 5,337 plant species from 1.6 million training images. It’s the backbone of Botanify, and this accuracy jump was the difference between a “decent prototype” and a “production-ready system.” The AMD 5700 XT was a gift from a friend—a last-gen card that got me into serious ML experimentation. While I’m grateful for it, I wish I’d switched sooner.
The Batch Size Problem
With the AMD 5700XT, I had to use DirectML as a translation layer to connect to Pytorch – a ML training library that I had used to build the identification model. This presented several limitations, as opposed to the use of CUDA architecture, which could allow an Nvidia GPU to tap into Pytorch directly.
DirectML’s memory management limitations with the use of Pytorch had compelled me to use a batch size of 4, which meant the model processed only 4 images per training step. This impaired Batch Normalization, a fundamental component in modern neural networks. BatchNorm calculates mean and standard deviation across the current batch to stabilize training. With only 4 samples, these statistics became unreliable and training slow—imagine trying to understand “what plants look like” by examining only 4 random photos at a time.
EfficientNet-B4 has 100+ Batch Normalization layers (one after every convolution), and each one computed divergent statistics from just 4 samples. When I switched to CUDA, I was able to scale to batch size 16, and the training curves smoothed out dramatically. This single change explained 5-7 percentage points of the accuracy gap.
Mixed Precision: The Feature DirectML Couldn’t Deliver
My DirectML code included with autocast(): to enable mixed precision training—using 16-bit floats (FP16) instead of 32-bit (FP32) for faster computation. The bits refer to how the GPU stores every number during training (weights, activations, gradients). FP32 uses 4 bytes per number, FP16 uses 2 bytes—half the memory.
While DirectML technically supports FP16 operations, PyTorch’s torch.cuda.amp doesn’t function with DirectML devices—the autocast context gets ignored, forcing FP32-only training. This resulted in double the memory consumption and half the speed. When I switched to CUDA, mixed precision actually worked—my batch size quadrupled from 4 to 16 and I could reduce training time from 48 hours to 16 hours per epoch.
Differential Learning Rates: The Optimization DirectML Prevented
For transfer learning (starting from pretrained weights), the best practice is to use different learning rates for different parts of the network. The pretrained backbone needs gentle updates (3e-5) to preserve learned features, while the new classifier needs aggressive updates (9e-5) to adapt quickly to 5,337 plant species. This 3x difference prevents catastrophic forgetting while enabling rapid specialization.
DirectML’s memory overhead and slower execution made this impractical. I was stuck with a single learning rate across all layers—an unhappy compromise.
Research shows differential learning rates provide 2-3% accuracy gains in transfer learning. I also used CosineAnnealingLR scheduler (smooth learning rate decay) instead of simple step decay, which should have resulted in another 1-2% improvement.
The Compound Effect: Why a ~8% improvement Made Sense
CUDA had removed the bottlenecks simultaneously. Batch size 16 stabilized BatchNorm. Real mixed precision doubled memory efficiency. Differential learning rates optimized each layer appropriately. 3x faster training enabled proper convergence. Result: 80.2% accuracy after 15 epochs.
What This Means for ML Execution
Batch size is architectural, not just a speed setting. Modern networks with extensive BatchNorm (ResNet, EfficientNet, MobileNet) make implicit assumptions that training will involve sizable batch sizes. From this experience, small batches don’t just slow training—they break the learning dynamics.
My AMD 5700 XT taught me the fundamentals and validated my approach. I’m grateful for that experience. But the CUDA switch revealed my model’s true potential.
For ML practitioners: If your model plateaus at “good but not great” accuracy and you’re using DirectML, the bottleneck might not be your architecture or data—it might be your training conditions. Moving from 72.3% to 80.2% wasn’t a groundbreaking innovation, it was removing infrastructure bottlenecks. For Botanify, this hardware switch was the difference between an interesting demo and a reliable product users can trust.
Technical details: EfficientNet-B4 backbone (~18M params), 5,337-class classifier (~10M params), total ~28M parameters. Trained on 1.6M images across iNaturalist, PlantNet, and houseplant datasets.
