Claude Opus 4.6 Told Me to Give Up. I Asked It to Do More Research Instead

A Q1 challenge + pushing back on Claude

I am happy to announce the release of Botanify’s new plant identification model, now expanded to identify 60% more plant species than before — from 5,100 to nearly 8,000 classes — with ~80% top-1 accuracy. Feel free to check out the release notes: (https://botanify.io/2026/04/05/botanify-q1-2026-release/)

The truth is, getting here was exponentially more challenging than the first training run. I assumed expanding the dataset would be relatively straightforward: rerun the training scripts that had worked before, feed in more data, and iterate. After all, I had a solid GPU — the Nvidia RTX 3060Ti that had been instrumental in training the original 5,100-species model. In practice, accuracy for the expanded dataset plateaued stubbornly at 75%, even after exhausting a long list of training strategies. At one point, Claude Opus 4.6 told me several times to just accept it — that 75% was “good enough” for an 8,000-class dataset given feature complexity. I refused.

The lesson I eventually took from this experience wasn’t about enhancing a specific algorithm or dataset. It was about something more fundamental: understanding the problem space before writing a single line of training code.

The Problem

Botanify’s original model used EfficientNet-B4 — a convolutional neural network — as its visual backbone. Trained on iNaturalist, PlantNet, and houseplant datasets, it achieved 80.2% top-1 accuracy across 5,071 species. When I set out to expand coverage to nearly 8,000 species by integrating PlantCLEF and Singapore’s NParks data, I assumed the path forward was clear: more data, more training, incrementally better results. I was wrong.

New species stubbornly plateaued around 65% accuracy, while the model simultaneously began forgetting species it had previously identified well — a phenomenon known as catastrophic forgetting. What followed was six to seven training iterations spread across months, each attempting a different fix: knowledge distillation, data ratio tuning, partial layer freezing. Some approaches showed marginal gains. None broke the 75% ceiling.

The Expensive Detour

Midway through, at the advice of Claude, I pivoted to ArcFace — a metric learning technique originally developed for face recognition, which replaces a traditional softmax classifier with an embedding-based approach. In theory it was a compelling fit: no softmax competition between classes, better handling of visually similar species, and the ability to add new species without retraining. In practice, top-1 accuracy dropped from 75% to 62.9%. Further iterations with larger backbones, higher-dimensional embeddings, and multi-reference galleries moved the needle by only a few percentage points.

The diagnosis, which came too late, was clear: the bottleneck was never the training strategy. It was the backbone itself. EfficientNet-B4 was pretrained on ImageNet — a dataset of everyday objects. Distinguishing Vitis californica from Vitis arizonica  requires a model that has spent millions of training steps learning what makes plants visually distinct at the finest grain. EfficientNet hadn’t. No amount of loss function tuning could compensate for features that were never there to begin with.

What Research Would Have Revealed

I was close to giving up when I came across a message a tech lead had shared in one of my work Teams channels — a lesson he had learned working with LLMs. His advice was simple: do thorough research before committing to an approach. With LLMs, he pointed out, conducting research is cheap and fast — which makes it inexcusable to skip.

Andrew Ng has made a similar point repeatedly, emphasising that strong fundamentals and a clear problem diagnosis matter more than jumping to implementation. In machine learning, that means knowing not just how to train a model, but which architecture is appropriate for the scale and nature of your task.

Had I done that research upfront, I would have found DINOv2 — Meta’s Vision Transformer pretrained on 142 million images using self-supervised learning — and more specifically, weights fine-tuned by on 7,806 plant species from the PlantCLEF 2024 benchmark, available publicly on Hugging Face. I would also have found that every top-performing entry in the PlantCLEF competition used Vision Transformers, not EfficientNets. That evidence alone would have redirected three months of work.

Claude Opus Told Me to Give Up
DinoV2!

The Result

With DINOv2’s backbone frozen and a simple linear classification layer trained on top of its 768-dimensional feature representations, accuracy jumped to 79.8% top-1 across all 7,918 species — surpassing the EfficientNet baseline by nearly five percentage points. Each training epoch took 45 seconds, down from several hours. When the model’s confidence score reaches 90% or above, it is correct 96.5% of the time.

The architectural change also solved the underlying expansion problem. Because DINOv2’s backbone stays frozen, adding new species now requires only extracting features for the new images — seconds per image — and retraining the lightweight classification head, a process that takes minutes rather than weeks.

The Takeaway

The instinct to start building quickly is understandable. But speed at the start of a training loop is not the same as efficiency. The months spent across those iterations were not entirely wasted — they produced a clear diagnosis and a hard-won understanding of where the real constraints lay. But that understanding could have come from research rather than trial and error.

For anyone building ML products: before your next training run, spend a day understanding what practitioners in your domain are actually using. The answer is often already out there — and all that’s needed is to adapt it to your specific use case.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top