Future-proofing your ML workflows – Plan for scale, log and hypothesize

If building a plant identification app from scratch was tough, expanding the number of plant classes my app could identify was an unexpected challenge. Specifically, this raises the importance of future-proofing not just the app, but the machine learning (ML) workflows behind each model.

Architecture decisions matter from Day 1

Since the launch of Botanify.io in Nov 2025, I made plans to expand the number of plant classes my app was trained on.

The source of my new dataset was PlantCLEF. After removing plant classes with fewer than 8 images (which made it harder to train, validate and test), I had around 7,000 plant classes. When I merged these with my existing ~5,000 plant dataset, 2,663 were brand new classes. The others already existed in my dataset and would enhance the training data for those species.

Keeping the datasets separate wasn’t a great architectural decision

While initially training up my plant identification model, I decided to keep individual species of plants from each dataset separate. This means that if I had Abies alba in the iNaturalist dataset, I would have Abies_alba_2 in the PlantNet dataset, as opposed to converging them. Back then, it just seemed easier to manage and track errors, so I decided to keep them separate.

On hindsight, converging them to a single class would have been more logical for scale, especially if I had planned to include more datasets.

The world has about 200K plant classes after all, and 5K barely scratches the surface.

New mapping and indexing issues

Using Claude Code, I was able to merge the plant classes easily, while still retaining the location of each train, validation and test folders from each dataset.

The final unique number of classes was 7,734, and I ran the script leveraging the previous best model at 80.2% accuracy.

Unfortunately, by Epoch 4 of training, the old classes were showing only 4.5% accuracy, and it was clear that there was an issue.

Essentially, the model was looking up at the new indexes, and getting confused by what were the images and styles which it had been trained on before. It was like a neuron which knew what monstera looked like, and tried to return to the same index but was given aloe vera as a result instead.

To resolve this, I ran a script which mapped old indices to new indices based on species names. This created a file with pretrained weights.

def create_index_mapping(old_mapping, new_mapping):
    """Map old indices to new indices based on species names."""
    old_to_new = {}
    
    for old_idx, old_species in enumerate(old_mapping):
        # Handle duplicates: Acacia_confusa_2 → Acacia_confusa
        base_species = old_species.rsplit('_', 1)[0] if old_species.endswith('_2') else old_species
        
        if base_species in new_mapping:
            new_idx = new_mapping[base_species]
            old_to_new[old_idx] = new_idx
    
    return old_to_new

# Transfer weights using the mapping
for old_idx, new_idx in old_to_new.items():
    new_classifier_weight[new_idx] = old_classifier_weight[old_idx]
    new_classifier_bias[new_idx] = old_classifier_bias[old_idx]

However, the problem persisted. I asked Claude Code to come up with several hypotheses of what might be the issue. Examples it gave include the weight decay on frozen parameters (when gradients zero out after several training epochs), as well as logit scale mismatches.

Understanding logits and why scale matters

Logits are raw, unnormalized scores for each class. They represent a model’s belief that a certain class should be attributed to a specific image that was provided as input to the model before normalization. Softmax (a formula) then converts these raw scores into probabilities (0-1, summing to 1), with higher logits receiving higher probabilities.

So I ran a diagnostic script—and found that the logits for old classes were significantly lower than new logits.

Old logits mean: -149.32 | New logits mean: 0.007

As a result, new classes were increasingly getting prioritized as opposed to old classes, which had been trained extensively during the first training run.

Claude explains why this is the case:

Initialization: Classifiers typically start with small random weights (mean ~0)
Early training: Logits are small, softmax spreads probability somewhat evenly
Later training: As the model learns, it pushes correct class logits higher and incorrect class logits lower
Final state: With thousands of classes, most logits end up strongly negative (high confidence against wrong classes), with only the correct class logit being positive or less negative

The fix: Matching statistical properties

To address this, I re-initialized classifier weights to match the statistical properties of the old ones:

def initialize_new_classifier_properly(old_classifier, new_classifier):
    """Initialize new classifier to match old classifier statistics."""
    # Match weight statistics
    old_weight_mean = old_classifier.weight.mean()
    old_weight_std = old_classifier.weight.std()
    
    new_classifier.weight.data.normal_(old_weight_mean, old_weight_std)
    
    # Critical: Give new classes a DISADVANTAGE initially
    # Set their biases 5 points lower than old class average
    old_bias_mean = old_classifier.bias.mean()
    new_classifier.bias.data.fill_(old_bias_mean - 5.0)

New classes now started with a handicap. The goal was to have pretrained old classes dominate predictions initially, then gradually let new classes compete as they learn.

With this fix, I restarted training:

Epoch 1: Overall=59.30%, Old=74.74%, New=16.39%

The training showed 74.74% on old classes. The pretrained knowledge was finally preserved. The new classes started at 16% (random guessing) and would improve over training.

Key learnings

1/ Architectural decisions at the start are key. There’s a need to carefully consider decisions to scale and at which point. Not just train and stop there. From experience, simplified workflows do aid model expansion.

2/ Hypothesize – when things don’t work, systematically work through hypotheses. In my case, I asked Claude Code to hypothesize and work through diagnostics to close out specific hypotheses.

3/ Check the model outputs when in doubt. In this case logits did help.

More well-equipped tech teams might retrain from scratch with extensive GPU resources. That said, resource constraints forced me to deeply understand transfer learning concepts, an education no amount of compute could replace.