Out-of-Fold Predictions for Stacking ML Models

Hello everyone, I’m Kien, an AI engineer at Ridge-i. Today, I’d like to walk you through a common issue that can arise in machine learning systems, and share an effective way to address it.

Introduction

If you are working with forecasting ML models, you may face stacking models [1]. Stacking models is an ensemble learning technique where multiple base models are trained first, and their predictions are then used as inputs to a higher-level model (called a meta-model) to produce the final prediction. The idea is that different models capture different patterns in the data, and the meta-model can learn how to best combine their outputs. However, stacking introduces a potential data overfitting problem if the meta-model is trained on predictions from base models that were generated using the same training data. Let's go to a concrete example and find out what the problem is.

AI medical case study and problem definition

Imagine you are an engineer tasked with building an AI system for disease screening and action recommendation. At a high level, the workflow is straightforward: given a user’s data, the system first estimates the probability of various diseases. Then, based on those probabilities and the user’s past medical actions, it recommends the next step - such as ordering a blood test or an X-ray - along with a confidence score for each recommendation.

To train the screening model, suppose you have access to a patient history dataset (let’s call it the customer info dataset). This dataset contains information about past patients, including age, gender, symptoms, vital signs, and more. Your first task is to train a model that predicts disease probabilities from this data. That sounds simple enough: feed the demographic and clinical features into a standard machine-learning classifier and have it output the likelihood of each disease.

Next comes the action recommendation model. This second model reuses the same customer info dataset, but now it also takes as input the disease probabilities produced by the screening model, along with a dataset of historically effective medical actions. Why design the system this way? In practice, the screening model acts as a form of feature engineering, providing higher-level signals that help the recommendation model make better decisions.

You might also wonder why the recommendation model is trained on predicted disease probabilities rather than the ground-truth disease labels. The reason is simple but important: during inference, the system only has access to probabilities, not confirmed diagnoses. Training the second model on probabilities, therefore, better reflects real-world conditions and leads to more robust recommendations.

So far, everything seems reasonable. But this is where the interesting challenges begin. Let’s take a closer look at the workflow.

Fig 1: Stacking models example

Now, imagine you start with a single training customer info dataset. You use this dataset to train the disease screening model. To train the action recommendation model, you then need disease probabilities, so you pass the same training data through the screening model to generate them. This is where the problem arises. When we train a screening model on a dataset, it naturally learns to minimize the loss on that exact data. The loss function directly penalizes prediction errors on those samples, pushing the model to fit them as well as possible. As a result, the predicted disease probabilities on training can look extremely confident. For example:

  • A negative sample might receive a very low predicted probability, such as 0.1.

  • A positive sample might receive a very high predicted probability, such as 0.9.

After that, we use those overly optimistic predicted probabilities to train the recommendation model. When we move to an inference (or unseen) dataset, the situation often changes. Even if the overall distribution is similar, small differences in feature patterns, noise, or sampling can affect the model’s outputs. Instead of predicting 0.1 and 0.9, the model might produce more moderate probabilities such as 0.3 and 0.7. And now, the distribution of the input to the screening model slightly changes, thus causing a negative impact as they do not accurately reflect real-world behavior.

How to solve the problem

If you have a huge dataset, one straightforward solution is to use two separate user information datasets, one to train the screening model and another to train the action recommendation model. In practice, however, data is often limited, so this approach is not always feasible. That’s when you need a different strategy. A common and effective solution is to use K-fold cross-validation. If you’re not familiar with it, K-fold is typically introduced as a way to evaluate model performance [2]. In this case, however, the idea can also be used to solve the overfitting problem [3,4].

To make the idea concrete, let’s look at a simple example with 3-fold cross-validation. First, split the training dataset K into three equal-sized folds, which we’ll call K1, K2, and K3. Next, we train three separate screening models - M1, M2, and M3 - to generate a set of disease probabilities.

  • Model M1 is trained on K2 and K3
  • Model M2 is trained on K1 and K3
  • Model M3 is trained on K1 and K2

Each model is then used to predict disease probabilities for the fold it has never seen: M1 predicts K1, M2 predicts K2, and M3 predicts K3. Because each fold is predicted using a model that was not trained on it, the resulting probabilities are free from overfitting to the training distribution and better reflect real inference-time behavior. These out-of-fold predictions can then be safely used as inputs for training the action recommendation model.

Fig 2: Out-of-Fold Predictions example (3 folds)

Things to note

How many folds should you use?

In theory, even a 2-fold dataset is enough to prevent overfitting. However, using fewer folds means each screening model is trained on less data, which can hurt performance. On the other hand, increasing the number of folds improves training data coverage but requires training more models, which can significantly increase computation time.

Ultimately, this is a trade-off between model quality and training cost. The best choice depends on your dataset size, model complexity, and available resources. In practice, 5-fold cross-validation is a common and well-balanced choice.

Does the input to the action model become larger?

The short answer is: No. Let’s revisit the example above. Suppose we train a screening model using 3 folds. For each fold K1, K2, K3, we obtain a set of predictions - let’s call them P1, P2, and P3. in which

|P1| + |P2| + |P3| = |K1| + |K2| + |K3| = |K|

These predictions (P1, P2, P3) are then combined and used as an input feature for the downstream action model, while in the common way, the recommendation model is trained with the entire dataset K. Therefore, the amount of data is unchanged.

Are the k-fold screening models and the action model trained at the same time?

Since the action model uses the outputs of the screening model as its input feature, it must be trained only after the 3-fold screening model training is fully completed and all out-of-fold predictions are generated.

Which screening model should be used during inference?

Unlike the standard workflow, K-fold training produces multiple screening models. This raises a natural question: which one should you use at inference time? One option is to select the model that achieved the best validation performance during training. Another widely used approach is to retrain a final screening model using the entire training dataset and deploy that model for inference. This second approach is often preferred, as it allows the model to learn from all available data while still ensuring that the downstream recommendation model is trained on predictions that are not affected by overfitting.

Notes on screening model performance

When training screening models for each fold, it’s important to monitor their performance carefully. Ideally, all models should achieve similar metrics across folds. Large performance discrepancies may indicate instability in the training process or issues with data distribution. While strict consistency is not always mandatory, ensuring comparable performance across folds is strongly recommended to improve the robustness and reliability of the overall system.

Should we use the remaining split for validation?

In standard K-fold cross-validation, it is common to use the held-out fold for validation. However, in this scenario, that approach is not ideal. Instead, the screening model should be validated using a separate, standalone validation dataset.

The goal here is to avoid data overfitting when generating inputs for the second (action recommendation) model. If the held-out fold is also used for validation, for example, to select the best checkpoint or tune hyperparameters, the model will inevitably become biased toward that split. As a result, when the same model is later used to generate disease probabilities for that fold, those predictions may no longer reflect true inference-time behavior. Using an independent validation dataset ensures that each fold remains genuinely unseen by the screening model when producing out-of-fold predictions, preserving their realism and preventing subtle overfitting to the downstream model.

Conclusion

When building a multi-stage AI system where one model’s outputs feed into another, preventing data overfitting is critical. Training both models on the same data can produce overfitted intermediate predictions that do not reflect real inference conditions. Using K-fold to generate out-of-fold predictions provides a practical solution, especially when data is limited, by ensuring realistic inputs that do not lead to overfitting in the downstream model. Careful decisions around the number of folds, validation strategy, and final model selection help balance performance, stability, and computational cost, ultimately leading to a more robust and reliable system.

References

  1. Grandmaster Pro Tip: Winning First Place in a Kaggle Competition with Stacking Using cuML | NVIDIA Technical Blog
  2. 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.8.0 documentation
  3. Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015).
  4. Nagalapalli, Satish & Anmala, Jagadeesh & Kanjirappuzha, Rajitha & Varma, Murari. (2024). A stacking ANN ensemble model of ML models for stream water quality prediction of Godavari River Basin, India. Ecological Informatics. 80. 102500. 10.1016/j.ecoinf.2024.102500.