The Test-Train Split Won't Save You

In modeling, cross-validation such as test/train splits is often treated as a panacea for poor datasets or model selection procedures. Examples of poor practice are not cited, though such a citation list could be several dozens entries long, from several fields.

I will illustrate through simulation how the test/train split cannot salvage a poor pipeline.

First, some housekeeping, below is the function I am using to generate some data where the predictors (\(X_{n}\)) may or may not have an underlying relationship to the response (\(y\)):

library(stats)
library(leaps)
library(dplyr)
library(caret)
library(MASS)

create_synthetic_data <- function(n = 100, p = 10,
                                  non_random_params = 2, 
                                  signal_variance = 0.8, 
                                  seed = 123) { 
  # Seed for reproducibility
  set.seed(seed)
  
  # Generate random data (n predictors)
  X <- matrix(rnorm(n * p), 
              nrow = n, ncol = p)  
  colnames(X) <- paste0("X", 1:p)               
  
  # Error handling for pnr > p mistake
  if (non_random_params > p) {
    stop("'non_random_params' 
         cannot be greater than 
         the number of predictors (p).")}
  
  # Generate non-random signal
  noise <- 1 - signal_variance
  
  if (non_random_params > 0) {
    # Coefficients for the non-random predictors
    signal_coefficients <- rep(0.7, 
                               non_random_params)
    
    # Construct y
    y <- rowSums(sweep(X[, 1:non_random_params, 
                         drop = F], 2, 
                       signal_coefficients, "*")) + 
      rnorm(n, 
            sd = sqrt(noise))}
  
  else {
    # If no non-random predictors, y is pure noise
    y <- rnorm(n, 
               sd = sqrt(1))}
  
  # Convert to df  
  data <- data.frame(y, X)
  
  # Scale predictors
  data_scaled <- as.data.frame(scale(data))
  
  return(data_scaled)}

Using this simulated data, I then will perform forward stepwise predictor selection, after predictors which are colinear (cutoff 0.3) are removed. In the example below, a dataset which contains 10 datapoints and 5 predictors, 2 of which have a correlation coefficient of 0.5 with the response, is evaluated:

synthetic_data_test <- create_synthetic_data(n = 10,p = 5,
  non_random_params = 2,signal_variance = 0.5,seed = 456)

sim_caret_control <- train(y ~., 
                           data = synthetic_data_test %>%
                             dplyr::select(-c(findCorrelation(
                               cor(synthetic_data_test %>%
                                     dplyr::select(-y)),
                               cutoff = 0.3, names = T))),
                           method = "leapForward", 
                           tuneGrid = data.frame(nvmax = 5),
                           trControl = 
                             trainControl(
                               method = "LGOCV",
                               p = 0.7,
                               number = 3))

Then, using a function which automates the data generation and model selection and tabulates the results, I test a range of scenarios, with different ratios of sample size (N) to predictors (p), with between 0 and 2 “true” predictors included. The simulation includes 100 dataset of each combination of factors:

generate_all_datasets <- function(n_values, n_p_ratios, 
                                  non_random_params_values,
                                  r2_signal_values, 
                                  num_datasets){
  
  # Initialize a list to store all datasets
  all_datasets <- list()
  
  # Initialize an iteration counter
  it <- 1
  
  # Loop over all parameter combinations
  for (n in n_values) {
    for (ratio in n_p_ratios) {
      p <- round(n / ratio)
      for (non_random_params in non_random_params_values) {
        for (r2_signal in r2_signal_values) {
          for (i in 1:num_datasets) {
            # Generate the synthetic data
            sim_data <- create_synthetic_data(n,p, 
              non_random_params, r2_signal,
              seed = it)
            
            step_reg <- train(y ~., 
                              data = sim_data %>%
                                dplyr::select(-c(findCorrelation(
                                  cor(sim_data %>%
                                        dplyr::select(-y)),
                                  cutoff = 0.3, names = T))),
                              method = "leapForward", 
                              tuneGrid = data.frame(
                                nvmax = n/5),
                              trControl = 
                                trainControl(
                                  method = "LGOCV",
                                  p = 0.7,
                                  number = 3))
            
            # Add metadata columns to a df
            output_df <- data.frame(
              iter = it,
              N = n, p = p,
              N_to_p = ratio,
              gt_param = non_random_params,
              gt_r2 = r2_signal,
              sel_mod_r2 = as.numeric(
                step_reg$results$Rsquared),
              no_pred_mod = as.numeric(
                step_reg$bestTune))
            
            # Append the dataset to the list
            all_datasets[[length(all_datasets) + 1]] <- 
              output_df
            
            # Increment the iteration counter
            it <- it + 1
          }
        }
      }
    }
  }
  
  # Combine all datasets into one large dataframe
  combined_df <- do.call(rbind, all_datasets) 
  
  return(combined_df)
}

sim_dat_df <- generate_all_datasets(seq(20, 100, by = 5),
                                 c(0.5,1,2),
                                 0:2,0.5,100)

In the model selection procedure, I have restricted the algorithm to models which contain at most 20 % as many predictors as datapoints (1-in-5 rule of thumb). In addition, a 70/30 test/train split is employed to evaluate the selected model.

Analyzing this dataset by examining the \(R^2\) of the final model selected in each case, reveals the following plot:

library(ggplot2)

R2_plot <- ggplot(data = sim_dat_df,
                  aes(x = as.factor(N_to_p),
                      y = sel_mod_r2,
                      fill = as.factor(gt_param))) +
  geom_hline(yintercept = 0.7,
             alpha = 0.3) +
  geom_boxplot() +
  labs(fill = "True Predictors",
       x = "N/p Ratio",
       y = expression(paste("Model ",
                            R^2))) +
  ylim(0,1) +
  theme_classic() +
  theme(legend.position = "bottom")

print(R2_plot)
R2
R2

The line indicates an \(R^2\) of 0.7, which is typically used as a rule-of-thumb reasonable cutoff for models that have acceptable predictive ability. The plot reveals several defficiencies of the overall process. Firstly, there is a non-negligible amount of false negatives in the situation were at least one predictor has a correlation coefficient of at least 0.5 - this situation is not even rectified when the N/p ratio becomes more favourable. As a reminder, p here is the number of predictors which were in the original dataset and evaluated at all, NOT those in the final model.

Now, looking at the number of predictors in the models:

predictor_plot <- ggplot(data = sim_dat_df,
                         aes(x = N,
                             y = no_pred_mod,
                             color = as.factor(gt_param))) +
  geom_point() +
  labs(color = "True Predictors",
       x = "Sample Size (N)",
       y = "Predictors in Model") +
  theme_classic() +
  theme(legend.position = "bottom")
Preds
Preds

It is evident that the algorithm is using the maximum allowable number of predictors in each case. Even when none are present, it produces noise, with a low mean \(R^2\) (consult previous graph) and outliers above 0.7 about 0.2 % of the time. However, the real failing comes when there are “true” predictors - the procedure invariably includes predictors which are pure noise in the final model, until it reaches the limit.