Overfitting. Nothing to do with clouds. This is a major hazard of model building that can affect all types of aggression and particularly, machine learning methods. It happens when you try to squeeze too many variables, actually, too many parameters, which I'll explain in a minute, into your model that it can't cope and it explodes. Let's use an analogy. You have an empty train. Many people are waiting on the platform to get on to take them to the airport. If a normal size adult or child with a small bag gets on first, they'll still be loads of room mono-train. But if a Sumo wrestler gets on, they'll still be lots of room, though less than with the childs. As more people bored the train, the amounts of free space falls until people really have to push their way on until no one has enough room to move and breathe and as my daily commute in London. It's also over-fitting. It's the same with modelling. In the analogy, the train is your model and the people are your predictors. Too many predictors in the model and you're in trouble. If you just picked one predictor, then you'll have no problem if you're predictor is the equivalent of an average person or child with a small bag. They can be a problem however if that predictor is equivalent of a sumo wrestler with enough luggage for a year stay at the North Pole. So, which types available are like children and which are like sumo wrestlers with luggage? Continuous variables are like children. Why? Because they only need one parameter in the model to describe their relation with the outcome. Remember, that this assumes that this relation is linear. To get a curved relation, you need more parameters. Also like children are categorical variables with only two categories for instance, male or female. They also need only one parameter to describe their relation which would be the odds for females relative to the odds for males, for instance. Remember that the reference category, which is males in this example, forms parts of the model intercept. Variables that take up the most room in a model and a light Sumo wrestler are categorical variables with lots of categories. For instance, suppose you've got age with 20 categories. Each of which is a five-year age band. This would need 19 parameters plus the intercept. The patients in your dataset have to be spread amongst 20 categories. So, you might not get very many patients in each one. That can cause the software algorithm problems. You may need to combine some of the categories that's the equivalent of stuffing a bag inside a hard suitcase. So, how can you spot overfitting? Well, in the most extreme case, the software would just give up and tell you that the model has not converged. This means that the underlying algorithm that's trying to estimate all your odds ratios is unable to find the best solution. Or, would give warning messages and tell you that the algorithm did not converge. But if it did converge, there might still be problems. So, to check, I always inspect the standard errors for the odds ratios and the size of the odds ratios themselves. Large values of either make me uneasy, but especially for the standard errors, as they are also an indication of how many patients and outcomes we use to estimate the associated odds ratio. This is more often a problem for logistic regression than for linear regression, but it can happen in any type of regression. It can also happen when two or more of your predictors are highly correlated with each other. But how large is large? Well, it's no agreed cut off for standard errors, but anything over 10, I'd say is definitely too big, and I personally rarely accept standard errors over one. So, what can you do about overfitting? Happily, there are some simple remedies that often work. So, if one or more of your categories in the categorical variable has large standard errors, then try combining them with another category that makes sense. Also, check that you're reference category isn't tiny. For instance, if you have ages 20 categories and only four people are in the end of fives, then don't have the under fives as the reference. If those things don't work, then you'll need to drop the whole variable. You might need to drop several variables with big standard errors. Overfitting is a major pitfall of predictive modelling and happens when you try to squeeze too many predictors or too many categories into your model. Happily, simple tricks often get around it, but it's vital to try your model out on a separate set of patients whenever possible to check that your model is robust.