Variance and Bias

What are they and how to they affect performance of models.

The main aim of any statistical or machine learning model is to predict the relationship between the predictors and respose variable. This relation can be written as:-

      Y = ƒ(X) + ε

Here ƒ() is an estimate of the actual relationship between the predictor and response variable and ε is the irreducible error term (caused due to factors not captured by our predictor variables). Obviously, as per its name we cannot reduce the irreducible error part. What we can reduce is the error caused by difference in our estimate of the relationship between X and Y and the actual relationship. This error that we can reduce and make out model better, is made up of Variance and Bias, both of which are introduced by our selection of model.

Generally, we can write the error in our estimate as:

      E(yi - ƒ(xi)) = Var(ƒ(xi)) + [Bias(ƒ(xi))]2 + Var(ε)

So, what are these terms intutively:

  • Variance: It is the amount by which ƒ will differ if we were we to train the model again using a different set of data
  • Bias: It is the error created by approximating a real life problem, which may be complicated by a much simpler model

Variance in caused when the model starts to fit the training data way too closely and in turn even fits in the noise in the training data. A good example of such models are the non-parametric models. On the other hand a good example of the models that introduces Bias are the parametric models such as linear regression.

So, either type of model parametric or non-parametric does not garauntee both low Variance as well Bias. In practice one has to strike a balance and try to reduce both bias and variance as much as possible in order to reduce the error