What it means and how to perform it.
Hypothesis testing is one of the basic things any data scientist may need to perform. It basically means what it sounds like, you test an hypothesis. Though it is simple, especially if you are using some software package for the computation. It is however, really important to understand what sort of test is sutiable for which scenario and what does the results mean. To properly understand the various techniques and the reasons for performing a certain operation for hypothesis testing requires knowledge and understanding of many statistical concepts. In this blog I will just try to describe the whole process and try to explain the thought process behind the methods.You can learn in deep about the statistical methods by going through the courses on descriptive and inferential statistics which can be found on the resources page.
What it means:By performing a hypothesis test, what we try to prove is that a given sample is significantly different from a given population. In other words the sample does not belong to the population. To clearify this point, consider the example of a drug trial. In a drug trial all the people affected by a disease that the drug is trying to cure form the population, now the new drug is a given to a randomly selected sample from the population. The drug is considered effective only if after its administration the health of the people belonging to the sample is significantly better than the rest of the population. In other words the sample is significantly different from the population.
Lets look at it in another way. Suppose in the drug trial experiment earlier 100 people were chosen for the experiment. From Central Limit Theorem, we know that the distribution of sample means given that sample size is large enough will always be normaly distributed. In other words if, we draw lets say 100000 samples of size 100 from the population of people suffering from a disease and calculate the mean of their health parameters. Then the distribution of these 100000 means will be bell shaped(normally distributed), like the following figure.
The next step would be to administer the drug to the sample set and then calculate the mean of the sample's health parameters. Now by doing a hypothesis test we need to try and prove that this red dot representing the treated sample actualy belongs to a different population (hopefully to that of the population of healthy people) with certain probabilitic certainty.
So what we will try and prove is that the mean of the treated sample does not belong to the original population but rather a different population such as the second distribution shown, with certain probabilistic certainity. If you made it this far, you already sort of understand the idea behind hypothesis testing. For the rest of the tutorial we will follow an example and understand the process and terminology related to hypothesis testing.
St. Junior high wants to test whether a new learning technique 'new better technique' is really effective in helping their students learn better or not. A rather simple way to test this would to introduce this technique to all the students and then see if their was a significant increase in student understanding of concepts. But this would really be inefficient as it could turn out that the technique had no effect at all. So they ask you an aspiring data scientist to help them out and you remember from your statistical foundation courses that hypothesis testing is what is needed here.
But before, you can get started with the analysis you need their help with the following:
Now that you have decided upon 0.05 α level, you need to set up your null and alternative hypothesis.
Lets now look at what I just said visually:
In a directional or one-tailed test. We would only accept the alternative hypothesis if the sample would lie in the shaded regions. We would do the same for a two-tailed test, except here the 5% area would be split equally on both the tails. As, in the case of St. Junior High school we only care if the new technique had any positive effect, we should decided to go with a positive one-tailed test.
Now that we have deicided on the significance level and our Null and Alternative hypothesis. We just need to choose which of the following tests will be be suited for this example.