What is ANOVA - Analysis of Variance & Why is it so useful?

Note: You need to know some statistics to understand the rest of the article.

An ANOVA test is a way to find out if the survey or experiment results are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.

You can skip the reading and jump straight into the video where this concept is very nicely explained.

Now to answer the question " What is the probability that two samples come from the same population"?

The Z and t distributions both look the same, they quite look the same, except that the Z distribution will have fatter tails. Eventually, with more samples they converge.

Now, let's refer to the F - Distribution that is used to answer the question "What is the probability that two samples come from populations that have the same variance?"

It can also answer the question "What is the probability that three or more samples come from the same population?"

this is where Analysis of variance or ANOVA comes to help.

Normally our null hypothesis would look like: H: μA=μB=μC

We could test each pair:

H: μA = μB α = 0.05
H: μA = μC α = 0.05
H: μB = μC α = 0.05

But there is a problem with this, as with this our overall confidence drops. Since we have 3 null hypothesis and 3 significant levels, we multiply it each time. Hence the overall significance level times how many tests we did will not be looking good.

.95  x  .95  x  .95 = .857

This is what Anova solves. We compute an F value, and then we compare it to a critical value determined by our degrees of freedom (the number of groups, and the number of items in each group). And that is why ANOVA is important.

Basically, we are testing the groups to check if there is any difference between them. Some examples are

  1. You are trying to AB test different pages: A Submit button, A Learn more button and contact us button

  2. A group of psychiatric patients are trying 3 different therapies: counseling, medication and biofeedback and you want to see which one of the therapies is better than the others

  3. If you want to know a particular party will win the election or the other

  4. Students from different colleges/schools take the same exam. You want to see if one college outperforms the other

So the formula for the F value becomes

   Variance Between Groups
F =  ————————————————      =   (SSG / dfgroups) / (SSE /dferror)
    Variance within Groups

SSG = Sum of Squares Groups

SSE = Sum of Squares Error

df groups = degrees of freedom (groups)

df error = degrees of freedom (error)

There are independent variables (IVs) in the ANOVA test and there is one way or two-way tests.

One way has one independent variable (with 2 levels) (For Ex: Age of people in a sample)

Two way has two or more independent variables (For ex: Age, Sex, BMI, Race, in a sample)

What do we mean by Groups? and Levels?

If you had data that had only one variable, but it had multiple variables - for ex: there is one patient variable, however his age, sex, race are multiple levels of a variable. There is also a nested ANOVA concept for groups that have a hierarchical structure, ex: studying genetic similarities between family members.

Here is a sample document that explains how to calculate ANOVA. In another article I talk more about ANOVAs usage, statistical inferencing using data, and calculating our hypothesis.