To find the variance of a series or a column in a DataFrame in pandas, the easiest way is to use the pandas **var()** function.

`df["Column1"].var()`

You can also use the numpy **var()** function, but be careful as the default algorithm is different than the default pandas **var()** algorithm.

```
np.var(df["Column1"]) #Different result from default pandas function
np.var(df["Column1"],ddof=1) #Same result as default pandas function
```

When doing data analysis, the ability to compute different summary statistics, such as the mean or median of a variable, is very useful to help us understand the data. One such summary statistic which can be useful is the variance of a variable.

The variance is the average of the squared deviations from the mean.

Finding the variance of columns or a Series using pandas is easy. We can use the pandas **var()** function to find the standard deviation of a column of numbers.

Let’s say we have the following DataFrame.

```
df = pd.DataFrame({'Name': ['Jim', 'Sally', 'Bob', 'Sue', 'Jill', 'Larry'],
'Weight': [160.20, 160.20, 209.45, 150.35, 187.52, 187.52],
'Height': [50.10, 68.94, 71.42, 48.56, 59.37, 63.42] })
print(df)
# Output:
Name Weight Height
0 Jim 160.20 50.10
1 Sally 160.20 68.94
2 Bob 209.45 71.42
3 Sue 150.35 48.56
4 Jill 187.52 59.37
5 Larry 187.52 63.42
```

To get the standard deviation of the column “Height”, we can use the pandas **std()** function in the following Python code:

```
print(df["Height"].var())
# Output:
90.15417666666664
```

## Calculating the Variance of a Series with numpy

We can also find the variance of a series using the numpy **std()** function. Depending on the complexity of our code, it might be faster to use the numpy **var()** function.

Let’s say we have the same dataset as above.

To get the variance of the column “Height”, we can use the numpy **var()** function in the following Python code.

```
print(np.var(df["Height"]))
# Output:
8.667668692073754
```

As you can verify for yourself, this is a different result from the pandas **var()** function. The reason for this is the default normalization method is different between pandas and numpy. This is because, by default, pandas provides an unbiased estimator of the variance of a hypothetical infinite population, or uses 1 delta degree of freedom.

To get the same variance using both numpy and pandas, you need to pass ‘ddof=1’ to the numpy **var()** function.

```
print(np.var(df["Height"]))
print(np.var(df["Height"],ddof=1))
print(df["Height"].var())
# Output:
75.12848055555554
90.15417666666664
90.15417666666664
```

As you can see above, we received the same result from the code when we pass ‘ddof=1’ to the numpy **var()** function.

Hopefully this article has been helpful for you to understand how to find the variance of a variable within a column or Series using pandas.

## Leave a Reply