In statistics, quartiles are values that split up a dataset into four equal parts.
When analyzing a distribution, we’re typically interested in the following quartiles:
- First Quartile (Q1): The value located at the 25th percentile
- Second Quartile (Q2): The value located at the 50th percentile
- Third Quartile (Q3): The value located at the 75th percentile
You can use the following methods to calculate the quartiles for columns in a pandas DataFrame:
Method 1: Calculate Quartiles for One Column
df['some_column'].quantile([0.25, 0.5, 0.75])
Method 2: Calculate Quartiles for Each Numeric Column
df.quantile(q=[0.25, 0.5, 0.75], axis=0, numeric_only=True)
The following examples show how to use each method in practice with the following pandas DataFrame:
import pandas as pd
#create DataFrame
df = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'points': [12, 14, 14, 16, 24, 26, 28, 30, 31, 35],
'assists': [2, 2, 3, 3, 4, 6, 7, 8, 10, 15]})
#view DataFrame
print(df)
team points assists
0 A 12 2
1 B 14 2
2 C 14 3
3 D 16 3
4 E 24 4
5 F 26 6
6 G 28 7
7 H 30 8
8 I 31 10
9 J 35 15
Example 1: Calculate Quartiles for One Column
The following code shows how to calculate the quartiles for the points column only:
#calculate quartiles for points column
df['points'].quantile([0.25, 0.5, 0.75])
0.25 14.5
0.50 25.0
0.75 29.5
Name: points, dtype: float64
From the output we can see:
- The first quartile is located at 14.5.
- The second quartile is located at 25.
- The third quartile is located at 29.5.
By only knowing these three values, we have a pretty good idea of how the values are distributed in the points column.
Example 2: Calculate Quartiles for Each Numeric Column
The following code shows how to calculate the quartiles for each numeric column in the DataFrame:
#calculate quartiles for each numeric column in DataFrame
df.quantile(q=[0.25, 0.5, 0.75], axis=0, numeric_only=True)
points assists
0.25 14.5 3.00
0.50 25.0 5.00
0.75 29.5 7.75
The output displays the quartiles for the two numeric columns in the DataFrame.
Note that there is more than one way to calculate quartiles for a distribution.
Refer to the pandas documentation page to see the various methods that the pandas quantile() function uses to calculate quartiles.
Additional Resources
The following tutorials explain how to perform other common tasks in pandas:
How to Calculate Percent Change in Pandas
How to Calculate Cumulative Percentage in Pandas
How to Calculate Percentage of Total Within Group in Pandas
I am confused how the quantile function is calculating the first and third quartiles. As a math teacher I have always excluded the median from calculating Q1. For example, if the data = {1,2,3,4,5,6,7}, the median is 4, Q1 = 2 and Q3 = 6. If I run the same data in python using the pandas library and the quantile function, I get Q1=2.5 and Q3=5.5.
Can you explain? I am a high school math/computer science teacher and teaching a data science class for the first time. I would like to get a grasp on what is happening so that I can explain to my student.
Hi Raymond…The difference you’re seeing in calculating the quartiles comes down to how the pandas `quantile()` function handles the data versus the traditional method you teach.
When you calculate quartiles by hand and exclude the median from the calculation of Q1 and Q3, you’re using what is known as the **exclusive method**. This method divides the data set into two halves and excludes the median from both halves when calculating Q1 and Q3.
However, the pandas `quantile()` function uses a method that includes interpolation, which can lead to non-integer results for Q1 and Q3. In particular, it calculates quantiles by interpolating between data points when necessary, depending on the position of the quantile. This is sometimes referred to as the **inclusive method** or linear interpolation.
Here’s what happens in your example:
Given the data set: {1, 2, 3, 4, 5, 6, 7}
– The **median** is 4.
– For Q1 (25th percentile), pandas calculates the position as \(0.25 \times (n+1) = 0.25 \times 8 = 2\), meaning it interpolates between the 2nd and 3rd values (2 and 3), resulting in \(Q1 = 2.5\).
– For Q3 (75th percentile), pandas calculates the position as \(0.75 \times (n+1) = 0.75 \times 8 = 6\), meaning it interpolates between the 6th and 7th values (6 and 7), resulting in \(Q3 = 5.5\).
### Key Difference:
– **Traditional method (manual)**: Q1 = 2, Q3 = 6 (you divide the dataset excluding the median).
– **Pandas `quantile()`**: Q1 = 2.5, Q3 = 5.5 (uses interpolation, which gives more precise, non-integer results).
You can explain to your students that while different methods of calculating quartiles exist, pandas uses interpolation, which is common in statistical software to handle continuous data more flexibly. Both methods are valid but reflect slightly different interpretations of quartiles.
Does this help clarify the difference? Let me know if you’d like further clarification for your class!