
Image by Author | ChatGPT
Descriptive statistics summarize and describe the main features of a dataset without making inferences about the entire population. From understanding central tendencies to identifying data quality issues, descriptive statistics form the foundation of any data analysis project.
In this article, we’ll explore 10 Python one-liners that demonstrate different approaches to descriptive statistics, progressing from basic pandas operations to specialized statistical libraries. Each example uses the arabica coffee quality dataset, which contains chemical measurements and quality scores for over 1,300 coffee samples.
Getting Started
We’ll start with pandas, the go-to library for data manipulation in Python, and introduce other libraries as needed throughout our analysis.
# Start with pandas import pandas as pd # Load the arabica coffee quality dataset url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv' df = pd.read_csv(url)
This loads a dataset with 1,311 rows and 44 columns, including measurements like Total Cup Points, Aroma, Flavor, and Processing Method.
1. Basic Summary
df['Total.Cup.Points'].describe()
Output:
count 1311.000000 mean 82.115927 std 3.515761 min 0.000000 25% 81.170000 50% 82.500000 75% 83.670000 max 90.580000
This provides a quick overview of the distribution including count, mean, standard deviation, and key percentiles. For all numeric columns, you can use df.describe() instead. Note that the 50% percentile is actually the median, making this function particularly valuable for understanding both central tendency and spread. The output also includes the 25th and 75th percentiles (first and third quartiles), which help identify the interquartile range and potential outliers.
2. Extended Statistical Moments
from scipy import stats stats.describe(df['Total.Cup.Points'])._asdict()
Output:
{'nobs': 1311,
'minmax': (np.float64(0.0), np.float64(90.58)),
'mean': np.float64(82.11592677345537),
'variance': np.float64(12.360573167152864),
'skewness': np.float64(-10.51756590695268),
'kurtosis': np.float64(228.37864259146895)}
SciPy’s describe function adds valuable insights about distribution shape through skewness and kurtosis, which aren’t available in pandas’ describe method. The extreme negative skewness (-10.5) indicates a heavy left tail, suggesting some unusually low coffee scores that could represent defective beans or data entry errors. The exceptionally high kurtosis (228) reveals a distribution with extremely heavy tails, indicating that while most coffees cluster around the mean, there are notable outliers at both ends of the quality spectrum.
3. Grouped Analysis
df.groupby('Country.of.Origin')['Total.Cup.Points'].describe().round(2).head(5)
Output:
count mean std min 25% 50% 75% max Country.of.Origin Brazil 132 82.41 2.14 70.67 81.73 82.42 83.25 88.83 Burundi 2 81.83 2.12 80.33 81.08 81.83 82.58 83.33 China 16 82.93 2.33 78.00 82.40 83.17 84.31 87.25 Colombia 183 83.11 1.41 72.83 82.62 83.25 83.92 86.00 Costa Rica 51 82.79 2.38 71.75 81.75 83.25 84.46 87.17
This shows how coffee quality varies by country of origin, revealing that Colombia tends to have higher scores with less variability compared to Brazil. The groupby approach automatically handles missing values and provides complete statistical summaries for each group, making it easy to identify patterns across categories. Notice how some countries like Burundi have very small sample sizes (only 2 coffees), which suggests we should be cautious about drawing strong conclusions from their statistics.
4. Correlation Analysis
df[['Total.Cup.Points', 'Aroma', 'Flavor', 'Aftertaste']].corrwith(df['Total.Cup.Points']).round(3)
Output:
Total.Cup.Points 1.000 Aroma 0.797 Flavor 0.878 Aftertaste 0.866
This reveals that Flavor has the strongest correlation with overall coffee quality (0.878), followed closely by Aftertaste (0.866), among the taste attributes examined. These positive correlations make intuitive sense since flavor and aftertaste are key components of coffee quality assessment. The correlation with Aroma (0.797) is slightly lower but still strong, suggesting that while aroma contributes significantly to quality perception, the actual taste experience is more important in determining overall scores.
5. Comprehensive Statistical Summary
from statsmodels.stats.descriptivestats import describe describe(df['Total.Cup.Points'])
Output:
Total.Cup.Points nobs 1.311000e+03 missing 0.000000e+00 mean 8.211593e+01 std_err 9.709972e-02 upper_ci 8.230624e+01 lower_ci 8.192561e+01 std 3.515761e+00 iqr 2.500000e+00 iqr_normal 1.853253e+00 mad 1.895843e+00 mad_normal 2.376086e+00 coef_var 4.281460e-02 range 9.058000e+01 max 9.058000e+01 min 0.000000e+00 skew -1.051757e+01 kurtosis 2.313786e+02 jarque_bera 2.873236e+06 jarque_bera_pval 0.000000e+00 mode 8.317000e+01 mode_freq 2.898551e-02 median 8.250000e+01 1% 7.235500e+01 5% 7.792000e+01 10% 7.933000e+01 25% 8.117000e+01 50% 8.250000e+01 75% 8.367000e+01 90% 8.467000e+01 95% 8.550000e+01 99% 8.791100e+01
Statsmodels provides an extensive summary including confidence intervals, robust statistics, detailed percentiles, and normality tests – perfect for thorough statistical analysis. The confidence intervals for the mean (81.93-82.31) give us a range where we expect the true population mean to fall with 95% confidence. The Jarque-Bera test with a p-value of 0.000 strongly rejects the null hypothesis of normality, confirming that coffee quality scores don’t follow a normal distribution due to the extreme outliers and skewness we observed earlier.
6. Data Quality Overview
df.isnull().sum().sort_values(ascending=False).head(10)
Output:
Lot.Number 1041 Farm.Name 356 Mill 310 Color 267 Producer 230 altitude_low_meters 227 altitude_high_meters 227 altitude_mean_meters 227 Altitude 223 Company 209
This quickly identifies columns with missing data, essential for understanding data completeness and deciding on cleaning strategies. The high number of missing values in Lot.Number (1041 out of 1311) suggests this field might be optional in the data collection process. Farm.Name also shows significant missingness (356 missing), which could indicate that many coffee samples come from larger processing facilities rather than individual farms.
7. Dataset Structure
df.dtypes.value_counts()
Output:
object 24 float64 16 int64 4 Name: count, dtype: int64
Understanding the distribution of data types helps optimize memory usage and choose appropriate analysis methods. With 24 object (string) columns, 16 float64 columns, and only 4 int64 columns, this dataset is predominantly categorical and continuous numeric. This distribution suggests the data captures both qualitative assessments (stored as strings) and quantitative measurements (stored as floats), typical of sensory evaluation datasets.
8. Categorical Distribution
df['Processing.Method'].value_counts()
Output:
Processing.Method Washed / Wet 812 Natural / Dry 251 Semi-washed / Semi-pulped 56 Other 26 Pulped natural / honey 14 Name: count, dtype: int64
This shows that Washed/Wet processing dominates the dataset (62%), followed by Natural/Dry processing (19%). The significant imbalance toward Washed/Wet processing reflects its prevalence in commercial coffee production, as this method typically produces more consistent flavor profiles. The small number of “Other” (26) and “Pulped natural/honey” (14) samples suggests these are specialty processing methods that might be worth investigating for their quality characteristics.
9. Cross-tabulation Analysis
pd.crosstab(df['Country.of.Origin'], df['Processing.Method'], normalize='index').round(3).head(5)
Output:

Image By Author
This reveals interesting patterns: Brazil heavily favors Natural/Dry processing (67.8%), while Burundi exclusively uses Washed/Wet processing. These processing preferences often reflect regional traditions, climate conditions, and market demands. The normalize=’index’ parameter converts raw counts to percentages within each country, making it easier to compare processing method preferences across countries with different sample sizes.
10. Professional Summary Table
from tableone import TableOne TableOne(df, columns=['Total.Cup.Points', 'Aroma', 'Flavor', 'Aftertaste'], groupby='Processing.Method').tableone
Output:

Image By Author
TableOne creates publication-ready summary tables, showing that Pulped natural/honey processing yields the highest quality scores (82.8) despite being the smallest group with only 14 samples. The table also reveals that Natural/Dry achieves 82.4 points compared to 82.0 for Washed/Wet. The standard deviations tell another story – Pulped natural/honey shows remarkable consistency (SD=1.7), while “Other” processing methods exhibit the highest variability (SD=4.0), suggesting this category might contain very diverse or experimental techniques.
Summary
These 10 one-liners demonstrate the progression from basic descriptive statistics to sophisticated analysis:
- Simple summaries with pandas for quick overviews
- Extended moments using SciPy for distribution insights
- Grouped analysis to compare categories
- Correlation analysis to understand relationships
- Comprehensive summaries with statsmodels for detailed statistics
- Data quality checks to assess completeness
- Structure analysis to understand variable types
- Distribution analysis for categorical variables
- Cross-tabulations to examine relationships between categories
- Professional formatting with TableOne for research-ready outputs
Starting with pandas and progressively introducing specialized libraries reflects the natural progression of data analysis projects. Each tool adds specific capabilities: SciPy for advanced statistics, statsmodels for comprehensive summaries, and TableOne for publication-ready tables.
Whether you’re exploring a new dataset or preparing analysis for stakeholders, these one-liners provide efficient ways to understand your data’s characteristics and communicate findings effectively.

Thanks Vinod,
I am using Python in Excel and got most of them to work but here are my comments that I hope you find helpful.
I cannot get Python in Excel to import data from an URL so I imported the data from github manually and created df= … by having the table on another worksheet in my Excel file. That’s FYI since it all worked but not the way you intended.
There are several problems with the data:
2 Extended Statistical Moments
I suppose this is a Python in Excel problem but this returned “dict” and not the output you suggested
3 Grouped Analysis throws up a few rows that seem to be corrupted. If I am right, you will find them very easily!
10 Professional Summary Table
I get this message: Error:Python
ModuleNotFoundError: No module named ‘tableone’
Otherwise, many thanks
Duncan
Thanks for the helpful feedback on using Python in Excel. You’ve identified a few key areas where Python-in-Excel integration currently has some friction. Here’s a breakdown of your points along with suggestions and clarification:
—
### 1. **Importing Data from URL**
You mentioned:
> *”I cannot get Python in Excel to import data from a URL…”*
**Cause:** Python in Excel currently runs in a sandboxed environment, which **restricts direct internet access** (for security reasons).
**Workaround (as you did):**
* Download the file manually.
* Load it into a worksheet.
* Use `=PY()` to read it into a DataFrame with:
“`python
df = pd.read_excel(xl(“SheetName”)[“A1:D100”])
“`
✅ Good solution!
—
### 2. **Extended Statistical Moments returning `dict`**
You said:
> *”Returned ‘dict’ and not the output you suggested…”*
That’s likely due to:
* Code using `describe()` or custom functions returning a dictionary instead of a DataFrame.
* When Python in Excel returns a `dict`, it cannot display it like a table.
**Fix: Convert the dict to a DataFrame before outputting**
“`python
import pandas as pd
moments_dict = {“mean”: 4.5, “skew”: 1.2, “kurtosis”: 3.1}
pd.DataFrame(moments_dict.items(), columns=[“Statistic”, “Value”])
“`
✅ This will display correctly in Excel.
—
### 3. **Grouped Analysis – Corrupted Rows**
> *”Throws up a few rows that seem to be corrupted…”*
**What to check:**
* Missing or NaN values in the grouping column.
* Unexpected data types (e.g., string in numeric columns).
* Special characters or non-UTF-8 encoding.
**Fix:** Clean the data before group analysis.
“`python
df_clean = df.dropna(subset=[“GroupColumn”, “ValueColumn”])
df_clean[“GroupColumn”] = df_clean[“GroupColumn”].astype(str)
“`
Then re-run your groupby analysis:
“`python
df_clean.groupby(“GroupColumn”)[“ValueColumn”].mean()
“`
—
### 4. **`tableone` Module Not Found**
> *”No module named ‘tableone’…”*
**Cause:** `tableone` is not currently included in Python in Excel’s default environment.
**Options:**
* ✅ Use a manual alternative:
“`python
df.groupby(“Group”).describe()
“`
or create a summary manually with:
“`python
import numpy as np
summary = df.groupby(“Group”).agg([“mean”, “std”, “min”, “max”, “count”])
“`
* ⛔ You **cannot install extra packages** like `tableone` unless Microsoft adds them to the supported list.
**Alternative Tool:** Use the built-in `statsmodels` (available) for professional summaries.
—