10 Python One-Liners for Descriptive Statistics

10 Python One-Liners for Descriptive Statistics
Image by Author | ChatGPT

Descriptive statistics summarize and describe the main features of a dataset without making inferences about the entire population. From understanding central tendencies to identifying data quality issues, descriptive statistics form the foundation of any data analysis project.

In this article, we’ll explore 10 Python one-liners that demonstrate different approaches to descriptive statistics, progressing from basic pandas operations to specialized statistical libraries. Each example uses the arabica coffee quality dataset, which contains chemical measurements and quality scores for over 1,300 coffee samples.

Getting Started

We’ll start with pandas, the go-to library for data manipulation in Python, and introduce other libraries as needed throughout our analysis.

 
# Start with pandas
import pandas as pd

# Load the arabica coffee quality dataset
url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
df = pd.read_csv(url)

This loads a dataset with 1,311 rows and 44 columns, including measurements like Total Cup Points, Aroma, Flavor, and Processing Method.

1. Basic Summary

 
df['Total.Cup.Points'].describe()

Output:

 
count    1311.000000
mean       82.115927
std         3.515761
min         0.000000
25%        81.170000
50%        82.500000
75%        83.670000
max        90.580000

This provides a quick overview of the distribution including count, mean, standard deviation, and key percentiles. For all numeric columns, you can use df.describe() instead. Note that the 50% percentile is actually the median, making this function particularly valuable for understanding both central tendency and spread. The output also includes the 25th and 75th percentiles (first and third quartiles), which help identify the interquartile range and potential outliers.

2. Extended Statistical Moments

 
from scipy import stats
stats.describe(df['Total.Cup.Points'])._asdict()

Output:

 
{'nobs': 1311,
 'minmax': (np.float64(0.0), np.float64(90.58)),
 'mean': np.float64(82.11592677345537),
 'variance': np.float64(12.360573167152864),
 'skewness': np.float64(-10.51756590695268),
 'kurtosis': np.float64(228.37864259146895)}

SciPy’s describe function adds valuable insights about distribution shape through skewness and kurtosis, which aren’t available in pandas’ describe method. The extreme negative skewness (-10.5) indicates a heavy left tail, suggesting some unusually low coffee scores that could represent defective beans or data entry errors. The exceptionally high kurtosis (228) reveals a distribution with extremely heavy tails, indicating that while most coffees cluster around the mean, there are notable outliers at both ends of the quality spectrum.

3. Grouped Analysis

 
df.groupby('Country.of.Origin')['Total.Cup.Points'].describe().round(2).head(5)

Output:

 
                count  mean   std    min    25%    50%    75%    max
Country.of.Origin                                                    
Brazil            132  82.41  2.14  70.67  81.73  82.42  83.25  88.83
Burundi             2  81.83  2.12  80.33  81.08  81.83  82.58  83.33
China              16  82.93  2.33  78.00  82.40  83.17  84.31  87.25
Colombia          183  83.11  1.41  72.83  82.62  83.25  83.92  86.00
Costa Rica         51  82.79  2.38  71.75  81.75  83.25  84.46  87.17

This shows how coffee quality varies by country of origin, revealing that Colombia tends to have higher scores with less variability compared to Brazil. The groupby approach automatically handles missing values and provides complete statistical summaries for each group, making it easy to identify patterns across categories. Notice how some countries like Burundi have very small sample sizes (only 2 coffees), which suggests we should be cautious about drawing strong conclusions from their statistics.

4. Correlation Analysis

 
df[['Total.Cup.Points', 'Aroma', 'Flavor', 'Aftertaste']].corrwith(df['Total.Cup.Points']).round(3)

Output:

 
Total.Cup.Points    1.000
Aroma               0.797
Flavor              0.878
Aftertaste          0.866

This reveals that Flavor has the strongest correlation with overall coffee quality (0.878), followed closely by Aftertaste (0.866), among the taste attributes examined. These positive correlations make intuitive sense since flavor and aftertaste are key components of coffee quality assessment. The correlation with Aroma (0.797) is slightly lower but still strong, suggesting that while aroma contributes significantly to quality perception, the actual taste experience is more important in determining overall scores.

5. Comprehensive Statistical Summary

 
from statsmodels.stats.descriptivestats import describe
describe(df['Total.Cup.Points'])

Output:

 
                  Total.Cup.Points
nobs                  1.311000e+03
missing               0.000000e+00
mean                  8.211593e+01
std_err               9.709972e-02
upper_ci              8.230624e+01
lower_ci              8.192561e+01
std                   3.515761e+00
iqr                   2.500000e+00
iqr_normal            1.853253e+00
mad                   1.895843e+00
mad_normal            2.376086e+00
coef_var              4.281460e-02
range                 9.058000e+01
max                   9.058000e+01
min                   0.000000e+00
skew                 -1.051757e+01
kurtosis              2.313786e+02
jarque_bera           2.873236e+06
jarque_bera_pval      0.000000e+00
mode                  8.317000e+01
mode_freq             2.898551e-02
median                8.250000e+01
1%                    7.235500e+01
5%                    7.792000e+01
10%                   7.933000e+01
25%                   8.117000e+01
50%                   8.250000e+01
75%                   8.367000e+01
90%                   8.467000e+01
95%                   8.550000e+01
99%                   8.791100e+01

Statsmodels provides an extensive summary including confidence intervals, robust statistics, detailed percentiles, and normality tests – perfect for thorough statistical analysis. The confidence intervals for the mean (81.93-82.31) give us a range where we expect the true population mean to fall with 95% confidence. The Jarque-Bera test with a p-value of 0.000 strongly rejects the null hypothesis of normality, confirming that coffee quality scores don’t follow a normal distribution due to the extreme outliers and skewness we observed earlier.

6. Data Quality Overview

 
df.isnull().sum().sort_values(ascending=False).head(10)

Output:

 
Lot.Number              1041
Farm.Name                356
Mill                     310
Color                    267
Producer                 230
altitude_low_meters      227
altitude_high_meters     227
altitude_mean_meters     227
Altitude                 223
Company                  209

This quickly identifies columns with missing data, essential for understanding data completeness and deciding on cleaning strategies. The high number of missing values in Lot.Number (1041 out of 1311) suggests this field might be optional in the data collection process. Farm.Name also shows significant missingness (356 missing), which could indicate that many coffee samples come from larger processing facilities rather than individual farms.

7. Dataset Structure

 
df.dtypes.value_counts()

Output:

 
object     24
float64    16
int64       4
Name: count, dtype: int64

Understanding the distribution of data types helps optimize memory usage and choose appropriate analysis methods. With 24 object (string) columns, 16 float64 columns, and only 4 int64 columns, this dataset is predominantly categorical and continuous numeric. This distribution suggests the data captures both qualitative assessments (stored as strings) and quantitative measurements (stored as floats), typical of sensory evaluation datasets.

8. Categorical Distribution

 
df['Processing.Method'].value_counts()

Output:

 
Processing.Method
Washed / Wet                 812
Natural / Dry                251
Semi-washed / Semi-pulped     56
Other                         26
Pulped natural / honey        14
Name: count, dtype: int64

This shows that Washed/Wet processing dominates the dataset (62%), followed by Natural/Dry processing (19%). The significant imbalance toward Washed/Wet processing reflects its prevalence in commercial coffee production, as this method typically produces more consistent flavor profiles. The small number of “Other” (26) and “Pulped natural/honey” (14) samples suggests these are specialty processing methods that might be worth investigating for their quality characteristics.

9. Cross-tabulation Analysis

 
pd.crosstab(df['Country.of.Origin'], df['Processing.Method'], normalize='index').round(3).head(5)

Output:

Coffee processing method proportions (natural, other, pulped, semi-washed, washed) for each country of origin.
Image By Author
 

This reveals interesting patterns: Brazil heavily favors Natural/Dry processing (67.8%), while Burundi exclusively uses Washed/Wet processing. These processing preferences often reflect regional traditions, climate conditions, and market demands. The normalize=’index’ parameter converts raw counts to percentages within each country, making it easier to compare processing method preferences across countries with different sample sizes.

10. Professional Summary Table

 
from tableone import TableOne
TableOne(df, columns=['Total.Cup.Points', 'Aroma', 'Flavor', 'Aftertaste'], groupby='Processing.Method').tableone

Output:

Mean coffee quality scores (cup points, aroma, flavor, aftertaste) and sample counts grouped by processing method.
Image By Author
 

TableOne creates publication-ready summary tables, showing that Pulped natural/honey processing yields the highest quality scores (82.8) despite being the smallest group with only 14 samples. The table also reveals that Natural/Dry achieves 82.4 points compared to 82.0 for Washed/Wet. The standard deviations tell another story – Pulped natural/honey shows remarkable consistency (SD=1.7), while “Other” processing methods exhibit the highest variability (SD=4.0), suggesting this category might contain very diverse or experimental techniques.

Summary

These 10 one-liners demonstrate the progression from basic descriptive statistics to sophisticated analysis:

  1. Simple summaries with pandas for quick overviews
  2. Extended moments using SciPy for distribution insights
  3. Grouped analysis to compare categories
  4. Correlation analysis to understand relationships
  5. Comprehensive summaries with statsmodels for detailed statistics
  6. Data quality checks to assess completeness
  7. Structure analysis to understand variable types
  8. Distribution analysis for categorical variables
  9. Cross-tabulations to examine relationships between categories
  10. Professional formatting with TableOne for research-ready outputs

Starting with pandas and progressively introducing specialized libraries reflects the natural progression of data analysis projects. Each tool adds specific capabilities: SciPy for advanced statistics, statsmodels for comprehensive summaries, and TableOne for publication-ready tables.

Whether you’re exploring a new dataset or preparing analysis for stakeholders, these one-liners provide efficient ways to understand your data’s characteristics and communicate findings effectively.

2 Replies to “10 Python One-Liners for Descriptive Statistics”

  1. Thanks Vinod,
    I am using Python in Excel and got most of them to work but here are my comments that I hope you find helpful.
    I cannot get Python in Excel to import data from an URL so I imported the data from github manually and created df= … by having the table on another worksheet in my Excel file. That’s FYI since it all worked but not the way you intended.
    There are several problems with the data:

    2 Extended Statistical Moments
    I suppose this is a Python in Excel problem but this returned “dict” and not the output you suggested
    3 Grouped Analysis throws up a few rows that seem to be corrupted. If I am right, you will find them very easily!
    10 Professional Summary Table
    I get this message: Error:Python
    ModuleNotFoundError: No module named ‘tableone’

    Otherwise, many thanks

    Duncan

    1. Thanks for the helpful feedback on using Python in Excel. You’ve identified a few key areas where Python-in-Excel integration currently has some friction. Here’s a breakdown of your points along with suggestions and clarification:

      ### 1. **Importing Data from URL**

      You mentioned:

      > *”I cannot get Python in Excel to import data from a URL…”*

      **Cause:** Python in Excel currently runs in a sandboxed environment, which **restricts direct internet access** (for security reasons).

      **Workaround (as you did):**

      * Download the file manually.
      * Load it into a worksheet.
      * Use `=PY()` to read it into a DataFrame with:

      “`python
      df = pd.read_excel(xl(“SheetName”)[“A1:D100”])
      “`

      ✅ Good solution!

      ### 2. **Extended Statistical Moments returning `dict`**

      You said:

      > *”Returned ‘dict’ and not the output you suggested…”*

      That’s likely due to:

      * Code using `describe()` or custom functions returning a dictionary instead of a DataFrame.
      * When Python in Excel returns a `dict`, it cannot display it like a table.

      **Fix: Convert the dict to a DataFrame before outputting**

      “`python
      import pandas as pd

      moments_dict = {“mean”: 4.5, “skew”: 1.2, “kurtosis”: 3.1}
      pd.DataFrame(moments_dict.items(), columns=[“Statistic”, “Value”])
      “`

      ✅ This will display correctly in Excel.

      ### 3. **Grouped Analysis – Corrupted Rows**

      > *”Throws up a few rows that seem to be corrupted…”*

      **What to check:**

      * Missing or NaN values in the grouping column.
      * Unexpected data types (e.g., string in numeric columns).
      * Special characters or non-UTF-8 encoding.

      **Fix:** Clean the data before group analysis.

      “`python
      df_clean = df.dropna(subset=[“GroupColumn”, “ValueColumn”])
      df_clean[“GroupColumn”] = df_clean[“GroupColumn”].astype(str)
      “`

      Then re-run your groupby analysis:

      “`python
      df_clean.groupby(“GroupColumn”)[“ValueColumn”].mean()
      “`

      ### 4. **`tableone` Module Not Found**

      > *”No module named ‘tableone’…”*

      **Cause:** `tableone` is not currently included in Python in Excel’s default environment.

      **Options:**

      * ✅ Use a manual alternative:

      “`python
      df.groupby(“Group”).describe()
      “`

      or create a summary manually with:

      “`python
      import numpy as np
      summary = df.groupby(“Group”).agg([“mean”, “std”, “min”, “max”, “count”])
      “`

      * ⛔ You **cannot install extra packages** like `tableone` unless Microsoft adds them to the supported list.

      **Alternative Tool:** Use the built-in `statsmodels` (available) for professional summaries.

Leave a Reply

Your email address will not be published. Required fields are marked *