Awesome Data Science

A curated list of tools, libraries, platforms, datasets, workflows, and learning resources for data science, spanning data collection, analysis, visualization, machine learning, and production analytics.

Foundations & References
Programming Languages
Data Wrangling & ETL
Exploratory Data Analysis
Visualization
Statistics & Probability
Machine Learning
Deep Learning
Time Series & Forecasting
Big Data & Distributed Computing
Databases & Storage
MLOps & Production
Notebooks & Experimentation
Datasets & Open Data
Learning Resources
Related Awesome Lists

Foundations & References

Data Science Stack Exchange – Community Q&A covering theory, tools, and practice.
arXiv Data Science – Open-access preprints across statistics, ML, and data analysis.
CRISP-DM – Widely used process model for data mining projects.
KDnuggets – News, tutorials, and opinions in data science and analytics.
Towards Data Science – Popular publication with practical data science articles.

Programming Languages

Python – Dominant language for data science, ML, and scientific computing.
R – Statistical computing language with rich analysis packages.
Julia – High-performance language for numerical and scientific computing.
Scala – JVM language commonly used with Apache Spark.
SQL – Query language essential for data extraction and analysis.

Data Wrangling & ETL

pandas – Core Python library for data manipulation and analysis.
Polars – Fast DataFrame library optimized for performance.
Apache Airflow – Workflow orchestration platform for data pipelines.
dbt – Transformation tool for analytics engineering workflows.
Apache NiFi – Visual tool for automating data flows between systems.
Talend Open Studio – Open-source ETL and integration platform.

Exploratory Data Analysis

ydata-profiling – Automated EDA reports for pandas DataFrames.
Sweetviz – Visual EDA tool for quick dataset inspection.
PandasGUI – Interactive GUI for exploring pandas DataFrames.
D-Tale – Interactive visualizer for pandas and NumPy data.

Visualization

Matplotlib – Foundational plotting library for Python.
Seaborn – Statistical data visualization built on Matplotlib.
Plotly – Interactive plotting library for dashboards and notebooks.
Altair – Declarative statistical visualization library for Python.
Tableau – Business intelligence and data visualization platform.
Power BI – Analytics and visualization service by Microsoft.

Statistics & Probability

SciPy Stats – Statistical functions for scientific computing.
Statsmodels – Statistical modeling and hypothesis testing in Python.
Stan – Probabilistic programming for Bayesian inference.
PyMC – Bayesian statistical modeling and probabilistic ML.
scikit-posthocs – Post-hoc statistical tests for Python.

Machine Learning

scikit-learn – Core ML library for classical algorithms in Python.
XGBoost – Gradient boosting library for structured data.
LightGBM – Fast gradient boosting framework by Microsoft.
CatBoost – Gradient boosting with strong categorical feature support.
MLflow – Platform for tracking experiments and managing ML lifecycles.

Deep Learning

TensorFlow – End-to-end deep learning framework.
PyTorch – Popular deep learning library with dynamic computation graphs.
Keras – High-level neural networks API.
Hugging Face Transformers – Pretrained models for NLP and multimodal tasks.
FastAI – High-level library simplifying deep learning workflows.

Time Series & Forecasting

statsforecast – Fast statistical forecasting library.
Prophet – Time series forecasting tool for business use cases.
GluonTS – Probabilistic time series modeling by AWS.
Darts – Python library for easy time series forecasting.
tslearn – Machine learning toolkit for time series data.

Big Data & Distributed Computing

Apache Spark – Distributed data processing engine.
Apache Hadoop – Framework for distributed storage and processing.
Dask – Parallel computing library that scales Python workflows.
Ray – Distributed computing framework for ML and Python apps.
Flink – Stream and batch processing framework.

Databases & Storage

PostgreSQL – Advanced open-source relational database.
MySQL – Popular relational database system.
MongoDB – NoSQL document-oriented database.
DuckDB – In-process analytical SQL database.
BigQuery – Serverless data warehouse on GCP.
Snowflake – Cloud-native data warehouse platform.

MLOps & Production

Kubeflow – Kubernetes-native ML workflows.
Seldon Core – Model deployment and inference on Kubernetes.
BentoML – Framework for serving ML models in production.
Weights & Biases – Experiment tracking and model monitoring.
Evidently – Data and model monitoring for ML systems.

Notebooks & Experimentation

Jupyter – Interactive notebooks for data analysis and ML.
Google Colab – Cloud-hosted Jupyter notebooks with free GPUs.
Kaggle Notebooks – Collaborative notebooks with datasets and competitions.
VS Code Notebooks – Notebook support inside VS Code.

Datasets & Open Data

Kaggle Datasets – Public datasets for data science projects.
UCI ML Repository – Classic datasets for ML research.
OpenML – Open datasets and benchmarks for ML experiments.
Google Dataset Search – Search engine for public datasets.
World Bank Open Data – Global development and economic data.

Learning Resources

Tutorials

Kaggle Learn – Hands-on micro-courses in data science.
DataCamp – Interactive courses for data science skills.
Coursera Data Science – University-backed data science programs.

Guides

Python Data Science Handbook – Comprehensive guide to Python data tools.
Google ML Crash Course – Practical intro to ML concepts.
The Data Science Lifecycle – Overview of end-to-end data science workflows.

Courses

Applied Data Science – End-to-end data analysis and modeling.
Machine Learning Engineering – Production ML systems and MLOps.
Statistics for Data Science – Probability and inference foundations.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
check_readme_links.py		check_readme_links.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Data Science

Contents

Foundations & References

Programming Languages

Data Wrangling & ETL

Exploratory Data Analysis

Visualization

Statistics & Probability

Machine Learning

Deep Learning

Time Series & Forecasting

Big Data & Distributed Computing

Databases & Storage

MLOps & Production

Notebooks & Experimentation

Datasets & Open Data

Learning Resources

Tutorials

Guides

Courses

Related Awesome Lists

Contribute

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Science

Contents

Foundations & References

Programming Languages

Data Wrangling & ETL

Exploratory Data Analysis

Visualization

Statistics & Probability

Machine Learning

Deep Learning

Time Series & Forecasting

Big Data & Distributed Computing

Databases & Storage

MLOps & Production

Notebooks & Experimentation

Datasets & Open Data

Learning Resources

Tutorials

Guides

Courses

Related Awesome Lists

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages