A curated list of tools, libraries, platforms, datasets, workflows, and learning resources for data science, spanning data collection, analysis, visualization, machine learning, and production analytics.
- Foundations & References
- Programming Languages
- Data Wrangling & ETL
- Exploratory Data Analysis
- Visualization
- Statistics & Probability
- Machine Learning
- Deep Learning
- Time Series & Forecasting
- Big Data & Distributed Computing
- Databases & Storage
- MLOps & Production
- Notebooks & Experimentation
- Datasets & Open Data
- Learning Resources
- Related Awesome Lists
- Data Science Stack Exchange – Community Q&A covering theory, tools, and practice.
- arXiv Data Science – Open-access preprints across statistics, ML, and data analysis.
- CRISP-DM – Widely used process model for data mining projects.
- KDnuggets – News, tutorials, and opinions in data science and analytics.
- Towards Data Science – Popular publication with practical data science articles.
- Python – Dominant language for data science, ML, and scientific computing.
- R – Statistical computing language with rich analysis packages.
- Julia – High-performance language for numerical and scientific computing.
- Scala – JVM language commonly used with Apache Spark.
- SQL – Query language essential for data extraction and analysis.
- pandas – Core Python library for data manipulation and analysis.
- Polars – Fast DataFrame library optimized for performance.
- Apache Airflow – Workflow orchestration platform for data pipelines.
- dbt – Transformation tool for analytics engineering workflows.
- Apache NiFi – Visual tool for automating data flows between systems.
- Talend Open Studio – Open-source ETL and integration platform.
- ydata-profiling – Automated EDA reports for pandas DataFrames.
- Sweetviz – Visual EDA tool for quick dataset inspection.
- PandasGUI – Interactive GUI for exploring pandas DataFrames.
- D-Tale – Interactive visualizer for pandas and NumPy data.
- Matplotlib – Foundational plotting library for Python.
- Seaborn – Statistical data visualization built on Matplotlib.
- Plotly – Interactive plotting library for dashboards and notebooks.
- Altair – Declarative statistical visualization library for Python.
- Tableau – Business intelligence and data visualization platform.
- Power BI – Analytics and visualization service by Microsoft.
- SciPy Stats – Statistical functions for scientific computing.
- Statsmodels – Statistical modeling and hypothesis testing in Python.
- Stan – Probabilistic programming for Bayesian inference.
- PyMC – Bayesian statistical modeling and probabilistic ML.
- scikit-posthocs – Post-hoc statistical tests for Python.
- scikit-learn – Core ML library for classical algorithms in Python.
- XGBoost – Gradient boosting library for structured data.
- LightGBM – Fast gradient boosting framework by Microsoft.
- CatBoost – Gradient boosting with strong categorical feature support.
- MLflow – Platform for tracking experiments and managing ML lifecycles.
- TensorFlow – End-to-end deep learning framework.
- PyTorch – Popular deep learning library with dynamic computation graphs.
- Keras – High-level neural networks API.
- Hugging Face Transformers – Pretrained models for NLP and multimodal tasks.
- FastAI – High-level library simplifying deep learning workflows.
- statsforecast – Fast statistical forecasting library.
- Prophet – Time series forecasting tool for business use cases.
- GluonTS – Probabilistic time series modeling by AWS.
- Darts – Python library for easy time series forecasting.
- tslearn – Machine learning toolkit for time series data.
- Apache Spark – Distributed data processing engine.
- Apache Hadoop – Framework for distributed storage and processing.
- Dask – Parallel computing library that scales Python workflows.
- Ray – Distributed computing framework for ML and Python apps.
- Flink – Stream and batch processing framework.
- PostgreSQL – Advanced open-source relational database.
- MySQL – Popular relational database system.
- MongoDB – NoSQL document-oriented database.
- DuckDB – In-process analytical SQL database.
- BigQuery – Serverless data warehouse on GCP.
- Snowflake – Cloud-native data warehouse platform.
- Kubeflow – Kubernetes-native ML workflows.
- Seldon Core – Model deployment and inference on Kubernetes.
- BentoML – Framework for serving ML models in production.
- Weights & Biases – Experiment tracking and model monitoring.
- Evidently – Data and model monitoring for ML systems.
- Jupyter – Interactive notebooks for data analysis and ML.
- Google Colab – Cloud-hosted Jupyter notebooks with free GPUs.
- Kaggle Notebooks – Collaborative notebooks with datasets and competitions.
- VS Code Notebooks – Notebook support inside VS Code.
- Kaggle Datasets – Public datasets for data science projects.
- UCI ML Repository – Classic datasets for ML research.
- OpenML – Open datasets and benchmarks for ML experiments.
- Google Dataset Search – Search engine for public datasets.
- World Bank Open Data – Global development and economic data.
- Kaggle Learn – Hands-on micro-courses in data science.
- DataCamp – Interactive courses for data science skills.
- Coursera Data Science – University-backed data science programs.
- Python Data Science Handbook – Comprehensive guide to Python data tools.
- Google ML Crash Course – Practical intro to ML concepts.
- The Data Science Lifecycle – Overview of end-to-end data science workflows.
- Applied Data Science – End-to-end data analysis and modeling.
- Machine Learning Engineering – Production ML systems and MLOps.
- Statistics for Data Science – Probability and inference foundations.
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.