Skip to content

brandonhimpfen/awesome-data-science

Repository files navigation

Awesome Data Science Awesome Lists

GitHub Sponsors   Ko-Fi   PayPal   Stripe   X   Facebook

A curated list of tools, libraries, platforms, datasets, workflows, and learning resources for data science, spanning data collection, analysis, visualization, machine learning, and production analytics.

Contents

Foundations & References

Programming Languages

  • Python – Dominant language for data science, ML, and scientific computing.
  • R – Statistical computing language with rich analysis packages.
  • Julia – High-performance language for numerical and scientific computing.
  • Scala – JVM language commonly used with Apache Spark.
  • SQL – Query language essential for data extraction and analysis.

Data Wrangling & ETL

  • pandas – Core Python library for data manipulation and analysis.
  • Polars – Fast DataFrame library optimized for performance.
  • Apache Airflow – Workflow orchestration platform for data pipelines.
  • dbt – Transformation tool for analytics engineering workflows.
  • Apache NiFi – Visual tool for automating data flows between systems.
  • Talend Open Studio – Open-source ETL and integration platform.

Exploratory Data Analysis

  • ydata-profiling – Automated EDA reports for pandas DataFrames.
  • Sweetviz – Visual EDA tool for quick dataset inspection.
  • PandasGUI – Interactive GUI for exploring pandas DataFrames.
  • D-Tale – Interactive visualizer for pandas and NumPy data.

Visualization

  • Matplotlib – Foundational plotting library for Python.
  • Seaborn – Statistical data visualization built on Matplotlib.
  • Plotly – Interactive plotting library for dashboards and notebooks.
  • Altair – Declarative statistical visualization library for Python.
  • Tableau – Business intelligence and data visualization platform.
  • Power BI – Analytics and visualization service by Microsoft.

Statistics & Probability

  • SciPy Stats – Statistical functions for scientific computing.
  • Statsmodels – Statistical modeling and hypothesis testing in Python.
  • Stan – Probabilistic programming for Bayesian inference.
  • PyMC – Bayesian statistical modeling and probabilistic ML.
  • scikit-posthocs – Post-hoc statistical tests for Python.

Machine Learning

  • scikit-learn – Core ML library for classical algorithms in Python.
  • XGBoost – Gradient boosting library for structured data.
  • LightGBM – Fast gradient boosting framework by Microsoft.
  • CatBoost – Gradient boosting with strong categorical feature support.
  • MLflow – Platform for tracking experiments and managing ML lifecycles.

Deep Learning

  • TensorFlow – End-to-end deep learning framework.
  • PyTorch – Popular deep learning library with dynamic computation graphs.
  • Keras – High-level neural networks API.
  • Hugging Face Transformers – Pretrained models for NLP and multimodal tasks.
  • FastAI – High-level library simplifying deep learning workflows.

Time Series & Forecasting

  • statsforecast – Fast statistical forecasting library.
  • Prophet – Time series forecasting tool for business use cases.
  • GluonTS – Probabilistic time series modeling by AWS.
  • Darts – Python library for easy time series forecasting.
  • tslearn – Machine learning toolkit for time series data.

Big Data & Distributed Computing

  • Apache Spark – Distributed data processing engine.
  • Apache Hadoop – Framework for distributed storage and processing.
  • Dask – Parallel computing library that scales Python workflows.
  • Ray – Distributed computing framework for ML and Python apps.
  • Flink – Stream and batch processing framework.

Databases & Storage

  • PostgreSQL – Advanced open-source relational database.
  • MySQL – Popular relational database system.
  • MongoDB – NoSQL document-oriented database.
  • DuckDB – In-process analytical SQL database.
  • BigQuery – Serverless data warehouse on GCP.
  • Snowflake – Cloud-native data warehouse platform.

MLOps & Production

  • Kubeflow – Kubernetes-native ML workflows.
  • Seldon Core – Model deployment and inference on Kubernetes.
  • BentoML – Framework for serving ML models in production.
  • Weights & Biases – Experiment tracking and model monitoring.
  • Evidently – Data and model monitoring for ML systems.

Notebooks & Experimentation

Datasets & Open Data

Learning Resources

Tutorials

Guides

Courses

  • Applied Data Science – End-to-end data analysis and modeling.
  • Machine Learning Engineering – Production ML systems and MLOps.
  • Statistics for Data Science – Probability and inference foundations.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

License

CC0

About

A curated list of tools, libraries, platforms, datasets, workflows, and learning resources for data science.

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages