HackForge

Transfer learning that cuts AI waste: lower carbon, less compute, safer deployment.

HackForge is a from-scratch PyTorch framework for evaluating how transfer learning affects performance, carbon emissions, parameter efficiency, and deployment feasibility across both classical ML and deep CNNs.


🚀 Inspiration

AI is powerful, but it is also expensive to train, energy-intensive, and often inaccessible to teams without large compute budgets.

In many real-world settings, practitioners do not have unlimited GPU access, massive labeled datasets, or time to retrain models from scratch. At the same time, transfer learning is often treated as automatically beneficial, even though it can help, hurt, or simply save compute without improving performance.

We built HackForge to answer a practical question:

Can transfer learning make AI not just better, but greener?

Instead of making generic efficiency claims, we wanted to measure exactly:

What we care about Why it matters
Performance Transfer learning should improve or preserve quality
Carbon Lower emissions make AI more sustainable
Parameters Fewer trainable parameters mean lower compute cost
Safety Harmful transfer should be detected before wasting compute

✨ What it does

HackForge is a transfer-learning sustainability benchmarking framework.

Core capabilities

Area What HackForge supports
Classical ML Scratch baselines, regularized transfer, Bayesian transfer, domain-shift analysis, negative transfer safety gate
Deep Learning ResNet50, EfficientNetB0, MobileNetV2; scratch, frozen backbone, fine-tuning, progressive unfreezing
Benchmarking Low-data sweeps at 100%, 50%, 25%, and 10%
Reporting CO2, runtime, trainable vs frozen parameters, official model size, edge feasibility
Metrics Sensitivity, specificity, F1, ROC-AUC, confusion matrix

🧠 Project overview

Scenario Category What we tested Key takeaway
Housing Affordability Classical ML Regression under geographic shift Transfer reduced compute and emissions while matching or improving performance
Health Screening Classical ML Classification under tumor-size shift Bayesian transfer reduced compute while staying competitive
Negative Transfer Safety Classical ML Harmful transfer detection Prevented wasted compute on a severely degraded transfer setup
Synthetic Histopathology Deep Learning CNN transfer, low-data behavior, carbon, deployment Validated the benchmarking and carbon-tracking pipeline

🌱 Sustainability impact

HackForge is built to make efficiency measurable, not anecdotal.

Classical ML results

Task Baseline Transfer result Carbon impact
Housing Affordability Scratch: R² = 0.56 Bayesian transfer: R² = 0.59 CO2 dropped from 7.35e-06 kg to 4.78e-09 kg (~99.9% reduction)
Health Screening Scratch: 93.52% accuracy Bayesian transfer: 91.55% accuracy CO2 dropped from 2.32e-06 kg to 1.22e-06 kg (~47% reduction)
Negative Transfer Safety Naive transfer failed badly Safety gate detected it and safe transfer recovered performance Avoided wasteful compute on harmful transfer

Deep learning results

The deep learning portion is currently a synthetic proof-of-concept inspired by breast cancer histopathology classification.

It is used to validate:

Pipeline element Why it matters
CNN benchmarking Compare transfer strategies fairly
NVML carbon tracking Measure real GPU energy on CUDA
Parameter accounting Show what is actually being trained
Low-data evaluation Test transfer behavior when labels are scarce
Deployment analysis Check whether models are edge-feasible

Current results show:

Finding Current outcome
CO2 reduction Frozen backbones reduced CO2 by 52–76%
Trainable parameter reduction Frozen backbones reduced trainable parameters by 87–98%
Low-data result On the synthetic task, scratch outperformed frozen transfer in all tested low-data regimes
Interpretation The synthetic signal appears too simple to benefit from pretrained texture features
Next step Apply the same pipeline to BreaKHis and PatchCamelyon

📐 Carbon measurement

For CUDA/NVIDIA experiments, HackForge uses:

NVML Energy API on Tesla T4

This gives hardware-level GPU energy measurement, rather than relying only on rough timing estimates.

For portable settings, the framework also supports time-based estimation:

$$ CO_2 = P \times t \times PUE \times CI $$

Symbol Meaning
\(P\) Power draw
\(t\) Training time
\(PUE\) Power usage effectiveness
\(CI\) Grid carbon intensity

🏗️ How we built it

HackForge was built as a from-scratch PyTorch framework focused on transparency, control, and reproducibility.

Engineering overview

Component Implementation
Training Custom PyTorch loops for scratch, frozen transfer, fine-tuning, and progressive unfreezing
Evaluation Multi-seed experiments, low-data sweeps, metric aggregation
Sustainability NVML-based energy measurement, parameter accounting, runtime tracking
Analysis Shift metrics, transfer safety checks, deployment feasibility
Reliability 98 unit tests, 8 demo scripts, seeded experiments, JSON export

Model support

Deep learning Classical ML
ResNet50 Scratch baselines
EfficientNetB0 Bayesian transfer
MobileNetV2 Regularized transfer
TorchVision pretrained backbones Domain-shift metrics
Transfer strategy benchmarking Negative transfer safety gate

🧪 Deep learning proof-of-concept disclaimer

Important: The CNN section is a synthetic proof-of-concept, not a clinical benchmark.

What it is What it is not
Synthetic images mimicking histopathology-style structure Not a medical claim
Controlled source vs target domain shift Not a diagnostic tool
A benchmarking pipeline for carbon, transfer, and low-data behavior Not a patient-level clinical evaluation

Real next step

We plan to run the exact same pipeline on:

  • BreaKHis
  • PatchCamelyon

with patient-aware splits and real deployment-oriented evaluation.


😓 Challenges we ran into

Challenge What we learned
Transfer learning is not always better On our synthetic CNN task, scratch outperformed frozen transfer, which forced us to make the project more rigorous and more honest
Avoiding overclaiming We intentionally framed the CNN section as a proof-of-concept instead of presenting it as clinical AI
Parameter accounting Official model size, experimental params, trainable params, and frozen params are all different and needed to be reported clearly
Carbon tracking across hardware Supporting both NVML and time-based estimation added complexity but made the framework more portable
Reporting sustainability Carbon, time, and parameter counts had to be treated as first-class outputs, not side notes

🏆 Accomplishments that we're proud of

HackForge is more than a model demo — it is a measurement and decision framework.

Highlight Why we’re proud of it
Unified sustainability benchmarking One framework across classical ML and deep learning
~99.9% CO2 reduction in one classical transfer setting Shows how powerful efficient transfer can be
60.6% aggregate CO2 reduction in the current benchmark run Demonstrates measurable sustainability impact
Negative transfer safety gate Prevents wasteful compute before it happens
NVML integration Adds hardware-level GPU energy tracking
3 CNNs × 4 strategies × 4 regimes Broad benchmarking instead of cherry-picked results
98 tests + 8 demo scripts Stronger reproducibility and engineering quality

What matters most to us

HackForge is intentionally honest:

  • real classical ML results are presented as real
  • deep learning is clearly marked as a synthetic proof-of-concept
  • unsupported claims are intentionally avoided

📚 What we learned

We learned that sustainable AI is not just about smaller models.

It is about:

Principle Meaning
Measure energy Efficiency should be observable, not assumed
Reuse useful representations Transfer can reduce waste when it genuinely helps
Avoid harmful transfer Some transfer setups cost compute without improving performance
Benchmark low-data behavior Label scarcity is one of the most practical real-world constraints
Design for deployment A model that cannot be deployed efficiently is harder to justify

Trust matters more than hype.
The strongest projects are the ones where the evidence is clear.


🔭 What's next for HackForge

The next major step is turning the CNN proof-of-concept into a real benchmark.

Roadmap

Next milestone Goal
Run on BreaKHis Evaluate transfer on real histopathology structure
Run on PatchCamelyon Test the pipeline on a larger real benchmark
Use patient-aware splits Make the evaluation clinically credible
Compare real pretrained transfer vs scratch Validate whether real texture/shape structure produces the expected transfer gains
Improve reporting and visualizations Make results easier to understand and present
Expand edge deployment analysis Test feasibility on practical hospital hardware

Long-term vision

We want HackForge to help teams answer a practical question:

Is this transfer-learning choice actually saving compute, carbon, and cost — and is it worth deploying?


💡 Why this matters

Transfer learning should not only improve model performance.

It should also help reduce:

Waste Benefit
Carbon emissions Greener AI systems
Compute waste Lower training cost
Retraining overhead Faster iteration
Infrastructure demands More realistic deployment in low-resource settings

HackForge brings those tradeoffs into the open by making them measurable.

Transfer learning can make AI greener — but only if we measure it honestly.

Built With

Share this project:

Updates