HackForge
Transfer learning that cuts AI waste: lower carbon, less compute, safer deployment.
HackForge is a from-scratch PyTorch framework for evaluating how transfer learning affects performance, carbon emissions, parameter efficiency, and deployment feasibility across both classical ML and deep CNNs.
🚀 Inspiration
AI is powerful, but it is also expensive to train, energy-intensive, and often inaccessible to teams without large compute budgets.
In many real-world settings, practitioners do not have unlimited GPU access, massive labeled datasets, or time to retrain models from scratch. At the same time, transfer learning is often treated as automatically beneficial, even though it can help, hurt, or simply save compute without improving performance.
We built HackForge to answer a practical question:
Can transfer learning make AI not just better, but greener?
Instead of making generic efficiency claims, we wanted to measure exactly:
| What we care about | Why it matters |
|---|---|
| Performance | Transfer learning should improve or preserve quality |
| Carbon | Lower emissions make AI more sustainable |
| Parameters | Fewer trainable parameters mean lower compute cost |
| Safety | Harmful transfer should be detected before wasting compute |
✨ What it does
HackForge is a transfer-learning sustainability benchmarking framework.
Core capabilities
| Area | What HackForge supports |
|---|---|
| Classical ML | Scratch baselines, regularized transfer, Bayesian transfer, domain-shift analysis, negative transfer safety gate |
| Deep Learning | ResNet50, EfficientNetB0, MobileNetV2; scratch, frozen backbone, fine-tuning, progressive unfreezing |
| Benchmarking | Low-data sweeps at 100%, 50%, 25%, and 10% |
| Reporting | CO2, runtime, trainable vs frozen parameters, official model size, edge feasibility |
| Metrics | Sensitivity, specificity, F1, ROC-AUC, confusion matrix |
🧠 Project overview
| Scenario | Category | What we tested | Key takeaway |
|---|---|---|---|
| Housing Affordability | Classical ML | Regression under geographic shift | Transfer reduced compute and emissions while matching or improving performance |
| Health Screening | Classical ML | Classification under tumor-size shift | Bayesian transfer reduced compute while staying competitive |
| Negative Transfer Safety | Classical ML | Harmful transfer detection | Prevented wasted compute on a severely degraded transfer setup |
| Synthetic Histopathology | Deep Learning | CNN transfer, low-data behavior, carbon, deployment | Validated the benchmarking and carbon-tracking pipeline |
🌱 Sustainability impact
HackForge is built to make efficiency measurable, not anecdotal.
Classical ML results
| Task | Baseline | Transfer result | Carbon impact |
|---|---|---|---|
| Housing Affordability | Scratch: R² = 0.56 | Bayesian transfer: R² = 0.59 | CO2 dropped from 7.35e-06 kg to 4.78e-09 kg (~99.9% reduction) |
| Health Screening | Scratch: 93.52% accuracy | Bayesian transfer: 91.55% accuracy | CO2 dropped from 2.32e-06 kg to 1.22e-06 kg (~47% reduction) |
| Negative Transfer Safety | Naive transfer failed badly | Safety gate detected it and safe transfer recovered performance | Avoided wasteful compute on harmful transfer |
Deep learning results
The deep learning portion is currently a synthetic proof-of-concept inspired by breast cancer histopathology classification.
It is used to validate:
| Pipeline element | Why it matters |
|---|---|
| CNN benchmarking | Compare transfer strategies fairly |
| NVML carbon tracking | Measure real GPU energy on CUDA |
| Parameter accounting | Show what is actually being trained |
| Low-data evaluation | Test transfer behavior when labels are scarce |
| Deployment analysis | Check whether models are edge-feasible |
Current results show:
| Finding | Current outcome |
|---|---|
| CO2 reduction | Frozen backbones reduced CO2 by 52–76% |
| Trainable parameter reduction | Frozen backbones reduced trainable parameters by 87–98% |
| Low-data result | On the synthetic task, scratch outperformed frozen transfer in all tested low-data regimes |
| Interpretation | The synthetic signal appears too simple to benefit from pretrained texture features |
| Next step | Apply the same pipeline to BreaKHis and PatchCamelyon |
📐 Carbon measurement
For CUDA/NVIDIA experiments, HackForge uses:
NVML Energy API on Tesla T4
This gives hardware-level GPU energy measurement, rather than relying only on rough timing estimates.
For portable settings, the framework also supports time-based estimation:
$$ CO_2 = P \times t \times PUE \times CI $$
| Symbol | Meaning |
|---|---|
| \(P\) | Power draw |
| \(t\) | Training time |
| \(PUE\) | Power usage effectiveness |
| \(CI\) | Grid carbon intensity |
🏗️ How we built it
HackForge was built as a from-scratch PyTorch framework focused on transparency, control, and reproducibility.
Engineering overview
| Component | Implementation |
|---|---|
| Training | Custom PyTorch loops for scratch, frozen transfer, fine-tuning, and progressive unfreezing |
| Evaluation | Multi-seed experiments, low-data sweeps, metric aggregation |
| Sustainability | NVML-based energy measurement, parameter accounting, runtime tracking |
| Analysis | Shift metrics, transfer safety checks, deployment feasibility |
| Reliability | 98 unit tests, 8 demo scripts, seeded experiments, JSON export |
Model support
| Deep learning | Classical ML |
|---|---|
| ResNet50 | Scratch baselines |
| EfficientNetB0 | Bayesian transfer |
| MobileNetV2 | Regularized transfer |
| TorchVision pretrained backbones | Domain-shift metrics |
| Transfer strategy benchmarking | Negative transfer safety gate |
🧪 Deep learning proof-of-concept disclaimer
Important: The CNN section is a synthetic proof-of-concept, not a clinical benchmark.
| What it is | What it is not |
|---|---|
| Synthetic images mimicking histopathology-style structure | Not a medical claim |
| Controlled source vs target domain shift | Not a diagnostic tool |
| A benchmarking pipeline for carbon, transfer, and low-data behavior | Not a patient-level clinical evaluation |
Real next step
We plan to run the exact same pipeline on:
- BreaKHis
- PatchCamelyon
with patient-aware splits and real deployment-oriented evaluation.
😓 Challenges we ran into
| Challenge | What we learned |
|---|---|
| Transfer learning is not always better | On our synthetic CNN task, scratch outperformed frozen transfer, which forced us to make the project more rigorous and more honest |
| Avoiding overclaiming | We intentionally framed the CNN section as a proof-of-concept instead of presenting it as clinical AI |
| Parameter accounting | Official model size, experimental params, trainable params, and frozen params are all different and needed to be reported clearly |
| Carbon tracking across hardware | Supporting both NVML and time-based estimation added complexity but made the framework more portable |
| Reporting sustainability | Carbon, time, and parameter counts had to be treated as first-class outputs, not side notes |
🏆 Accomplishments that we're proud of
HackForge is more than a model demo — it is a measurement and decision framework.
| Highlight | Why we’re proud of it |
|---|---|
| Unified sustainability benchmarking | One framework across classical ML and deep learning |
| ~99.9% CO2 reduction in one classical transfer setting | Shows how powerful efficient transfer can be |
| 60.6% aggregate CO2 reduction in the current benchmark run | Demonstrates measurable sustainability impact |
| Negative transfer safety gate | Prevents wasteful compute before it happens |
| NVML integration | Adds hardware-level GPU energy tracking |
| 3 CNNs × 4 strategies × 4 regimes | Broad benchmarking instead of cherry-picked results |
| 98 tests + 8 demo scripts | Stronger reproducibility and engineering quality |
What matters most to us
HackForge is intentionally honest:
- real classical ML results are presented as real
- deep learning is clearly marked as a synthetic proof-of-concept
- unsupported claims are intentionally avoided
📚 What we learned
We learned that sustainable AI is not just about smaller models.
It is about:
| Principle | Meaning |
|---|---|
| Measure energy | Efficiency should be observable, not assumed |
| Reuse useful representations | Transfer can reduce waste when it genuinely helps |
| Avoid harmful transfer | Some transfer setups cost compute without improving performance |
| Benchmark low-data behavior | Label scarcity is one of the most practical real-world constraints |
| Design for deployment | A model that cannot be deployed efficiently is harder to justify |
Trust matters more than hype.
The strongest projects are the ones where the evidence is clear.
🔭 What's next for HackForge
The next major step is turning the CNN proof-of-concept into a real benchmark.
Roadmap
| Next milestone | Goal |
|---|---|
| Run on BreaKHis | Evaluate transfer on real histopathology structure |
| Run on PatchCamelyon | Test the pipeline on a larger real benchmark |
| Use patient-aware splits | Make the evaluation clinically credible |
| Compare real pretrained transfer vs scratch | Validate whether real texture/shape structure produces the expected transfer gains |
| Improve reporting and visualizations | Make results easier to understand and present |
| Expand edge deployment analysis | Test feasibility on practical hospital hardware |
Long-term vision
We want HackForge to help teams answer a practical question:
Is this transfer-learning choice actually saving compute, carbon, and cost — and is it worth deploying?
💡 Why this matters
Transfer learning should not only improve model performance.
It should also help reduce:
| Waste | Benefit |
|---|---|
| Carbon emissions | Greener AI systems |
| Compute waste | Lower training cost |
| Retraining overhead | Faster iteration |
| Infrastructure demands | More realistic deployment in low-resource settings |
HackForge brings those tradeoffs into the open by making them measurable.


Log in or sign up for Devpost to join the conversation.