Search Program
Organizations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
Academia Sinica — 0 people · 0 presentations
— 3 people · 1 presentation
— 1 person · 1 presentation
— 21 people · 11 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 4 people · 1 presentation
— 2 people · 1 presentation
— 2 people · 1 presentation
— 1 person · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 9 people · 2 presentations
— 2 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 7 people · 3 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 24 people · 14 presentations
— 11 people · 8 presentations
ASIC With Ankit — 0 people · 0 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 5 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 9 people · 2 presentations
— 2 people · 2 presentations
— 25 people · 14 presentations
— 1 person · 1 presentation
— 4 people · 3 presentations
— 3 people · 6 presentations
— 1 person · 1 presentation
— 13 people · 7 presentations
— 1 person · 1 presentation
— 1 person · 2 presentations
— 2 people · 1 presentation
— 1 person · 1 presentation
— 4 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
Boise State University — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
Brigham Young University — 0 people · 0 presentations
— 2 people · 2 presentations
— 4 people · 4 presentations
— 2 people · 2 presentations
— 7 people · 4 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 68 people · 39 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 6 people · 2 presentations
— 1 person · 1 presentation
— 2 people · 2 presentations
— 1 person · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
CEA — 0 people · 0 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 2 people · 1 presentation
— 3 people · 3 presentations
Charlotte — 0 people · 0 presentations
— 1 person · 1 presentation
— 5 people · 2 presentations
— 3 people · 1 presentation
— 1 person · 1 presentation
— 17 people · 8 presentations
— 1 person · 1 presentation
— 6 people · 4 presentations
— 7 people · 3 presentations
— 2 people · 1 presentation
— 8 people · 3 presentations
— 1 person · 1 presentation
— 13 people · 13 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 2 presentations
— 1 person · 1 presentation
— 4 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 3 people · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
Daegu Gyeongbuk Institute of Science and Technology — 0 people · 0 presentations
— 6 people · 3 presentations
— 8 people · 4 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 1 presentation
— 6 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 8 people · 2 presentations
— 1 person · 1 presentation
— 10 people · 5 presentations
— 4 people · 1 presentation
— 17 people · 6 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 3 presentations
— 3 people · 4 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 2 presentations
— 10 people · 7 presentations
— 1 person · 1 presentation
Ericsson — 0 people · 0 presentations
— 1 person · 1 presentation
— 14 people · 4 presentations
— 2 people · 2 presentations
— 2 people · 1 presentation
— 2 people · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 51 people · 23 presentations
— 1 person · 1 presentation
Fujitsu Limited — 0 people · 0 presentations
Fujitsu Research — 0 people · 0 presentations
— 1 person · 1 presentation
— 13 people · 8 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 8 people · 2 presentations
— 2 people · 2 presentations
— 13 people · 8 presentations
— 2 people · 2 presentations
Global Technology Applied Research, JPMorgan Chase — 0 people · 0 presentations
— 7 people · 2 presentations
— 4 people · 1 presentation
— 7 people · 2 presentations
— 32 people · 22 presentations
— 1 person · 1 presentation
— 9 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 3 presentations
— 2 people · 2 presentations
— 3 people · 2 presentations
— 5 people · 2 presentations
— 1 person · 1 presentation
— 11 people · 4 presentations
— 6 people · 2 presentations
— 6 people · 1 presentation
— 1 person · 1 presentation
— 3 people · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 2 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 18 people · 20 presentations
— 2 people · 2 presentations
HPE Labs — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 29 people · 8 presentations
— 51 people · 20 presentations
— 23 people · 10 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
IBM — 0 people · 0 presentations
— 83 people · 34 presentations
— 1 person · 1 presentation
— 5 people · 3 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 2 presentations
— 1 person · 1 presentation
— 7 people · 2 presentations
— 6 people · 5 presentations
— 3 people · 1 presentation
— 5 people · 1 presentation
— 2 people · 2 presentations
— 3 people · 2 presentations
— 2 people · 1 presentation
— 4 people · 1 presentation
— 2 people · 3 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 7 people · 1 presentation
— 9 people · 5 presentations
— 28 people · 17 presentations
— 18 people · 14 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 3 people · 1 presentation
— 2 people · 1 presentation
— 5 people · 2 presentations
— 10 people · 4 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 66 people · 26 presentations
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 10 people · 1 presentation
— 1 person · 1 presentation
— 7 people · 1 presentation
— 1 person · 1 presentation
— 2 people · 3 presentations
— 5 people · 1 presentation
— 16 people · 8 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 3 people · 5 presentations
— 1 person · 2 presentations
Kennesaw State University — 0 people · 0 presentations
— 13 people · 7 presentations
— 3 people · 2 presentations
— 1 person · 3 presentations
— 1 person · 4 presentations
— 4 people · 2 presentations
— 2 people · 1 presentation
— 4 people · 2 presentations
— 2 people · 2 presentations
— 6 people · 3 presentations
— 5 people · 3 presentations
— 16 people · 8 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 2 presentations
— 2 people · 2 presentations
— 2 people · 1 presentation
— 2 people · 1 presentation
— 2 people · 2 presentations
— 4 people · 2 presentations
— 1 person · 2 presentations
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 2 people · 2 presentations
— 4 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 2 presentations
— 1 person · 1 presentation
— 14 people · 6 presentations
— 3 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 2 presentations
— 1 person · 1 presentation
— 11 people · 2 presentations
— 5 people · 2 presentations
— 5 people · 1 presentation
— 13 people · 3 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 11 people · 5 presentations
— 1 person · 1 presentation
— 20 people · 10 presentations
— 4 people · 3 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 2 presentations
— 14 people · 5 presentations
— 4 people · 4 presentations
— 4 people · 1 presentation
— 3 people · 3 presentations
— 10 people · 3 presentations
— 1 person · 1 presentation
— 7 people · 6 presentations
— 1 person · 1 presentation
— 1 person · 2 presentations
— 5 people · 5 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 36 people · 17 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 4 people · 4 presentations
— 9 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 40 people · 15 presentations
— 13 people · 10 presentations
— 8 people · 5 presentations
— 1 person · 1 presentation
— 1 person · 2 presentations
— 1 person · 1 presentation
— 6 people · 4 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 2 people · 3 presentations
— 4 people · 6 presentations
— 11 people · 6 presentations
— 4 people · 1 presentation
— 4 people · 6 presentations
— 4 people · 3 presentations
— 3 people · 2 presentations
— 2 people · 1 presentation
— 12 people · 7 presentations
— 1 person · 1 presentation
— 6 people · 6 presentations
— 5 people · 4 presentations
— 2 people · 1 presentation
— 1 person · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 26 people · 23 presentations
— 20 people · 12 presentations
— 3 people · 1 presentation
— 2 people · 2 presentations
— 2 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 3 people · 1 presentation
— 2 people · 1 presentation
— 3 people · 3 presentations
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 76 people · 51 presentations
— 1 person · 6 presentations
— 2 people · 3 presentations
— 4 people · 5 presentations
— 3 people · 2 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 4 people · 1 presentation
— 10 people · 3 presentations
— 1 person · 1 presentation
— 4 people · 2 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 4 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 2 presentations
— 16 people · 8 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 4 people · 2 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 3 people · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 4 people · 1 presentation
— 1 person · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
Rice University — 0 people · 0 presentations
— 1 person · 1 presentation
RIKEN — 0 people · 0 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 2 people · 2 presentations
— 3 people · 1 presentation
— 1 person · 1 presentation
— 5 people · 1 presentation
— 12 people · 8 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 80 people · 39 presentations
— 4 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 3 people · 1 presentation
— 4 people · 3 presentations
— 10 people · 4 presentations
— 8 people · 3 presentations
— 5 people · 1 presentation
Santa Clara University — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 6 presentations
— 5 people · 1 presentation
— 2 people · 2 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 19 people · 8 presentations
— 4 people · 1 presentation
— 14 people · 6 presentations
— 5 people · 2 presentations
— 63 people · 30 presentations
— 2 people · 2 presentations
— 1 person · 3 presentations
— 1 person · 1 presentation
— 3 people · 2 presentations
— 20 people · 8 presentations
— 11 people · 3 presentations
— 3 people · 1 presentation
— 3 people · 1 presentation
— 1 person · 2 presentations
— 2 people · 2 presentations
— 15 people · 10 presentations
— 4 people · 4 presentations
— 63 people · 43 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 9 people · 4 presentations
— 4 people · 1 presentation
— 1 person · 2 presentations
— 4 people · 2 presentations
— 9 people · 2 presentations
— 4 people · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 8 people · 2 presentations
— 2 people · 1 presentation
— 1 person · 1 presentation
— 13 people · 5 presentations
— 1 person · 10 presentations
— 67 people · 43 presentations
— 5 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 22 people · 10 presentations
— 5 people · 3 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 2 people · 1 presentation
— 2 people · 3 presentations
— 11 people · 4 presentations
— 4 people · 4 presentations
State Key Laboratory of Novel Software Techniques — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 6 people · 4 presentations
— 48 people · 27 presentations
— 2 people · 1 presentation
— 2 people · 2 presentations
— 16 people · 4 presentations
— 25 people · 9 presentations
— 12 people · 10 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
SUSTech — 0 people · 0 presentations
— 110 people · 61 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 5 people · 2 presentations
— 1 person · 1 presentation
— 3 people · 1 presentation
— 16 people · 13 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
Télécom Paris — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
Temple University — 0 people · 0 presentations
— 1 person · 1 presentation
— 5 people · 2 presentations
— 1 person · 1 presentation
— 5 people · 5 presentations
— 32 people · 9 presentations
— 15 people · 6 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 42 people · 49 presentations
— 10 people · 6 presentations
— 1 person · 1 presentation
— 8 people · 3 presentations
— 1 person · 1 presentation
— 6 people · 6 presentations
— 8 people · 6 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 21 people · 8 presentations
— 10 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
The University of Texas at San Antonio — 0 people · 0 presentations
— 3 people · 2 presentations
— 2 people · 1 presentation
— 6 people · 3 presentations
— 2 people · 2 presentations
— 52 people · 36 presentations
— 9 people · 8 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 6 people · 3 presentations
— 4 people · 1 presentation
Tufts University — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 3 presentations
Ulsan National Institute of Sci. &Tech. — 0 people · 0 presentations
— 2 people · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
Undo — 0 people · 0 presentations
— 2 people · 1 presentation
— 1 person · 1 presentation
— 4 people · 4 presentations
— 1 person · 2 presentations
— 1 person · 1 presentation
— 1 person · 2 presentations
— 2 people · 1 presentation
— 1 person · 3 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 6 people · 1 presentation
— 2 people · 3 presentations
— 4 people · 1 presentation
— 1 person · 1 presentation
— 4 people · 2 presentations
— 1 person · 1 presentation
— 3 people · 2 presentations
— 3 people · 2 presentations
University of Calgary — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 3 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 7 people · 4 presentations
— 2 people · 1 presentation
— 10 people · 3 presentations
— 14 people · 13 presentations
— 2 people · 1 presentation
— 2 people · 2 presentations
— 28 people · 13 presentations
— 2 people · 1 presentation
— 6 people · 2 presentations
— 1 person · 2 presentations
— 3 people · 2 presentations
— 15 people · 7 presentations
— 5 people · 2 presentations
— 4 people · 2 presentations
University of Dhaka — 0 people · 0 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 4 people · 1 presentation
— 4 people · 1 presentation
— 1 person · 3 presentations
— 7 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 3 presentations
— 1 person · 1 presentation
University of Illinois at Urbana-Champaign — 0 people · 0 presentations
— 3 people · 4 presentations
— 4 people · 3 presentations
University of Illinois, Chicago — 0 people · 0 presentations
— 1 person · 1 presentation
University of Louisiana — 0 people · 0 presentations
— 2 people · 2 presentations
University of Maine — 0 people · 0 presentations
— 1 person · 2 presentations
University of Maryland Baltimore County — 0 people · 0 presentations
— 16 people · 6 presentations
— 1 person · 4 presentations
— 1 person · 1 presentation
— 12 people · 5 presentations
— 13 people · 8 presentations
University of Nebraska – Lincoln — 0 people · 0 presentations
— 1 person · 3 presentations
— 2 people · 2 presentations
— 2 people · 1 presentation
— 3 people · 1 presentation
University of North Carolina, Chapel Hill — 0 people · 0 presentations
— 19 people · 6 presentations
— 2 people · 1 presentation
— 4 people · 3 presentations
University of Pennsylvania — 0 people · 0 presentations
University of Pittsburgh — 0 people · 0 presentations
— 1 person · 1 presentation
— 3 people · 2 presentations
— 2 people · 2 presentations
— 36 people · 18 presentations
— 5 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 3 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 11 presentations
University of South Carolina — 0 people · 0 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 3 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
University of Tennessee, Knoxville — 0 people · 0 presentations
— 2 people · 2 presentations
— 5 people · 3 presentations
— 4 people · 2 presentations
— 1 person · 1 presentation
— 2 people · 1 presentation
— 2 people · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 1 presentation
— 2 people · 3 presentations
— 13 people · 6 presentations
— 3 people · 4 presentations
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 2 presentations
— 2 people · 11 presentations
UT Dallas — 0 people · 0 presentations
Utah State University — 0 people · 0 presentations
— 2 people · 2 presentations
— 4 people · 1 presentation
Veriest Solutions Ltd. — 0 people · 0 presentations
— 1 person · 1 presentation
— 5 people · 4 presentations
— 2 people · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 2 presentations
— 3 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 1 person · 1 presentation
— 5 people · 2 presentations
— 1 person · 1 presentation
— 1 person · 1 presentation
— 2 people · 2 presentations
— 1 person · 1 presentation
— 4 people · 3 presentations
— 3 people · 1 presentation
— 10 people · 2 presentations
— 3 people · 3 presentations
— 1 person · 1 presentation
— 16 people · 9 presentations
— 2 people · 2 presentations
— 1 person · 1 presentation
— 3 people · 2 presentations
— 3 people · 1 presentation
— 39 people · 23 presentations
— 8 people · 2 presentations
— 2 people · 1 presentation
— 4 people · 1 presentation
— 3 people · 2 presentations
Presentations
Networking Events
DescriptionCo-located with the 2026 DAC
In-person, 8:00 AM - 5:00 PM (PDT)
More Information Coming Soon!
In-person, 8:00 AM - 5:00 PM (PDT)
More Information Coming Soon!
People
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionVisual Autoregressive (VAR) model, via innovative next-resolution prediction, demonstrates significant potential of GPT-style AR models in image generation. However, due to its coarse-to-fine nature, the input token-map size grows dramatically with each step, resulting in excessive memory access and computational overhead. In this paper, we propose 3D-DuRA, an algorithm-architecture co-design based on a hybrid 3D near-memory and in-memory computing architecture equipped with dual-ring sparse attention, for efficient next-resolution visual-autoregressive generation. Experimental results demonstrate that our proposed 3D-DuRA achieves 4.1× improvement in area efficiency compared with RTX 6000 Ada GPU, along with 3.5× and 9.1× speedups and 10.1× and 13.1× improvements in energy efficiency on Infinity2B and VAR-d36, respectively.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLow-batch LLM inference on edge hardware places stringent demands on both memory bandwidth and computational capacity. While 3D-stacked DRAM accelerators offer a promising solution, they introduce two critical overheads that are frequently under-optimized: DRAM refresh and collective communication.
To mitigate these issues, we propose 3D-ReSAC, a near-memory processor based on 3D-stacked DRAM, equipped with an area-efficient sequential refresh-skip method and high-link-utilization ArrowMesh communication.
Our evaluations show that 3D-ReSAC reduces refresh and communication overheads by 7–100% and 53–75%, respectively, leading to a 1.12× to 2.02× latency reduction across low-batch LLM inference workloads.
To mitigate these issues, we propose 3D-ReSAC, a near-memory processor based on 3D-stacked DRAM, equipped with an area-efficient sequential refresh-skip method and high-link-utilization ArrowMesh communication.
Our evaluations show that 3D-ReSAC reduces refresh and communication overheads by 7–100% and 53–75%, respectively, leading to a 1.12× to 2.02× latency reduction across low-batch LLM inference workloads.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Description3DIC chiplet heterogeneous integration faces significant challenges in cross-die timing signoff due to the lack of unified 3D PDKs and the inability of conventional STA to account for jitter from TSV+HB arrays. This paper proposes a 3DEM-driven PDK-independent STA approach for reliable cross-die timing signoff. The approach constructs cross-die channels with die final-stage buffers and cross-die sections, integrating 3D EM modeling for S-parameter acquisition, SPICE simulation for total jitter analysis, R/C parameter back-annotation, and jitter incorporation as uncertainty into STA. A case study on 9-layer stacking (1GHz) shows that the proposed method enables transmission pattern optimization (reducing jitter from 230.74ps to 120.28ps and achieving timing closure), which is unattainable with conventional STA. This work facilitates reliable and flexible 3DIC chiplet heterogeneous integration.
People
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionExisting Boolean processor-based (BP-based) hardware emulation systems typically rely on fixed interconnects, which often suffer from severe bandwidth underutilization under highly imbalanced traffic. We propose a BP-based hardware emulation system with hierarchical reconfigurable interconnects. At both the inter-chip and intra-chip levels, our system dynamically reallocates idle interconnect lanes from low-traffic pairs to high-demand pairs, alleviating communication bottlenecks and improving interconnect utilization without modifying the underlying ASIC fabric. The hardware is co-designed with the compiler, which profiles communication, generates per-lane configuration tables and scheduling strategies, and programs the interconnect via registers. Experimental results on industrial digital designs show up to 64% emulation performance improvement and an average 27% reduction in compilation time, with only 7.4% area overhead compared to a fixed-interconnect baseline system.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWe propose a chip-level, layout-aware DTCO framework to correct deterministic interlayer misalignment. Using layout-driven features and machine learning, the proposed approach achieves ~70% variation reduction, enables fast inline prediction, and improves yield on silicon-proven products.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAutomotive radar systems rely on highly linear FMCW chirp signals to ensure accurate distance and velocity measurements for ADAS applications. Deviations in chirp linearity can compromise radar performance and violate functional safety standards. This paper presents a compact, all-digital IP for real-time chirp linearity monitoring integrated within the radar transmitter subsystem. The proposed solution leverages zero-crossing alignment between the chirp signal and an on-chip reference clock to compute instantaneous frequency slope deviations without requiring dedicated high-frequency clocks. The architecture includes counters, edge overlap detection, and an error computation engine, demonstrates detection of minor deviations as low as 0.001% across a 10 μs chirp duration. Implemented entirely in digital logic, the IP is technology-independent, incurs less than 5% area overhead, and operates with any available on-chip clock source. This scalable, low-cost solution enhances radar transmitter reliability and supports compliance with automotive safety standards, making it ideal for integration into next-generation radar SoCs.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionThe Gottesman–Kitaev–Preskill (GKP) encoding is a promising approach for realizing fault-tolerant continuous-variable (CV) photonic quantum computers.
Recent studies have investigated various GKP decoding algorithms, among which correlation-aware methods that exploit inter-qubit correlations achieve significantly improved error-correction performance.
However, hardware implementation of GKP decoding remains largely unexplored.
Error decoding in CV photonic quantum computers must operate within tens of nanoseconds to match the optical clock frequency.
A straightforward hardware implementation of correlation-aware decoding introduces substantial computational latency due to its arithmetic complexity.
To address this challenge, this paper proposes a high-accuracy and low-latency accelerator architecture optimized for correlation-aware GKP decoding.
Logic synthesis using 7-nm FinFET technology demonstrates that our decoder can complete correlation-aware GKP decoding with in 12.96 ns.
In addition, comprehensive design space exploration is conducted to evaluate tradeoffs among decoding accuracy, latency, and circuit area, leading to design guidelines for future correlation-aware GKP decoder implementations.
Recent studies have investigated various GKP decoding algorithms, among which correlation-aware methods that exploit inter-qubit correlations achieve significantly improved error-correction performance.
However, hardware implementation of GKP decoding remains largely unexplored.
Error decoding in CV photonic quantum computers must operate within tens of nanoseconds to match the optical clock frequency.
A straightforward hardware implementation of correlation-aware decoding introduces substantial computational latency due to its arithmetic complexity.
To address this challenge, this paper proposes a high-accuracy and low-latency accelerator architecture optimized for correlation-aware GKP decoding.
Logic synthesis using 7-nm FinFET technology demonstrates that our decoder can complete correlation-aware GKP decoding with in 12.96 ns.
In addition, comprehensive design space exploration is conducted to evaluate tradeoffs among decoding accuracy, latency, and circuit area, leading to design guidelines for future correlation-aware GKP decoder implementations.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGraph partitioning is essential for many EDA applications that leverage task graph parallelism for faster execution. For instance, RTL simulators partition an input RTL design into dependent tasks and schedule them across threads. However, existing partitioners are largely limited to general-purpose heuristics that overlook real threading costs, resulting in suboptimal performance. Consequently, we introduce DiffPart, a differentiable task graph partitioning framework that automatically learns high-quality partitions under real operating conditions. Applied to RTL simulation, DiffPart improves state-of-the-art Verilator's partitioning quality, delivering up to 1.22--55.25x faster simulation runtime across diverse designs.
Engineering Presentation
Design
EDA
Systems
DescriptionFor large-scale SerDes designs, dynamic power integrity sign-off is becoming increasingly challenging due to growing design scale and performance demands. Modern SerDes integrate analog blocks, digital blocks, sensitive clocks, decaps with complex Power Delivery Network (PDN) design. Such designs can reach tens of millions of transistors and multi-million-node as design nodes shrinking, making full-chip dynamic EMIR sign-off and Chip Power Model (CPM) generation difficult. In addition, data rates are scaling from 56G to 112G and 224G, leading to higher switching activity and larger transient current peaks. Accurate sign-off and CPM generation require smaller, picosecond-level time resolution and long transient windows of up to hundreds of nanoseconds, making sign-off runtime even longer.
In traditional power integrity analysis, existing EMIR tools have limited distributed scalability for full-chip, multi-domain SerDes designs. Realistic dynamic simulations often exhibit low parallel efficiency and inefficient utilization of computing resources, resulting in long runtime to days or even weeks. To reduce runtime and resource usage, designers have to downsize current vector capture window or simplify extracted networks, which impacts overall power integrity analysis accuracy. Moreover, past early analysis is typically limited to simple static analysis, while dynamic analysis and CPM generation can only be performed at the sign-off stage, making chip-package-system PDN co-optimization hard to process in early stage.
Here, we adopt a distributed and scalable dynamic power integrity flow from early analysis to sign-off. Early-stage dynamic power integrity analysis based on Build Quality Metric (BQM) provides early EMIR insight and enables early CPM generation for package- and system-level power integrity analysis, allowing issues to be addressed earlier and reducing sign-off iterations. Distributed and scalable dynamic EMIR sign-off flow enables efficient full-chip analysis under realistic workloads through multi-machine, multi-thread parallelism, without simplifying extracted networks or reducing analysis windows. With the unified dynamic power integrity flow from early analysis to sign-off, analysis efficiency is significantly improved while maintaining accuracy and increasing confidence in SerDes design robustness.
keywords:large-scale Serdes designs, distributed and scalable, dynamic power integrity, Chip Power Model
In traditional power integrity analysis, existing EMIR tools have limited distributed scalability for full-chip, multi-domain SerDes designs. Realistic dynamic simulations often exhibit low parallel efficiency and inefficient utilization of computing resources, resulting in long runtime to days or even weeks. To reduce runtime and resource usage, designers have to downsize current vector capture window or simplify extracted networks, which impacts overall power integrity analysis accuracy. Moreover, past early analysis is typically limited to simple static analysis, while dynamic analysis and CPM generation can only be performed at the sign-off stage, making chip-package-system PDN co-optimization hard to process in early stage.
Here, we adopt a distributed and scalable dynamic power integrity flow from early analysis to sign-off. Early-stage dynamic power integrity analysis based on Build Quality Metric (BQM) provides early EMIR insight and enables early CPM generation for package- and system-level power integrity analysis, allowing issues to be addressed earlier and reducing sign-off iterations. Distributed and scalable dynamic EMIR sign-off flow enables efficient full-chip analysis under realistic workloads through multi-machine, multi-thread parallelism, without simplifying extracted networks or reducing analysis windows. With the unified dynamic power integrity flow from early analysis to sign-off, analysis efficiency is significantly improved while maintaining accuracy and increasing confidence in SerDes design robustness.
keywords:large-scale Serdes designs, distributed and scalable, dynamic power integrity, Chip Power Model
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe generation of fluid-dynamics fields is essential for understanding complex nonlinear systems and enabling real-time scientific computing. Conventional computational fluid dynamics pipelines rely on finite-element or finite-volume solvers on von Neumann architectures, which discretize continuous physical evolution into many iterative updates, leading to prohibitive latency and energy consumption. Inspired by neural dynamical systems in the brain, we propose a biologically inspired continuous-time hardware–software co-design framework for flow-matching–based turbulent-flow generation. (1) The flow-matching model adopts an MLP-Mixer architecture that emulates cortical-style information integration and hierarchical signal mixing, providing a compact backbone that naturally aligns with closed-loop analog computation. (2) A fully analog continuous-time RRAM CIM neural ordinary differential equation (ODE) solver is developed to physically realize neural-like continuous-time latent dynamics, enabling high-speed and low-power flow generation. (3) Noise-aware training and decoder retraining are jointly introduced to ensure robust generation quality in the presence of RRAM read/write noise. Experiments on three turbulent-flow datasets show that the MLP-Mixer backbone matches convolutional and attention-based flow-matching models in velocity-field accuracy while mapping efficiently to CIM hardware, and that the proposed analog ODE solver reduces energy consumption by 98.24% and latency by 99.99% compared with an NVIDIA A100 GPU, while maintaining stable generation fidelity under realistic RRAM read/write noise. This work establishes a new paradigm for high-speed, energy-efficient physical process generation and scientific AI acceleration using neuromorphic continuous-time CIM computing.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionThe optimization of variational quantum algorithms (VQAs) is notoriously challenging due to poor parameter initialization, which often traps optimizers in suboptimal local minima. Existing methods rely on a static guessing paradigm that is fundamentally limited. This paper presents a Hamiltonian-guided pre-trainer (HGP), a new approach that dynamically constructs a better starting point. HGP iteratively refines parameters by performing exact global optimization within low-dimensional subspaces. These subspaces are identified using a Hamiltonian-guided parameter blocking strategy, and the optimization is achieved by reconstructing the analytic landscape from a few quantum measurements via a Fast Fourier Transform. We evaluated HGP on canonical spin models, where it consistently produced superior starting points for standard optimizers. Ablation studies reveal Hamiltonian-guided parameter blocking reduces the initial energy error by nearly 30-fold versus the next best benchmark. These results highlight the importance of Hamiltonian guided pre-training for enhancing VQA performance.
Engineering Presentation
EDA
Systems
DescriptionAs AI, machine learning, and cloud workloads scale, interconnects must deliver higher bandwidth and lower latency without compromising system-level power integrity and PDN robustness. Massive parallel compute leads to high transient current demand, fast di/dt switching, and increased simultaneous switching noise, causing traditional PDN assumptions to break down at the system level.
A scalable system-level PI simulation flow is key to balancing accuracy and turnaround time for large, high-performance product designs under aggressive go-to-market timelines. This work presents a virtual validation framework for system-level power integrity sign-off analysis, including die, package, and PCB. It enables engineers to identify anomalously high impedance responses in the system; excessive transient voltage drops and elevated current ripples at each interface (per bump) of the system.
This approach helps to correct the design before tape-out, helping to avoid costly redesign cycles and potential product failures.
A scalable system-level PI simulation flow is key to balancing accuracy and turnaround time for large, high-performance product designs under aggressive go-to-market timelines. This work presents a virtual validation framework for system-level power integrity sign-off analysis, including die, package, and PCB. It enables engineers to identify anomalously high impedance responses in the system; excessive transient voltage drops and elevated current ripples at each interface (per bump) of the system.
This approach helps to correct the design before tape-out, helping to avoid costly redesign cycles and potential product failures.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionVerification of status flags and interrupts is critical for correct interaction between IP components, yet exhaustive validation of interrupt behavior remains challenging when relying solely on simulation-based approaches. This approach presents a hybrid simulation and formal verification methodology for interrupt and status flag verification using an interrupt Verification IP (VIP) and demonstrates its application on a Real-Time Clock (RTC) IP. Simulation-based verification is used to validate interrupt behavior under realistic programming scenarios, while formal verification is leveraged to exhaustively prove interrupt properties and corner cases that are difficult to cover in simulation. The results highlight the effectiveness of combining dynamic and formal verification through a reusable interrupt verification infrastructure to improve confidence in interrupt correctness at the IP level.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPower consumption plays a key factor in SoC. With advances made in wireless technologies through 3GPP LTE & NR, IEEE Wifi and now with 3GPP 6G, there is a need for advanced architecture in SoC. Typically, a SoC uses sleep and wakeup (warmboot) procedure for saving battery. The warmboot procedure involves changing DRAM from refresh mode to active mode. This takes certain time until which other operations get delayed. Also usage of DRAM further consumes power. Hence, hardware design that enables software to have systematic control on wakeup (warmboot) and sleep operations is required. This idea discusses on design of SoC to use hybrid mode of SRAM and DRAM. SRAM and DRAM are memory entities which play a key role in wakeup, software execution and sleep. With combination of software control on proposed hardware design for system-on-chip, can efficiently increase sleep time by minimizing DRAM usage and wakeup time by managing with SRAM along with faster execution of procedures required during wakeup.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAs integrated circuit (IC) designs grow increasingly complex and transistor counts per chip exceed ten billion, post-layout SPICE simulations involve large-scale sparse linear systems, severely degrading simulation efficiency
Current GPU acceleration methods, despite their promise, struggle with efficient load balancing and resource utilization, which restricts their effectiveness in ultra-large-scale circuit simulations.
In this paper, we propose a levelized load-balanced and structure-adaptive LU factorization framework for GPU-based circuit simulation.
Our method improves resource utilization and parallel efficiency by introducing computation-balanced dependency level partitioning, adaptive resource allocation, and a hybrid matrix indexing mechanism.
These strategies ensure that both the memory and computational resources of the GPU are fully leveraged.
We demonstrate significant acceleration over existing methods, achieving a 2.1X speedup compared to GLU3.0 and a 5.2X speedup over 16-thread PARDISO on circuit sparse matrices ranging from thousands to millions of dimensions.
Additionally, our framework has been successfully integrated into the open-source SPICE simulator Ngspice, accelerating circuit simulations with promising results and showcasing its potential for large-scale IC design verification.
Current GPU acceleration methods, despite their promise, struggle with efficient load balancing and resource utilization, which restricts their effectiveness in ultra-large-scale circuit simulations.
In this paper, we propose a levelized load-balanced and structure-adaptive LU factorization framework for GPU-based circuit simulation.
Our method improves resource utilization and parallel efficiency by introducing computation-balanced dependency level partitioning, adaptive resource allocation, and a hybrid matrix indexing mechanism.
These strategies ensure that both the memory and computational resources of the GPU are fully leveraged.
We demonstrate significant acceleration over existing methods, achieving a 2.1X speedup compared to GLU3.0 and a 5.2X speedup over 16-thread PARDISO on circuit sparse matrices ranging from thousands to millions of dimensions.
Additionally, our framework has been successfully integrated into the open-source SPICE simulator Ngspice, accelerating circuit simulations with promising results and showcasing its potential for large-scale IC design verification.
People
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionMining temporal motifs in temporal graphs is essential for many critical applications. Despite several software/hardware temporal motif mining solutions have been proposed, they still suffer from substantial redundant and irregular off-chip communications due to misaligned search tree expansions across different motif matching tasks. In this work, we observe that different tasks traverse the same temporal graph edges in strict chronological order, exhibiting strong data locality among these tasks. Motivated by this insight, we propose LTMiner, a locality-aware hardware accelerator designed to efficiently handle temporal motif mining. Specifically, LTMiner proposes a novel chunk-based search tree expansion mechanism into the accelerator design to align the graph traversals of different tasks at the granularity of data chunks, substantially boosting the data locality among these tasks for lower data access cost. The results show that LTMiner gains 1.1×–652.6×, 1.8×–70.3× speedups and 3.9×–2050.9×, 1.2×–17.3× energy savings compared to the cutting-edge software and hardware solutions, respectively.
People
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn analog-on-top design flows, ensuring pin position and ordering constraints from the analog layout team is critical for the Place & Route of digital IP blocks. Traditionally, digital layout engineers manually convert DEF files, which describe physical layout and pin information, into .save.io files required by Innovus. This manual editing is time-consuming, error-prone, and must be repeated frequently due to evolving constraints. We present "IO Generator," a Python-based tool that automates the conversion of DEF files into the .save.io format, significantly accelerating the backend digital flow. The tool accurately assigns pins to floorplan sides and computes offsets and skips based on LEF-defined pitch and width parameters. It includes robust error checking to ensure pin alignment and offers two modes: pin extraction for file generation and pin comparison for mismatch reporting between DEF and netlist files. With a user-friendly GUI and command-line interface, IO Generator reduces manual effort from hours to seconds, minimizes errors, and integrates seamlessly into existing flows. IO Generator has been successfully adopted across multiple projects, proving its effectiveness and versatility in modern analog-on-top digital design flows.
Engineering Presentation
Chiplet
EDA
DescriptionThe rapid growth of AI and machine-learning workloads has driven adoption of advanced packaging technologies such as silicon interposers, bridges, and heterogeneous 3D integration in chiplet-based systems. These structures often require patterned ground planes, such as hatched or meshed geometries, to meet manufacturability and reliability constraints. However, these geometries introduce significant challenges for signal and power integrity analysis due to impedance variation, increased loss, resonances, and crosstalk.
This paper presents a practical mixed-domain interconnect modeling approach that combines specialized quasi-static 2D field solvers with full-wave Finite Element Method (FEM) solvers to efficiently analyze hatched ground planes in advanced interconnects. The proposed approach is designed to integrate into back-end interconnect simulation flows, balancing modeling accuracy with scalability suitable for design iteration and signoff.
Two test vehicles, a UCIe 2.0–based silicon bridge and a mobile flex PCB, are fabricated, simulated, and correlated against measured data. The results demonstrate strong agreement with measurements while improving modeling efficiency compared to standalone 3D EM analysis, enabling more reliable interconnect analysis and faster design turnaround for advanced packaging implementations.
This paper presents a practical mixed-domain interconnect modeling approach that combines specialized quasi-static 2D field solvers with full-wave Finite Element Method (FEM) solvers to efficiently analyze hatched ground planes in advanced interconnects. The proposed approach is designed to integrate into back-end interconnect simulation flows, balancing modeling accuracy with scalability suitable for design iteration and signoff.
Two test vehicles, a UCIe 2.0–based silicon bridge and a mobile flex PCB, are fabricated, simulated, and correlated against measured data. The results demonstrate strong agreement with measurements while improving modeling efficiency compared to standalone 3D EM analysis, enabling more reliable interconnect analysis and faster design turnaround for advanced packaging implementations.
Exhibitor Forum
AI
EDA
Systems
DescriptionVerification is hard, and it consumes an outsized share of silicon schedule and budget. VerifAgent is an agentic AI platform for accelerating UVM-based functional verification of digital IPs. Given an IP's micro-architecture specification and partial RTL, VerifAgent can generate a comprehensive test plan and implement full UVM test bench and test cases. It can compress the verification cycle by a factor of seven and reduce engineering cost by more than 85%. Our Multi-AI-Agent Orchestration Methodology produces high-quality code, and the EDA-in-the-loop approach ensures the generated code is always compile clean. Deployed at 10+ chip companies, we have helped our customers save thousands of hours and millions of dollars over the past 6 months.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe increasing complexity of digital circuits and stringent functional safety standards, such as ISO 26262 and IEC 61508, require test coverage levels above 99%, which often leads to significant area overhead due to traditional Design For Testability (DFT) techniques. This work presents a novel automated methodology, driven by a Python script, that combines ATPG patterns generated by TestMAX ATPG with functional fault simulation run using VC_Z01X to significantly improve test coverage without additional area overhead. The methodology was validated on a STMicroelectronics' synchronous step-down regulator, showing coverage improvements from 78.87% to 92.90% in full-scan configuration and from 42.09% to 84.25% in partial-scan mode, while drastically reducing ATPG untestable faults. The automated flow leverages the existing functional testbench, enhancing verification efficiency and repeatability. This approach advances the trade-off between achieving high test coverage and minimizing area impact, offering valuable benefits for safety-critical applications and designs with strict area constraints.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWith the ongoing advancement of automotive intelligence, vehicles are incorporating an increasing number of semiconductor chips to improve efficiency and functionality. This technological progression results in elevated current levels across automotive systems, leading to two key physical challenges: electromagnetic compatibility (EMC) and thermal management. Properly managing these factors is crucial for ensuring system stability and reliability.
However, simultaneously addressing EMC and thermal effects is complex, as their respective physical impacts and design requirements are often conflicting. This complexity extends the design cycle needed to achieve a suitable balance between EMC and thermal considerations that meets chip specifications. To address this challenge, Denso and Siemens EDA have partnered to develop an automated optimization workflow by integrating Siemens' Solido Simulation Suite, HEEDS, and FLOEFD. This approach efficiently determines optimal design solutions for both EMC and thermal performance.
Utilizing this innovative workflow can reduce design timelines by up to 68% and decrease physical area requirements by as much as 20%. These improvements significantly accelerate time-to-market and lower costs associated with circuit design for automotive applications.
However, simultaneously addressing EMC and thermal effects is complex, as their respective physical impacts and design requirements are often conflicting. This complexity extends the design cycle needed to achieve a suitable balance between EMC and thermal considerations that meets chip specifications. To address this challenge, Denso and Siemens EDA have partnered to develop an automated optimization workflow by integrating Siemens' Solido Simulation Suite, HEEDS, and FLOEFD. This approach efficiently determines optimal design solutions for both EMC and thermal performance.
Utilizing this innovative workflow can reduce design timelines by up to 68% and decrease physical area requirements by as much as 20%. These improvements significantly accelerate time-to-market and lower costs associated with circuit design for automotive applications.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern semiconductor fabrication faces an escalating verification challenge as technology nodes shrink to 3nm and beyond, where Layout Versus Schematic (LVS) verification complexity has grown exponentially. Today's LVS must go beyond comparing Schematic and Layout designs; it must extract R/C values for hundreds of devices, perform parasitic extraction (PEX), and analyze complex CAD layer interactions. At 3nm nodes, device extraction requires 30 to 50 CAD layers per device due to Mask Data Preparation (MDP) demands, making manual verification infeasible. This paper presents a novel automated Device Extraction Quality Assurance (QA) framework integrating three verification engines: Truth Table Checker, Compare Checker, and Device Extraction QA. The workflow ensures device extraction is properly performed as described by process development, extracting layouts from seed layers and supplying devices to SPICE netlists for simulation-based verification. Validated across technologies from 8-12 inch legacy processes to 3nm nodes, the framework enables early detection of truth table errors and parasitic device extraction issues through automated parameter qualification. The truth table checker flow has dramatically reduced turn-around time to rule deck release while significantly improving quality. With average runtimes under 30 minutes, qualification engineers receive comprehensive reports clearly indicating where attention is needed, enabling rapid issue resolution and high-confidence PDK releases.
Work in Progress
DescriptionThe increasing complexity of modern digital designs presents significant challenges for formal verification, particularly when properties fail to converge within practical proof bounds. Non-convergent assertions hinder verification sign-off and limit the ability to expose deep corner-case bugs. This work proposes a contract-based refinement framework to enhance property convergence in the formal verification of complex hardware systems. The methodology employs proof decomposition to identify refinement properties—helper assertions selected through overlapping cones of influence (COI)—which are then composed to strengthen the convergence of the target property. Implemented and evaluated using the Cadence JasperGold formal verification platform, the approach demonstrates improved proof bounds and enhanced bug detection across multiple architectures, including memory controllers and CPUs. Results show that the proposed technique systematically improves convergence for critical properties while maintaining scalability and broad applicability to diverse digital designs.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionNeural compact models are increasingly explored for design–technology co-optimization (DTCO), yet their black-box nature hinders physical interpretability and seamless SPICE deployment. We introduce a physics-prior neural-to-symbolic compact modeling framework based on Efficient Kolmogorov–Arnold Networks (EKAN) trained on multidimensional oxide-FET data. EKAN first learns a smooth, bias-aware log-current surrogate; its spline activations are then distilled into a closed-form current expression via KAN-derived one-dimensional atoms, physics-guided feature libraries, and weighted sparse regression with monotonicity regularization. The resulting Verilog-A model is SPICE-ready, preserves key device trends across bias and process, and attains accuracy comparable to neural compact models while remaining interpretable.
People
Engineering Presentation
Chiplet
EDA
DescriptionDynamic Thermal Management (DTM) techniques are increasingly critical for high-power heterogeneous chip designs, where sustained peak operation can lead to rapid thermal limit violations. We present a dynamic, probe-based thermal methodology to quantitatively evaluate settling time post peak power events and maximize peak state duration. Design comprises of a compute die along with HBM placed on interposer and then package. Distinct tile based power map was created for peak and lower activity (50-90%) states. Static thermal analysis was performed to find the corresponding maximum temperatures.
Next using DTM analysis, the settling time was extracted, which is required for each lower power state to reach steady-state temperature after exiting the peak. This was exponentially inverse to power. Lastly, peak state duration between certain thresholds was maximized depending on the settling time of each state. Maximum peak state duration of 80% and 76% were seen for peak to 50% and 60% power states respectively. This methodology enables evaluation of transient metrics using DTM analysis by linking power throttling decisions directly to time-domain thermal behavior, providing actionable insights for DVFS, workload scheduling, and safe peak performance budgeting.
Next using DTM analysis, the settling time was extracted, which is required for each lower power state to reach steady-state temperature after exiting the peak. This was exponentially inverse to power. Lastly, peak state duration between certain thresholds was maximized depending on the settling time of each state. Maximum peak state duration of 80% and 76% were seen for peak to 50% and 60% power states respectively. This methodology enables evaluation of transient metrics using DTM analysis by linking power throttling decisions directly to time-domain thermal behavior, providing actionable insights for DVFS, workload scheduling, and safe peak performance budgeting.
Engineering Presentation
Design
EDA
Systems
DescriptionThis work presents an automated EDA toolchain for synthesizing programmable and scalable CMOS analog optimization IP cores. Addressing the latency wall in real-time control, where conventional digital solvers face polynomial scaling bottlenecks, our methodology translates high-level mathematical specifications (AMPL/MPS) directly into verification-ready SPICE netlists. The synthesized IP utilizes a reconfigurable switched-capacitor architecture that implements continuous-time Karush-Kuhn-Tucker dynamics, allowing it to solve constrained optimization problems through parallel physics-based evolution rather than sequential algorithms.Unlike prior art limited to small-scale fixed-function circuits, our architecture incorporates a software-driven calibration layer to neutralize PVT variations, ensuring reliability in standard CMOS nodes. We demonstrate the flow's scalability on problem sizes ranging from dense 500-variable instances to sparse 10,000-variable workloads. Results show the generated IP achieves invariant sub-millisecond convergence regardless of complexity, delivering a >300X speedup over state-of-the-art digital interior-point solvers while maintaining solution accuracy within 0.02% relative error. This work provides a complete code-to-silicon path for deploying high-performance analog computing in edge AI and real-time control systems.
Research Special Session
Systems
DescriptionAutonomous edge machine vision requires image sensor architectures that simultaneously advance inference autonomy and energy autonomy. This paper introduces a quantitative modeling framework that spans conventional CMOS imagers and emerging designs—event-based/DVS sensors, coded-exposure sensors, and self-powered sensors with energy-harvesting pixels. We unify optical, circuit, and algorithmic modeling in a single analytical flow to capture how architectural choices in pixel structures, readout pipelines, compressive acquisition, and hybrid imaging–harvesting mechanisms propagate to power, latency, noise, and task-level vision performance. Extending first-principles models of self-powered vision systems, the framework also evaluates when energy-harvesting pixels provide system-level advantages over external solar harvesting, particularly in long-lived and hard-to-maintain deployments. By enabling rapid, physically grounded exploration and comparative analyses, our results reveal key trade-offs that determine when unconventional sensor architectures meaningfully enhance edge intelligence. This framework offers a principled foundation for designing next-generation energy-aware, autonomous machine vision systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionProblem Statement & Motivation
• This paper presents an automated scorecard analysis system designed to systematically evaluate and track design quality metrics throughout the development cycle.
• The system provides quantifiable insights into design using a weighted categorical scoring approach, highlighting optimization opportunities and flow gaps, enabling project teams to make informed decisions based on severity-categorized metrics.
• This Scorecard tracker application is designed to manage and display the scores of various physical design runs to ensure design quality through a structured checklist-based scoring process. This tool presents designers with categorized information to enable pin-pointed evaluation to improve quality while reducing manual efforts, leading to improved turnaround time for design closure.
• The tool offers various features such as quality assessment of flow logs, quality checks on run environment, highlighting missed checks and ensuring each manual checks is run and cleaned up. This is done along with binning checks into different categories to help designers prioritize debugs.
• The system allows users to monitor their runs efficiently. It is built with a focus on usability and monitoring various flow of physical design (Synthesis, PnR, reliability, CV, custom flows) which are used for the development of SOC.
• Results demonstrate improved prediction of design quality issues and more effective allocation of engineering resources, leading to measurable reductions in design iterations and validation time while maintaining high quality during implementation.
• Additionally, all the scores for each run, project and block are logged centrally which helps in analyzing trends of design closure during and after implementation of design. This can be an effective tool to collate learning during implementation and improving design cycle for up-coming designs.
• This paper presents an automated scorecard analysis system designed to systematically evaluate and track design quality metrics throughout the development cycle.
• The system provides quantifiable insights into design using a weighted categorical scoring approach, highlighting optimization opportunities and flow gaps, enabling project teams to make informed decisions based on severity-categorized metrics.
• This Scorecard tracker application is designed to manage and display the scores of various physical design runs to ensure design quality through a structured checklist-based scoring process. This tool presents designers with categorized information to enable pin-pointed evaluation to improve quality while reducing manual efforts, leading to improved turnaround time for design closure.
• The tool offers various features such as quality assessment of flow logs, quality checks on run environment, highlighting missed checks and ensuring each manual checks is run and cleaned up. This is done along with binning checks into different categories to help designers prioritize debugs.
• The system allows users to monitor their runs efficiently. It is built with a focus on usability and monitoring various flow of physical design (Synthesis, PnR, reliability, CV, custom flows) which are used for the development of SOC.
• Results demonstrate improved prediction of design quality issues and more effective allocation of engineering resources, leading to measurable reductions in design iterations and validation time while maintaining high quality during implementation.
• Additionally, all the scores for each run, project and block are logged centrally which helps in analyzing trends of design closure during and after implementation of design. This can be an effective tool to collate learning during implementation and improving design cycle for up-coming designs.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionAs post-quantum cryptographic (PQC) schemes are standardized, evaluating their resilience to side-channel attacks (SCA) becomes critical. While most prior studies focus on physical SCAs, the practicality of remote SCAs on full implementations of standardized PQCs remains largely unexplored. In this paper, we present the first generic remote power SCA on the Module-Lattice-Based Key Encapsulation Mechanism (ML-KEM), evaluated on a modern Intel x86 processor. Our result demonstrates that power traces can be exploited remotely as a plaintext-checking oracle, enabling secret key recovery despite the scheme's theoretical IND-CCA security. Using ML-KEM as a case study, we show that complex microarchitectural mechanisms such as speculative execution and dynamic power management do not eliminate exploitable power leakage. We evaluated our attack on the PQClean implementation, achieving secret key recovery with a success rate up to 99.5%. These findings provide a realistic assessment of PQC leakage behavior on high-end processors and underscore the need for architecture-aware leakage models and co-designed hardware–software defenses to ensure secure PQC deployment in practice.
People
Engineering Presentation
EDA
DescriptionAs technology nodes continue to scale, DTCO (Design-Technology Co-Optimization) increasingly requires layout teams to evaluate multiple architectural options early in the design cycle—especially in standard cells and SRAM periphery, where small structural changes can significantly affect area, performance, and DRC results.
In reality, layout implementation has not kept pace with this demand. Even modest architectural updates often invalidate existing layouts, forcing engineers to repeatedly rework structurally similar designs and slowing down DTCO feedback loops.
This work presents a lightweight, rule-aware layout conversion framework aimed at reducing repetitive layout modification effort at the cell- and library-level. Rather than introducing a new design flow, the framework focuses on automating common but time-consuming tasks such as cell-height migration, layout template regeneration, and structural refactoring under updated design rules. Each transformation step is performed in a DRC-conscious manner, helping preserve layout intent and physical consistency.
Experimental evaluations on standard cell and memory-related layouts demonstrate significant reductions in layout turnaround time while maintaining area efficiency and rule compliance comparable to hand-crafted designs. The framework integrates smoothly into existing back-end environments with minimal disruption, making it practical for everyday use.
In the longer term, this structured approach also establishes a solid foundation for future AI-assisted layout automation grounded in real layout expertise and reusable transformation logic.
In reality, layout implementation has not kept pace with this demand. Even modest architectural updates often invalidate existing layouts, forcing engineers to repeatedly rework structurally similar designs and slowing down DTCO feedback loops.
This work presents a lightweight, rule-aware layout conversion framework aimed at reducing repetitive layout modification effort at the cell- and library-level. Rather than introducing a new design flow, the framework focuses on automating common but time-consuming tasks such as cell-height migration, layout template regeneration, and structural refactoring under updated design rules. Each transformation step is performed in a DRC-conscious manner, helping preserve layout intent and physical consistency.
Experimental evaluations on standard cell and memory-related layouts demonstrate significant reductions in layout turnaround time while maintaining area efficiency and rule compliance comparable to hand-crafted designs. The framework integrates smoothly into existing back-end environments with minimal disruption, making it practical for everyday use.
In the longer term, this structured approach also establishes a solid foundation for future AI-assisted layout automation grounded in real layout expertise and reusable transformation logic.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionUltra‑large digital designs at advanced technology nodes now include billions of instances, deep hierarchies, and highly complex clocking and interconnect structures, making traditional flat static timing analysis (STA) increasingly impractical. Timing closure cycles have become prohibitively long due to excessive runtime, memory demands, and limited visibility across block boundaries. This work presents a scalable and silicon‑correlated timing analysis and closure methodology tailored for these massive designs. The approach unifies Boundary Model, Context‑Aware Timing, and Advanced Multi‑Input Switching (AMIS) to deliver accurate hierarchical timing without requiring design flattening. Boundary Model preserves interface logic by abstracting internal logic in order to reduce design size, while Context‑Aware Timing ensures that each block's interface timing remains aligned with top‑level requirements, regardless of differences introduced by independently developed constraints. AMIS effectively addresses inherent optimism in single‑input switching by capturing simultaneous switching effects. Combined with Tempus ECO and Certus, the methodology enables fast, localized optimization and predictable convergence. Applied to a multi‑billion‑instance design across 150+ timing views, the flow demonstrates 3.5×–5× runtime improvement, 50–65% memory reduction, and strong correlation with flat STA. This scalable methodology provides a robust foundation for achieving efficient timing closure in emerging high‑performance systems.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs the architectural complexity of SSD controllers increases, the demand for "Shift-Left" in security firmware (FW) development has become a critical necessity. Conventional firmware development heavily relies on hardware prototypes or FPGA environments, which often suffer from limited debuggability and late-stage availability, particularly for "hardware-enforced" security features.
Engineering Presentation
Design
EDA
Systems
DescriptionModern Dual‑SIM‑Dual‑Standby (DSDS) designs must synchronize signaling, paging, and bearer scheduling to avoid QoS loss caused by radio conflicts, SIM‑switch delays, or RF‑resource contention creating challenges for preserving quality‑of‑service (QoS) across simultaneous networks. Dynamic Voltage and Frequency Scaling (DVFS) is a widely adopted power‑management technique in today's processors and system‑on‑chip (SoC) architectures. By dynamically adjusting supply voltage and clock frequency in response to real‑time workload demands, DVFS cuts both dynamic and static power while still meeting performance targets. In dual‑SIM devices that share radio resources between two protocol stacks, activity on one stack can induce a "blackout" on the other, leading to data inactivity and QoS degradation. The blackout duration is directly governed by the DVFS settings applied to the processor and bus while the opposite stack is executing its tasks. Existing state‑of‑the‑art methods focus on monolithic workloads, thermal‑throttling avoidance, or overall energy minimization, and they ignore the inter‑stack blackout overhead that is unique to dual‑SIM phones. This paper proposes a service‑energy based, intelligent DVFS framework that jointly optimizes the DVFS levels for both stacks by minimizing a weighted sum of throughput loss and total energy consumption. Simulation results show that the algorithm achieves a smooth, monotonic trade‑off, allowing a seamless shift between reduced blackout time and lower energy use.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFor Large designs with multiple levels of hierarchies, predicting LVS convergence is very challenging if there are power and ground shorts and opens especially across hierarchies, as its debug could be time consuming and iterative. Also, it is difficult to identify all integration issues at next hierarchy level if any partition is LVS dirty. There are inherent disadvantages in traditional techniques such as black box LVS and destructive cleanup approach that can enable hierarchical level LVS run. To overcome these shortcomings, a simple and efficient solution which can be implemented at any stage of the partition's APR flow from floorplan to route, is described. In this solution critical interface details like Power-Ground network, global clock and signal routes which are pushed down to partitions, hard macros and partition ports are retained and are made LVS clean using this automated flow. This solution was wildly used for many partitions to identify and fix LVS issues accurately early on in multiple projects and enabled timely SoC tape-outs. This solution enabled seamless integration of late-coming partitions, as all the risks with integration were identified and fixed upfront.
Engineering Presentation
Design
EDA
Systems
DescriptionThis paper presents a CISC processor architecture designed for modern CMOS technology, featuring a compact and flexible datapath capable of operating efficiently on arbitrary data formats. The architecture supports high-level languages as well as graphics, signal processing, memory, and I/O workloads within a unified execution model.
The processor has evolved over several decades, from a TTL-based minicomputer to multiple generations of CMOS microprocessors implemented in different fabrication nodes. It has been deployed in a wide range of commercial applications worldwide. A recent dual-core implementation has been silicon-verified in 65 nm technology, and a corresponding 22 nm design has been functionally validated on FPGA.
Measured results demonstrate high energy efficiency, strong code density, and extensive support for specialized processing and peripheral control. Because much of the system functionality is implemented in writable microcode rather than fixed hardware, the architecture is well suited as a control processor in universal SoCs targeting mid-volume IoT/OT devices, where custom SoC development is impractical.
The core can also function as a processing element in AI accelerators, with microcode distributed across clustered PEs to implement individual DNN layers.
The processor has evolved over several decades, from a TTL-based minicomputer to multiple generations of CMOS microprocessors implemented in different fabrication nodes. It has been deployed in a wide range of commercial applications worldwide. A recent dual-core implementation has been silicon-verified in 65 nm technology, and a corresponding 22 nm design has been functionally validated on FPGA.
Measured results demonstrate high energy efficiency, strong code density, and extensive support for specialized processing and peripheral control. Because much of the system functionality is implemented in writable microcode rather than fixed hardware, the architecture is well suited as a control processor in universal SoCs targeting mid-volume IoT/OT devices, where custom SoC development is impractical.
The core can also function as a processing element in AI accelerators, with microcode distributed across clustered PEs to implement individual DNN layers.
Engineering Special Session
AI
Design
EDA
Systems
DescriptionCorvicAI introduces an Intelligence Composition Platform designed to free enterprises from the complexity and rigidity of traditional AI data pipelines. Modern organizations struggle with fragmented tooling, extensive plumbing, and specialized expertise requirements across ingestion, parsing, feature engineering, retrieval, and reasoning. Corvic replaces this with a unified, agentic architecture that enables rapid deployment of zero-hallucination GenAI applications built on complex, multimodal data. The platform operationalizes three core laws—Agentic Data Transformation, Multimodal Retrieval Fabric, and Adaptive Agentic Orchestration—to ensure trustworthy data access, contextual understanding, and dynamic reasoning. By integrating graph AI, semantic search, multimodal retrieval, and explainable agentic workflows, Corvic delivers higher precision, full traceability, and dramatically faster time-to-value, reducing development cycles from months to days. The platform powers use cases across compliance, analytics, customer support, research, and predictive intelligence, offering enterprises a scalable path to reliable AI adoption.
People
Engineering Presentation
AI
Design
EDA
DescriptionHigh-quality standard-cell libraries are essential for reliable SoC design and sign-off. However, modern libraries span many views, PVT corners, and modeling formats, making validation increasingly complex, fragmented, and time-consuming when using ad-hoc or per-view approaches.
This work presents a structured quality-checking methodology based on Siemens Solido Crosscheck for systematic validation of standard-cell collaterals. The approach classifies checks by configuration complexity, ranging from simple Boolean validations to fully parametric checks requiring characterization-specific inputs. Native Crosscheck checks are leveraged and fine-tuned in collaboration with design and characterization engineers, while custom checks are introduced to address requirements not covered by native capabilities. The methodology supports both early, incremental validation during Liberty generation and comprehensive quality assessment once all views are available, enabling unified cross-view analysis.
The proposed framework enables scalable, parallel validation across thousands of configurations, improves detection of modeling and consistency issues early in the flow, provides structured reporting and traceability, and reduces late-stage rework ultimately accelerating time-to-market while improving release confidence and overall library quality.
This work presents a structured quality-checking methodology based on Siemens Solido Crosscheck for systematic validation of standard-cell collaterals. The approach classifies checks by configuration complexity, ranging from simple Boolean validations to fully parametric checks requiring characterization-specific inputs. Native Crosscheck checks are leveraged and fine-tuned in collaboration with design and characterization engineers, while custom checks are introduced to address requirements not covered by native capabilities. The methodology supports both early, incremental validation during Liberty generation and comprehensive quality assessment once all views are available, enabling unified cross-view analysis.
The proposed framework enables scalable, parallel validation across thousands of configurations, improves detection of modeling and consistency issues early in the flow, provides structured reporting and traceability, and reduces late-stage rework ultimately accelerating time-to-market while improving release confidence and overall library quality.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAdvanced‑node libraries must meet aggressive PPA targets across wide voltage ranges, including near‑/ultra‑low‑voltage operation where process variation is amplified and non‑Gaussian. Traditional sensitivity‑based LVF (SBA) fails to retain accuracy under these conditions, while Monte‑Carlo (MC) is impractical for full‑library, multi‑PVT production. We present a unified methodology that integrates
(i) ML‑based LVF to capture moments and sigma at ULV with production‑viable runtime,
(ii) MIS‑aware characterization to model simultaneous input switching for accurate gate delays—particularly on hold‑critical paths, and
(iii) in‑flow EM reliability generation and validation.
Across combinational and sequential cells, our ML‑LVF correlates closely with MC (comparable accuracy at a fraction of runtime), and MIS modeling closes the GLS vs. standalone delay gap observed on short paths. The flow also delivers signoff‑quality Liberty views with PrimeTime‑consistent timing/power correlation and enables early reliability checks. Overall, the methodology accelerates library turnaround while improving accuracy and reducing downstream timing/reliability closure risk for advanced‑node designs.
(i) ML‑based LVF to capture moments and sigma at ULV with production‑viable runtime,
(ii) MIS‑aware characterization to model simultaneous input switching for accurate gate delays—particularly on hold‑critical paths, and
(iii) in‑flow EM reliability generation and validation.
Across combinational and sequential cells, our ML‑LVF correlates closely with MC (comparable accuracy at a fraction of runtime), and MIS modeling closes the GLS vs. standalone delay gap observed on short paths. The flow also delivers signoff‑quality Liberty views with PrimeTime‑consistent timing/power correlation and enables early reliability checks. Overall, the methodology accelerates library turnaround while improving accuracy and reducing downstream timing/reliability closure risk for advanced‑node designs.
Analyst Presentation
DescriptionWe will examine the financial performance and key business metrics of the EDA industry through 2025, the consolidation of Engineering Software, and the material technical and market trends and requirements that are affecting the industry's business performance and strategies- including our AI/ML Phenomenonology. Among the trends and catalysts, we will again examine the progression of semiconductor R&D spending and how the market values of Cadence-Synopsys have evolved. Lastly, we will provide our updated financial projections for the EDA industry for 2026.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionGeneral Sparse Matrix-Vector Multiplication (SpMV) is a fundamental kernel in scientific computing, graph analysis and deep learning. However, to fully unleash the power of CUDA cores performance, systematic optimization is required for SpMV. In this paper, we propose OmniSpMV, a high-performance SpMV library on CUDA cores, with multiple optimizations, including data-locality-aware reordering, memory-efficient tiling , sparsity-aware load balancing and highly optimized SpMV kernel. Extensive experimental results on various NVIDIA GPU architectures with 2715 matrices show that, OmniSpMV achieves significant performance improvements on average, 2.58x (up to 6.39x) speedup on RTX 4090, 1.91x (up to 4.92x) speedup on A800, and 1.78x (up to 8.21x) speedup on H100 over cuSPARSE designed for CUDA cores, and 1.93x (up to 15.67x) speedup on RTX 4090, 1.70x (up to 15.02x) speedup on A800, and 1.75x (up to 14.53x) speedup on H100 over DASP designed for Tensor cores.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionDesign Rule Checking (DRC) is a critical yet computation-intensive stage in modern very large scale integration design. As processes evolve, the high computational cost of DRC has become a significant bottleneck for design efficiency. Current acceleration approaches do not fully leverage the parallelism of DRC, resulting in limited performance gains. To address this challenge, we propose AccDRC, an FPGA-accelerated DRC based on software–hardware co-design. On the software, we design a cell-aware partitioning strategy with a data preparation and task encapsulation mechanism, which reorganize layouts into balanced task units tailored for FPGA processing. On the hardware, we implements an acceleration architecture consisting of a locality-preserving data-loading module, a unified and reconfigurable check core, and a sparse result writeback module. This architecture exploits DRC's locality, structural commonality across rules, and sparse violation outcomes, enabling high-throughput dataflow execution with multi-level parallelism. Experimental results show that AccDRC achieves 522.11x ~1071.03x speedup over the CPU-based DRC tool KLayout, and 9.62x ~ 26.07 x speedup over the state-of-the-art GPU-based DRC tool OpenDRC.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionSRAM-based compute-in-memory (CIM) offers high computational density and energy efficiency for deep neural network (DNN) accelerators, but its limited capacity causes on/off-chip data movement overhead for large DNN models. Existing CIM accelerator studies typically assume that DNN models fit entirely on-chip, leaving efficient dataflow design largely untapped. This paper introduces AccelCIM, a systematic dataflow exploration framework for SRAM CIM accelerator, which addresses two key limitations of prior work. (1) It formulates a systematic dataflow design space spanning CIM macro configurations and macro-array organizations. (2) It introduces rigorous design evaluation using cycle-accurate architectural simulation and post-layout PPA analysis. We conduct an extensive design space exploration and apply AccelCIM to representative LLM applications, providing practical insights for the principled design of CIM accelerators.
People
Work in Progress
DescriptionAccurate dynamic voltage drop (DVD) analysis is increasingly critical in advanced nodes, where higher densities, lower voltages, and complex packaging exacerbate power delivery challenges. Traditional simulations are computationally expensive and typically performed late in the design cycle, risking costly redesigns. We propose a lightweight ML-based DVD prediction model using multi-scale CNNs, fusion layers, and skip connections to capture spatial and hierarchical power grid features. The model uniquely incorporates package and grid inductance and supports both vectorless and vector-based inputs. Evaluated on a 16 nm RISC-V core, it achieves 80-86 % accuracy with 5 mV tolerance and over 25,000× faster runtime than commercial tools.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionSynthetic Aperture Radar (SAR), benefiting from its all-weather, all-time, and high-resolution characteristics, has become a vital tool in earth observation.
A typical application of SAR first completes the imaging process of echo data and then conducts subsequent analysis.
As AI excels in image classification and recognition, integrating AI with SAR has garnered significant interest.
In order to leverage the acceleration capabilities of existing AI frameworks, particularly graph optimizations, while also reducing developing complexity, developers strive to streamline the entire SAR application within these frameworks.
However, a key performance challenge arises: SAR imaging differs greatly from typical AI tasks in both the requirements of data layouts and the composition of operators, causing graph optimizations to fail in effectively accelerating the SAR imaging process.
The fundamental reason is the failure of the two key graph optimizations: layout transformation and operator splitting.
In this paper, we first address the issue of significant transpose overhead introduced by the layout transformation strategy.
To this end, we propose a novel layout transformation strategy based on pseudo-transposition operators, which can completely eliminate transpose overhead while maintaining memory access efficiency.
Subsequently, we design a tailored splitting strategy based on movable reverse-order operators to compensate for existing frameworks' lack of capability in handling the core FFT operators.
The proposed strategies were implemented in PyTorch and LiteRT, yielding a significant speedup of 3.45x for SAR imaging process.
A typical application of SAR first completes the imaging process of echo data and then conducts subsequent analysis.
As AI excels in image classification and recognition, integrating AI with SAR has garnered significant interest.
In order to leverage the acceleration capabilities of existing AI frameworks, particularly graph optimizations, while also reducing developing complexity, developers strive to streamline the entire SAR application within these frameworks.
However, a key performance challenge arises: SAR imaging differs greatly from typical AI tasks in both the requirements of data layouts and the composition of operators, causing graph optimizations to fail in effectively accelerating the SAR imaging process.
The fundamental reason is the failure of the two key graph optimizations: layout transformation and operator splitting.
In this paper, we first address the issue of significant transpose overhead introduced by the layout transformation strategy.
To this end, we propose a novel layout transformation strategy based on pseudo-transposition operators, which can completely eliminate transpose overhead while maintaining memory access efficiency.
Subsequently, we design a tailored splitting strategy based on movable reverse-order operators to compensate for existing frameworks' lack of capability in handling the core FFT operators.
The proposed strategies were implemented in PyTorch and LiteRT, yielding a significant speedup of 3.45x for SAR imaging process.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs technology nodes scale into the angstrom regime, design complexity has surged due to stringent performance, power, and area (PPA) targets and the need to manage diverse cell libraries across multiple PVT corners. Achieving optimal cell selection, macro placement, and layer distribution under these conditions is highly challenging, making manual tuning impractical given tight timing closure, power budgets, VT proliferation, and floorplan sensitivity. Automation is now essential to enable systematic design space exploration and maintain competitiveness.
This paper presents an AI-driven approach to automate floorplanning and VT optimization for macro-dominated, high-frequency CPU designs in sub-nanometer nodes. The proposed solution integrates VT-Optimizer (VT-Opt), which tunes multi-VT flows by generating adaptive VT recipes and validating them through full-flow runs, and FP-Opt, which explores alternative floorplans by adjusting bounding boxes, aspect ratios, and macro placements while targeting utilization, congestion, timing, and power. Optimal configurations are selected based on full-flow evaluations, followed by PPA optimization to achieve best-in-class results.
The methodology significantly reduces design turnaround time, mitigates risk, and improves PPA, demonstrating its effectiveness for next-generation CPU designs.
This paper presents an AI-driven approach to automate floorplanning and VT optimization for macro-dominated, high-frequency CPU designs in sub-nanometer nodes. The proposed solution integrates VT-Optimizer (VT-Opt), which tunes multi-VT flows by generating adaptive VT recipes and validating them through full-flow runs, and FP-Opt, which explores alternative floorplans by adjusting bounding boxes, aspect ratios, and macro placements while targeting utilization, congestion, timing, and power. Optimal configurations are selected based on full-flow evaluations, followed by PPA optimization to achieve best-in-class results.
The methodology significantly reduces design turnaround time, mitigates risk, and improves PPA, demonstrating its effectiveness for next-generation CPU designs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis paper describes the development of a full-chip gate-level emulation environment that enables DFT verification and the successful application of Debug STIL vectors generated by the Tessent DFT tool in the Veloce emulation environment, significantly improving debuggability and shortening the product verification time.
People
Engineering Presentation
Design
EDA
DescriptionAs hardware designs grow in complexity, Formal Verification (FV) often hits scalability walls, resulting in "bounded" proofs rather than full closure.Deep state-space exploration is limited by computational resources. Overcoming this typically requires "helper" assertions (invariants), but manually identifying and writing these is labor-intensive and requires deep micro-architectural knowledge. In this presentation, we present our findings where we used AI/ML & GenFV to generate helpers to achieve full proofs and reduce the verification cycle time
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionHPC applications like Datacenter, Server, AI-Training and AI-Inference application chips are Multi Billion Big-Die Designs and requires huge compute and disk requirements for simulations. Depending on the type of design architecture logic subsystems are repeated 30 to 80% of full design. Full flat simulations on such designs take ~8k to 11k cores with a peak machine memory requirement of 200TB to 300TB. Disk space requirement would be ~50TB. Hence, running Full Flat simulations are not practical as they would be taking huge compute and disk space resources. Full chip EMIR checks are important to ensure Blocks and Subsystems are well connected with RDL/Bumps and to ensure signoff is within margin. This can be done at abstract level with blocks' and subsystems' current and parasitic modelled using Reduced Order Model (ROM).On using ROM, a significant reduction is observed in run-time and disk-space used while maintaining accuracy.
Work in Progress
DescriptionFunctional fault grading is essential for post-silicon validation in scan-limited designs, but simulation cost grows with fault and pattern volume. We propose an optimization-driven methodology combining static fault optimization, fault clustering, design pruning, stimulus grading, dynamic fault optimization, and parallel fault simulation management. By leveraging both structural design information and stimulus behavior, the flow eliminates redundant computation by performing optimized simulations tailored to fault activation and propagation potential. Applied to a production-grade NAND Flash and DRAM designs, the approach achieved a 3.7x and 2x reduction in simulation time, respectively, while preserving test coverage. The methodology is broadly applicable to logic and SoC designs, offering scalable fault grading without reliance on scan structures.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionWith the growing deployment of large language models (LLMs), LLM inference cost has become a key challenge. Pruning techniques that introduce sparsity into weight matrices can accelerate inference. However, maintaining model quality typically limits pruning to moderate unstructured sparsity (around 50\%). At these sparsity levels, none of the existing GPU kernels for sparse matrix multiplication (SpMM) can outperform their dense counterparts. This paper proposes an efficient GPU inference method for LLMs with moderate sparsity. We propose a three-layer matrix storage format comprising: (i) a Sparse-TC layer enabling sparse tensor cores to accelerate SpMM; (ii) a Slot-Filling layer using parallel differential distance for matrix compression while supporting low-cost on-chip decoding; (iii) a lightweight Residual Layer ensuring correct SpMM computation. Building on this format, we design a SpMM kernel that jointly utilizes sparse tensor cores and CUDA cores. This design enables an efficient execution pipeline and overlaps on-chip computation with memory access. Evaluations show that our work is the first to outperform dense matrix multiplication on modern GPUs equipped with high-bandwidth memory (HBM). It achieves up to 1.64× kernel-level speedup over SpInfer (EuroSys'25, Best paper) and up to 1.41× end-to-end speedups over FlashLLM (VLDB'24). Our source code: https://anonymous.4open.science/r/spmm-32E1.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionComplex applications (such as AI & HPC) and advanced tech nodes are driving a dramatic increase in design scale and complexity. It's currently impractical to construct a robust power delivery network (PDN) by consuming a large amount of design resources and the higher local cell density causes worse voltage drop(IR) violations than ever before. At the same time, the margin of the process is decreasing continuously, the influence of voltage drop on timing is becoming increasing prominent. Therefore, it becomes crucial to establish the correlation between IR and timing to avoid over-fixing.
In traditional flow,
§ We fix IR violations by fixing thousands of violation instances(victims). However, this approach has a low profit. It takes multiple manual iterations to resolve the violations, because the root cause of the violations(aggressors) is not identified.
§ The impact on timing cannot be considered during the manual IR violation fix cycles. Meanwhile, an instance with IR issue is usually also an instance that is sensitive to the timing. In this case, we finally need to perform a hard tradeoff between IR and timing.
§ Separated IR and Timing ECOs are very time consuming, costly and always posed significant challenges for design closure within tape-out timelines.
To minimize iteration counts and design changes, avoid timing degradation or chip failure, an automated late-stage timing-aware IR fix methodology is proposed here. The joint IR-ECO flow ensures seamless communication between the golden ECO tool PrimeClosure and golden IR/EM analysis tool RedHawk-SC, enables accurate identification of root cause aggressors, provides immediate feedback on the voltage impact of ECO operations proposed by PrimeClosure, and automatically fixes IR violations without timing hurt.
In our design, IR-ECO can fix over 60% of signoff-stage IR violations within a few hours, and introducing few/no negative impact on timing, which is very important for timing- critical blocks. Minimum number of instances changed by addressing aggressors. Fewer design change operations and minimum iterations can save several weeks of iteration period than traditional flow. This offering better PPA and significantly boosts the time to tape out.
keywords : IR-ECO, IR-timing closure, aggressor analytics, timing-aware IR fix
In traditional flow,
§ We fix IR violations by fixing thousands of violation instances(victims). However, this approach has a low profit. It takes multiple manual iterations to resolve the violations, because the root cause of the violations(aggressors) is not identified.
§ The impact on timing cannot be considered during the manual IR violation fix cycles. Meanwhile, an instance with IR issue is usually also an instance that is sensitive to the timing. In this case, we finally need to perform a hard tradeoff between IR and timing.
§ Separated IR and Timing ECOs are very time consuming, costly and always posed significant challenges for design closure within tape-out timelines.
To minimize iteration counts and design changes, avoid timing degradation or chip failure, an automated late-stage timing-aware IR fix methodology is proposed here. The joint IR-ECO flow ensures seamless communication between the golden ECO tool PrimeClosure and golden IR/EM analysis tool RedHawk-SC, enables accurate identification of root cause aggressors, provides immediate feedback on the voltage impact of ECO operations proposed by PrimeClosure, and automatically fixes IR violations without timing hurt.
In our design, IR-ECO can fix over 60% of signoff-stage IR violations within a few hours, and introducing few/no negative impact on timing, which is very important for timing- critical blocks. Minimum number of instances changed by addressing aggressors. Fewer design change operations and minimum iterations can save several weeks of iteration period than traditional flow. This offering better PPA and significantly boosts the time to tape out.
keywords : IR-ECO, IR-timing closure, aggressor analytics, timing-aware IR fix
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAchieving rigorous 6-6.5σ yield targets in L1 caches and other memory IP components requires extensive high-sigma verification. However, traditional brute-force Monte Carlo simulations and manually piloted GUI-based analyses are too slow and inefficient to verify large numbers of cells within tight project timelines, limiting analysis scope and compromising critical worst-case corner identification.
This paper introduces an innovative AI-powered batch flow that automates and significantly accelerates high-sigma memory cache verification, ensuring accurate, full-coverage analysis across all cells and components. The proposed methodology leverages adaptive AI to rapidly identify worst-case corners and eliminates unnecessary high-sigma runs by only deploying brute-force accurate verification on the identified critical corners.
In an example testcase, this AI-powered batch flow achieved full-coverage verification across 726 netlists in 38 hours and 20 minutes, averaging 2,514 simulations per job, representing an average 3.25x runtime speedup and 2.4x simulation speedup per job over the previous GUI-based method. Enhanced modeling and yield solver algorithms also contributed to reducing per-job simulations, and the flow automation and parallelization reduced engineering effort and overall runtime, making full coverage verification feasible within production timelines and improving disk space management. This scalable, AI-driven flow provides a fast, accurate, and comprehensive solution for memory IP validation challenges.
This paper introduces an innovative AI-powered batch flow that automates and significantly accelerates high-sigma memory cache verification, ensuring accurate, full-coverage analysis across all cells and components. The proposed methodology leverages adaptive AI to rapidly identify worst-case corners and eliminates unnecessary high-sigma runs by only deploying brute-force accurate verification on the identified critical corners.
In an example testcase, this AI-powered batch flow achieved full-coverage verification across 726 netlists in 38 hours and 20 minutes, averaging 2,514 simulations per job, representing an average 3.25x runtime speedup and 2.4x simulation speedup per job over the previous GUI-based method. Enhanced modeling and yield solver algorithms also contributed to reducing per-job simulations, and the flow automation and parallelization reduced engineering effort and overall runtime, making full coverage verification feasible within production timelines and improving disk space management. This scalable, AI-driven flow provides a fast, accurate, and comprehensive solution for memory IP validation challenges.
Engineering Presentation
AI
EDA
Systems
DescriptionThe accelerating scale and intricacy of system‑on‑chip (SoC) designs demand innovative methodologies for register‑transfer‑level (RTL) development and verification to sustain aggressive time‑to‑market objectives. Typical chip development starts with architectural planning and diagrammatical representations for visual reference (.vsdx). This is followed with generation of Block/Sub-System wise specifications to use it for RTL Development and Verification. However, conventional flow relies on manual abstraction of these downstream specifications, which is highly resource-intensive, error‑prone, and prone to inconsistencies with the canonical design intent. The proposed work introduces an automated utility that extracts structural information from high‑level architecture diagrams (.vsdm) and synthesizes a comprehensive JSON metadata repository. The generated metadata serves as a single source of truth for automatically deriving all downstream collaterals required for RTL coding and verification, thereby eliminating manual transcription errors and ensuring alignment with the golden architecture reference. The utility also incorporates a sanity checker that validates and corrects architecture diagrams to conform with design guidelines. Integration of the metadata with existing verification frameworks further streamlines test‑bench generation and compliance checking. Experimental deployment demonstrates a substantial reduction in specification‑related issue tickets and a measurable acceleration of RTL development and verification cycles, confirming the utility's effectiveness in mitigating human effort and error.
Engineering Presentation
EDA
Systems
DescriptionWith the advent of the A.I era, the amount of handling data of SOC has sky-rocketed and power consumption inevitably increased. To suppress this phenomenon, various low-power hardware architectures have been added to the SOC and consequently 1) "X" values that cause bugs at silicon level have appeared and 2) verification complexity and TAT have increased to detect "X". Emulator is a general solution for reducing verification time, but it degrades the quality of power verification since it can only express two logical states ("0" and "1"). In this study, we propose a high performance 4-state ("0", "1", "X", and "Z") RTL power-aware simulation methodology leveraging an emulator. This methodology focuses on the precise modeling of UPF (Unified Power Format) intents and 4-state logic, which are vital for capturing realistic power behavior. We demonstrate that the seamless verification of UPF-driven power scenarios at hardware speeds while maintaining the logic granularity of 4-state simulation. Finally, by applying to the latest SOC project, we confirm that there is a 35 times performance gain compared to conventional simulation.
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionTo meet the increasing computational demands of large language models (LLMs), multi-systolic-array architectures are widely adopted by AI accelerators. However, compared with single-array designs, hardware faults in multi-array accelerators propagate more complexly way and cause more severe reliability degradation. Directly deploying single-array fault-tolerant mechanisms to multi-array architectures will introduces significant overhead. To address this limitation, we propose a Reliability-Aware Scheduling framework that jointly considers hardware-level reliability variations, operator-level error sensitivity, and inter-operator data dependencies. The scheduler maps insensitive operators to faulty arrays and reserves reliable arrays for critical computations, significantly improving reliability with minimal performance loss.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs semiconductor technology scales to smaller nodes, achieving area-optimized and low-power integrated circuits require advanced design architecture ‘s coupled with precise power estimation. Accurate power analysis depends on NLPM data embedded in Liberty models, which guide critical architectural trade-offs.
This paper investigates internal power modelling challenges in modern System-on-Chip designs employing Combo-IO architectures, where multiple communication standards such as GPIO, I2C, and I3C are integrated within a single physical block. The absence of dedicated control signals causes multiple receiver outputs to switch simultaneously, leading to significant inaccuracies in conventional internal energy characterization.
A comparative study of two characterization methodologies, path-based vs design-based, is presented for Combo-IO, with experiments driven by Siemens Characterizer. Liberty models generated by these methods are validated through power analysis.
Results demonstrate that path-based characterization leads to a scalable overestimation of internal power, with errors ranging from 2X to 4X for designs containing two to four active outputs. The proposed design-based methodology accurately distributes energy dissipation across concurrently switching outputs, enabling reliable power estimation for advanced low-power ICs.
This paper investigates internal power modelling challenges in modern System-on-Chip designs employing Combo-IO architectures, where multiple communication standards such as GPIO, I2C, and I3C are integrated within a single physical block. The absence of dedicated control signals causes multiple receiver outputs to switch simultaneously, leading to significant inaccuracies in conventional internal energy characterization.
A comparative study of two characterization methodologies, path-based vs design-based, is presented for Combo-IO, with experiments driven by Siemens Characterizer. Liberty models generated by these methods are validated through power analysis.
Results demonstrate that path-based characterization leads to a scalable overestimation of internal power, with errors ranging from 2X to 4X for designs containing two to four active outputs. The proposed design-based methodology accurately distributes energy dissipation across concurrently switching outputs, enabling reliable power estimation for advanced low-power ICs.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAnalyzing circuits for susceptibility to electro-migration has been a requirement for integrated circuit design since the 1960's. In fact, one of the basic equations for electro-migration, Black's Equation, was formalized in 1969 and is still in use today. In related work, Statistical Electro-migration budgeting (SEB) was developed by DEC in the 1980's. Both these equations are the foundation of the electro-migration analyses used in modern EMIR tools. One very important feature of these equations is that they are exponentially dependent upon the wire temperature and failures start to occur with increasing probability at higher temperatures. This fact is important because with the introduction of new device structures such as FINFET, GAA, and possibly CFET, the power per unit area and resulting heating generated by these structures has been increasing dramatically. In addition, the power dissipated by high resistance wires and vias is also increasing as these wires need to carry higher RMS currents to support higher frequency operation of the circuits. Therefore, it is becoming increasingly critical to model the temperature very accurately for small device structures to avoid EM failures associated with high device and wire heating. By utilizing the latest Finite Element solvers (FEM) available from various CAD vendors, accurate wire temperatures can be computed by solving the heat conduction equations at the device and wire nanoscale thereby providing very useful feedback to designers regarding thermal and EM SEB FIT reliability risk for unusual corner cases that require additional validation using these more accurate simulations.
Engineering Presentation
Design
EDA
DescriptionIt has been observed for several years that transistors of different threshold voltage types (VT) in one chip may contain different silicon to SPICE gap (S2S). This gap may result in risk of yield loss when industry uses SS (all VTs at SS) and FF (all VTs at FF) in signoff because of omitting risk when one VT is fast but the other VT is slow. This risk on timing is called as VT skew. Currently, there is no efficient approach that can accurately quantize VT skew risk. To solve this issue, this study introduces an analytical metric and an Monte-Carlo (MC)-based solution to accurately model Vt skew impact. The analytical metric can help give early quick assessment and the MC-based solution has been developed inside EDA tool.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionSpiking Neural Networks (SNNs) promise exceptional energy efficiency for neuromorphic computing through event-driven processing. However, unlocking their full potential requires navigating a
complex, strongly coupled design space of network topology and
temporal neural encoding. Existing SNN Neural Architecture Search
(NAS) frameworks typically decouple these dimensions, focusing
exclusively on topology search while relying on manual, fixed encoding schemes. This limitation leaves a vast portion of the design
space unexplored, resulting in sub-optimal energy-accuracy trade-offs. To bridge this gap, we present ACE-NAS, the first zero-cost
NAS framework that automates the co-design of SNN architecture
and encoding mechanisms. Addressing the challenge of "encoding-blind" proxies, we introduce Jacob_cov, a novel Jacobian-based
spectral metric that efficiently quantifies the temporal discriminability of different encoding schemes without training. ACE-NAS
integrates Jacob_cov with structural proxies (ZiCo) into a hardware-aware, multi-objective evolutionary search strategy. On CIFAR-10,
ACE-NAS achieves 92.95% accuracy, effectively identifying Pareto-optimal designs that balance high performance with minimal spike
activity. By automating the joint optimization of structure and
dynamics, ACE-NAS delivers an orders-of-magnitude reduction
compared to standard training-based NAS and a 7× speedup over
state-of-the-art efficient SNN-NAS methods (e.g., AutoSNN).
complex, strongly coupled design space of network topology and
temporal neural encoding. Existing SNN Neural Architecture Search
(NAS) frameworks typically decouple these dimensions, focusing
exclusively on topology search while relying on manual, fixed encoding schemes. This limitation leaves a vast portion of the design
space unexplored, resulting in sub-optimal energy-accuracy trade-offs. To bridge this gap, we present ACE-NAS, the first zero-cost
NAS framework that automates the co-design of SNN architecture
and encoding mechanisms. Addressing the challenge of "encoding-blind" proxies, we introduce Jacob_cov, a novel Jacobian-based
spectral metric that efficiently quantifies the temporal discriminability of different encoding schemes without training. ACE-NAS
integrates Jacob_cov with structural proxies (ZiCo) into a hardware-aware, multi-objective evolutionary search strategy. On CIFAR-10,
ACE-NAS achieves 92.95% accuracy, effectively identifying Pareto-optimal designs that balance high performance with minimal spike
activity. By automating the joint optimization of structure and
dynamics, ACE-NAS delivers an orders-of-magnitude reduction
compared to standard training-based NAS and a 7× speedup over
state-of-the-art efficient SNN-NAS methods (e.g., AutoSNN).
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAutomotive Ethernet ICs require robust design and extensive high-sigma verification to meet stringent "zero defect" quality standards, with every device parameter required to yield at least 5σ. Robustness by design is time-consuming and introduces PPA and EMI tradeoffs that must be mitigated by parameter optimization. Traditional brute-force Monte Carlo simulations are impractical for high-sigma verification, and manually trimming and optimizing parameters is time-intensive.
This paper presents an innovative AI-powered methodology that significantly accelerates high-sigma verification, trimming, and optimization for automotive ethernet solutions. This approach integrates adaptive AI for efficient parameter robustness screening and ML-based sensitivity analysis for identifying parameters that most contribute to variation, while intelligent, built-in trimming and optimization in-the-loop automatically mitigate sensitive parameters to bring critical measurements back into spec.
In the example Bandgap Reference to Low-Dropout (LDO) circuit, the proposed methodology achieves strong correlation with silicon production data, while providing >64,000x speedup over brute-force methods. This solution accurately predicts high-sigma failures, automatically determines trim codes, and automatically optimizes targets with minimal simulations, leading to substantial speedups, reduced resource requirements, and faster time-to-market.
This paper presents an innovative AI-powered methodology that significantly accelerates high-sigma verification, trimming, and optimization for automotive ethernet solutions. This approach integrates adaptive AI for efficient parameter robustness screening and ML-based sensitivity analysis for identifying parameters that most contribute to variation, while intelligent, built-in trimming and optimization in-the-loop automatically mitigate sensitive parameters to bring critical measurements back into spec.
In the example Bandgap Reference to Low-Dropout (LDO) circuit, the proposed methodology achieves strong correlation with silicon production data, while providing >64,000x speedup over brute-force methods. This solution accurately predicts high-sigma failures, automatically determines trim codes, and automatically optimizes targets with minimal simulations, leading to substantial speedups, reduced resource requirements, and faster time-to-market.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe RETIME-IP is a critical component in OMNIVISION's image sensor SoC, used in automotive, 360-degree surround-view systems. This IP is responsible for receiving raw data from upstream image sources, processing it, and storing it into SRAM for consumption by downstream modules. Given that this SoC is classified as ISO-26262 ASIL-D, the RETIME-IP must also adhere to ASIL-D requirements, in accordance with the ASIL decomposition guidelines set forth in ISO 26262-10:2018, Clause 11. While the RETIME-IP was initially designed with several safety-mechanisms (SMs) to meet the necessary fault coverage, subsequent fault injection analysis identified the need for additional SMs to reach the ASIL-D targets.
This paper outlines a novel methodology using a Certified tool as per ISO26262-8:11. Using Static Analysis fault classification to find the safety gaps and thereby add SMs only in the areas required without the increase in the hardware. An intuitive way of fault classification to ISO26262 classification is employed. Also to reduce the faults that need to be simulated, Constant-Analysis, is employed to mark blocked faults as safe. The result was not only achieving the ASIL-D SPFM target of 99% but also achieving it in much less iterations, thereby saving project time compared to similar SoC projects.
This paper outlines a novel methodology using a Certified tool as per ISO26262-8:11. Using Static Analysis fault classification to find the safety gaps and thereby add SMs only in the areas required without the increase in the hardware. An intuitive way of fault classification to ISO26262 classification is employed. Also to reduce the faults that need to be simulated, Constant-Analysis, is employed to mark blocked faults as safe. The result was not only achieving the ASIL-D SPFM target of 99% but also achieving it in much less iterations, thereby saving project time compared to similar SoC projects.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionNear-memory processing (NMP) is a promising way to overcome the memory wall in large language models (LLMs). However, dataflow optimization in NMP is fundamentally constrained, as existing analyses cannot efficiently handle the new distributed vault/channel organization. We propose the Remote-Access-Free (RAF) dataflow, which uses tensor rotation to abstract this distributed organization and eliminate all intra-operator remote access. On top of RAF, we apply analytical optimization to minimize intra-operator local access, thereby achieving the intra-operator communication lower bound. We then introduce data partitioning that removes inter-operator remote access and enable operator fusion to minimize inter-operator local access, so that the inter-operator communication lower bound is also reached. Experimental results show that RAF reduces energy by 50.4%, 39.0%, and 37.8%, and delivers speedups of 3.98×, 1.72×, and 1.57× over IANUS, H^2LLM, and OptiPIM, respectively.
Student
Student
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
DescriptionVision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hindered by high inference latency. While smooth robotic interaction requires control frequencies of 20--30 Hz, current VLA models typically operate at only 3--5 Hz on edge devices due to the memory-bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy.
To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a \textbf{Cross-Request State Packed Forward} operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55 times improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dynamic manipulation on edge hardware.
Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a \textbf{Cross-Request State Packed Forward} operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55 times improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dynamic manipulation on edge hardware.
Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
People
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionLarge Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, their limited availability and high cost have driven growing interest in high-performance general-purpose CPUs, particularly emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. While these architectures improve bandwidth and data locality, they introduce severe thermal constraints and non-uniform cache latencies caused by 3D Networks-on-Chip (NoC). Efficient management of thread migration and V/f scaling remains challenging due to diverse LFM kernels and hardware heterogeneity. We propose AILFM, an Active Imitation Learning (AIL)–based scheduling framework that learns near-optimal thermal-aware policies from Oracle demonstrations with minimal runtime overhead. AILFM captures both core-level performance variations and kernel-specific behavior, maintaining thermal safety while maximizing inference efficiency. Extensive experiments demonstrate that AILFM outperforms state-of-the-art baselines and generalizes across diverse LFM workloads.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAccurate switching activity is essential for effective low‑power optimization in modern digital flows, however, activity fidelity rapidly degrades as RTL progresses through synthesis, mapping, and physical implementation. This loss of annotation forces downstream tools to depend on incomplete or vectorless estimates, diminishing the quality of power optimization. Glitch power, which can account for 20-30% of total dynamic power at advanced nodes, typically requires costly delay‑based Gate Level Simulation (GLS), resulting in prohibitive runtime and compute overhead. Conventional activity regeneration further relies on RTL or GLS teams, creating schedule bottlenecks and widening PPA gaps.
The proposed Flash Replay flow methodology eliminates these limitations by enabling rapid, on‑demand activity refresh using the Joules RTL Power tool, with minimal turnaround‑time impact on the full digital flow. It generates glitch‑aware power data significantly faster than GLS and reconstructs complete switching activity for vectorless or partially stimulated designs using lightweight seed vectors or prior RTL activity of the same design. Seamless integration with Genus and Innovus enables consistent, delay‑based activity generation throughout the RTL‑to‑signoff flow.
Results demonstrate up to 10% total power reduction when Flash Replay is incorporated into existing implementation flows, along with up to 25X speedup versus simulation‑based replay. These improvements establish Flash Replay as a practical, scalable solution for restoring activity accuracy and enhancing low‑power implementation efficiency.
The proposed Flash Replay flow methodology eliminates these limitations by enabling rapid, on‑demand activity refresh using the Joules RTL Power tool, with minimal turnaround‑time impact on the full digital flow. It generates glitch‑aware power data significantly faster than GLS and reconstructs complete switching activity for vectorless or partially stimulated designs using lightweight seed vectors or prior RTL activity of the same design. Seamless integration with Genus and Innovus enables consistent, delay‑based activity generation throughout the RTL‑to‑signoff flow.
Results demonstrate up to 10% total power reduction when Flash Replay is incorporated into existing implementation flows, along with up to 25X speedup versus simulation‑based replay. These improvements establish Flash Replay as a practical, scalable solution for restoring activity accuracy and enhancing low‑power implementation efficiency.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionRegister-transfer-level (RTL) simulation is commonly accelerated through multi-threading and event-driven technique.
However, when these techniques are combined, severe load imbalance arises: in event-driven execution, circuit regions exhibit highly uneven activity, leaving many threads idle with very low utilization.
Existing multi-threaded RTL partitioning models often optimize full-cycle execution and ignore the dynamic activity variation that dominates event-driven workloads.
We present PGSIM, a multi-threaded event-driven RTL simulator that incorporates runtime activity information into its partitioning model.
Our model augments hypergraph weights with measured activity frequency and inter-block activation correlation, which is efficiently estimated through a lightweight MinHash profiler.
By aligning thread assignment with true activity patterns, PGSIM substantially reduces idle time and improves utilization.
Across experiments on large benchmarks, PGSIM achieves 20–30% higher performance than its activity-unaware version, delivers a simulation speed of 200-400 kHz and up to 8.9× speedup over Verilator.
These results demonstrate that runtime activity modeling is essential for scalable multi-threaded event-driven simulation.
However, when these techniques are combined, severe load imbalance arises: in event-driven execution, circuit regions exhibit highly uneven activity, leaving many threads idle with very low utilization.
Existing multi-threaded RTL partitioning models often optimize full-cycle execution and ignore the dynamic activity variation that dominates event-driven workloads.
We present PGSIM, a multi-threaded event-driven RTL simulator that incorporates runtime activity information into its partitioning model.
Our model augments hypergraph weights with measured activity frequency and inter-block activation correlation, which is efficiently estimated through a lightweight MinHash profiler.
By aligning thread assignment with true activity patterns, PGSIM substantially reduces idle time and improves utilization.
Across experiments on large benchmarks, PGSIM achieves 20–30% higher performance than its activity-unaware version, delivers a simulation speed of 200-400 kHz and up to 8.9× speedup over Verilator.
These results demonstrate that runtime activity modeling is essential for scalable multi-threaded event-driven simulation.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionReducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves up to 9.47× speedup over the original 3D-GS on a GPU, with only about 0.2 dB degradation in PSNR.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionDistributed training of Graph Neural Networks (GNNs) is hindered by communication overhead from exchanging embeddings and gradients. Existing quantization methods mitigate this cost by using shared bit widths at the layer or group level, followed by node-level uniform quantization, but such coarse granularity cannot fully exploit redundancy in long-tailed communication distributions. Entropy compression is well suited to these distributions, but static entropy compression schemes become less effective when training induces untracked distribution shifts. To address this issue, AdaHuff-GNN, a convergency-aware adaptive Huffman compression framework, is proposed to apply entropy compression to reduce communication cost in distributed GNN training. Within AdaHuff-GNN, loss-triggered codebook reconstruction and coarse-to-fine adaptive selection of codebook sizes jointly produce stage-wise codebooks that track distribution shifts. Binning-assisted clustering and conflict-free decoding reduce codec overhead on the communication critical path. Experiments show that AdaHuff-GNN reduces communication volume by 2.87× and shortens training epoch time by 37.5% over state-of-the-art methods without degrading model accuracy.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionWe propose \texttt{Adana}, a hardware-software co-design that enables efficient low-bit group-wise quantization for LLMs based on the adaptive nonuniform asymmetric numeric type. First, \texttt{Adana} introduces a novel numeric type that precisely captures the nonuniformity and asymmetry of data within small groups. In addition, an approximate metric for quantization error is proposed to facilitate efficient implementation of online adaptive activation quantization. Finally, a dedicated LLM acceleration microarchitecture is developed for \texttt{Adana}. Compared to state-of-the-art designs, \texttt{Adana} achieves 1.42$\times$--2.10$\times$ speedups and 18.9\%--48.5\% power savings on LLMs, while maintaining superior accuracy.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionSpiking neural networks (SNNs) represent a promising solution for emerging architectures due to high sparsity and low power consumption. Spiking Transformers extend these advantages to attention-based modeling and show strong potential for energy-efficient applications. However, their practical deployment remains difficult. Spiking transformers demand substantial computation and memory access because of long token sequences and heavy workloads in feed-forward networks. Moreover, existing sparse optimization methods provide limited benefit for spiking transformers since they either focus only on unstructured bit-level sparsity or require model retraining. This work presents ADAPT, an algorithm–hardware co-design that exploits the hierarchical sparsity of spiking transformers. At the algorithm level, we propose Adaptive Token Pruning (ATP), a training-free method that evaluates token diversity and spike activity to remove redundant tokens. At the hardware level, we design a hierarchical-sparse accelerator that introduces a block-sparsity compression format and a pattern processing unit to leverage repeated bit patterns inside non-zero blocks. Experiments on multiple spiking transformer models demonstrate that ATP prunes 50\% of tokens with negligible accuracy loss. The ADAPT accelerator achieves average speedups of 3.1x and 4.2x over state-of-the-art SNN accelerators Prosperity and GPU, while reducing energy by 1.9x and 149.4x. These results show that exploiting hierarchical sparsity with algorithm–hardware co-design enables efficient deployment of spiking transformers.
People
Engineering Presentation
AI
Design
EDA
DescriptionAs chip complexity reaches tens of billions of transistors, standard cells are duplicated millions of times, making fast and accurate high‑sigma verification essential. Fixed‑sigma approaches are no longer viable. Each cell requires a flexible sigma target to avoid redesign and to enable yield‑based repurposing rather than discarding.
We present a fully automated, AI‑driven methodology that verifies an entire standard cell library in a single pass. Adaptive AI tailors verification jobs to individual cells, while Additive AI iteratively refines models across multiple PVT and input‑vector conditions, delivering brute‑force‑level accuracy without additional simulations.
The Worst‑Case Yield Solver (WCYS) identifies near‑target and worst‑case samples in the Solido PVTMC Verifier, builds predictive models, and triggers a reinforcement‑learning‑based High‑Sigma Verifier only when sign‑off criteria demand. Dynamic Constraint Yield Sign‑off (DCYS) automatically sweeps failing‑cell constraints until they pass, reducing the number of failed cells by 2.4×.
Compared with traditional scaled Monte‑Carlo methods, the proposed flow achieves a ten‑fold speedup, a tighter confidence interval (6.000 [5.900–6.098] vs 6.180 [5.067–7.146]), and a 60 % reduction in cells failing between 0.5 V and 0.6 V. This AI‑enabled verification provides 6‑sigma‑level confidence across massive libraries while substantially reducing verification time and resource consumption.
We present a fully automated, AI‑driven methodology that verifies an entire standard cell library in a single pass. Adaptive AI tailors verification jobs to individual cells, while Additive AI iteratively refines models across multiple PVT and input‑vector conditions, delivering brute‑force‑level accuracy without additional simulations.
The Worst‑Case Yield Solver (WCYS) identifies near‑target and worst‑case samples in the Solido PVTMC Verifier, builds predictive models, and triggers a reinforcement‑learning‑based High‑Sigma Verifier only when sign‑off criteria demand. Dynamic Constraint Yield Sign‑off (DCYS) automatically sweeps failing‑cell constraints until they pass, reducing the number of failed cells by 2.4×.
Compared with traditional scaled Monte‑Carlo methods, the proposed flow achieves a ten‑fold speedup, a tighter confidence interval (6.000 [5.900–6.098] vs 6.180 [5.067–7.146]), and a 60 % reduction in cells failing between 0.5 V and 0.6 V. This AI‑enabled verification provides 6‑sigma‑level confidence across massive libraries while substantially reducing verification time and resource consumption.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern semiconductor design demands managing diverse constraints-- timing, power, CDC, RDC, UPF-- across multiple EDA flows. Manual writing, scripting and validation of these constraints is slow, error-prone, and inconsistent.
We propose a multi-agent, debate-driven system--orchestrated by GitHub Copilot-- to transform this process into intelligent automation. The system orchestrates constraint generation and validation through a structured refinement loop: Generate, Critique, Debate, Human-Feedback and Evolution. The system becomes an expert that understands our silicon context by integrating human expertise into an adaptive feedback loop-- rapidly prioritizing corrective actions based on real-time violations. Advanced context engineering methodologies like RAG, RLHF, and MCP ensure consistent context management for the agents and effective constraint generation across diverse tool environments.
This approach accelerates signoff, reduces errors, and scales seamlessly across heterogeneous constraint domains-- paving the way for intent-driven, intelligent design methodologies.
We propose a multi-agent, debate-driven system--orchestrated by GitHub Copilot-- to transform this process into intelligent automation. The system orchestrates constraint generation and validation through a structured refinement loop: Generate, Critique, Debate, Human-Feedback and Evolution. The system becomes an expert that understands our silicon context by integrating human expertise into an adaptive feedback loop-- rapidly prioritizing corrective actions based on real-time violations. Advanced context engineering methodologies like RAG, RLHF, and MCP ensure consistent context management for the agents and effective constraint generation across diverse tool environments.
This approach accelerates signoff, reduces errors, and scales seamlessly across heterogeneous constraint domains-- paving the way for intent-driven, intelligent design methodologies.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionDeep neural network (DNN) accelerators have been investigated for efficient inference. Underscaling the supply voltage of MACs can effectively reduce the power dissipation. In this paper, we propose a bit-width adjustment circuit (denoted as ADA) for arbitrary MAC unit under aggressive voltage underscaling. Incurring 16% area overhead, the proposed ADA enables a MAC array to achieve zero accuracy loss. While preserving accuracy, the MAC array equipped with ADA achieves up to 48% power reduction compared to that without ADA. Furthermore, we propose ADA-Plus to optimize the MACs in output stationary systolic arrays, which reduces the area of ADA by 23%.
Engineering Presentation
AI
EDA
Systems
DescriptionThis work presents a reinforcement learning (RL) approach to optimize System-on-Chip (SoC) performance by automatically tuning Quality of Service (QoS) knobs. As SoC complexity increases, manual optimization of diverse QoS parameters across varying scenarios becomes intractable. To address this, we develop a simulation environment mimicking the SoC, where an RL agent explores the design space to identify optimal knob settings. We employ Deep Q-Networks (DQN) and enhancements (e.g., Double DQN, Dueling Networks) within a Markov Decision Process (MDP) framework, defining states, actions, and reward systems based on key metrics like throughput, latency, and power. Techniques such as Temporal Difference (TD) learning and Prioritized Experience Replay (PER) improve sample efficiency and convergence. Our framework rapidly evaluates configurations under different SoC architectures. Experimental results demonstrate the RL model's ability to discover high-performance, power-efficient QoS settings through training, significantly improving over manual methods. This work highlights RL's potential in automating SoC design optimization, offering a scalable solution for complex multi-master systems and paving the way for future design automation research.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
Description3D NAND flash is widely used from PCs to data centers, but its major weakness is data reliability. As data is retained, charge leakage causes read errors. Vendors mitigate this using the read retry mechanism, which iteratively applies predefined read reference voltage combinations, called read retry parameters (RRPs), stored in a read retry table (RRT). However, the limited number of RRPs can lead to RRP burnout, a condition where all read retry parameter entries have been exhausted without successful data recovery.
In such cases, the data becomes undecodable and the corresponding block is marked as bad, degrading both capacity and lifetime of the NAND flash. Experiment showed that even a small portion of recovey failures may lead to significant capacity loss. We propose an Adaptive RRT Extension framework, integrating two key mechanisms: WL-Aware Interpolated Retry (WIR) and Multi-Dimensional Guided Retry (MDGR), to expand recovery capability. Experiments show that our method recovers up to 99\% of failed pages and preserves up to 99\% and 84\% of usable capacity under 12- and 30-month retention, respectively—where the baseline retains only 90\% or reaches end-of-life.
In such cases, the data becomes undecodable and the corresponding block is marked as bad, degrading both capacity and lifetime of the NAND flash. Experiment showed that even a small portion of recovey failures may lead to significant capacity loss. We propose an Adaptive RRT Extension framework, integrating two key mechanisms: WL-Aware Interpolated Retry (WIR) and Multi-Dimensional Guided Retry (MDGR), to expand recovery capability. Experiments show that our method recovers up to 99\% of failed pages and preserves up to 99\% and 84\% of usable capacity under 12- and 30-month retention, respectively—where the baseline retains only 90\% or reaches end-of-life.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionProviding deterministic timing guarantees, beyond merely optimizing the accuracy-latency trade-off, is a mandatory yet unaddressed challenge for ultra-low-power spiking neural networks (SNNs) in resource-limited, safety-critical systems. In this paper, we propose RT-SNN, a novel adaptive SNN methods that integrates a system-level scheduling framework for SNN-based multi-object detection that, for the first time, co-optimizes inference accuracy while providing these strict timing guarantees. RT-SNN orchestrates SNN inference at both frame and timestep levels, introducing flexible timestep control and a novel membrane potential reuse mechanism to enhance accuracy without increasing latency. Evaluations on the KITTI dataset show that RT-SNN significantly improves the accuracy and energy efficiency compared to both state-of-the-art SNNs and traditional ANNs. Furthermore, a case study on a ROS-based F1/10 autonomous vehicle testbed demonstrates its real-time efficacy, validating its practical deployment in safety-critical systems.
People
Additional Meeting
DescriptionJoin Accellera for a dynamic luncheon exploring how artificial intelligence is reshaping the standards landscape for design and verification. As AI and machine learning are increasingly integrated into EDA workflows—from design generation to verification and system-level optimization—new challenges are emerging around interoperability, data exchange, trust, and reproducibility.
This session will highlight key areas where standards can enable scalable innovation, including AI-driven methodologies, training data availability, safety and security considerations. Attendees will gain forward-looking insights from industry experts and have the opportunity to share perspectives on where standardization is most needed across industry and academia.
Seating is limited and will be offered on a first-come, first-served basis.
This session will highlight key areas where standards can enable scalable innovation, including AI-driven methodologies, training data availability, safety and security considerations. Attendees will gain forward-looking insights from industry experts and have the opportunity to share perspectives on where standardization is most needed across industry and academia.
Seating is limited and will be offered on a first-come, first-served basis.
Additional Meeting
DescriptionSemiconductors have enabled today’s AI, and continued progress in AI will remain tightly coupled to advances in semiconductor hardware. This hardware is becoming increasingly specialized and heterogeneous, from accelerator-rich system-on-chip (SoC) devices to chiplet-based systems in advanced packages. At the same time, growing design complexity threatens this progress, as engineering and verification effort continues to rise and the semiconductor industry faces a persistent workforce gap. Agentic AI is expected to help address this challenge by boosting design productivity through a new class of EDA tools. Yet major obstacles remain. Hardware design is intrinsically different from software development, and many of the high-quality design artifacts, datasets, and workflows needed to train, evaluate, and validate AI-based methods are proprietary. Building on my experience developing ESP, an open-source platform for heterogeneous SoC design, this talk makes the case that open-source hardware is a key enabler for effective AI for EDA. By providing shareable artifacts, reproducible evaluation, and collaboration at scale across universities, government labs, and industry, open-source hardware can accelerate the development and validation of agentic AI workflows, while fostering curriculum innovation to train the next generation of hardware engineers and EDA tool developers.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionA Daisy Chain (DC) die replicates the final package design mechanically to evaluate Board Level Reliability (BLR) and die-package interface. Typically the daisy chain die is created manually with some automation enabled for signal or ground routes. In case of package iterations, this causes rework especially depending on the ground bump pattern.
The paper looks at a semi-automated method to make the daisy chain connections thereby reducing manual effort and an automated way to confirm the integrity of the package-die connection. Using the proposed method the daisy chain die creation has been reduced from a couple of days to a couple of hours. The integrated checks also help catch real shorts if present on the package removing the need for manual review.
The paper looks at a semi-automated method to make the daisy chain connections thereby reducing manual effort and an automated way to confirm the integrity of the package-die connection. Using the proposed method the daisy chain die creation has been reduced from a couple of days to a couple of hours. The integrated checks also help catch real shorts if present on the package removing the need for manual review.
Engineering Presentation
EDA
Security
DescriptionAbstract
• Modern SOCs typically have dozens of clock domains, multi‑phase clock‑generation structures and dividers, complex reset-architecture and numerous exception constraints. This information must be handled consistently from RTL design to verification (DV) testbenches, DFT clock/reset architecture and finally to the timing constraints used for physical‑design implementation. Current practices rely on manual extraction of clocking, constraints and reset data from multiple documents, leading to long turnaround times, human error, and fragmented management of inconsistent data.
• This paper presents an advanced automation framework that is built on a Template Clock & Reset Management (CRM) document defined for a given family of devices. It generates timing constraints as well as the verification testcases from the CRM document, leveraging a lightweight parsing engine. The framework automatically generates UVM‑compatible DV testbench components (clock drivers, reset sequencers) and timing constraint files (source and generated clock definitions, case analysis, exceptions like multicycle path, false path, clock groups, etc.), ensuring strict traceability between design intent, verification, test and implementation. The automation framework also generates the DFT clock planner directly from the functional clock planner, which derives test frequency, test clock domains, shaping ICG creation and DFT overrides test structures ensuring error-free DFT RTL for higher quality, fewer iterations, and faster pre-silicon verification.
Motivation
• Increase productivity and reduce human error – By consolidating all clock, reset and exception related metadata (clock definitions, MCP entries, exception rules, reset ordering) into a single, version‑controlled document, the same data can be consumed automatically by scripts that generate DV test‑bench components, DFT clocking planner and PD constraint files. This eliminates repetitive iterations and ensures that any change is reflected everywhere instantly.
• Establish a single source of truth – A centrally maintained document provides traceability and auditability. Designers can track who modified a clock parameter, when it was changed, and what downstream artefacts were regenerated, supporting robust change‑impact analysis. It also enforces strict alignment between functional and DFT clock domain plans by streamlining DFT clocking to ensure DFT clocks follow related functional clocks. This in-turn helps in reducing any DFT timing overheads.
• Accelerate timing closure – Automated generation of SDC constraints, DFT case analysis and exception scripts shorten the PD flow, while automatically produced UVM‑compatible clock drivers and reset sequencers speed up verification. Early detection of mismatches between design intent and implementation reduces the risk of late‑stage bugs.
• Demonstrated maturity and extensibility – The tool has already evolved through multiple releases (e.g., addition of clock‑mode columns, MCP value fields, reset‑polarity handling, and auto‑generation flags), highlighting continuous improvement and real‑world applicability of template. Once the updates are made, all downstream collaterals can be updated using scripts which saves a huge chunk of effort across all domains, while maintaining high quality standards.
• Facilitate collaboration across domains – Because the document template is editable by both verification and physical‑design engineers, it encourages cross‑team communication and aligns expectations early in the design cycle.
• Create a foundation for future enhancements – With the data model in place, extensions such as power‑aware clock gating, dynamic frequency scaling, or integration with a broader "SOC timing data hub" can be added with minimal effort.
• Modern SOCs typically have dozens of clock domains, multi‑phase clock‑generation structures and dividers, complex reset-architecture and numerous exception constraints. This information must be handled consistently from RTL design to verification (DV) testbenches, DFT clock/reset architecture and finally to the timing constraints used for physical‑design implementation. Current practices rely on manual extraction of clocking, constraints and reset data from multiple documents, leading to long turnaround times, human error, and fragmented management of inconsistent data.
• This paper presents an advanced automation framework that is built on a Template Clock & Reset Management (CRM) document defined for a given family of devices. It generates timing constraints as well as the verification testcases from the CRM document, leveraging a lightweight parsing engine. The framework automatically generates UVM‑compatible DV testbench components (clock drivers, reset sequencers) and timing constraint files (source and generated clock definitions, case analysis, exceptions like multicycle path, false path, clock groups, etc.), ensuring strict traceability between design intent, verification, test and implementation. The automation framework also generates the DFT clock planner directly from the functional clock planner, which derives test frequency, test clock domains, shaping ICG creation and DFT overrides test structures ensuring error-free DFT RTL for higher quality, fewer iterations, and faster pre-silicon verification.
Motivation
• Increase productivity and reduce human error – By consolidating all clock, reset and exception related metadata (clock definitions, MCP entries, exception rules, reset ordering) into a single, version‑controlled document, the same data can be consumed automatically by scripts that generate DV test‑bench components, DFT clocking planner and PD constraint files. This eliminates repetitive iterations and ensures that any change is reflected everywhere instantly.
• Establish a single source of truth – A centrally maintained document provides traceability and auditability. Designers can track who modified a clock parameter, when it was changed, and what downstream artefacts were regenerated, supporting robust change‑impact analysis. It also enforces strict alignment between functional and DFT clock domain plans by streamlining DFT clocking to ensure DFT clocks follow related functional clocks. This in-turn helps in reducing any DFT timing overheads.
• Accelerate timing closure – Automated generation of SDC constraints, DFT case analysis and exception scripts shorten the PD flow, while automatically produced UVM‑compatible clock drivers and reset sequencers speed up verification. Early detection of mismatches between design intent and implementation reduces the risk of late‑stage bugs.
• Demonstrated maturity and extensibility – The tool has already evolved through multiple releases (e.g., addition of clock‑mode columns, MCP value fields, reset‑polarity handling, and auto‑generation flags), highlighting continuous improvement and real‑world applicability of template. Once the updates are made, all downstream collaterals can be updated using scripts which saves a huge chunk of effort across all domains, while maintaining high quality standards.
• Facilitate collaboration across domains – Because the document template is editable by both verification and physical‑design engineers, it encourages cross‑team communication and aligns expectations early in the design cycle.
• Create a foundation for future enhancements – With the data model in place, extensions such as power‑aware clock gating, dynamic frequency scaling, or integration with a broader "SOC timing data hub" can be added with minimal effort.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionMotivation: In SoCs 3rd party and TI internal hard IPs such as PHY, PLL, Dividers, etc are often required and integrated. An integration and specification document is often shared which contains the intricacies to be taken care during implementation. However, the integration requirements are often scattered across multiple pages, embedded on figures/tables, etc. The current manual approach of extracting, reviewing, and translating these requirements into actionable checklists is time-consuming and error-prone. The requirements can be scattered across different documents too. Eg ODP and efuse controller are expected to be in the same power domain but, the requirement is present in the fusefarm document and not in the ODP document. These need to be clubbed together and might be missed unless the user has read all documents.
Solution: An AI based automated approach in two phases. Phase 1 leverages AI-based document parsing to intelligently extract and categorize integration requirements into multiple categories - floorplan, IR, timing, placement etc. with automatically correlating cross-document dependencies. Phase 2 employs AI to translate these requirements into automated TCL-based checkers and where feasible, provide TCL based fix coding scripts.
Impact: The Physical Design (PD) team emerges as the most significant beneficiary of this AI-powered solution, as they are directly responsible for implementing majority of integration requirements extracted from specification documents. Experimental validation across various design integration documents have shown significant time saving up to 2 weeks and zero requirement omission rates, making it a scalable solution.
Future scope: While the current system requires prompt refinement and domain expertise for evolving IP types and documentation formats, future enhancements will incorporate automated prompt optimization to minimize specialist dependencies.
Solution: An AI based automated approach in two phases. Phase 1 leverages AI-based document parsing to intelligently extract and categorize integration requirements into multiple categories - floorplan, IR, timing, placement etc. with automatically correlating cross-document dependencies. Phase 2 employs AI to translate these requirements into automated TCL-based checkers and where feasible, provide TCL based fix coding scripts.
Impact: The Physical Design (PD) team emerges as the most significant beneficiary of this AI-powered solution, as they are directly responsible for implementing majority of integration requirements extracted from specification documents. Experimental validation across various design integration documents have shown significant time saving up to 2 weeks and zero requirement omission rates, making it a scalable solution.
Future scope: While the current system requires prompt refinement and domain expertise for evolving IP types and documentation formats, future enhancements will incorporate automated prompt optimization to minimize specialist dependencies.
Engineering Presentation
AI
Design
EDA
DescriptionMotivation: In SoCs 3rd party and TI internal hard IPs such as PHY, PLL, Dividers, etc are often required and integrated. An integration and specification document is often shared which contains the intricacies to be taken care during implementation. However, the integration requirements are often scattered across multiple pages, embedded on figures/tables, etc. The current manual approach of extracting, reviewing, and translating these requirements into actionable checklists is time-consuming and error-prone. The requirements can be scattered across different documents too. Eg ODP and efuse controller are expected to be in the same power domain but, the requirement is present in the fusefarm document and not in the ODP document. These need to be clubbed together and might be missed unless the user has read all documents.
Solution: An AI based automated approach in two phases. Phase 1 leverages AI-based document parsing to intelligently extract and categorize integration requirements into multiple categories - floorplan, IR, timing, placement etc. with automatically correlating cross-document dependencies. Phase 2 employs AI to translate these requirements into automated TCL-based checkers and where feasible, provide TCL based fix coding scripts.
Impact: The Physical Design (PD) team emerges as the most significant beneficiary of this AI-powered solution, as they are directly responsible for implementing majority of integration requirements extracted from specification documents. Experimental validation across various design integration documents have shown significant time saving up to 2 weeks and zero requirement omission rates, making it a scalable solution.
Future scope: While the current system requires prompt refinement and domain expertise for evolving IP types and documentation formats, future enhancements will incorporate automated prompt optimization to minimize specialist dependencies.
Solution: An AI based automated approach in two phases. Phase 1 leverages AI-based document parsing to intelligently extract and categorize integration requirements into multiple categories - floorplan, IR, timing, placement etc. with automatically correlating cross-document dependencies. Phase 2 employs AI to translate these requirements into automated TCL-based checkers and where feasible, provide TCL based fix coding scripts.
Impact: The Physical Design (PD) team emerges as the most significant beneficiary of this AI-powered solution, as they are directly responsible for implementing majority of integration requirements extracted from specification documents. Experimental validation across various design integration documents have shown significant time saving up to 2 weeks and zero requirement omission rates, making it a scalable solution.
Future scope: While the current system requires prompt refinement and domain expertise for evolving IP types and documentation formats, future enhancements will incorporate automated prompt optimization to minimize specialist dependencies.
Research Special Session
Systems
DescriptionAdvanced packaging is deployed with increasing frequency to support heterogenous integration, chiplet design strategies, and enablement of large SOCs. For AI applications in particular, it is the only way to integrate the large compute silicon area and high memory capacity and bandwidth required for competitive products. This talk will review the usage of advanced packaging technologies in AMD products across various families, with an emphasis on AI products such as the AMD Instinct™ MI300, which blends 3D hybrid bond logic die stacking, 2.5D interposers, and HBM into a single SOC. Trends observed in AI products are then used to project the evolution of advanced packaging technologies, including finer pitch scaling, continued growth in module size, and increasing power delivery and thermal demands. These trends will finally be connected to design and test strategies. As cost and complexity grow, creating manufacturing flows that allow accurate screening at multiple steps in the process to ensure good yield and quality are an essential part of 'Design-Technology Co-optimiziation' (DTCO). Successful products will require careful consideration of test methods in every chiplet and at every step of the manufacturing flow.
Research Special Session
EDA
DescriptionIn this talk, AIM Photonics presents advances in heterogeneous integration and 3D packaging within 300 mm silicon photonics at the Albany Nanotech Complex. We highlight results from laser integration on PICs using 2.5D and monolithic bonding, offering dense integration capabilities. The talk also covers 3D co-packaged optics (CPO) architectures that lead to large gains in optical interconnect efficiency and bandwidth. Finally, we discuss unique features of our design automation offerings, including feature-rich photonic PDKs and Assembly Design Kit, enabling efficient, accurate circuit design and packaging. These innovations address key challenges in manufacturability, scalability, and yield for next-generation photonic AI systems.
Research Special Session
AI
DescriptionAI-driven hardware code generation, across both frontier LLMs and emerging agentic workflows, is advancing rapidly, yet it still trails the maturity of AI-driven software development. High-quality benchmarks are essential for catalyzing progress, revealing capability gaps, and guiding the development of new methodologies. The Comprehensive Verilog Design Problems (CVDP) benchmark, now adopted by the Silicon Integration Initiative's LLM Benchmarking Coalition, provides a foundation for evaluating these capabilities within the broader silicon design community. We survey the evolving landscape of language models and agentic systems for hardware design, examining their performance on RTL generation, testbench creation, debugging, and related tasks through the lens of CVDP. We highlight emerging trends, persistent challenges, and key opportunities for advancing trustworthy and automation-ready hardware design flows.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIncorrect clock domain synchronization can cause critical functional bugs in SoCs, potentially requiring costly fixes. When low-power requirements with multiple switchable power domains are involved, CDC verification becomes more complex. In UPF-based low-power designs, isolation cells at power domain boundaries preserve signal integrity but may cause resynchronization issues if their enable signals are improperly synchronized, leading to metastability or failures. The presence of isolation cells, retention registers, and their enable signals increases the number of asynchronous paths to analyze.
This paper proposes a new flow based on formally proven' smart waiver CDC jointly with LPV that drastically reduces the number of CDC and RDC violations to analyze in a typical ultra-low-power SoC. Normally, the number of violations can be several hundred, but this flow safely waives those proved true through formal verification. The remaining violations can then be analyzed to uncover real resynchronization bugs.
In the presented test case, the number of violations to analyze was reduced by about 35%, enabling the discovery of bugs that might have been lost among the initial hundreds of violation reports. Formal verification remains the optimal method for CDC checking, and Jasper CDC with LPV enables power-aware CDC verification using UPF
This paper proposes a new flow based on formally proven' smart waiver CDC jointly with LPV that drastically reduces the number of CDC and RDC violations to analyze in a typical ultra-low-power SoC. Normally, the number of violations can be several hundred, but this flow safely waives those proved true through formal verification. The remaining violations can then be analyzed to uncover real resynchronization bugs.
In the presented test case, the number of violations to analyze was reduced by about 35%, enabling the discovery of bugs that might have been lost among the initial hundreds of violation reports. Formal verification remains the optimal method for CDC checking, and Jasper CDC with LPV enables power-aware CDC verification using UPF
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionMacro placement is a critical stage in physical design, directly impacting the quality and performance of VLSI circuits. We propose a reinforcement learning (RL)-based macro placement framework that integrates proven design practices through a design-practice-embedded action mask, including peripheral placement, dead space avoidance, and proximal placement for macros with shared design hierarchy and physical footprint. Unlike previous RL based methods, our method uses macro clusters-formed according to design hierarchy and physical footprint-as the basic placement units, which reduces placement steps and consequently accelerates convergence and improves runtime. The proposed framework also introduces a novel compaction method to minimize wasted area caused by grid granularity, and jointly optimizes macro cluster location and tiling pattern for more effective exploration. Experimental results show that our approach achieves expert level placement quality and consistently outperforms three leading commercial macro placers on industrial designs, with reduced turnaround time. On public benchmarks, our method achieves up to 25.37% and 39.51% improvements in worst negative slack (WNS) and total negative slack (TNS), respectively, over five state of the art (SOTA) placers. These results demonstrate the effectiveness and real-world applicability of our RL based framework, paving the way for further advancements in physical design automation.
Engineering Presentation
Design
EDA
Systems
DescriptionSilicon Lifecycle Monitoring (SLM) addresses the lifetime silicon health and reliability challenges from early life to end of life. While ATPG based structural tests are very effective at detecting manufacturing defects there are many other complex failure mechanisms that manifest during operational lifetime and require in-field debug and continuous monitoring. The root cause of such failures are latent manufacturing defects, IR Drop, leakage currents, thermal effects, soft errors, and aging – many of which fall into the category of silent data corruption (SDC). Functional failures often manifest as program misbehavior that impact large scale deployment of highly connected systems such as web servers, 5G base stations, automotives and AI/ML processors.
We describe an SLM solution that uses an embedded trace system as its foundation. This collects time stamped data to analyze the trajectory of the transactions involving the CPU, memory, I/Os, peripherals, and other sub-systems. Trace data is highly compressible (up to 700 X) and can be easily stored in system or offloaded for analysis that executes in less than 165 milliseconds for these programs. Results on an industrial benchmark (for RISC-V CPUs) and an AI Inference engine silicon case study demonstrate the value and effectiveness of this approach.
We describe an SLM solution that uses an embedded trace system as its foundation. This collects time stamped data to analyze the trajectory of the transactions involving the CPU, memory, I/Os, peripherals, and other sub-systems. Trace data is highly compressible (up to 700 X) and can be easily stored in system or offloaded for analysis that executes in less than 165 milliseconds for these programs. Results on an industrial benchmark (for RISC-V CPUs) and an AI Inference engine silicon case study demonstrate the value and effectiveness of this approach.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionQuantum computing holds the potential to revolutionize numerous fields, yet the practical execution of quantum circuits depends on an efficient compilation process, including the placement of logical qubits on a quantum chip. This placement is a hard combinatorial problem that often defeats traditional heuristics and manual optimization. We present QAgent, a novel multi-agent reinforcement learning (RL) framework that autonomously optimizes logical qubit layouts on quantum processor. In QAgent, one agent is assigned to each logical qubit, and agents jointly learn placement policies that minimize circuit execution cost. To address the challenges of sparse rewards and credit assignment, we propose the Breakthrough Return Bonus (BRB), a dynamic reward shaping mechanism that encourages meaningful layout improvements and accelerates convergence. Extensive experiments on diverse quantum circuit benchmarks show that QAgent reduces execution costs by up to 53.9% compared to leading approaches, and significantly enhances circuit success rates. Ablation studies confirm that BRB is essential for stable training and effective policy optimization. These results demonstrate the promise of AI-driven, workload-aware placement for advancing fault-tolerant quantum computation.
People
Work in Progress
DescriptionWe benchmark a new agentic AI approach for chip design and verification on the CVDP dataset. Our results show that multi-agent orchestration, custom system prompts, and improved tool-use guidance enable our agent to debug and complete substantially more complex hardware verification problems. On relevant CVDP problems, our agent demonstrates a relative advantage of 11.6% vs. CVDP state of the art performance and 15.3% vs. Claude Code. Beyond aggregate metrics, we analyze execution traces and find evidence of enhanced reasoning and debugging capabilities, as well as important limitations of under- and over-specified test harnesses.
Additional Meeting
DescriptionJuly 27th, 4:00 PM – 5:00 PM
EDA is on the brink of its most disruptive transformation yet. Agentic AI—multi-agent systems capable of autonomous, goal-driven decision-making—promises to move beyond assistive point solutions toward orchestrating entire design flows. From synthesis and verification to physical design and system optimization, these agents could redefine productivity and complexity management. Yet as autonomy grows, critical questions emerge: Who—or what—is really in control?
This panel will explore the technical, organizational, and ethical tensions surrounding agentic AI adoption. We’ll examine infrastructure and governance: securing sensitive IP and maintaining compliance when autonomous agents operate across hybrid environments. We’ll discuss interoperability and standards: for example, Si2’s AI/ML EDA ontology provides a standardized representation of design terminology and relationships, bridging graph representations, ML datasets, and tool semantics to enable agents to reason about workflows. Such efforts highlight both the promise and the challenge of enabling multi-vendor collaboration without fragmenting the ecosystem. Finally, we’ll confront trust and accountability: how do we reason about correctness, explainability, and oversight when design decisions emerge from interacting agents rather than deterministic scripts?
The discussion will surface points of disagreement across the ecosystem, including how much autonomy is desirable, which risks are acceptable, and whether current infrastructures and standards are sufficient. Attendees will leave with a clearer understanding of the readiness of agentic AI, practical tensions it raises, and open questions shaping its future adoption.
Watch for Si2 DAC updates!
EDA is on the brink of its most disruptive transformation yet. Agentic AI—multi-agent systems capable of autonomous, goal-driven decision-making—promises to move beyond assistive point solutions toward orchestrating entire design flows. From synthesis and verification to physical design and system optimization, these agents could redefine productivity and complexity management. Yet as autonomy grows, critical questions emerge: Who—or what—is really in control?
This panel will explore the technical, organizational, and ethical tensions surrounding agentic AI adoption. We’ll examine infrastructure and governance: securing sensitive IP and maintaining compliance when autonomous agents operate across hybrid environments. We’ll discuss interoperability and standards: for example, Si2’s AI/ML EDA ontology provides a standardized representation of design terminology and relationships, bridging graph representations, ML datasets, and tool semantics to enable agents to reason about workflows. Such efforts highlight both the promise and the challenge of enabling multi-vendor collaboration without fragmenting the ecosystem. Finally, we’ll confront trust and accountability: how do we reason about correctness, explainability, and oversight when design decisions emerge from interacting agents rather than deterministic scripts?
The discussion will surface points of disagreement across the ecosystem, including how much autonomy is desirable, which risks are acceptable, and whether current infrastructures and standards are sufficient. Attendees will leave with a clearer understanding of the readiness of agentic AI, practical tensions it raises, and open questions shaping its future adoption.
Watch for Si2 DAC updates!
DAC Pavilion Panel
DescriptionEDA is on the brink of its most disruptive transformation yet. Agentic AI—multi-agent systems capable of autonomous, goal-driven decision-making—promises to move beyond assistive point solutions toward orchestrating entire design flows. From synthesis and verification to physical design and system optimization, these agents could redefine productivity and complexity management. Yet as autonomy grows, critical questions emerge: Who—or what—is really in control?
This panel will explore the technical, organizational, and ethical tensions surrounding agentic AI adoption. We'll examine infrastructure and governance: securing sensitive IP and maintaining compliance when autonomous agents operate across hybrid environments. We'll discuss interoperability and standards: for example, Si2's AI/ML EDA ontology provides a standardized representation of design terminology and relationships, bridging graph representations, ML datasets, and tool semantics to enable agents to reason about workflows. Such efforts highlight both the promise and the challenge of enabling multi-vendor collaboration without fragmenting the ecosystem. Finally, we'll confront trust and accountability: how do we reason about correctness, explainability, and oversight when design decisions emerge from interacting agents rather than deterministic scripts?
The discussion will surface points of disagreement across the ecosystem, including how much autonomy is desirable, which risks are acceptable, and whether current infrastructures and standards are sufficient. Attendees will leave with a clearer understanding of the readiness of agentic AI, practical tensions it raises, and open questions shaping its future adoption.
Key Takeaways
- Diverse industry perspectives on the promise and perils of agentic AI in EDA
- Emerging standards and collaboration needs, including ontology-driven approaches, for interoperability
- Critical risk areas—including security, governance, and trust—under debate
- Points of contention and open questions shaping adoption strategies
This panel will explore the technical, organizational, and ethical tensions surrounding agentic AI adoption. We'll examine infrastructure and governance: securing sensitive IP and maintaining compliance when autonomous agents operate across hybrid environments. We'll discuss interoperability and standards: for example, Si2's AI/ML EDA ontology provides a standardized representation of design terminology and relationships, bridging graph representations, ML datasets, and tool semantics to enable agents to reason about workflows. Such efforts highlight both the promise and the challenge of enabling multi-vendor collaboration without fragmenting the ecosystem. Finally, we'll confront trust and accountability: how do we reason about correctness, explainability, and oversight when design decisions emerge from interacting agents rather than deterministic scripts?
The discussion will surface points of disagreement across the ecosystem, including how much autonomy is desirable, which risks are acceptable, and whether current infrastructures and standards are sufficient. Attendees will leave with a clearer understanding of the readiness of agentic AI, practical tensions it raises, and open questions shaping its future adoption.
Key Takeaways
- Diverse industry perspectives on the promise and perils of agentic AI in EDA
- Emerging standards and collaboration needs, including ontology-driven approaches, for interoperability
- Critical risk areas—including security, governance, and trust—under debate
- Points of contention and open questions shaping adoption strategies
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe deployment of advanced high-speed protocols (e.g., PCIe Gen6/CXL) in multi-billion-gate designs creates a severe "Logical-Physical Gap," where traditional implementation flows suffer from unmanageable runtimes and convergence volatility. This work introduces a novel Cognitive Agentic Framework that orchestrates multi-objective partitioning and semantic-driven floorplanning to bridge this divide. Unlike conventional methodologies relying on blind min-cut algorithms or static placement rules, our approach utilizes HGNN to predict physical feasibility during logical partitioning, optimizing for size balance, boundary timing, and data stream integrity. Furthermore, we propose Dataflow-Driven Vision-Language Reinforced Semantic Floorplanning, where a Large Multimodal Model functions as a "visual perception sensor" within a physics-based force field. This allows the agent to iteratively "see" congestion hotspots and dynamically adjust repulsion forces to enforce semantic consistency. By embedding physical awareness into the architectural phase, our framework reduces the design iteration cycle, achieving superior PPA metrics and deterministic timing closure compared to standard industrial flows.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionHigh-Level Synthesis (HLS) code is often developed using a subset of C/C++, along with specialized libraries. This allows HLS engineers to make use of standard software engineering tools and techniques during the development and debug phase of their project. This talk introduces the topic of time-travel debugging, whereby the state of a design can be examined by going backwards and forwards in time. This approach saves huge amounts of effort, allowing the root cause of bugs, including challenging concurrency bugs, to be found with ease and a new codebase to be understood rapidly.
Beyond this, we will look at how you can connect the latest AI models to the time-travel debugging tools to enable Agentic Time-Travel Debugging capabilities to verify hypotheses and predictions against authoritative records of what did happen during execution, closing the loop to allow the AI Agents to iterate and improve their diagnosis of behaviour until they get a verified root cause diagnosis of issues during development and simulation. It is also possible to generate waveforms from the recordings and report detailed root cause explanations, using waveforms and language that potentially explains complex software issues to hardware developers generating or using the HLS models.
Beyond this, we will look at how you can connect the latest AI models to the time-travel debugging tools to enable Agentic Time-Travel Debugging capabilities to verify hypotheses and predictions against authoritative records of what did happen during execution, closing the loop to allow the AI Agents to iterate and improve their diagnosis of behaviour until they get a verified root cause diagnosis of issues during development and simulation. It is also possible to generate waveforms from the recordings and report detailed root cause explanations, using waveforms and language that potentially explains complex software issues to hardware developers generating or using the HLS models.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionChiplet accelerators offer a scalable solution for LLM inference with reduced manufacturing costs. However, the design space exploration (DSE) of chiplet accelerators is challenging due to the complex design space. Prior black-box Bayesian optimization (BO) solutions lack domain knowledge, limiting their effectiveness. In this work, we propose AgenticDSE, a multi-agent DSE framework that incorporates the collaboration of three LLM agents, i.e., exploration orchestrator, architecture analyst, and optimization engineer. This multi-agent framework enhances exploration efficiency through analysis-driven design space refinement and phase-wise design exploration using ensemble surrogate modeling. Over the same number of explorations, AgenticDSE achieves up to 36.9% reduction of average distance to the real Pareto front, and 66% increase in diversity of explored Pareto front compared to state-of-the-art DSE solutions. Additionally, it offers scalable performance with 46x and 18x reduction in input and output token consumption compared to prior LLM-based solutions.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionCo-packaged optics (CPO) presents unique physical design challenges: numerous small blocks with irregular shapes to accommodate photonics and analog integration, combined with frequent RTL changes extending close to tapeout due to evolving system architectures of cutting-edge silicon photonics design. Traditional manual PD flows widely used in the industry cannot sustain rapid iteration requirements while maintaining team scalability. We present a production-grade GitOps-based PD orchestration framework that treats PD recipes as version-controlled code, enabling automated synthesis through signoff with CI/CD integration. Our framework leverages YAML/Jinja configuration templates, parallel multi-block execution, commercial LLM APIs enabled with EDA tool docs, and human-in-the-loop validation via automated QoR regression and Slack notifications. Deployed at Lighmatter, a fabless CPO company, this system enabled a team of fewer than 10 PD engineers to successfully deliver multiple test chips and product tapeouts with boosted productivity compared to industry baselines, while seamlessly absorbing RTL changes in the final two months before tapeout. Beyond immediate productivity gains, this infrastructure establishes the data collection and execution framework necessary for AI-agent-assisted PD optimization, providing auditable run histories and structured metrics that serve as training data for autonomous design exploration.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionSpeculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2× in throughput and 5.6× in energy efficiency improvements over a GPU-only baseline, and 1.5× in throughput and 1.24× in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3% of the DRAM area.
People
Engineering Special Session
AI
Chiplet
Design
EDA
DescriptionThe convergence of AI and multi‑die architectures is reshaping the fundamentals of compute efficiency. Near‑memory and in‑memory processing—enabled by advanced non‑volatile memory technologies—reduces data movement and latency, accelerating AI workloads at their source. Cost‑effective multi‑die integration unlocks scalable performance without escalating silicon expense, fueling innovation from edge devices to hyperscale cloud.
Automotive‑grade AI raises the bar further, requiring multi‑die reliability that meets rigorous functional‑safety demands. Sustainability pressures drive advances in interconnect efficiency, power‑aware partitioning, and intelligent storage hierarchies across AI–multi‑die systems. Collectively, these trends point toward a future where heterogeneous integration and intelligent compute evolve in tandem, delivering new levels of performance, robustness, and environmental efficiency.
Automotive‑grade AI raises the bar further, requiring multi‑die reliability that meets rigorous functional‑safety demands. Sustainability pressures drive advances in interconnect efficiency, power‑aware partitioning, and intelligent storage hierarchies across AI–multi‑die systems. Collectively, these trends point toward a future where heterogeneous integration and intelligent compute evolve in tandem, delivering new levels of performance, robustness, and environmental efficiency.
Engineering Special Session
AI
Design
EDA
Systems
DescriptionSpecifications remain the foundation of silicon development, yet they suffer from systemic problems that compound across the stack. This talk addresses four critical challenges: (1) Formal vs. Informal Specs: Natural language specs create ambiguity, contradiction, and drift. AI enables incremental formalization - translating high-value sections into machine-readable formats while keeping specs accessible. (2) Living Documents vs. Snapshots: Specs diverge from implementation almost immediately. AI-powered drift detection and spec-as-code practices can keep specs aligned with reality throughout the development lifecycle. (3) Spec Archaeology: Legacy IP, acquisitions, and lost documentation create integration risk. AI can reconstruct specs from RTL/firmware and surface undocumented assumptions from tribal knowledge sources. (4) Cross-Company Interfaces: Specs crossing organizational boundaries become contractual documents where ambiguity has legal and financial consequences. Structured, AI-validated specs reduce handoff risk with IP vendors, OSATs, and customers.
People
Welcome Address
AI
EDA
DescriptionA new era is emerging where AI supercomputing and EDA come together to redefine what is possible in chip and system design. Accelerated computing, foundation models, and intelligent design flows are transforming every stage of the journey, from architecture and verification to deployment in complex, secure systems. As AI, Design, EDA, Security, and Systems converge, design cycles can shrink, chiplet-based and future quantum-ready platforms can become practical, and human creativity can be amplified. This keynote explores how collaborating with AI at scale can unlock breakthrough performance, faster innovation, and a new generation of chips and systems that power the world’s most ambitious ideas.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs chip complexity continues to grow, shift-left verification has become critical for improving verification efficiency and design quality. Formal Verification is a key enabler of shift-left adoption, as it provides exhaustive analysis and exposes corner case bugs early in the design cycle. Despite these advantages, many design verification (DV) teams struggle to adopt formal methods due to expertise gaps, scalability limitations, and the perceived difficulty of manual property authoring.
This presentation introduces a practical, AI-assisted roadmap for scaling formal verification across IP, subsystem, and SoC level designs. The approach combines early "quick win" formal applications (linting, UNR, connectivity) with scalable intermediate techniques (CSR, CDC/RDC, safety and security), and extends to advanced Formal Property Verification (FPV). Recent advancements in AI-assisted assertion generation, leveraging natural-language-to-property workflows, context-aware reasoning, and automated constraint derivation significantly reduce the manual effort required for FPV setup.
The methodology has been field tested across real design teams, demonstrating improved scalability, faster proof convergence, reduced assertion authoring effort, and higher overall adoption. Key complexity management techniques, including modularization, abstraction, and constraint strategies, are also discussed as essential enablers for deploying formal verification at scale in modern front-end verification flows.
This presentation introduces a practical, AI-assisted roadmap for scaling formal verification across IP, subsystem, and SoC level designs. The approach combines early "quick win" formal applications (linting, UNR, connectivity) with scalable intermediate techniques (CSR, CDC/RDC, safety and security), and extends to advanced Formal Property Verification (FPV). Recent advancements in AI-assisted assertion generation, leveraging natural-language-to-property workflows, context-aware reasoning, and automated constraint derivation significantly reduce the manual effort required for FPV setup.
The methodology has been field tested across real design teams, demonstrating improved scalability, faster proof convergence, reduced assertion authoring effort, and higher overall adoption. Key complexity management techniques, including modularization, abstraction, and constraint strategies, are also discussed as essential enablers for deploying formal verification at scale in modern front-end verification flows.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis provides the new analog design methodology which will generates schematic topology for practical usage.
The abstract concept of topology generation is following.
Decompose the many circuits schematic resources to small cells or blocks.
Then construct for the best combination topology to meet the each specific target under spice simulation base optimization for block level architecture schematic. This design could be existing schematic but partially structured as blocks which internally build by connected between selectable libraries' cells or blocks.
Each cell and block are already validated and parametric , the final schematic is acceptable by designers.
This tool is independent from circuit categories and any process. The evaluations are from spice simulation under PDK from testbenches.
This may produce better combined circuit topology than depending on designers experiences or skill.
The key technologies of this tool are
1. AI searching method in huge combinatorial space which is used like Alpha-GO
2. Recent Machine learning algorithm for circuit parameter optimization
3. Dived the stage as learning and inference which scalable by machine resources
This will improve the reusability and improve migration efficiency equipped with topology changing.
The final design sign off could be done in current customers design environments.
The abstract concept of topology generation is following.
Decompose the many circuits schematic resources to small cells or blocks.
Then construct for the best combination topology to meet the each specific target under spice simulation base optimization for block level architecture schematic. This design could be existing schematic but partially structured as blocks which internally build by connected between selectable libraries' cells or blocks.
Each cell and block are already validated and parametric , the final schematic is acceptable by designers.
This tool is independent from circuit categories and any process. The evaluations are from spice simulation under PDK from testbenches.
This may produce better combined circuit topology than depending on designers experiences or skill.
The key technologies of this tool are
1. AI searching method in huge combinatorial space which is used like Alpha-GO
2. Recent Machine learning algorithm for circuit parameter optimization
3. Dived the stage as learning and inference which scalable by machine resources
This will improve the reusability and improve migration efficiency equipped with topology changing.
The final design sign off could be done in current customers design environments.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEkepower is committed to developing chips for advanced printing technology. A key feature of these innovative chips is their integration of diverse circuit block functionalities into a single device. These include high-voltage motor and printhead drivers, precise timing references essential for the entire system, and RF sensor integration capable of detecting signals with extremely low voltage. Finding a universal simulation and verification platform that efficiently accommodates the unique requirements of different circuit blocks remains a significant challenge.
By leveraging integration with the Solido Simulation Suite and Solido Design Environment, users gain access to comprehensive analysis and accelerated performance compared to traditional workflows, without compromising accuracy. Additionally, AI-driven variation-aware feature enable thorough verification and identification of circuit distributions, facilitating the efficient creation of trimming plans to optimize both analysis and circuit performance.
Ekepower leverages this design flow to efficiently verify and optimize circuit blocks with varying properties simultaneously through transient, RF, and mixed-signal analysis, achieving speed improvements of up to 7.6x over conventional methods. Additionally, the AI-powered and automated trimming flow effectively optimizes circuit blocks, resulting in an overall reduction in top-level voltage variation.
By leveraging integration with the Solido Simulation Suite and Solido Design Environment, users gain access to comprehensive analysis and accelerated performance compared to traditional workflows, without compromising accuracy. Additionally, AI-driven variation-aware feature enable thorough verification and identification of circuit distributions, facilitating the efficient creation of trimming plans to optimize both analysis and circuit performance.
Ekepower leverages this design flow to efficiently verify and optimize circuit blocks with varying properties simultaneously through transient, RF, and mixed-signal analysis, achieving speed improvements of up to 7.6x over conventional methods. Additionally, the AI-powered and automated trimming flow effectively optimizes circuit blocks, resulting in an overall reduction in top-level voltage variation.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionLevel shifters are critical for reliable cross-voltage domain communication: their failure risks the integrity of the circuit, excessive power dissipation, and damage to components. Extensive verification is required to mitigate their susceptibility to device variability and multiple failure mechanisms. Traditional brute-force Monte Carlo analysis is infeasible for evaluating all failure modes across large corner sets at advanced nodes, making accurate, timely verification challenging.
This paper introduces an AI-powered, high-throughput batch flow for fast, accurate signoff verification of single and multibit level shifter cells. Adaptive AI performs rapid identification of worst-case corners, which are then verified with AI-powered, brute-force accurate high-sigma technology. This precision deployment optimizes runtime and resources by running high-sigma verification only when necessary.
The proposed methodology demonstrates significant improvements over previous methods. 3σ single-cell validation was 3x faster than traditional brute-force Monte Carlo, and high-sigma verification precision deployment reduced overall runtime by 10x across 256 cells at 6σ. Targeted analysis validated glitch heights in multibit level shifters at 3σ and 6σ, with precision deployment reducing overall runtime by >14x. This AI-driven solution optimizes compute and engineering resources to provide timely, accurate guidance on reduced verification timelines, ensuring full-coverage functional robustness, improved silicon quality, and faster time-to-market.
This paper introduces an AI-powered, high-throughput batch flow for fast, accurate signoff verification of single and multibit level shifter cells. Adaptive AI performs rapid identification of worst-case corners, which are then verified with AI-powered, brute-force accurate high-sigma technology. This precision deployment optimizes runtime and resources by running high-sigma verification only when necessary.
The proposed methodology demonstrates significant improvements over previous methods. 3σ single-cell validation was 3x faster than traditional brute-force Monte Carlo, and high-sigma verification precision deployment reduced overall runtime by 10x across 256 cells at 6σ. Targeted analysis validated glitch heights in multibit level shifters at 3σ and 6σ, with precision deployment reducing overall runtime by >14x. This AI-driven solution optimizes compute and engineering resources to provide timely, accurate guidance on reduced verification timelines, ensuring full-coverage functional robustness, improved silicon quality, and faster time-to-market.
Engineering Presentation
AI
Design
EDA
DescriptionEDA tool development often suffers from redundant effort and complexity due to repeated implementation of common HDL parsing and construction tasks. This work introduces an AI-assisted framework that leverages MCP modular building blocks and an Agentic AI approach to automate tool generation, eliminating the need for deep parser expertise. The proposed methodology collects user specifications and synthesizes EDA tools using reusable components. Applied to IBM structural verification tools, the framework demonstrated significant productivity gains - reducing development time from an estimated four person-weeks to under 30 minutes. The generated implementation required 1,110 lines of code, with approximately 90% reused across checks, ensuring scalability and maintainability. By automating repetitive tasks and enabling rapid deployment of new checks, this approach accelerates design workflows, improves reusability, and shortens time-to-market for EDA tools. The results highlight a transformative shift from manual coding to AI-driven synthesis, addressing inefficiencies and empowering design teams to focus on innovation.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe increasing adoption of chiplet-based 2.5D advanced packaging has introduced formidable power integrity (PI) challenges, requiring the simultaneous optimization of static DC performance and dynamic AC response—a complex, multi-variable task traditionally reliant on manual expertise. This paper presents an integrated, AI-driven workflow that fully automates the PI co-optimization process. For DC performance, the methodology automates the design of the interposer Power-Ground (PG) mesh. It leverages AI-based multi-objective optimization within Redhawk-SC to intelligently adjust metal patterns, directly minimizing IR drop while adhering to process design rules. For system-level AC performance, the workflow automates the selection and configuration of the hierarchical decoupling network. By combining HSPICE simulation with optimizers (optiSLang/ASO.ai), it co-optimizes components from die MIMCAP and interposer Deep-Trench Capacitors (DTC) to package decaps and PCB MLCCs, targeting impedance and noise reduction under constraints of cost, area, and assembly rules. This automated approach achieves a 26% reduction in DC IR drop and a 12% improvement in AC noise margin. Ultimately, it consolidates fragmented manual procedures into a unified flow, reducing the total PI optimization timeline from over three weeks to approximately one week, thereby enhancing design robustness and significantly accelerating time-to-market.
Exhibitor Forum
AI
EDA
Systems
DescriptionMove Silicon presents Spaceman, an AI-driven design automation toolchain developed to accelerate and optimize analog and mixed-signal circuit design. Seamlessly integrated with leading EDA environments, Spaceman automates sizing, optimization, and design-space exploration, dramatically reducing both time-to-market and engineering effort for IP and ASIC development. The goal is not to replace analog designers, but to augment their capabilities, enabling faster, smarter, and more scalable design methodologies for the next generation of semiconductors.
At the core of these results lies Move Silicon’s proprietary AI modeling technology, capable of autonomously training neural network models directly from a circuit schematic and its testbench. This approach enables highly efficient and virtually unlimited IP reuse, significantly improving design productivity and portability across projects and PDKs. Spaceman goes beyond simply identifying the correct sizing for a given specification. It shifts the paradigm from traditional SPICE-driven iterative workflows toward AI-driven predictive models, enabling immediate exploration, analysis, and forecasting of the entire design solution space for any IP.
At the core of these results lies Move Silicon’s proprietary AI modeling technology, capable of autonomously training neural network models directly from a circuit schematic and its testbench. This approach enables highly efficient and virtually unlimited IP reuse, significantly improving design productivity and portability across projects and PDKs. Spaceman goes beyond simply identifying the correct sizing for a given specification. It shifts the paradigm from traditional SPICE-driven iterative workflows toward AI-driven predictive models, enabling immediate exploration, analysis, and forecasting of the entire design solution space for any IP.
Engineering Presentation
AI-Enabled EDA Cloud Infrastructure and Design Optimization for Next-Generation Semiconductor Design
11:45am - 12:00pm PDT Tuesday, July 28 Seaside Ballroom AAI
Design
EDA
DescriptionAs semiconductor designs scale to billions of transistors, compute demands for electronic design automation (EDA) increasingly exceed the capacity of traditional on-premises infrastructure. Migrating workloads to hybrid or cloud environments offers scalability, but many organizations face inefficiencies in job provisioning, resource utilization, and scheduling accuracy. This work introduces AI Assist for EDA Cloud Optimization, a framework leveraging graph neural networks (GNNs) and reinforcement learning (RL) to forecast compute requirements and optimize workload execution across heterogeneous environments.
By analyzing job metadata from synthesis, placement, and verification stages, AI Assist predicts runtime and resource needs prior to submission—reducing overprovisioning by 60–80% and improving runtime completion speed by 20–40% in test deployments. The adaptive scheduling layer learns policies that balance cost, performance, and licensing constraints across hybrid compute environments.
Complementary modules—such as placement optimization and hotspot prediction—apply similar AI paradigms to design optimization and verification, demonstrating measurable improvements in PPA efficiency and iteration time. Together, these solutions illustrate how AI Assist enables semiconductor firms to modernize their EDA workflows securely and efficiently while accelerating design convergence in the transition to cloud-scale engineering.
By analyzing job metadata from synthesis, placement, and verification stages, AI Assist predicts runtime and resource needs prior to submission—reducing overprovisioning by 60–80% and improving runtime completion speed by 20–40% in test deployments. The adaptive scheduling layer learns policies that balance cost, performance, and licensing constraints across hybrid compute environments.
Complementary modules—such as placement optimization and hotspot prediction—apply similar AI paradigms to design optimization and verification, demonstrating measurable improvements in PPA efficiency and iteration time. Together, these solutions illustrate how AI Assist enables semiconductor firms to modernize their EDA workflows securely and efficiently while accelerating design convergence in the transition to cloud-scale engineering.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAbstract—Artificial Intelligence (AI) has emerged as a transfor-
mative technology in electronic design automation (EDA) flows,
enabling significant improvements in power, performance, and
area (PPA) metrics. This paper explores the critical role of AI-
driven techniques in optimizing semiconductor design processes,
highlighting how intelligent algorithms can automate decision-
making, predict optimal design parameters, and effectively re-
duce silicon area. By integrating AI into the design flow, users can
achieve enhanced PPA trade-offs, minimize manual intervention,
and accelerate time-to-market. The study demonstrates practical
methodologies and case studies illustrating how AI empowers
designers to optimize chip layouts, improve resource utilization,
and achieve superior silicon efficiency.
Index Terms—EDA, PPA, Artificial Intelligence, Floorplanning,
Silicon Area Reduction
mative technology in electronic design automation (EDA) flows,
enabling significant improvements in power, performance, and
area (PPA) metrics. This paper explores the critical role of AI-
driven techniques in optimizing semiconductor design processes,
highlighting how intelligent algorithms can automate decision-
making, predict optimal design parameters, and effectively re-
duce silicon area. By integrating AI into the design flow, users can
achieve enhanced PPA trade-offs, minimize manual intervention,
and accelerate time-to-market. The study demonstrates practical
methodologies and case studies illustrating how AI empowers
designers to optimize chip layouts, improve resource utilization,
and achieve superior silicon efficiency.
Index Terms—EDA, PPA, Artificial Intelligence, Floorplanning,
Silicon Area Reduction
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis paper presents an advanced AI methodology for building a high-fidelity power amplifier behavioral model using Keysight commercial EDA tools SystemVue and Advanced Design System (ADS), in conjunction with AI-enabled modeling techniques. The example device is a power amplifier circuit modeled within ADS, which a corresponding behavioral model is developed in form of a neural network. Training data for the neural network model is generated through co-simulation between SystemVue and ADS, ensuring consistency and accuracy. The resulting behavioral model not only achieves high modeling accuracy but also enables the characterization of transient circuit behavior and supports operation within a broad power range. Once developed, the model can be seamlessly imported into system-level simulation for performance verification and subsequent system-level design.
Engineering Presentation
Chiplet
EDA
DescriptionWhile 3D disaggregated chiplets are viewed as the future of chip design, the approach inherently has drawbacks in thermal performance. This is because some of the chiplets in a given 3D stack may not be in contact with a good heat sink. In traditional 2D design methodology, a good heat sink is usually assumed to exist, which allows thermal analysis to be deferred to the end of design cycle. In 3D disaggregate designs, however, we may not have the luxury of deferring thermal analysis, as thermal violations may not be fixable without contact to sufficient heat sink. Therefore, it is imperative to optimize for thermal behavior along with traditional performance, power, and area (PPA) metrics early in the design flow. In this work, we propose an AI-driven methodology to incorporate thermal-aware module placement that simultaneously optimizes for thermal and other PPA targets. This methodology automatically experiments with hundreds of thermal-constrained scenarios in heterogenous 3D chiplet stacks using AI-assisted analysis to achieve the best possible thermal and PPA results, while requiring minimal user guidance and intervention.
Exhibitor Forum
AI
EDA
Systems
DescriptionCustom ASIC development remains out of reach for many hardware companies. Traditional design services are expensive, timelines are long, and off-the-shelf chips often leave major cost, power, and performance on the table. Visibl is building AI-native workflows that automate large portions of chip development, from architecture and design through verification and production readiness. In this presentation, we will show how a software platform can help engineering teams move faster by reducing repetitive work, improving regression triage, and accelerating the path from product requirements to manufacturable silicon. The goal is not just better internal productivity, but making custom chip development accessible to robotics, drones, industrial, IoT, and edge hardware companies that have historically been priced out of the market.
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionEliminating undefined behaviors (UBs) in Rust programs requires a deep semantic understanding to enable accurate and reliable repair. While existing studies have demonstrated the potential of LLMs to support Rust code analysis and repair, most frameworks remain constrained by inflexible templates or lack grounding in executable semantics, resulting in limited contextual awareness and semantic incorrectness. Here, we present AkiraRust, an LLM-driven repair and verification framework that incorporates a finite-state machine to dynamically adapt its detection and repair flow to runtime semantic conditions. AkiraRust introduces a dual-mode reasoning strategy that coordinates fast and slow thinking across multiple agents. Each agent is mapped to an FSM state, and a waveform-driven transition controller manages state switching, rollback decisions, and semantic checkpointing, enabling context-aware and runtime-adaptive repair. Experimental results show that AkiraRust achieves about 92% semantic correctness and delivers a 2.15× average speedup compared to SOTA.
People
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionComplex-Valued Neural Networks (CVNNs) have significant advantages in handling tasks that involve complex numbers. However, existing CVNNs are unable to quantify predictive uncertainty. We propose, for the first time, dropout-based Bayesian Complex-Valued Neural Networks (BayesCVNNs) to enable uncertainty quantification for complex-valued applications, exhibiting broad applicability and efficiency for hardware implementation due to modularity. Furthermore, as the dual-part nature of complex values significantly broadens the design space and enables novel configurations based on layer-mixing and part-mixing, we introduce an automated search approach to effectively identify optimal configurations for both real and imaginary components. To facilitate deployment, we present a framework that generates customized FPGA-based accelerators for BayesCVNNs, leveraging a set of optimized building blocks. Experiments demonstrate the best configuration can be effectively found via the automated search, attaining higher performance with lower hardware costs compared with manually crafted models. The optimized accelerators achieve approximately 4.5× and 13× speedups on different models with less than 10% power consumption compared to GPU implementations, and outperform existing work in both algorithm and hardware aspects. The code will be open-source after acceptance.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionCounter-based stochastic computing (CBSC) utilizes simple counting logic to realize multiplication. Inspired by CBSC, this paper proposes the ALOHA architecture that significantly reduces the computing latency and stochastic number generation (SNG) overhead. Specifically, ALOHA incorporates three key innovations, including input scaling, bit-reversal counter-based SNG, and sequence-aware optimization schemes. Experimental results show that the ALOHA multiplier significantly outperforms state-of-the-art CBSC designs in terms of accuracy, area and power consumption with reduced sequence length. In CNN and Transformer inference, ALOHA matches the accuracy of prior FSM-based multipliers while reducing area and energy.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAnalog layout design remains heavily dependent on manual expertise, with placement being the most critical stage that requires significant development time and domain knowledge. Current automated placement techniques face challenges in capturing expert design practices and fall short of practical deployment. To address this limitation, we present AlphaPlacer, a novel MCTS-based analog placement framework that learns from historical layouts to provide expert-guided placement optimization.
Our approach formulates analog placement as a hierarchical sequence pair search with a two level MCTS structure, embedding a pretrain learning framework that captures sequence pair distributions from expert layouts to guide the search toward high quality solutions.
Experimental results demonstrate that our method significantly outperforms existing baselines across multiple key metrics including area, wirelength, and post-layout performance.
Our approach formulates analog placement as a hierarchical sequence pair search with a two level MCTS structure, embedding a pretrain learning framework that captures sequence pair distributions from expert layouts to guide the search toward high quality solutions.
Experimental results demonstrate that our method significantly outperforms existing baselines across multiple key metrics including area, wirelength, and post-layout performance.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionModern LLMs use diverse integer, floating-point, and microscaling (MX) precisions, but most accelerators are optimized for only a few formats. We propose AMBER, a general-purpose LLM accelerator for plug-and-play multi-precision deployment and compression of weights and KV caches in cloud and edge. AMBER introduces bit-transposed encoding that exploits bit-level statistical concentration for lossless compression across all of INT, FP, and MX formats. A precision-agnostic, stage-pipelined bit-serial PE further reuses bit-level redundancy for efficient versatile precision computation. Evaluated on 8 LLMs and 10 formats, AMBER boosts memory efficiency 1.17× and compute 3.16×, surpassing Olive and Tender in throughput and energy.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionIn Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput.
To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75×-6.57× compared to state-of-the-art solutions.
To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75×-6.57× compared to state-of-the-art solutions.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionThis work presents a silicon-verified analog–mixed-signal (AMS) layout-automation framework integrating a circuit-level analog standard-cell (CLAS) library with self-biasing circuits in a digital place-and-route (PnR) environment. Unlike transistor-level stem-cell approaches, the CLAS library standardizes matched circuit-level blocks, including self-biased amplifiers, current mirrors, and delay cells. Fabricated 180-nm and 65-nm CMOS chips validate fully automated, DRC/LVS-clean layouts for a current source, adaptive-bandwidth PLL, and probabilistic computer. All benchmarks show strong pre-/post-layout and silicon correlation, with the PLL exhibiting 0.98–1.50 ps simulated and 4.1 ps measured jitter. The framework achieves ≥96 % area utilization, 14.4–69.5 % DCAP density, and <1 min–1 hr runtime, demonstrating scalable, silicon-consistent AMS automation.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDigital FIR filters are typically designed for worst-case full-scale input precision, resulting in unnecessary dynamic power dissipation when processing low-amplitude signals. This paper introduces an adaptive technique that tracks real-time signal dynamic range and selectively disables toggling of sign-extended bits in datapath registers. By gating both data and clock for these redundant bits, the approach significantly reduces switching activity without impacting filter functionality or performance. Demonstrated on a 20-tap FIR filter in 28nm FDSOI technology, the proposed architecture achieved over 7% total power savings for a wideband OFDM signal with -24 dB input amplitude. The solution is fully digital, incurs minimal hardware overhead, and is easily deployable across existing filter architectures and technologies. Furthermore, it is scalable to other shift-register-based structures such as delay chains and FIFOs, making it a practical and efficient method for dynamic power reduction in contemporary signal processing systems.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionThe prohibitively slow speed of long-context LLM decoding is a critical bottleneck, caused by massive Key-Value (KV) access that saturates memory bandwidth. Existing solutions fail due to a fundamental trade-off: static predictors cannot handle the Tokens Fluctuation, leading to permanent accuracy loss from irreversible eviction, while an on-the-fly scoring method creates a synchronous bottleneck that stalls the entire pipeline. This paper presents a co-designed system that breaks this trade-off by leveraging an Adaptive Prediction algorithm to enable a fully Decoupled Prefetching architecture. The proposed Adaptive Prediction model first mirrors the nature of Tokens Fluctuation, enabling a lightweight, reuse-based recovery mechanism that solves irreversible eviction. This high-fidelity prediction then unlocks the Decoupled Prefetching system, which completely hides data transfer latency and eliminates the synchronous bottleneck. On long-context LLM decoding tasks, the proposed system matches the accuracy of a full KV cache with just 256 tokens, while achieving an average 2.50x end-to-end speedup.
Engineering Presentation
EDA
Security
DescriptionEmbedded memories occupy the largest part of modern SoCs. As memories are high-density physical structure, they are more prone to failures than other circuits and concentrate the large majority of fabrication faults, affecting yield adversely. Hence, memory Built-In Self-Repair(BISR) is mandatory for maintaining acceptable fabrication yield. And the repair information will be burned into the electrically programmable fuse (efuse). As the chip scale increases and the number of memories increases, the required efuse resources also increase dramatically.
Therefore, researchers have proposed a series of methods to efficiently compress memory repair information and save efuse space. In the traditional compression method, there is a repair segment consists of zero compress and repair chain, to handle the worst case, the zero compress will be calculated with the whole chain length and the repair chain data will use the longest Bisr register.
This paper proposes an adjustable length compression method to effectively save the storage space of repair information. As the probability of memory faults is very low, and the fault location, number, and memory repair chain length vary. Therefore there is no need to store the longest bisr register and compress the whole repair chain. And the more fault points in the memory, the greater the benefit of the adjustable length compression method.
Therefore, researchers have proposed a series of methods to efficiently compress memory repair information and save efuse space. In the traditional compression method, there is a repair segment consists of zero compress and repair chain, to handle the worst case, the zero compress will be calculated with the whole chain length and the repair chain data will use the longest Bisr register.
This paper proposes an adjustable length compression method to effectively save the storage space of repair information. As the probability of memory faults is very low, and the fault location, number, and memory repair chain length vary. Therefore there is no need to store the longest bisr register and compress the whole repair chain. And the more fault points in the memory, the greater the benefit of the adjustable length compression method.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis paper proposes a novel modeling methodology with improved accuracy for Multi-plate Metal-Insulator-Metal Capacitors (MIMCAPs), aimed at increasing capacitance density per unit area while minimizing parasitic resistance. As a critical component for decoupling and noise filtering, the demand for high-performance MIMCAPs is intensifying as process nodes scale down and device footprints shrink. In current Samsung foundry processes, 3-plate MIMCAP structures are utilized to maximize capacitance and reduce resistance. However, despite its 3-plate structure, the model is simplified to a 2-port representation for user convenience. In the most reasonable conventional modeling approach, the capacitance is calculated based on the layout area and perimeter, and reflected in the SPICE model as the sum of the main capacitance and fringe capacitance, while resistance is incorporated through extraction to cover various layout patterns. Nevertheless, this approach has several limitations. First, the resistance of the bottom metal layer is not accurately reflected in a dangling configuration. Second, The use of a lumped capacitor model leads to significant discrepancies from the actual distributed RC characteristics. Third, Extraction accuracy for plate resistance remains insufficient across various layout patterns. To overcome these challenges, this study derives compensation coefficients to address each limiting factor and integrates them into a refined 3-plate MIMCAP model. The proposed methodology provides a high-fidelity model that ensures superior accuracy for Multi-plate MIMCAPs across a wide range of layout configurations.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionCoarse-Grained Reconfigurable Arrays (CGRAs) are promising as
accelerators for energy and performance benefits. Agile and accu-
rate energy estimates are essential to explore the CGRA design
space and guide algorithm mapping decisions. Existing estimations
are agile but inaccurate since they do not factor in the impact
of wires in the design. This paper presents synchoros energy es-
timation methodology for post-route accurate estimations. The
methodology leverages the properties of synchoros VLSI design
style to incorporate the impact of interconnects in the design. Over-
all, our methodology achieves 98% estimation accuracy relative
to post-route results for algorithms irrespective of variations in
functionality, dimensions, and parallelism.
accelerators for energy and performance benefits. Agile and accu-
rate energy estimates are essential to explore the CGRA design
space and guide algorithm mapping decisions. Existing estimations
are agile but inaccurate since they do not factor in the impact
of wires in the design. This paper presents synchoros energy es-
timation methodology for post-route accurate estimations. The
methodology leverages the properties of synchoros VLSI design
style to incorporate the impact of interconnects in the design. Over-
all, our methodology achieves 98% estimation accuracy relative
to post-route results for algorithms irrespective of variations in
functionality, dimensions, and parallelism.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAn AI-enhanced real-time analytics and prediction platform designed to bring comprehensive intelligence to complex ASIC Physical Design (PD) flows. This tool automatically captures PD runs, parses massive volumes of logs and reports, and unifies timing, congestion, CTS, IR/EM, DRC/LVS, runtime, and resource metrics into centralized dashboards. By integrating AI-driven pre-run prediction, runtime anomaly detection, and post-run learning, the platform enables early identification of QoR regressions, resource bottlenecks, and failure risks before costly iterations are executed. The solution transforms PD workflows from reactive debugging to proactive, data-driven optimization, significantly reducing manual effort, improving resource utilization, accelerating convergence, and enhancing confidence in physical design closure.
Work in Progress
DescriptionFeedforward equalizers (FFEs) are critical digital signal processing (DSP) components in ultra-high speed wireline receivers, frequently limiting power and area efficiency while requiring extensive design effort. This work introduces an autonomous design agent that automatically generates FFEs from system specifications to Verilog implementation. The framework employs a computational data-flow graph model combining unfolding and fast FIR algorithm, along with nonlinear programming for architecture to circuit level optimization. A 128-parallel, 24-tap FFE for a 112-Gbps PAM-4 receiver is successfully synthesized in TSMC 28-nm CMOS within 6 hours using the design agent, achieving 14 % power and 25 % area savings.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionFine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8× larger batch sizes and 6× larger models. In evaluations, SlideFormer achieves 1.40× to 6.27× higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.
Work in Progress
DescriptionElectroencephalography (EEG) enables non-invasive monitoring of brain activity, but its high channel count and computationally intensive neural models pose major challenges to realizing energy-efficient edge accelerator hardware for real-time brain–computer interface (BCI) systems. This work presents EEGDeep, an energy-efficient and fully automated hardware generator that bridges neural network model development and deep-learning hardware accelerators for brain–computer interface (BCI) applications. The framework integrates three core modules: an EEG channel reduction module, which prunes non-informative channels to minimize model size and computational load, enabling practical deployment in wearable applications; an EEGDeep architecture evaluator, which performs constraint-aware neural architecture and hardware exploration through model optimization, a layer-reordering algorithm, and predefined deep-learning configurations to balance accuracy, area, latency, energy, and memory bandwidth; and an EEGDeep RTL generator, which converts optimized models into synthesizable RTL using a parameterized IP template. Using this automated framework, an EEGDeep design implemented in TSMC 90-nm CMOS technology achieves a 91.67% reduction in EEG channels, a processing latency of 1.43 ms per trial, and an energy consumption of 5.51 µJ per trial. These results correspond to a 65.9% reduction in area, a 98.49% improvement in latency, a 98.6% reduction in energy consumption, and a 93.19% improvement in memory bandwidth compared with conventional implementations, demonstrating its potential for practical, low-energy, and real-time BCI applications.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionThe Mixture-of-Experts (MoE) model sparsely activates a subset of experts for each token, causing each expert to process a varying number of tokens during inference. However, existing GPU-based MoE inference frameworks adopt a fixed tensor size for the inputs to each expert, requiring padding when an expert receives fewer tokens, which leads to substantial redundant computation.
In response, we propose a token-stationary dataflow that uniformly abstracts the multiplication between an expert's parameters and input tensors with varying token counts. Based on this dataflow, we design a reconfigurable systolic array that eliminates padding-incurred redundant computation. Evaluation demonstrates that our design outperforms the state-of-the-art GPU-based MoE inference frameworks significantly.
In response, we propose a token-stationary dataflow that uniformly abstracts the multiplication between an expert's parameters and input tensors with varying token counts. Based on this dataflow, we design a reconfigurable systolic array that eliminates padding-incurred redundant computation. Evaluation demonstrates that our design outperforms the state-of-the-art GPU-based MoE inference frameworks significantly.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAn engineering case study examining the practical implications of autonomous PCB layout in a real hardware development workflow. The automation used in this study is domain-specific and constraint-driven, not an LLM. The case study system is a Linux-capable computer comprising a SOM and baseboard with over 800 components and more than 5,000 pins, designed under realistic electrical, manufacturing, and assembly constraints.
Starting from completed schematics and a defined board outline, the design progressed through automated layout generation, human review and cleanup, fabrication, assembly, and system bring-up using standard advanced manufacturing processes. The system booted Linux and ran real workloads on the first spin.
This outcome required non-trivial human correction of automated results. These intervention points illustrate why experienced PCB designers remain essential, particularly for identifying and correcting failures that are not locally observable prior to system integration and bring-up.
Rather than focusing on tools or algorithms, this talk analyzes what was required for success at each stage of the workflow, where automation was effective, where expert intervention remained necessary, and which assumptions did not hold in practice. It examines how autonomous layout reallocates expert effort—from manual routing toward constraint definition, review, and exception handling—within a broader trajectory of system design.
Starting from completed schematics and a defined board outline, the design progressed through automated layout generation, human review and cleanup, fabrication, assembly, and system bring-up using standard advanced manufacturing processes. The system booted Linux and ran real workloads on the first spin.
This outcome required non-trivial human correction of automated results. These intervention points illustrate why experienced PCB designers remain essential, particularly for identifying and correcting failures that are not locally observable prior to system integration and bring-up.
Rather than focusing on tools or algorithms, this talk analyzes what was required for success at each stage of the workflow, where automation was effective, where expert intervention remained necessary, and which assumptions did not hold in practice. It examines how autonomous layout reallocates expert effort—from manual routing toward constraint definition, review, and exception handling—within a broader trajectory of system design.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionMicrochip FPGAs feature multiple I/O banks, each configurable independently for diverse standards such as SSTL, LVSTL etc., enabling designers to mix and match signaling protocols across the same device while covering a wide frequency range. I/O banks support voltages from 0.5V to 3.3 V, including high-speed DDR PHY interfaces for LPDDR4/5 up to 3.2Gbps. Robust signal integrity across wide frequency-voltage spectrum is achieved through on die IO PVT calibration which dynamically compensates for variations in silicon process, supply voltage, temperature ensuring consistent electrical characteristics like output drive strength and termination impedance. A SAR-based digital block and analog block collaboratively manage reference currents and voltages resulting in complex mixed-signal system.
Verification of this subsystem is paramount to prevent chip-level failures. Full-chip verification challenges include multiple IO banks distributed across chip periphery and CPU-individual bank connections. Validating signal paths from CPU to farthest IO bank necessitates analyzing significant portion of the chip, leading to increased simulation runtime and memory usage.
This presentation covers PVT calibration architecture and verification strategies. It highlights how Siemens Symphony Pro optimizes debugging, reduces simulation time by ~3x, improves memory usage by ~2-3x, and streamlines regressions ensuring reliable calibration verification across diverse I/O standards and extreme PVT conditions.
Verification of this subsystem is paramount to prevent chip-level failures. Full-chip verification challenges include multiple IO banks distributed across chip periphery and CPU-individual bank connections. Validating signal paths from CPU to farthest IO bank necessitates analyzing significant portion of the chip, leading to increased simulation runtime and memory usage.
This presentation covers PVT calibration architecture and verification strategies. It highlights how Siemens Symphony Pro optimizes debugging, reduces simulation time by ~3x, improves memory usage by ~2-3x, and streamlines regressions ensuring reliable calibration verification across diverse I/O standards and extreme PVT conditions.
Engineering Presentation
Design
EDA
Systems
DescriptionAnalog IP verification remains a bottleneck in modern mixed-signal SoC development due to lengthy SPICE simulation times and integration challenges. This paper introduces ANA-MODELGEN, a comprehensive hierarchical modeling methodology that transforms analog IP verification through automated model generation and validation. Our approach enables rapid development and validation of behavioral models for any analog IP type, featuring direct integration capabilities with industry-standard verification environments. The methodology includes automated testbench generation for rigorous model validation and equivalence checking against SPICE references. Deployment in production SoC verification flows demonstrates substantial gains in functional coverage and execution speed, enabling earlier bug detection and faster time-to-market.
Engineering Special Session
AI
Design
EDA
DescriptionThis talk explores the intersection of physics-aware AI and electronic design automation, focusing on how modern neural network architectures can transform analog and mixed-signal verification. We review the evolution of neural networks—from convolutional models to attention-based transformers—and discuss why purely data-driven approaches fall short for engineering problems governed by differential equations. Physics-aware AI embeds domain constraints directly into learning architectures, enabling models that respect conservation laws, device behavior, and circuit dynamics. This approach gives rise to “analog intelligence”: AI systems that reason over continuous-time, multi-physics behavior rather than discrete abstractions. By accelerating core analysis tasks such as transient and frequency-domain verification, physics-aware AI can significantly reduce verification turnaround time, improve design confidence, and enable faster time-to-market with higher first-silicon success rates.
People
Research Special Session
EDA
DescriptionGoogle's quantum chips have been designed to be ideal for running digital quantum circuits, particularly for performing quantum error correction with the surface code. However, starting in 2024, we have demonstrated that these chips can also be repurposed to run extremely high-fidelity analog quantum simulations. Furthermore, the two capabilities can be combined, mixing digital gates with analog evolution. In this talk, I will discuss several recent experiments that leverage our hybrid analog-digital simulation platform, including adiabatic preparation of low-energy states both within the qubit subspace and incorporating higher energy levels beyond |0⟩ and |1⟩, as well as an analog counterpart to random circuit sampling.
People
Engineering Presentation
EDA
Systems
DescriptionAs high-speed, highly integrated systems-on-chip (SoCs) advance, the design of robust power delivery networks (PDNs) becomes critical to ensure reliable operation. In 3D-stacked high bandwidth memory (HBM), the vertical stacking of multiple core dies (Cdie) on a base die (Bdie) introduces significant challenges, including increased PDN impedance and cross-die current interactions, necessitating full-stack 3D-IC IR-drop analysis. However, conventional EMIR tools face scalability bottlenecks due to the massive transistor count and terabyte-scale input in HBM Bdie designs, often leading to excessive runtime or tool failure. This study presents a scalable methodology using Synopsys Totem-SC to enable full-chip 3D-IC EMIR analysis for HBM. We introduce a (1) the DSPF-only flow with parallel parsing, eliminate redundant processes, (2) cell‑based decoupling capacitor method that cuts transistor count in DSPF files by up to 70% and (3) current copy methodology. Further enhancements include (4) improve runtime (up to 78.3%) and capacity (up to 2 billion transistors), in parasitic/current extraction, GDS-to-DEF conversion, and partitioned database generation stage. Successful 3D-IC analysis reveals a 442% increase in maximum IR-drop device point (from 0.5 to 2.21) compared to single-die analysis, highlighting the necessity of full-stack simulation. The proposed method enables efficient, high-capacity EMIR analysis for next-generation 3D-IC designs.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionTiming-driven global placement is critical in modern VLSI physical design for achieving timing closure. Among existing approaches, weighting is a mainstream technique, but traditional weights are often manually designed based on heuristics, which can lead to suboptimal timing performance. To address this, we present an analytically-derived model of interconnect and cell-delay contributions of pin pairs along critical paths, from which we obtain explicit and differentiable formulations approximating Total Negative Slack (TNS) and Worst Negative Slack (WNS). The formulations include three wirelength components: net wirelength, linear pin-to-pin wirelength, and quadratic pin-to-pin wirelength, where net and linear pin-to-pin wirelength are smoothed via the weighted-average (WA) model. Based on these formulations, we derive a hybrid net-and-pin weighting scheme and propose a novel timing-driven global placement framework that directly optimizes TNS and WNS. The weighting scheme features dynamic and cumulative updates, ensuring that consistently critical paths are prioritized throughout optimization. Experimental results on the ICCAD'15 benchmark demonstrate that our method achieves average improvements of 39% in TNS and 6% in WNS compared with state-of-the-art timing-driven placers, while maintaining competitive wirelength and runtime, validating the effectiveness of the analytically-derived weighting framework.
Engineering Special Session
AI
Design
EDA
Systems
DescriptionSandisk will present how a leading semiconductor company applies AI to transform specification-driven development and engineering productivity. The session will showcase practical workflows that improve specification quality, accelerate downstream documentation, and help engineers ramp faster on complex systems. Examples include using AI to detect ambiguities, inconsistencies, and missing requirements in specifications, while recommending fixes and useful extensions. We will also cover multimodal capabilities for analyzing existing diagrams and generating new visual content directly from specification inputs. The presentation will demonstrate an AI-powered knowledge assistant integrated into engineers’ daily development environment, enabling fast access to trusted design information. In addition, AI-generated outputs such as verification plans, coverage plans, sign-off documentation, and structured work-item breakdowns from high-level requirements will be shown. This talk highlights practical ways AI can modernize semiconductor engineering workflows today.
People
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionApproximate Nearest Neighbor (ANN) search has become fundamental to modern AI infrastructure, powering recommendation systems, search engines, and large language models across industry leaders from Google to OpenAI. Hierarchical Navigable Small World (HNSW) graphs have emerged as the dominant ANN algorithm, widely adopted in production systems due to their superior recall vs. latency balance. However, as vector databases scale to billions of embeddings, HNSW faces critical bottlenecks: memory consumption expands, distance computation overhead dominates query latency, and it suffers suboptimal performance on heterogeneous data distributions. This paper presents Adaptive Quantization and Rerank HNSW (AQR-HNSW), a novel framework that synergistically integrates three strategies to enhance HNSW's scalability. AQR-HNSW introduces (1) density-aware adaptive quantization, achieving 4× compression while preserving distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (4) quantization-optimized SIMD implementations delivering 16-64 operations per cycle across architectures. Evaluation on standard benchmarks demonstrates 2.5-3.3× higher QPS than state-of-the-art HNSW implementations while maintaining 98%+ recall, with 75% memory reduction for the index graph and 5× faster index construction.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionConventional SRAMs are designed for random access, resulting in significant energy waste under regular access patterns common in image processing, computer vision, deep learning, and dense linear algebra to name a few. We propose an energy-efficient ARC-SRAM subarray architecture that integrates three synergetic energy reduction schemes: (1) Address decoding energy reduction by relocating the address generation to the memory periphery; (2) Read energy reduction by reducing wordline activity to one pulse per row and minimizing precharge activity to suppress half-selection energy; (3) Write energy reduction by reusing the bitline charge across consecutive rows, and by minimizing wordline activation.
We implement a 64KB instance of our ARC-SRAM using an advanced gate all around nanosheet 1.4 nm research process design kit. At the cell-array level, the design achieves a 63–76\% energy reduction for reads and 68–72\% for writes in loop-nest dominated applications, with only 4\% area overhead.
We implement a 64KB instance of our ARC-SRAM using an advanced gate all around nanosheet 1.4 nm research process design kit. At the cell-array level, the design achieves a 63–76\% energy reduction for reads and 68–72\% for writes in loop-nest dominated applications, with only 4\% area overhead.
People
Exhibitor Forum
DescriptionAI-agentic flows for chip design have emerged with the potential to create a paradigm shift in the EDA industry. At Architect Labs, we are building the first generation of AI-native design methodology for chip development. In this presentation, we share details on our AI-based automation flow for fast and verifiable generation of frontend design-collaterals, including architectural exploration, software-modeling, RTL, and verification. Our methodology enables capturing the design intent of the architecture based on design specifications, while the human-in-the-loop iterates on the specs as the “architect”. At the core of this methodology, an agentic harness combined with ML methods including reinforcement learning and test-time-scaling achieves fully end-to-end autonomous HW design and verification capability, all while using SOTA EDA tools and proven chip design methods. As a demonstration vehicle, we further present how
our AI system autonomously designed a specialized AI hardware accelerator targeting FPGA deployment for running inference on a SOTA LLM model. In this end-to-end demo, we share details on how our harness performs architectural exploration under memory vs. compute bound workloads to generate performance models, firmware, RTL, AI kernels, and more. We will bring the FPGA for an in-person demo, and share quantitative results on both the methodology and the design results, such as number of engineers, time-to-design, and achieved performance numbers.
our AI system autonomously designed a specialized AI hardware accelerator targeting FPGA deployment for running inference on a SOTA LLM model. In this end-to-end demo, we share details on how our harness performs architectural exploration under memory vs. compute bound workloads to generate performance models, firmware, RTL, AI kernels, and more. We will bring the FPGA for an in-person demo, and share quantitative results on both the methodology and the design results, such as number of engineers, time-to-design, and achieved performance numbers.
Engineering Presentation
EDA
DescriptionVerification coverage analysis evaluates the effectiveness of verification stimulus on design verification. Significant computational resources are spent toward coverage closure. Robust infrastructure and tools are available to analyze large coverage data to efficiently allocate resources to the most effective testcases. While we have correlation of testcases to coverage events they are likely to discover, little is known about why.
Further investigation into components that comprise the testcases, both from a stimulus-completeness and coverage correlation perspectives, is necessary. In this work, we augment coverage data with instrumentation to collect testcase components. Our analysis in the POWER simulation environment guided resolution of undesirable biases in stimulus generation and inefficient tests previously unknown. Furthermore, using ML and statistical techniques, we establish correlation between test components and coverage events, allowing composition of new testcases from components highly correlated with desired event scenarios. Coverage analysis shows these super-tests are most likely to achieve hard-to-hit scenarios and therefore make up 86% of recommended testcases. Increasing likelihood of such unlikely events and tuning environment inefficiencies lead to savings in computational resources in the verification environment.
Further investigation into components that comprise the testcases, both from a stimulus-completeness and coverage correlation perspectives, is necessary. In this work, we augment coverage data with instrumentation to collect testcase components. Our analysis in the POWER simulation environment guided resolution of undesirable biases in stimulus generation and inefficient tests previously unknown. Furthermore, using ML and statistical techniques, we establish correlation between test components and coverage events, allowing composition of new testcases from components highly correlated with desired event scenarios. Coverage analysis shows these super-tests are most likely to achieve hard-to-hit scenarios and therefore make up 86% of recommended testcases. Increasing likelihood of such unlikely events and tuning environment inefficiencies lead to savings in computational resources in the verification environment.
Engineering Presentation
Design
EDA
DescriptionHighly configurable, hash‑based address translation logic is central to modern memory subsystems, enabling scalability, parallelism, and load balancing(fairness) across multiple clients and lanes. However, the resulting configuration and address spaces grow far beyond what simulation can feasibly explore. Even subtle specification‑level errors in hashing or decoding semantics can silently propagate into RTL and system software, leading to persistent data corruption and costly late‑stage debug. This paper presents an architectural‑level formal verification methodology that shifts correctness assurance left, before RTL development. The intended hash and decoder behaviour is captured using an executable architectural reference model, independent of RTL implementation. Architectural intent is expressed as global invariants—such as no aliasing, load balancing(fairness), address safety, lane conflict avoidance, and controlled error propagation—and proven symbolically across all legal configurations and the full address space. The same reference model and proven invariants are then reused directly at the RTL level to perform invariant‑guided equivalence checking. By integrating these already verified block‑level properties as assumptions in the existing end‑to‑end RTL formal environment, system‑level correctness is validated without building a new formal setup or re‑verifying internal blocks. This approach uncovers mis‑integration bugs while avoiding over‑constraint. Applying this methodology exposed multiple critical specification bugs prior to RTL. The results demonstrate a scalable, reusable, and production‑ready formal flow for complex memory systems.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionFPGAs are widely deployed in critical systems, but flaws in EDA toolchains can enable malicious HDL injection, leading to severe hardware vulnerabilities. We propose DefVul-Risk, an automated framework for assessing fault-injection risks in FPGA synthesis tools. It constructs a defect knowledge base, generates triggerable vulnerability samples via large language models, and fine-tunes risk-assessment models to prioritize high-risk defects. DefVul-Risk enhances toolchain security evaluation and vulnerability remediation efficiency. We submitted 26 CVE reports, with 9 officially confirmed.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionAs conventional FinFET architectures encounter severe scaling limitations, Complementary-FET (CFET) technology with vertically stacked PMOS and NMOS transistors has emerged as a promising solution for continued standard cell density scaling. However, aggressive area compaction in CFET standard cells drastically limits intra-cell routing resources, leading to routing congestion and design rule challenges. To mitigate this issue, multi-row CFET standard cell architectures have been introduced to improve intra-cell routability and alleviate block-level congestion. Nevertheless, these multi-row configurations introduce new placement-routing coupling and design rule complexities, making it challenging to achieve compact, DRC-clean, and routable layouts. Therefore, this work proposes an area-optimal and routability-driven layout synthesis framework for multi-row CFET cells, which effectively addresses the challenges of area efficiency, constrained pin accessibility, and DRC compliance under multi-row CFET architectures. Therefore, this work proposes an area-optimal and routability-driven layout synthesis framework for multi-row CFET cells, which effectively addresses the challenges of area efficiency, constrained pin accessibility, and DRC compliance under multi-row CFET architectures. and (3) a two-stage Satisfiability Modulo Theories (SMT)-based routing flow consisting of a Multi-Commodity Flow (MCF)-based routability-guaranteed pin-access selection and an Integer Linear Programming (ILP)-enhanced hierarchical routing to ensure DRC/LVS closure. Compared with state-of-the-art multi-row CFET cell generators, experimental results show that our algorithm consistently achieves the optimal layout area, while delivering significant improvements in solution quality and efficiency.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionSolid-State Drives (SSDs) now power data centers, high-performance computing, and artifical intelligence workloads, but the increasing complexity of modern SSD controllers has surpassed the capabilities of traditional rule-based verification, necessitating scalable, data-driven testing methods. In this paper, we present ARTEMIS, a reinforcement‑learning (RL)‑based framework for automatic SSD test case (TC) generation that operates directly on commercial devices. Our contributions are threefold. First, we formulate a novel RL problem, enabling seamless deployment across heterogeneous commercial products without requiring any device-specific customization. Second, to cope with the highly dynamic and non‑stationary operating environments of SSDs, we introduce an ensemble‑based inference mechanism that aggregates policies learned under diverse workload distributions, thereby improving generalization and robustness. Third, we validate the approach on two representative stress‑testing tasks: Meta and User Garbage Collection. We show that the RL‑driven tester discovers high-impact TCs that trigger critical blocking garbage collection events more frequently than conventional methods in diverse tasks.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFront-end RTL verification workflows consume significant compute resources and engineer time, with Verilog simulations often failing due to syntax errors, testbench issues, or configuration problems detected only after resource consumption. Traditional workflow management lacks predictive intelligence to prevent this waste.
This presentation introduces ARTEMIS (Automated RTL Testing & Error Management Intelligent System), a production-deployed multi-agent AI architecture that optimizes front-end verification workflows through intelligent error detection. ARTEMIS employs four specialized agents working collaboratively: Debug Agent (error identification), Supervisor Agent (resource orchestration), Scheduler Agent (regression test sequencing), and Job Analysis Agent (real-time monitoring). The system integrates with Verilog and industry-standard schedulers through Model Context Protocol servers.
The intelligent error detection system analyzes iverilog output patterns to identify syntax errors, module instantiation issues, and testbench problems before they consume compute resources. When failures are predicted, the system automatically terminates jobs and releases resources.
Production deployment metrics demonstrate 15-25% reduction in compute costs and 45% reduction in debugging time. This presentation chronicles the journey from award-winning prototype to production MLP with actionable implementation guidance for EDA workflows
This presentation introduces ARTEMIS (Automated RTL Testing & Error Management Intelligent System), a production-deployed multi-agent AI architecture that optimizes front-end verification workflows through intelligent error detection. ARTEMIS employs four specialized agents working collaboratively: Debug Agent (error identification), Supervisor Agent (resource orchestration), Scheduler Agent (regression test sequencing), and Job Analysis Agent (real-time monitoring). The system integrates with Verilog and industry-standard schedulers through Model Context Protocol servers.
The intelligent error detection system analyzes iverilog output patterns to identify syntax errors, module instantiation issues, and testbench problems before they consume compute resources. When failures are predicted, the system automatically terminates jobs and releases resources.
Production deployment metrics demonstrate 15-25% reduction in compute costs and 45% reduction in debugging time. This presentation chronicles the journey from award-winning prototype to production MLP with actionable implementation guidance for EDA workflows
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionFor the first time, this paper introduces an adaptive sparsely-gated mixture of experts neural network (ASG-MOENN) as a novel digital predistortion (DPD) framework for wide dynamic power range quadrature switched-capacitor power amplifiers (SCPAs). State-of-the-art (SOTA) DPD models suffer from fixed, high computational complexity, leading to inefficiency at lower power levels. To overcome this limitation, the proposed ASG-MOENN employs a dual-stage adaptive mechanism. First, it embeds the underlying physical principles of SCPA power variation to condition the input signal. Second, it incorporates an adaptive sparse gating mechanism that dynamically determines both which experts to activate and how many are required based on a cumulative confidence criterion, allowing the model to flexibly scale its run-time complexity according to the PA's operating state. Experimental results demonstrate that, compared with the SOTA baseline, the proposed model achieves superior linearization performance while reducing the average computational load by up to 50% across a 30-dB dynamic power range.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionDistributed large language model (LLM) training faces prevalent failures and requires efficient checkpointing. State-of-the-art approaches employ partition-based pipelined checkpointing, splitting checkpoints into partitions for concurrent processing. However, existing solutions rely on a fixed partition size, which our analysis reveals is suboptimal for LLM training: large partitions cause bandwidth stalls during forward passes, while small partitions incur substantial startup overhead during backward passes. We propose AsymCheck, which employs asymmetric partitioning: small partitions for forward passes and large partitions for backward passes, plus selective partition compression and batched flushing optimizations. Evaluation on 64 GPUs shows AsymCheck reduces training time by 20.1%-48.2% over state-of-the-art methods, approaching no-checkpointing efficiency.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionWe present Athena, a scalable analytical framework for pre-silicon side-channel leakage evaluation that avoids simulation entirely. Athena integrates static timing information with signal-probability propagation to model time-dependent circuit behavior, capturing leakage arising from input-arrival skew, glitches, and other timing-induced effects. The fine-grained leakage estimates produced, precisely identify vulnerable signals and time intervals, and supports evaluation of diverse countermeasures including masking, gate resizing, and differential logic. We evaluate Athena across a broad set of countermeasures on S-boxes and full ciphers, comparing both correctness and performance against state-of-the-art tools. Across all benchmarks, Athena delivers over 400x speedup over existing methods.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionThis work presents ATLAS, an LLM-driven framework that bridges standardized threat modeling and property-based formal verification for System-on-Chip (SoC) security. Starting from vulnerability knowledge bases such as Common Weakness Enumeration (CWE), the framework identifies SoC-specific assets, maps relevant weaknesses, and generates assertion-based security properties and JasperGold scripts for verification. By combining asset-centric analysis with standardized threat model templates, it automates the transformation from vulnerability reasoning to formal proof. Evaluated on three HACK@DAC benchmarks, ATLAS detected 39/48 CWEs and, out of that 39, ATLAS was able to generate correct properties for 33 of the bugs, advancing automated, knowledge-driven SoC security verification toward a secure-by-design paradigm.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionSafety-critical integrated circuits targeting ISO 26262 ASIL-D require a failure-in-time (FIT) $<10$. These physics-based tools for FIT prediction, such as BFIT, are accurate but slow, while ML baselines suffer collapse under long-tailed FIT distributions. We propose \textbf{ATLAS}, a decoupled GNN framework that combines a bidirectional asynchronous topological message passing backbone, aligned with BFIT's forward and backward propagation, with RankNet and CalibNet synergistic heads. RankNet uses a gap-aware pairwise ranking loss with log-gap weighting $\Delta\log(t)$ and explicit emphasis on head samples to identify top-$k\%$ high-risk gates. CalibNet employs a dual-domain loss in log and linear spaces, incorporating importance weighting, for accurate calibration. The proposed design achieves $O(|V|+|E|)$ complexity, delivering up to 220$\times$ speedup over BFIT while reducing selective hardening area overhead by 53\% compared to DeepGate2 on ITC'99 benchmarks, thereby enabling the rapid reliability-aware design of large-scale circuits. We plan to open-source our data and model code after acceptance.
People
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionIn this paper, we propose an analytical method for 3D mixed-size placement that incorporates macro-orientation awareness in face-to-face (F2F) terminal-bonded heterogeneous 3D ICs. This work introduces a novel one-pass placement framework that bypasses the conventional 2.5D co-optimization, effectively narrowing the gap between partitioning and placement and achieving significant time savings. Our method constructs a differentiable distributed macro-rotation system that unifies partitioning, placement, and macro-orientation within 3D global placement. To compensate for deviations arising from wirelength estimation in 3D placement, we further apply an FM-based post-partitioner. Experimental results on the ICCAD 2023 contest benchmark show that our approach achieves an 8% higher quality than the first-place contest entry and a 2% improvement against the best published method with 1.6× runtime speedup.
Engineering Presentation
Design
EDA
Security
Systems
DescriptionThis work presents a pre-silicon side-channel leakage assessment of an open-source AES-XTS hardware implementation at the RTL level. The complete encryption flow is functionally verified, and switching activity is captured and converted into high-resolution waveform data for power reconstruction. A Test Vector Leakage Assessment (TVLA) and Correlation Power Analysis (CPA) are applied to identify statistically significant leakage across encryption cycles and the key byte disclosures. Results reveal dominant leakage during early intermediate datapath operations, motivating targeted security refinement. The study demonstrates the effectiveness of integrating vector-based power reconstruction with systematic leakage evaluation to expose vulnerabilities prior to fabrication, enabling cost-effective mitigation and improved hardware security robustness.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionHardware security competitions such as HackTheSilicon serve as benchmarking platforms for evaluating vulnerability detection methods and training human and AI. However, our study reveals that LLMs threaten their validity. Instead of genuine security reasoning, detectors exploit a diff-style syntactic comparison, achieving an 83% detection rate, undermining fair evaluation. To mitigate this, we propose the first LLM-oriented, semantics-preserving obfuscation framework for these benchmarks. Unlike IP-protection approaches, it applies human-readable transformations and controlled diff-noise while preserving functionality. On HackTheSilicon, the framework reduces LLM-based detection accuracy by 50% with only 10% obfuscation and by 78.6% under complete obfuscation, restoring benchmark reliability.
People
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionAs capacitance extraction accuracy of rule-based pattern matching becomes difficult to sustain at advanced nodes, a growing trend emerges to develop deep-learning-based 2D capacitance models. However, existing MLP- and CNN-based methods constrain their input to fixed metal-layer combinations in a specific process node, limiting their usability in practice. Recognizing the inherent similarity between capacitance matrix and the prevailing attention mechanism, we propose AttentionCap, a customized Transformer for capacitance matrix learning, with a Gram representation framework, a physics-aligned symmetric-attention output layer, and a novel normalized Laplacian loss. We also introduce a process-node embedding to enable multi-node learning. Trained on synthetic data, AttentionCap attains 0.67%/3.99% self/coupling-capacitance error on unseen real designs under a multi-layer and multi-node setting, surpassing the CNN-Cap baseline with 4.6x/5.7x lower self/coupling error and 192x faster inference speed. A pretrained AttentionCap accurately transfers to an unseen node with only 5K samples and 4K finetuning steps. With sufficient accuracy on unseen real designs and strong transferability to new process nodes, AttentionCap offers highly practical value for modern EDA workflows. Code and data are available at https://anonymous.4open.science/r/AttentionCap-release-F698.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionThis paper presents AttestLLM, the first attestation framework to protect device vendors' hardware-level intellectual property by ensuring that only authorized large language models (LLMs) can execute on target platforms. To overcome the scalability and efficiency limitations of prior work, AttestLLM leverages an algorithm/software/hardware co-design approach to embed robust watermarks onto the activation of LLM layers. In addition, it optimizes the attestation protocol within the trusted execution environment, providing efficient ownership verification without compromising inference throughput. Evaluations on various on-device LLMs demonstrate AttestLLM's attestation reliability, fidelity preservation, and efficiency. Furthermore, AttestLLM exhibits resilience against forgery, replacement, and system attacks.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionModern zero-knowledge proof (ZKP) schemes are moving from costly elliptic curve-based systems towards algebraic proof systems. This transition is ushering in a new era of broad adoption of hardware-friendly ZKPs, implementable even on resource-constrained edge devices. Algebraic structures, such as towers of binary fields, are core components of such schemes, and they require hardware and algorithmic innovation to realize advanced ZKPs, such as Binius and Binius-FRI, efficiently.
Hence, this paper introduces the ATTINA framework, a high-throughput additive number theoretic transform (ANTT) accelerator operating over extended binary fields. ANTT is the most computationally intensive task for realizing Binius. ATTINA partitions ANTT tasks into multiple subtasks and strategically allocates them to processing elements (PEs) and hardware programmable logic (PL) kernels in an interleaved computing pattern. ATTINA features a scalable architecture that enables dynamic parameter changes for different cryptographic applications. To understand its applicability, this paper implements ATTINA on an edge-deployable heterogeneous Versal adaptive system-on-chip (ASoC) platform. ATTINA on Versal ASoC maintains a throughput of 10 Gb/s. ATTINA outperforms a benchmark CPU implementation of ANTT on a high-end processor (i9-14900K) by \textbf{139$\times$} and a PE-only implementation by \textbf{38$\times$} in terms of latency while operating at $\ leq$5W for a 4096-point ANTT. ATTINA's code is published at \url{https://anonymous.4open.science/r/ATTINA} for evaluation and reproducible research.
Hence, this paper introduces the ATTINA framework, a high-throughput additive number theoretic transform (ANTT) accelerator operating over extended binary fields. ANTT is the most computationally intensive task for realizing Binius. ATTINA partitions ANTT tasks into multiple subtasks and strategically allocates them to processing elements (PEs) and hardware programmable logic (PL) kernels in an interleaved computing pattern. ATTINA features a scalable architecture that enables dynamic parameter changes for different cryptographic applications. To understand its applicability, this paper implements ATTINA on an edge-deployable heterogeneous Versal adaptive system-on-chip (ASoC) platform. ATTINA on Versal ASoC maintains a throughput of 10 Gb/s. ATTINA outperforms a benchmark CPU implementation of ANTT on a high-end processor (i9-14900K) by \textbf{139$\times$} and a PE-only implementation by \textbf{38$\times$} in terms of latency while operating at $\ leq$5W for a 4096-point ANTT. ATTINA's code is published at \url{https://anonymous.4open.science/r/ATTINA} for evaluation and reproducible research.
Engineering Presentation
EDA
Security
DescriptionRecent RCA studies show that failures in digital sign-off rarely stem from EDA execution itself, but from the last-mile interpretation layer. Script-based automation, built on rigid syntax matching, breaks under tool/version drift and cannot enforce cross-stage causality, leading to silent misses. Human-centric checklists, introduced as a safety net, collapse under massive unstructured reports and schedule pressure, causing cognitive overload and false sign-off escapes.
We present Auto-Chk, a neuro-symbolic compiler framework that closes this interpretation gap. Instead of hard-coding scripts, Auto-Chk compiles natural-language verification intent into a deterministic intermediate representation, ItemSpec, which decouples intent from implementation. Generated checkers are stress-tested using adversarial, metamorphic validation without requiring golden data, and executed inside a secure sandbox with runtime monitoring. A cross-layer self-healing loop automatically adapts to log or tool changes. On top, an evidence-driven dashboard and role-based copilots transform raw logs into confidence-aware decisions and organizational knowledge.
Auto-Chk has been deployed at production scale, covering over 120 checklist items across multiple IPs, EDA tool versions, and process nodes. Results demonstrate a 20× reduction in checker development time, 100% semantic consistency, and sub-30-minute time to recovery, with full project convergence achieved within 15 hours. Outputs are delivered through a project-level dashboard and a role-based copilot, providing executable, resilient sign-off intelligence.
We present Auto-Chk, a neuro-symbolic compiler framework that closes this interpretation gap. Instead of hard-coding scripts, Auto-Chk compiles natural-language verification intent into a deterministic intermediate representation, ItemSpec, which decouples intent from implementation. Generated checkers are stress-tested using adversarial, metamorphic validation without requiring golden data, and executed inside a secure sandbox with runtime monitoring. A cross-layer self-healing loop automatically adapts to log or tool changes. On top, an evidence-driven dashboard and role-based copilots transform raw logs into confidence-aware decisions and organizational knowledge.
Auto-Chk has been deployed at production scale, covering over 120 checklist items across multiple IPs, EDA tool versions, and process nodes. Results demonstrate a 20× reduction in checker development time, 100% semantic consistency, and sub-30-minute time to recovery, with full project convergence achieved within 15 hours. Outputs are delivered through a project-level dashboard and a role-based copilot, providing executable, resilient sign-off intelligence.
Exhibitor Forum
AI
EDA
Systems
DescriptionAI can already help individual engineers write code faster, but generating the complex, interdependent artifacts that silicon design needs - like testbenches, reference models, checkers and RTL - requires intelligent harnesses and agent orchestration layers that understand how these artifacts relate and how they're validated.
Existing applications of AI for silicon design fall into two patterns: general-purpose coding agents that lack native understanding of silicon workflows, and LLM wrappers around legacy tool suites that reach baseline quality on isolated tasks but degrade as workflows chain across steps. Both generate artifacts in isolation, and each step can introduce errors that compound downstream. A holistic approach to generation requires continuous feedback among specifications, simulation results, downstream design metrics, and the engineers themselves.
Normal Computing addresses these limitations by building AI systems that learn from every artifact generated, every human edit, every simulation run, and every downstream signal about power, performance, and area. Generation and validation agents, trained on silicon engineering trajectories and simulation feedback, with native access to the toolchains, are the foundation. We treat continual learning as the architecture, and each interaction strengthens a shared knowledge base that grounds future generations in the design intent of the team, and the physics. Auto-formalization is one mechanism within this system, used to construct structured intermediate representations that keep complex artifacts coherent as they scale. Closing the loop with our own synthesis and analysis tools lets the system reason about quantitative design tradeoffs while it generates RTL and verification artifacts that are aware of the constraints they'll be measured against.
This talk presents the architecture behind these learning loops, benchmarks against general-purpose approaches on production-representative tasks, and results from deployments with top semiconductor partners globally
Existing applications of AI for silicon design fall into two patterns: general-purpose coding agents that lack native understanding of silicon workflows, and LLM wrappers around legacy tool suites that reach baseline quality on isolated tasks but degrade as workflows chain across steps. Both generate artifacts in isolation, and each step can introduce errors that compound downstream. A holistic approach to generation requires continuous feedback among specifications, simulation results, downstream design metrics, and the engineers themselves.
Normal Computing addresses these limitations by building AI systems that learn from every artifact generated, every human edit, every simulation run, and every downstream signal about power, performance, and area. Generation and validation agents, trained on silicon engineering trajectories and simulation feedback, with native access to the toolchains, are the foundation. We treat continual learning as the architecture, and each interaction strengthens a shared knowledge base that grounds future generations in the design intent of the team, and the physics. Auto-formalization is one mechanism within this system, used to construct structured intermediate representations that keep complex artifacts coherent as they scale. Closing the loop with our own synthesis and analysis tools lets the system reason about quantitative design tradeoffs while it generates RTL and verification artifacts that are aware of the constraints they'll be measured against.
This talk presents the architecture behind these learning loops, benchmarks against general-purpose approaches on production-representative tasks, and results from deployments with top semiconductor partners globally
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe rapid evolution of 3DIC stacking technologies has unlocked new opportunities for scaling and system technology co-optimization. However, these technologies also bring increased power density and thermal impedance while simultaneously expanding the design space. In this environment, the consequences of early architectural decisions are magnified, the information available to make them is sparce, and speed is of the essence.
We present an automated 3Dblox-based solution that streamlines the generation and evaluation of multi-die architectures, enabling architects to rapidly prototype, analyze, and optimize complex 3D Multi-Chip Modules for thermal and IR performance. Our work builds upon an existing GUI-driven prototyping system which supports flexible floor planning and power density specifications without requiring RTL or netlist data. Here we integrate the generation of industry standard 3dbv, 3dbx, and 3dbf files for analysis with EDA tools.
By leveraging a combined internal and EDA tool flow, we reduce simulation turnaround time for thermal and IR feasibility from days to minutes. This approach identifies thermal and IR hotspots, facilitating better-informed design decisions early in the process. We demonstrate a scalable and practical methodology for early-stage 3DIC design, shifting critical analysis left in the development cycle and enabling broader, more rigorous exploration of next-generation system architectures.
We present an automated 3Dblox-based solution that streamlines the generation and evaluation of multi-die architectures, enabling architects to rapidly prototype, analyze, and optimize complex 3D Multi-Chip Modules for thermal and IR performance. Our work builds upon an existing GUI-driven prototyping system which supports flexible floor planning and power density specifications without requiring RTL or netlist data. Here we integrate the generation of industry standard 3dbv, 3dbx, and 3dbf files for analysis with EDA tools.
By leveraging a combined internal and EDA tool flow, we reduce simulation turnaround time for thermal and IR feasibility from days to minutes. This approach identifies thermal and IR hotspots, facilitating better-informed design decisions early in the process. We demonstrate a scalable and practical methodology for early-stage 3DIC design, shifting critical analysis left in the development cycle and enabling broader, more rigorous exploration of next-generation system architectures.
Late Breaking Results
DescriptionDesigning device fingerprinting circuits for FPGAs is challenging due to limited knowledge about device layout and manufacturinginduced biases, often resulting in poor uniqueness and biased responses. Existing mitigation techniques are typically device-specific, time-consuming, and difficult to generalize. We propose an automated EDA tool, that analyses the target FPGA to optimize circuit layout to maximise extracted entropy and mitigate systematic bias.
Across 100-device testbed, the tool outperformed comparable manual designs, achieving near-ideal uniqueness and bias. These results demonstrate that automated device-aware design can reliably estimate and approach the best achievable performance on a given
FPGA.
Across 100-device testbed, the tool outperformed comparable manual designs, achieving near-ideal uniqueness and bias. These results demonstrate that automated device-aware design can reliably estimate and approach the best achievable performance on a given
FPGA.
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionUndocumented instructions pose a growing security threat, yet current research is limited.
Existing work focuses on discovery with constrained techniques, while semantic analysis is manual and ad-hoc. Moreover, a significant gap exists in methodologies for systematic security verification. This paper presents a comprehensive framework for the automated discovery, systematic classification, and security verification of undocumented instructions. Our evaluation discovered 23 new instructions and classified 29 using a classifier with 99.8357% accuracy. Most importantly, we verified that 5 instructions have tangible security implications, demonstrating our approach's efficacy in addressing threats at the processor instruction level.
Existing work focuses on discovery with constrained techniques, while semantic analysis is manual and ad-hoc. Moreover, a significant gap exists in methodologies for systematic security verification. This paper presents a comprehensive framework for the automated discovery, systematic classification, and security verification of undocumented instructions. Our evaluation discovered 23 new instructions and classified 29 using a classifier with 99.8357% accuracy. Most importantly, we verified that 5 instructions have tangible security implications, demonstrating our approach's efficacy in addressing threats at the processor instruction level.
Engineering Presentation
EDA
Systems
DescriptionDesigning System-on-Chips (SoCs) requires a well-organized development process. Pre-silicon validation/verification plays a crucial role; it can be performed in various ways. Here, simulation and prototyping are addressed. This presentation focuses on digital devices. Therefore, FPGA-based prototyping fits perfectly. SoC development requires significant time for RTL coding and IP integration. The design evolves daily. Continuous simulations are launched, and prototyping must keep pace. Hence, constant realignment between the two environments is necessary. The FPGA scenario differs slightly from the simulation one: some blocks must be replaced (memories, PLLs and others) when switching from one domain to another. This replacement essentially involves managing and modifying lists of source code files (Verilog, SystemVerilog and VHDL) used for simulation and FPGA synthesis. Given the complexity of modern designs, manual handling could lead to human mistakes. This presentation describes a system, based on Python scripts, for the automatic transition from simulation to prototyping (and back).
Engineering Presentation
AI
Design
EDA
DescriptionSignoff-quality standard cell libraries are critical for successful SoC tape-outs. Even minor inaccuracies in Liberty (.lib) models can lead to late-stage failures in timing, power, IR drop, and noise analysis. Manual QA of Liberty files are error-prone due to their complexity, especially with statistical LVF and CCS waveform data for advanced nodes.
Traditional validation methods are slow, resource-intensive, and often miss subtle inconsistencies. The need for a scalable, automated, and SPICE-accurate validation framework was driven by:
• Increasing complexity of Liberty views
• Growing number of PVT corners and cells
• Demand for early detection of outliers to reduce debug cycles
We present an automated validation framework methodology covering
• Moments-based LVF validation: Reduces false positives from traditional sigma checks.
• STA-like LVF comparison: Uses effective delay/transition metrics for realistic analysis.
• Automated SPICE setup: Supports all arcs, slew/load conditions, and parasitic.
• Insight-driven debug: Visual trend plots accelerate root cause analysis.
Applied to over 1000 cells across 30+ PVT corners, the methodology enables early outlier detection, improves SPICE level correlation, and delivers up to 2× productivity gains in validation and revision analysis. The results demonstrate a scalable approach to achieving consistent, signoff ready Liberty libraries with reduced silicon risk.
Traditional validation methods are slow, resource-intensive, and often miss subtle inconsistencies. The need for a scalable, automated, and SPICE-accurate validation framework was driven by:
• Increasing complexity of Liberty views
• Growing number of PVT corners and cells
• Demand for early detection of outliers to reduce debug cycles
We present an automated validation framework methodology covering
• Moments-based LVF validation: Reduces false positives from traditional sigma checks.
• STA-like LVF comparison: Uses effective delay/transition metrics for realistic analysis.
• Automated SPICE setup: Supports all arcs, slew/load conditions, and parasitic.
• Insight-driven debug: Visual trend plots accelerate root cause analysis.
Applied to over 1000 cells across 30+ PVT corners, the methodology enables early outlier detection, improves SPICE level correlation, and delivers up to 2× productivity gains in validation and revision analysis. The results demonstrate a scalable approach to achieving consistent, signoff ready Liberty libraries with reduced silicon risk.
Engineering Presentation
AI
EDA
Systems
DescriptionChiplets allow flexibility of design, enabling higher yields and faster turnaround times compared to monolithic SoCs. As part of creating the Cadence chiplet development solution, various areas for automation were identified to speed up the emulation and performance processes, thereby reducing the time required to bring chiplet designs to market.
Provision of emulation platforms early in the design cycle allow real hardware to be used by stakeholders such as system architects and software developers, allowing them to kick-off their activities significantly earlier, and reducing inter-team dependencies by creating a cohesive flow between all departments.
The verification flow uses outputs from design automation to generate the simulation & emulation platforms, the verification environment and tests. Tests are written using the Accellera Portable Test and Stimulus Standard (PSS) to allow C code to be generated for different target platforms, reducing duplication and saving effort and time.
The flow further generates the performance testing suite allowing metrics to be automatically retrieved between any defined source and end point using the Cadence System Performance Analyser. Availability of performance metrics early in the design cycle results in earlier design closure, reducing late architectural changes and increasing confidence in functionality and performance.
Provision of emulation platforms early in the design cycle allow real hardware to be used by stakeholders such as system architects and software developers, allowing them to kick-off their activities significantly earlier, and reducing inter-team dependencies by creating a cohesive flow between all departments.
The verification flow uses outputs from design automation to generate the simulation & emulation platforms, the verification environment and tests. Tests are written using the Accellera Portable Test and Stimulus Standard (PSS) to allow C code to be generated for different target platforms, reducing duplication and saving effort and time.
The flow further generates the performance testing suite allowing metrics to be automatically retrieved between any defined source and end point using the Cadence System Performance Analyser. Availability of performance metrics early in the design cycle results in earlier design closure, reducing late architectural changes and increasing confidence in functionality and performance.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs power integrity margins continue to tighten in advanced nodes and 3DIC designs, identifying the true worst-case PDN stress scenarios across multiple compute tiles has become increasingly complex and time-consuming. Traditional manual exploration of block workloads and vector combinations often requires expert intervention, exhaustive simulations, and significant runtime. To address this challenge, we present an automated RedHawk-SC framework that intelligently explores Reduced Order Models(ROM)-based block workload combinations to uncover high-stress PDN conditions with minimal runtime overhead.
The proposed flow introduces a novel scenario optimization algorithm that automatically generates, ranks, and prunes workload mixes for each compute tile, capturing die-package co-analysis through integrated RedHawk-SC simulations. The framework incorporates workload intelligence to learn per-block power impact, adaptive optimization to select optimal workload pairings using a greedy exploration algorithm, and scalability across multiple domains and workload vectors. Implemented as a Python wrapper, it delivers a plug-and-play architecture supporting automation and rapid integration within existing analysis environments.
This automated flow replaces expert-driven manual analysis with a data-guided methodology that ensures accurate prediction of voltage stress regions, early detection of grid vulnerabilities, and stronger silicon correlation. The approach simplifies scenario complexity, enhances reliability coverage, and accelerates sign-off for chiplet and 3DIC architectures.
The proposed flow introduces a novel scenario optimization algorithm that automatically generates, ranks, and prunes workload mixes for each compute tile, capturing die-package co-analysis through integrated RedHawk-SC simulations. The framework incorporates workload intelligence to learn per-block power impact, adaptive optimization to select optimal workload pairings using a greedy exploration algorithm, and scalability across multiple domains and workload vectors. Implemented as a Python wrapper, it delivers a plug-and-play architecture supporting automation and rapid integration within existing analysis environments.
This automated flow replaces expert-driven manual analysis with a data-guided methodology that ensures accurate prediction of voltage stress regions, early detection of grid vulnerabilities, and stronger silicon correlation. The approach simplifies scenario complexity, enhances reliability coverage, and accelerates sign-off for chiplet and 3DIC architectures.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTiming constraints are written to specify the design intent. SDC constraints are needed to properly: specify clocks in a design, instantiate on the correct pins/ports, compute uncertainty for clocks, and assert behavior/interaction of clocks in a design.
Timing constraints are typically manually written by the designer. The work can be tedious, time-consuming, prone to missing an interaction, or prone to calculation errors – leading to lengthy iterations, and the risk of hardware fails or excessive power.
We present a novel tool, ClockACE, to automatically create and apply all necessary timing constraints.
This saves significant amount of work for the designer. This prevents human error in computation or missing constraints. Any change to a clock spec is automatically re-computed within the timing run.
In this paper we present ClockACE for automatic constraint generation. We highlight a specific example using generated clocks, and the challenges with implementation of this clocking paradigm. We show how ClockACE handles this scenario automatically.
Timing constraints are typically manually written by the designer. The work can be tedious, time-consuming, prone to missing an interaction, or prone to calculation errors – leading to lengthy iterations, and the risk of hardware fails or excessive power.
We present a novel tool, ClockACE, to automatically create and apply all necessary timing constraints.
This saves significant amount of work for the designer. This prevents human error in computation or missing constraints. Any change to a clock spec is automatically re-computed within the timing run.
In this paper we present ClockACE for automatic constraint generation. We highlight a specific example using generated clocks, and the challenges with implementation of this clocking paradigm. We show how ClockACE handles this scenario automatically.
Engineering Presentation
EDA
Security
DescriptionAs SoC complexity scales, the resulting explosion in Automatic Test Pattern Generation (ATPG) pattern counts has become a primary driver of escalating manufacturing test costs. A critical bottleneck in this process is the emergence of "TCPF (Test cost Per Fault) Hotspots" across designs such as highly complex, auto-generated blocks such as Control and Status Registers (CSRs) that exhibit disproportionately high pattern counts due to deep combinational logic and structural bottlenecks like shared address buses. Traditional structural test point insertion often fails to address these architectural constraints effectively, as it operates on a late-stage netlist without design-specific context.
This paper proposes an "Extreme Left-Shift" methodology that moves DFT intelligence directly into the specification-to-RTL generation phase. By integrating Design-Aware Test Points (DATP) natively within a SystemRDL-to-RTL generator, we introduce architectural enhancements—including address parallelism, local protocol control, and optimized OR-reduction logic—that are structurally impossible to implement efficiently at the netlist level. This automated approach utilizes intelligent algorithms to analyze input RDL specifications and fine-tune DATP parameters at scale.
We demonstrate the efficacy of this methodology on a production-grade Tensor SoC. Experimental results show that while maintaining a minimal area overhead (~0.1%), the proposed DATP solution achieves a 4x improvement in TCPF at the IP level and a 50% overall reduction in ATPG pattern count at the top level compared to reference designs. By solving DFT bottlenecks at the source, this scalable framework provides a robust path for significantly reducing SoC test costs across the industry
This paper proposes an "Extreme Left-Shift" methodology that moves DFT intelligence directly into the specification-to-RTL generation phase. By integrating Design-Aware Test Points (DATP) natively within a SystemRDL-to-RTL generator, we introduce architectural enhancements—including address parallelism, local protocol control, and optimized OR-reduction logic—that are structurally impossible to implement efficiently at the netlist level. This automated approach utilizes intelligent algorithms to analyze input RDL specifications and fine-tune DATP parameters at scale.
We demonstrate the efficacy of this methodology on a production-grade Tensor SoC. Experimental results show that while maintaining a minimal area overhead (~0.1%), the proposed DATP solution achieves a 4x improvement in TCPF at the IP level and a 50% overall reduction in ATPG pattern count at the top level compared to reference designs. By solving DFT bottlenecks at the source, this scalable framework provides a robust path for significantly reducing SoC test costs across the industry
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTest coverage, test time, and test power are tradeoff on automotive system logic test. Unlike test coverage and test time, power is critical factor which is hard to be predicted and evaluated when implementation. Logic test usually has large amount of registers toggled at the same time and results in scan shift IR drop which might fail the test. Reducing pattern toggle would result in longer test time. Time consuming iterations for months are required to validate and fine tune scan structure with lower peak current until IR analysis is completed
Automotive System Logic Test Power management includes two parts: Pseudo-CPM Analyzer for test current prediction, and ASKA + POWA for shift power reduction. It helps user to fine tune design during design phase and IR drops reduce 3.2% with add-on clock skew on real product. The product also met design spec on logic test's coverage/test time/power and got silicon proven.
Automotive System Logic Test Power management includes two parts: Pseudo-CPM Analyzer for test current prediction, and ASKA + POWA for shift power reduction. It helps user to fine tune design during design phase and IR drops reduce 3.2% with add-on clock skew on real product. The product also met design spec on logic test's coverage/test time/power and got silicon proven.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionThis paper presents the first self-evolved logic synthesis framework, which leverages Large Language Models (LLMs) in a multi-agent setup to autonomously improve the source code of ABC. Our approach builds upon recent work in LLM-driven code evolution and extends the idea to the substantially more complex monolithic ABC codebase. We bootstrap the process with established human-designed optimizations and external research code. And in each iteration, the agents propose and implement code modifications which are then validated for correctness and evaluated on standard benchmark circuits to provide quality-of-result (QoR) feedback. Over time, the framework organically discovers improvements beyond the initial heuristics and complete the coding evolution autonomously in the whole ABC repository, effectively learning-to-progress better synthesis tool.
Exhibitor Forum
AI
EDA
Systems
DescriptionHu-mind.ai VLSI Teammate autonomously handles the full design cycle of a digital block, from initial concept to timing constraints. The Teammate execute these key phases: (1) Architecture: Transforming definitions into formal architecture documents. (2) RTL Design: Generating high-quality RTL code from architectural specs. (3) Test Planning: Drafting comprehensive test plans for robust verification. (4) Verification: Building SystemVerilog environments and test cases based on the test plan. (5) Debug: Running simulations and resolving design issues. (5) Timing: Composing SDC (Synopsys Design Constraints) for synthesis. Whether you're an RTL designer, a verification lead, or a physical design engineer, this shows how the VLSI Teammate augments your workflow to hit tape-out faster. With VLSI Teammates, every engineer steps into the role of a team leader. Communication with your virtual team is facilitated through a command-line interface, a web interface, or via your preferred instant messaging application on any device.
People
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionFlip-flops (FFs) are a critical component affecting system-level power, performance, and area (PPA). Many logic-based FF design methods have been introduced to expand the scope of topology exploration beyond intuition. However, their exploration scope is still limited to 2-bit finite state machines (FSMs) because of inefficient search space representation. We present B-Flex, an automated FF FSM search that integrates a complete, graph-based behavioral equivalence check into a pruning-based generative search. This efficient and scalable approach vastly expands the design space and allows exhaustive exploration of 3-bit-state FSMs. B-Flex has identified over 568 million valid FF mechanisms, including many novel 3-bit-state FF designs. Topology synthesis on 20 sampled FSMs has produced several high-performance FF circuits that outperforms conventional FFs. For instance, FF1, achieves a 2.53× speedup over a transmission-gate FF (TGFF) and improved metastability window at 0.9 V.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionThe increasing adoption of GPUs in real-time systems necessitates precise timing analysis for GPU thread blocks to ensure overall system predictability. However, the uncertainties of branch executions within GPU warps impose significant barriers for predicting the worst-case execution time (WCET) of the thread block. Existing WCET analysis for thread blocks typically assumes deterministic warp execution paths, which is unrealistic given the dynamic control flows of threads within the warps. Moreover, as the warp scheduler operates as a black box, the analysis must rely on relaxed scheduling assumptions, resulting in overly-pessimistic bounds in order to cover edge scenarios. This paper first establishes the need for static analysis by showing how branch divergence can trigger timing anomalies and influence block WCETs. We then develop an exact WCET analysis for a GPU thread block under the same scheduling constraints as prior work while not imposing any warp execution path assumptions. Furthermore, by enforcing a practical constraint on warp executions, we present a tighter analysis that enhances system predictability. Experiments show that the proposed analysis under warp execution constraints significantly reduces the WCET estimations of GPU thread blocks (by 19.13% on average and up to 39.39%).
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionCircuit Representation Learning (CRL) offers a powerful paradigm to guide and optimize core Electronic Design Automation (EDA) tasks, but its practical adoption is hindered by the immense scale of industrial netlists and a failure to explicitly model register-level temporal dynamics. To overcome these barriers, we introduce DeepSeq3, a novel hierarchical framework that abstracts circuits into a two-level representation: fine-grained combinational subgraphs partitioned by flip-flops (FFs), and a high-level Super-Node Graph (SNG) that models the register-transfer structure. A dual Graph Neural Network (GNN) architecture learns representations at both levels, capturing local Boolean logic and global state transitions. Crucially, we introduce a state-centric pre-training scheme that predicts the reachability between FF states, endowing the model with a deep understanding of temporal behavior. Demonstrated on large-scale benchmarks, DeepSeq3's approach yields superior scalability and richer representations, reducing bounded model checking (BMC) solving time by 18% while guaranteeing correctness. Our code is avaiable at https://anonymous.4open.science/r/DeepSeq3-6760
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionDespite recent progress, every CPU fuzzer explores only a narrow slice of the vast micro-architectural state space due to its fixed mutation and feedback biases. While different fuzzers thus excel in disjoint regions, naïve combination fails because of conflicting strategies, seed pollution, and early saturation.
LiFU introduces micro-architecture-aware orchestration that dynamically profiles heterogeneous fuzzers, detects complementary strengths, decreases harmful interactions, and steers each fuzzer in real time using coverage and bug feedback, augmented by semantic seed triage.
Evaluated on the BOOM core, LiFU achieves 93.4% line, 43.6% FSM, and 95.7% condition coverage (90.1%, 75.0%, 78.3% on Rocket) with around 40% fewer tests than the best standalone fuzzer, consistently closing long-standing verification gaps in modern CPU.
LiFU introduces micro-architecture-aware orchestration that dynamically profiles heterogeneous fuzzers, detects complementary strengths, decreases harmful interactions, and steers each fuzzer in real time using coverage and bug feedback, augmented by semantic seed triage.
Evaluated on the BOOM core, LiFU achieves 93.4% line, 43.6% FSM, and 95.7% condition coverage (90.1%, 75.0%, 78.3% on Rocket) with around 40% fewer tests than the best standalone fuzzer, consistently closing long-standing verification gaps in modern CPU.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern AI accelerators, such as Meta's MTIA, rely on massive arrays of Processing Elements (PEs) to maximize compute density. This architecture creates a design structure known as MIMs (Multiple Instantiated Modules). While these repetitive structures are essential for physical design optimization, they create a verification nightmare when doing flat analysis. Most modern flows rely on hierarchical verification using abstract models of smaller blocks at top level. However, in massive AI chips even this approach falls short due to the sheer volume of redundant Clock Domain Crossing (CDC) and Reset Domain Crossing (RDC) checks generated when large subsystems are instantiated repeatedly, causing chip-level flow runtimes to become unmanageable.
This paper presents a static verification methodology designed to overcome these scalability challenges in reticle-limit AI chips. While standard hierarchical flows utilizing Block Abstract Models attempt to manage complexity, the physical repetition of MIMs results in top analysis runs exceeds 26 hours and consuming up to 12x more memory due to redundant reporting.
To address this, we propose an RTL-based "stubbing" technique that selectively isolates unique instances of large blocks, such as Processing Elements, while neutralizing redundant copies via parameterized RTL defines. By shifting signoff to the block level and reducing the scope of top-level analysis, this methodology reduces the number of flat instances from 14.4 billion to 1.1 billion. Consequently, this approach eliminates redundant crossings, slashing verification runtime by 90% (down to 3.5 hours) and significantly reducing storage requirements, thereby enabling faster design convergence and feedback loops
This paper presents a static verification methodology designed to overcome these scalability challenges in reticle-limit AI chips. While standard hierarchical flows utilizing Block Abstract Models attempt to manage complexity, the physical repetition of MIMs results in top analysis runs exceeds 26 hours and consuming up to 12x more memory due to redundant reporting.
To address this, we propose an RTL-based "stubbing" technique that selectively isolates unique instances of large blocks, such as Processing Elements, while neutralizing redundant copies via parameterized RTL defines. By shifting signoff to the block level and reducing the scope of top-level analysis, this methodology reduces the number of flat instances from 14.4 billion to 1.1 billion. Consequently, this approach eliminates redundant crossings, slashing verification runtime by 90% (down to 3.5 hours) and significantly reducing storage requirements, thereby enabling faster design convergence and feedback loops
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionLooking at the block diagram of any modern SoC – whether a standalone device or a chiplet – and you will see that most of the blocks, and die area, consists of generated logic. From regular structures like RAMs and ROMs, to semi-regular structures like control/status register and NoCs, to entire processors, the HDL source and other input files for these blocks is produced by tooling from a higher-level specification, not typed by designers or output by AI. My work revolves around raising the level of abstraction used when writing HDL generators beyond scripts that output text. I propose that generators should not generate HDL text directly as with a "script", rather generators should be written with an API to assemble the HDL for output. This assures that the generator can only output syntactically valid HDL, and that errors are reported in the context of user input. This is especially important when an HDL generator is product, where failures and poor error reporting result in higher support costs for vendors, and poor outcomes for customers.
Work in Progress
DescriptionRowhammer attacks repeatedly activate DRAM rows and cause bit flips in adjacent rows. DDR5 introduced Per-Row Activation Counting (PRAC) to address this, but static thresholds and unnecessary distant-row refreshes limit its efficiency. This paper proposes Dynamic-PRAC, which adapts mitigation thresholds based on service-queue utilization and refines refresh actions by row distance. Experiments show that Dynamic-PRAC improves performance over PRAC, MOAT, and QPRAC while lowering overhead. Security results further demonstrate the lowest unmitigated activation count and up to 73% reduced energy use, offering an efficient and balanced Rowhammer mitigation mechanism.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionDeep neural networks tend to make overconfident predictions on unseen data, potentially raising risk in safety critical domains. Bayesian neural networks (BayesNNs) can estimate the uncertainty of prediction results, by modeling the uncertainty of model weights. However, BayesNNs have expensive
computing costs due to repeated sampling while evaluating uncertainty, which makes them inapplicable in practice. This work proposes a binarization framework for BayesNN, named Binarized Bayesian Neural Network (BiBNN), thus unlocking the potential for applications on embedded devices. We train real latent weight with Gaussian prior distribution by Variational Inference and use binary weight with Bernoulli distribution to perform inference. Experiment results show that the BiBNN retains similar uncertainty evaluation capability but with only 11.4% computing cost when compared to BayesNN. And BiBNN outperforms binary neural network by 1.5% on the CIFAR-10 dataset in terms of inference accuracy.
computing costs due to repeated sampling while evaluating uncertainty, which makes them inapplicable in practice. This work proposes a binarization framework for BayesNN, named Binarized Bayesian Neural Network (BiBNN), thus unlocking the potential for applications on embedded devices. We train real latent weight with Gaussian prior distribution by Variational Inference and use binary weight with Bernoulli distribution to perform inference. Experiment results show that the BiBNN retains similar uncertainty evaluation capability but with only 11.4% computing cost when compared to BayesNN. And BiBNN outperforms binary neural network by 1.5% on the CIFAR-10 dataset in terms of inference accuracy.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionDespite decades of success, CMOS technology is increasingly constrained by power dissipation and propagation delay, thus motivating the exploration of photonic integrated circuits as an alternative for high-speed, energy-efficient computing. However, existing optical logic synthesis frameworks rely on an oversimplified efficiency factor model that fails to accurately capture physical attenuation and the hierarchical nature of signal degradation. This work presents Bident, a comprehensive optical logic synthesis framework that provides a physically accurate loss model and achieves optimal combiner configuration. We introduce the binary aggregate model, which explicitly reflects the binary-tree topology and input ordering in multi-input combiners. Based on this model, the Huffman tree optimization guarantees optimal Y-branch combiner configuration without requiring additional hardware resources. Meanwhile, a greedy algorithm optimally employs directional couplers at the leaf layer of a binarycombiner tree to achieve harmonic-mean efficiency at minimal hardware cost. Experimental results demonstrate that Bident achieves superior signal-attenuation reduction and lower switch cost than state-of-the-art methods, using both the conventional efficiency factor model and our binary aggregate model. A schematic-level optical circuit simulation further validates the feasibility and effectiveness of our binary-combiner configurations.
Engineering Presentation
AI
Design
EDA
DescriptionAutomating Register-Transfer Level (RTL) code generation and summarization is critical for improving hardware design productivity and reducing time-to-market. However, existing approaches struggle due to limited Verilog and VHDL training data and the structural concurrency inherent in hardware description languages (HDLs). We propose BiMem-RAG, a dual-memory retrieval and bidirectional reasoning framework for HDL synthesis and documentation. BiMem-RAG combines (i) an exemplar memory of specification-to-code and code-to-summary pairs indexed using joint code and abstract syntax tree embeddings, and (ii) an analogical memory capturing reusable hardware design patterns such as sequential and combinational logic. The system employs modular decomposition, reinforcement-guided retrieval, and backward verification to improve semantic fidelity and reduce hallucinations. Experiments on Verilog and VHDL benchmarks demonstrate up to 19 percentage-point improvements in Pass@1 and 9-point gains in ROUGE-L over strong baselines. These results show that structured retrieval and verification significantly improve RTL generation and summarization, enabling more reliable hardware design automation.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionIn advanced packages, large power/ground (P/G) planes are essential for power integrity but prone to warpage. While a uniform metal-density constraint combats warpage, mandatory degassing holes complicate adherence and degrade IR-drop. We propose BLADE, the first P/G plane synthesis methodology that enforces this constraint while optimizing power integrity. BLADE expands P/G planes from skeletons under metal density control to ensure connectivity, naturally leaving holes for degassing. Guided by a fast IR-drop evaluator, our bi-level Bayesian optimization framework efficiently navigates the design space. Experiments on six SiP cases demonstrate that BLADE satisfies the metal-density constraint and achieves superior IR-drop performance.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAbstract:
Incremental changes are common during IP delivery, with approximately half of IP releases involving limited updates such as view additions, PVT updates, cell fixes, or layout modifications. While expected, these changes significantly complicate IP qualification, as conventional version-to-version validation often produces excessive false positives, increasing debug effort and slowing delivery cycles. Efficiently distinguishing intended modifications from unintended regressions is therefore critical to maintaining IP quality at scale.
This paper presents a profile-driven IP validation framework deployed at STMicroelectronics using Solido IPDelta and integrated within the internal IP QA infrastructure. The approach introduces predefined validation profiles that explicitly model expected changes between IP versions, such as layout-only updates, layout-plus-abstract changes, netlist updates, or liberty model revisions. By waiving known, intentional differences through profile definitions, the framework isolates unexpected inconsistencies for focused analysis across CAD views including GDS, OA, LEF, DEF, netlists, and liberty formats.
Applied across diverse IP types and technology nodes, the profile-based methodology significantly reduces validation noise and improves debugging efficiency. Measured results demonstrate up to a 60% reduction in manual QA effort per IP delivery, while provide a scalable and consistent validation solution for evolving IP ecosystems. This work shows that profile-driven change detection enables robust, efficient IP qualification in incremental delivery environments.
Incremental changes are common during IP delivery, with approximately half of IP releases involving limited updates such as view additions, PVT updates, cell fixes, or layout modifications. While expected, these changes significantly complicate IP qualification, as conventional version-to-version validation often produces excessive false positives, increasing debug effort and slowing delivery cycles. Efficiently distinguishing intended modifications from unintended regressions is therefore critical to maintaining IP quality at scale.
This paper presents a profile-driven IP validation framework deployed at STMicroelectronics using Solido IPDelta and integrated within the internal IP QA infrastructure. The approach introduces predefined validation profiles that explicitly model expected changes between IP versions, such as layout-only updates, layout-plus-abstract changes, netlist updates, or liberty model revisions. By waiving known, intentional differences through profile definitions, the framework isolates unexpected inconsistencies for focused analysis across CAD views including GDS, OA, LEF, DEF, netlists, and liberty formats.
Applied across diverse IP types and technology nodes, the profile-based methodology significantly reduces validation noise and improves debugging efficiency. Measured results demonstrate up to a 60% reduction in manual QA effort per IP delivery, while provide a scalable and consistent validation solution for evolving IP ecosystems. This work shows that profile-driven change detection enables robust, efficient IP qualification in incremental delivery environments.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionGraph neural networks (GNNs) are crucial for numerous applications, yet their huge computational demands often lead to suboptimal performance.
Hyper-Dimensional Computing (HDC) is a brain-inspired learning approach for efficient and robust learning.
HDC-based Graph Learning (HDGL) shows significant improved computational efficiency and accuracy by learning graph representations in a high-dimensional space. Despite these advantages, general-purpose computing platforms such as CPUs and GPUs are insufficient for efficiently handling HDGL tasks.
In this paper we propose an accelerator called HDGAS through algorithm and hardware co-design for HDGL.
Based on the insight that not all node features in a graph are equally important, we propose to jointly optimize a lightweight filter with the HDGL model to dynamically identify and eliminate less significant node features during runtime.
Moreover, we design a specialized system architecture for end-to-end HDGL acceleration, harnessing the proposed dynamic sparsification technique in tandem with the inherent SpMM operations within HDGL.
Extensive experiments demonstrate that HDGAS achieves $6.76\times$ ($69.31\times$) speedup and $7.58\times$ ($80.12\times$) energy-efficiency improvements over GNN accelerators (GPU).
Hyper-Dimensional Computing (HDC) is a brain-inspired learning approach for efficient and robust learning.
HDC-based Graph Learning (HDGL) shows significant improved computational efficiency and accuracy by learning graph representations in a high-dimensional space. Despite these advantages, general-purpose computing platforms such as CPUs and GPUs are insufficient for efficiently handling HDGL tasks.
In this paper we propose an accelerator called HDGAS through algorithm and hardware co-design for HDGL.
Based on the insight that not all node features in a graph are equally important, we propose to jointly optimize a lightweight filter with the HDGL model to dynamically identify and eliminate less significant node features during runtime.
Moreover, we design a specialized system architecture for end-to-end HDGL acceleration, harnessing the proposed dynamic sparsification technique in tandem with the inherent SpMM operations within HDGL.
Extensive experiments demonstrate that HDGAS achieves $6.76\times$ ($69.31\times$) speedup and $7.58\times$ ($80.12\times$) energy-efficiency improvements over GNN accelerators (GPU).
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis paper presents the formal verification of a subsystem with indirect read mechanisms and a nonstandard access protocol. Verifying registers using the control and status register (CSR) protocol typically requires a proof accelerator to convert a standard bus into the CSR protocol. However, this secure design uses a custom protocol, creating unique formal verification challenges. In this design, writing to an IP register through the subsystem is straightforward and followed by polling a status register to check the write data status. The read data is available at a different address within the subsystem register map.
To address this, a protocol-to-protocol AMBA bridge is developed. The bridge converts write commands from the verification intellectual property (VIP) into writes for the device under test (DUT) and reads data from the DUT, returning it to the corresponding VIP address. Because the DUT does not return data at the same address, address conversion logic with several finite state machines (FSMs) is implemented. This approach allows the VIP to perform checks as expected, despite the indirect read mechanism.
Counter value abstraction is used to manage delays and simplify verification, ensuring efficient completion. This approach enables robust and efficient register verification for complex subsystems and saves significant time compared with the UVM register model.
To address this, a protocol-to-protocol AMBA bridge is developed. The bridge converts write commands from the verification intellectual property (VIP) into writes for the device under test (DUT) and reads data from the DUT, returning it to the corresponding VIP address. Because the DUT does not return data at the same address, address conversion logic with several finite state machines (FSMs) is implemented. This approach allows the VIP to perform checks as expected, despite the indirect read mechanism.
Counter value abstraction is used to manage delays and simplify verification, ensuring efficient completion. This approach enables robust and efficient register verification for complex subsystems and saves significant time compared with the UVM register model.
Exhibitor Forum
AI
EDA
Systems
DescriptionOboe is accelerating the future of silicon development with next-generation Emulation and Waveform platforms purpose-built to unlock true agentic chip design. While AI has yielded remarkable productivity gains in software, hardware engineering lags behind as agents are severely bottlenecked by legacy EDA infrastructure. During frontend verification, both human engineers and autonomous agents waste critical time waiting for traditional simulation engines to return results, stalling the iterative loop and limiting agentic effectiveness. As agents drop the cost of forming hypotheses and generating tests, the burden of data generation, evaluation, and retrieval only become more acute. In this presentation, we discuss Oboe’s solutions for accelerating simulation, bringing the digital verification workflow to the speed of agents and ultimately enabling rapid and accurate root cause analysis.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAchieving full formal sign-off for deep hardware tracking structures—critical components in high-bandwidth networking and AI accelerators—often stalls against a 'Complexity Wall' as pipeline depths increase. Standard Bounded Model Checking (BMC) requires sequential unrolling proportional to latency, leading to an exponential state-space explosion that typically renders proofs inconclusive beyond 32-stages. For an N-stage buffer with W-bit identifiers, the total unrolled state complexity grows by O(N^2×W), quadrupling the verification effort with every depth doubling.
This paper introduces a scalable formal verification methodology that shifts the verification paradigm from deep sequential searching to single-cycle inductive transitions, effectively achieving linear scalability (O(N×W)) . Approach uses LLM-driven "Impact Grading" framework that automates the discovery of high-impact inductive invariants. By categorizing candidate properties into a tiered hierarchy (S/A/B), we isolate a 'Golden Triangle' of invariants—State Mapping, Ingress Consistency, and Control-Path Sync—that bridge the reference model and hardware state.
Experimental results demonstrate a paradigm shift: while standard BMC remains inconclusive at 64 stages, our methodology achieves 100% mathematical convergence in ~50 minutes . Furthermore, we demonstrate deterministic scalability up to 256 stages in approximately 7 hours. This work establishes a repeatable, automated framework for the formal sign-off of high-latency structures previously deemed unreachable.
This paper introduces a scalable formal verification methodology that shifts the verification paradigm from deep sequential searching to single-cycle inductive transitions, effectively achieving linear scalability (O(N×W)) . Approach uses LLM-driven "Impact Grading" framework that automates the discovery of high-impact inductive invariants. By categorizing candidate properties into a tiered hierarchy (S/A/B), we isolate a 'Golden Triangle' of invariants—State Mapping, Ingress Consistency, and Control-Path Sync—that bridge the reference model and hardware state.
Experimental results demonstrate a paradigm shift: while standard BMC remains inconclusive at 64 stages, our methodology achieves 100% mathematical convergence in ~50 minutes . Furthermore, we demonstrate deterministic scalability up to 256 stages in approximately 7 hours. This work establishes a repeatable, automated framework for the formal sign-off of high-latency structures previously deemed unreachable.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionNext-generation SoCs demand robust and high-confidence DDR subsystem verification, yet traditional simulation struggles to keep pace. Long-duration DDR behaviors—such as training loops, refresh cycles, retention checks, and hours of sustained traffic—consume excessive simulation time and slow down project velocity. These time-intensive patterns often stall coverage closure, push out debug windows, and delay subsystem readiness.
This work showcases a high-performance verification acceleration strategy using the Siemens Veloce Strato CS hardware emulation platform to remove these bottlenecks. By transitioning existing UVM stimulus into an emulation-optimized transactor environment, the flow executes long-running DDR workloads at speed order of magnitude faster while maintaining cycle-accurate behavior and full debug correlation. The result is faster iteration, increased test throughput, and earlier discovery of timing-sensitive issues that rarely surface in conventional simulation schedules.
Key highlights include an optimized DDR command/data transactor architecture for emulation, transparent reuse of simulation sequences, and targeted debug techniques purpose-built for hardware-assisted verification. This accelerated DDR flow enables teams to run extensive, hours-to-days traffic profiles within practical timelines—transforming verification productivity and delivering higher confidence in subsystem quality ahead of tape-out.
This work showcases a high-performance verification acceleration strategy using the Siemens Veloce Strato CS hardware emulation platform to remove these bottlenecks. By transitioning existing UVM stimulus into an emulation-optimized transactor environment, the flow executes long-running DDR workloads at speed order of magnitude faster while maintaining cycle-accurate behavior and full debug correlation. The result is faster iteration, increased test throughput, and earlier discovery of timing-sensitive issues that rarely surface in conventional simulation schedules.
Key highlights include an optimized DDR command/data transactor architecture for emulation, transparent reuse of simulation sequences, and targeted debug techniques purpose-built for hardware-assisted verification. This accelerated DDR flow enables teams to run extensive, hours-to-days traffic profiles within practical timelines—transforming verification productivity and delivering higher confidence in subsystem quality ahead of tape-out.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionYield Multi-Corner Analysis validates circuits across 125+ Process-Voltage-Temperature corners, creating combinatorial simulation cost of $O(K \times N)$ where $K$ denotes corners and $N$ exceeds $10^5$ samples per corner. Existing methods face a fundamental trade-off: simple models achieve automation but fail on nonlinear circuits, while advanced AI models capture complex behaviors but require hours of hyperparameter tuning per design iteration, forming the Tuning Barrier.
We break this barrier by replacing engineered priors (i.e., model specifications) with learned priors from a foundation model pre-trained on millions of regression tasks. This model performs in-context learning, instantly adapting to each circuit without tuning or retraining. Its attention mechanism automatically transfers knowledge across corners by identifying shared circuit physics between operating conditions. Combined with an automated feature selector (1152D to 48D), our method matches state-of-the-art accuracy (mean MREs as low as 0.11\%) with zero tuning, reducing total validation cost by over $10\times$.
We break this barrier by replacing engineered priors (i.e., model specifications) with learned priors from a foundation model pre-trained on millions of regression tasks. This model performs in-context learning, instantly adapting to each circuit without tuning or retraining. Its attention mechanism automatically transfers knowledge across corners by identifying shared circuit physics between operating conditions. Combined with an automated feature selector (1152D to 48D), our method matches state-of-the-art accuracy (mean MREs as low as 0.11\%) with zero tuning, reducing total validation cost by over $10\times$.
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionWe present BREW-RC, a non-invasive technique that uses a timing side channel to reverse engineer error-correcting codes (ECCs) in ReRAM crossbar memories. ECCs use a parity submatrix to extend data words with parity bits, forming codewords. These parity bits cannot be read directly, but BREW-RC reveals them by using write latency variability to determine the aggregate polarity of codeword transitions. Timing perturbations caused by ECC parity bit updates are encoded as a SAT instance whose solution yields the bit-exact parity matrix. Knowledge of the parity matrix enables: (1) improved timing models for side-channel attacks/mitigations and latency-sensitive applications, and (2) ECC-aware reliability characterization of devices whose error correction capabilities are unadvertised. We demonstrate BREW-RC on two commercial ReRAM products, recovering the complete ECC from multiple samples of each.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe rapid adoption of chiplet-based architectures and high-speed die-to-die interconnects primarily driven by AI/ML massive data throughput requirements and integration complexity presents significant mixed-signal verification challenges. Traditional Digital Mixed-Signal (DMS) flows lack analog fidelity, while Analog Mixed-Signal (AMS) flows struggle with scalability and manual overhead. To address these fundamental limitations, Analog Port, in close collaboration with Siemens EDA, developed a unified, top-down verification methodology designed to bridge the gap between analog fidelity and digital performance.
This innovative approach leveraging Symphony Pro, part of the Solido Simulation Suite, extends the conventional UVM based digital verification framework to mixed-signal domain, enabling a holistic verification strategy while overcoming tool fragmentation and data integrity issues. It seamlessly combines high-fidelity SPICE simulation for critical analog blocks with high-performance digital simulation for the broader system.
In this presentation, we will demonstrate how Analog Port achieved up to 26X memory savings and more than 3X performance gain for their RX and TX simulations. We will also explore this novel methodology and its implementation, showcasing how it successfully verified a complex 32 Gbps, 16-transmit/16-receive full-duplex die-to-die interface, significantly enhancing productivity, improving silicon quality, and reducing time-to-market by resolving traditional mixed-signal verification trade-offs.
This innovative approach leveraging Symphony Pro, part of the Solido Simulation Suite, extends the conventional UVM based digital verification framework to mixed-signal domain, enabling a holistic verification strategy while overcoming tool fragmentation and data integrity issues. It seamlessly combines high-fidelity SPICE simulation for critical analog blocks with high-performance digital simulation for the broader system.
In this presentation, we will demonstrate how Analog Port achieved up to 26X memory savings and more than 3X performance gain for their RX and TX simulations. We will also explore this novel methodology and its implementation, showcasing how it successfully verified a complex 32 Gbps, 16-transmit/16-receive full-duplex die-to-die interface, significantly enhancing productivity, improving silicon quality, and reducing time-to-market by resolving traditional mixed-signal verification trade-offs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn advanced verification environments, standard UVM monitors may not offer enough flexibility or abstraction to capture and relay all relevant hardware events to the testbench. To solve this, a standardized mechanism is needed that allows smooth, event-driven communication from hardware interfaces to the UVM environment. This is especially crucial when the testbench must respond to complex hardware events—such as status changes, signal transitions, or protocol-specific conditions—and use these events to drive scoreboards, perform checks, or trigger reference model transactions. By adding proxy objects that serve as intermediaries between the hardware (HDL) and software (UVM) domains, the testbench can respond to hardware events in a modular and reusable way, reducing dependence on direct signal polling and improving maintainability. This approach is generally applicable to any hardware signal that needs to stay synchronized with the UVM environment.
Engineering Special Session
EDA
Quantum
DescriptionAs quantum computing continues to advance, superconducting qubits have emerged as a leading platform due to their scalability and strong compatibility with microwave engineering. At the same time, quantum amplifiers are essential for enabling high fidelity readout by amplifying fragile quantum signals while minimizing added noise. Designing these systems, however, remains challenging due to tight performance margins, dense integration, cryogenic operation, and complex interactions between electromagnetic, circuit, and quantum effects. In this talk, I discuss how Quantum EDA techniques—tailored to the unique requirements of superconducting quantum circuits—can help bridge physics driven design and scalable engineering practice. I will highlight how EM and eigenmode analysis, automated quantum parameter extraction, and nonlinear circuit co simulation can be combined to guide the design and optimization of qubits, readout resonators, and quantum amplifiers within a unified workflow. These approaches illustrate how Quantum EDA can accelerate development toward robust and scalable superconducting quantum systems.
Research Panel
Design
DescriptionQuantum technologies are rapidly transitioning from academic curiosity to practical enablers of next-generation electronic design automation (EDA). Yet the real opportunity lies not only in quantum computing itself, but in the convergence of quantum simulation, materials modeling, AI-native design, and system-level automation. This panel brings together leaders across quantum hardware, software, and industrial design workflows to examine how quantum technologies—paired with frontier AI—will transform the semiconductor and system-design landscape over the next decade.
Today's design challenges have outgrown classical scaling curves. Whether optimizing advanced node devices, discovering novel materials, modeling parasitic effects, or designing 3D-IC systems with extreme multiphysics coupling, classical simulation methods are straining under exponential complexity. Emerging quantum-accelerated approaches offer breakthrough potential: Hamiltonian-based solvers for nanoscale transport; quantum-enhanced simulation for chemical and materials discovery; quantum optimization methods for scheduling, routing, and verification; and hybrid quantum-classical pipelines that integrate seamlessly with existing EDA flows.
At the same time, generative AI and large language models (LLMs) are redefining design productivity. The intersection—Quantum × AI × EDA—is creating a new paradigm where quantum simulation feeds AI-driven design agents, AI discovers optimal materials for quantum and classical devices, and EDA frameworks orchestrate end-to-end system exploration. This compounding loop is catalyzing a shift toward Materials Design Automation (MDA), automated device co-optimization, and rapid multi-physics exploration at atomic accuracy.
This panel will explore four key themes:
1. Quantum Simulation for Materials & Devices: How quantum-accurate modeling (AIMD, MLFF, QMC, variational quantum solvers) unlocks breakthroughs in semiconductors, batteries, photonics, and superconducting components.
2. Quantum Hardware and Software Stack Evolution: Practical roadmaps, error-correction thresholds, algorithmic readiness, and how near-term quantum systems can augment industrial design workflows.
3. AI-Native and LLM-Driven Design Automation: How foundation models, retrieval-augmented simulation, and autonomous design agents interface with quantum tools to accelerate PPA and manufacturability.
4. Industrial Adoption & ROI: What it will take for design teams—fabless, IDM, EDA vendors, and hyperscalers—to integrate quantum capabilities into real production flows.
The session will conclude with a forward-looking discussion on how quantum technologies may reshape the future of design automation—shifting the industry from traditional scaling to physics-accelerated, AI-co-designed innovation. Attendees will leave with a clear understanding of what's real, what's coming, and how to prepare their organizations for the quantum-enabled EDA era.
Today's design challenges have outgrown classical scaling curves. Whether optimizing advanced node devices, discovering novel materials, modeling parasitic effects, or designing 3D-IC systems with extreme multiphysics coupling, classical simulation methods are straining under exponential complexity. Emerging quantum-accelerated approaches offer breakthrough potential: Hamiltonian-based solvers for nanoscale transport; quantum-enhanced simulation for chemical and materials discovery; quantum optimization methods for scheduling, routing, and verification; and hybrid quantum-classical pipelines that integrate seamlessly with existing EDA flows.
At the same time, generative AI and large language models (LLMs) are redefining design productivity. The intersection—Quantum × AI × EDA—is creating a new paradigm where quantum simulation feeds AI-driven design agents, AI discovers optimal materials for quantum and classical devices, and EDA frameworks orchestrate end-to-end system exploration. This compounding loop is catalyzing a shift toward Materials Design Automation (MDA), automated device co-optimization, and rapid multi-physics exploration at atomic accuracy.
This panel will explore four key themes:
1. Quantum Simulation for Materials & Devices: How quantum-accurate modeling (AIMD, MLFF, QMC, variational quantum solvers) unlocks breakthroughs in semiconductors, batteries, photonics, and superconducting components.
2. Quantum Hardware and Software Stack Evolution: Practical roadmaps, error-correction thresholds, algorithmic readiness, and how near-term quantum systems can augment industrial design workflows.
3. AI-Native and LLM-Driven Design Automation: How foundation models, retrieval-augmented simulation, and autonomous design agents interface with quantum tools to accelerate PPA and manufacturability.
4. Industrial Adoption & ROI: What it will take for design teams—fabless, IDM, EDA vendors, and hyperscalers—to integrate quantum capabilities into real production flows.
The session will conclude with a forward-looking discussion on how quantum technologies may reshape the future of design automation—shifting the industry from traditional scaling to physics-accelerated, AI-co-designed innovation. Attendees will leave with a clear understanding of what's real, what's coming, and how to prepare their organizations for the quantum-enabled EDA era.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern SoCs integrate a wide range of IP types: I/O's, SerDes, PLLs, DACs, ADCs, memories...each operating across multiple protocols, voltage domains, and performance modes. While characterization flows accurately capture this mode dependent electrical behavior (including distinct thresholds, load limits, slew limits, and rail sensitivities), the Liberty (.lib) format remains largely scalar and static.
Key attributes such as max_capacitance, max_transition, and voltage_map can only be expressed as fixed per pin values, and existing constructs like when or mode_support affect only logical activation, not the underlying electrical state of the IP.
This fundamental limitation forces designers to maintain multiple mode specific .lib files, which increases configuration overhead and risks silent extrapolation during sign off when tools operate outside valid characterized ranges.
We propose a State Aware Liberty Modeling approach that introduces explicit constructs for both electrical configuration states and operational modes (protocol specific electrical constraints). This allows voltage mapping, thresholds, load limits, and slew limits to vary per mode within a single unified library. These enhancements generalize across all IP types, improving silicon to model correlation and enabling robust, mode accurate timing and power sign off for next generation SoC architectures.
Key attributes such as max_capacitance, max_transition, and voltage_map can only be expressed as fixed per pin values, and existing constructs like when or mode_support affect only logical activation, not the underlying electrical state of the IP.
This fundamental limitation forces designers to maintain multiple mode specific .lib files, which increases configuration overhead and risks silent extrapolation during sign off when tools operate outside valid characterized ranges.
We propose a State Aware Liberty Modeling approach that introduces explicit constructs for both electrical configuration states and operational modes (protocol specific electrical constraints). This allows voltage mapping, thresholds, load limits, and slew limits to vary per mode within a single unified library. These enhancements generalize across all IP types, improving silicon to model correlation and enabling robust, mode accurate timing and power sign off for next generation SoC architectures.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionProcessing-in-Memory (PIM) alleviates the memory bottleneck of modern AI workloads; however, its limited computational capability often necessitates hybrid architectures that integrate PIM with nearby processing units. In such heterogeneous systems, communication and host-side overheads remain dominant performance bottlenecks. Our profiling of SK hynix's AiMX reveals that CPU–AiMX communication and CPU-side data reordering account for 50.07% of end-to-end LLM prefill and 78.36% of decode execution time. Notably, 54.03% of reordering operations occur adjacent to AiMX-executable nodes, indicating significant untapped potential for near-memory execution to eliminate unnecessary data movement.
We propose a graph compiler that jointly optimizes computation and data movement for AiMX-based acceleration. The compiler performs (1) graph simplification to maximize AiMX-kernel coverage, (2) DMA-assisted tensor reordering and layout redefinition to offload host-side preprocessing, and (3) operator fusion to suppress redundant memory transfers. Integrated into the ONNX Runtime, our approach improves 2.00~2.73x speedup in the prefill phase and 18.50~38.29x in the decode phase, demonstrating the substantial benefits of graph-level co-optimization for PIM-enabled LLM inference and achieving performance comparable to hand-tuned AiMX execution.
We propose a graph compiler that jointly optimizes computation and data movement for AiMX-based acceleration. The compiler performs (1) graph simplification to maximize AiMX-kernel coverage, (2) DMA-assisted tensor reordering and layout redefinition to offload host-side preprocessing, and (3) operator fusion to suppress redundant memory transfers. Integrated into the ONNX Runtime, our approach improves 2.00~2.73x speedup in the prefill phase and 18.50~38.29x in the decode phase, demonstrating the substantial benefits of graph-level co-optimization for PIM-enabled LLM inference and achieving performance comparable to hand-tuned AiMX execution.
Exhibitor Forum
DescriptionBronco Debug in production is able to tackle multi-day, multi-person debugs in a matter of minutes. This session covers some of Bronco's AI-native EDA tooling that enables this work at full-chip SoC scale.
Exhibitor Forum
DescriptionAn announcement and overview of DVBench — Bronco's to-be-released benchmark that evaluates AI across real DV tasks spanning multiple data modalities and design hierarchies from block-level to full SoC, at production difficulty.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionAbstract
Graph Convolutional Networks (GCNs), a foundational technology
for relational data applications, are often bottlenecked by irregular
memory accesses caused by sparse graph structures. Existing accelerators face two fundamental limitations: their rigid dataflows
cannot adapt to varying graph sparsities, and their narrow focus on
high-degree nodes (HDNs) leads to systematic neglect of memory
traffic from the vast number of low-degree nodes (LDNs).
To address these issues, this paper introduces BSGCN, an accelerator built on a novel band-segmentation strategy. This approach partitions the graph such that the majority of LDNs are
grouped into large bands with similar data reuse characteristics,
while the remaining nodes form small bands containing only a few
LDNs. This segregation enables two co-designed innovations: 1) a
Band-Segmented Dataflow (BSD), which applies a tailored traversal strategy to each band type to handle diverse sparsity patterns,
and 2) a Band-Segmented Caching (BSC) hierarchy, combining a
specialized caching policy to exploit the reuse potential of LDNs
with a dedicated pinning mechanism for HDNs. Evaluation results
show that BSGCN achieves average speedups of 1.41× (up to 4.80×)
over state-of-the-art accelerators, while reducing DRAM traffic by
24.24% and improving energy efficiency by 1.30×.
Graph Convolutional Networks (GCNs), a foundational technology
for relational data applications, are often bottlenecked by irregular
memory accesses caused by sparse graph structures. Existing accelerators face two fundamental limitations: their rigid dataflows
cannot adapt to varying graph sparsities, and their narrow focus on
high-degree nodes (HDNs) leads to systematic neglect of memory
traffic from the vast number of low-degree nodes (LDNs).
To address these issues, this paper introduces BSGCN, an accelerator built on a novel band-segmentation strategy. This approach partitions the graph such that the majority of LDNs are
grouped into large bands with similar data reuse characteristics,
while the remaining nodes form small bands containing only a few
LDNs. This segregation enables two co-designed innovations: 1) a
Band-Segmented Dataflow (BSD), which applies a tailored traversal strategy to each band type to handle diverse sparsity patterns,
and 2) a Band-Segmented Caching (BSC) hierarchy, combining a
specialized caching policy to exploit the reuse potential of LDNs
with a dedicated pinning mechanism for HDNs. Evaluation results
show that BSGCN achieves average speedups of 1.41× (up to 4.80×)
over state-of-the-art accelerators, while reducing DRAM traffic by
24.24% and improving energy efficiency by 1.30×.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionBackside power delivery networks (BSPDN) provide superior power integrity while freeing backside routing resources. However, existing works fail to fully exploit these resources for power, performance, and area (PPA) optimization through strategic net allocation. This paper presents a co-optimization framework that maximizes PPA by strategically routing both clock and signal nets on the backside, enabling double-side routing. Experimental results demonstrate 88.1% IR-drop reduction, 23.0% frequency improvement, 10.9% power savings, and timing improvements (66.3% WNS, 40.2% TNS) over FSPDN with negligible nTSVs overhead.
People
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis work evaluates the performance and thermal benefits of Backside Power Delivery Networks (BSPDN) compared to traditional Frontside (FSPDN) architectures. Instead of push to max-performance, we proposed to focus on power efficiency. Our results demonstrate that BSPDN achieves a 19.8% reduction in power consumption under nominal operating conditions compared to FSPDN. Furthermore, the structural optimization of the power grid significantly enhances thermal dissipation, resulting in a 9.3°C reduction in peak operating temperature. These findings suggest that BSPDN is a critical enabler for next generation datacenter.
DAC Pavilion Panel
DescriptionArtificial Intelligence is rapidly transforming semiconductor design automation, promising major gains in productivity across the design lifecycle — from system requirements and architectural specifications to verification, layout optimization, and design closure. Yet unlike traditional EDA tools, AI systems rely on proprietary data, evolving models, and complex integrations with mission-critical workflows. This raises a fundamental and provocative question for the industry: would you trust an AI system to make decisions that directly impact tape-out success?
This panel brings together leaders from startups, established EDA vendors, and semiconductor design houses to explore how AI is being applied to both upstream activities such as requirements definition and specification development, as well as downstream design and verification tasks. The discussion will examine trust, explainability, data ownership, and accountability when AI influences critical engineering decisions.
Panelists will debate the strategic build-versus-buy decision for AI-based design automation, including whether AI-generated requirements should be treated as authoritative inputs and how organizations can balance innovation with risk in cost- and schedule-critical silicon programs.
This panel brings together leaders from startups, established EDA vendors, and semiconductor design houses to explore how AI is being applied to both upstream activities such as requirements definition and specification development, as well as downstream design and verification tasks. The discussion will examine trust, explainability, data ownership, and accountability when AI influences critical engineering decisions.
Panelists will debate the strategic build-versus-buy decision for AI-based design automation, including whether AI-generated requirements should be treated as authoritative inputs and how organizations can balance innovation with risk in cost- and schedule-critical silicon programs.
Engineering Presentation
EDA
DescriptionSemiconductor design involves extensive documentation, often spanning hundreds of pages in documents with highly intricate and complex internal relationships, making manual navigation and cross-referencing highly inefficient. To overcome this critical bottleneck, we implemented a solution that significantly reduces manual effort by leveraging an Ontology-based Knowledge Graph (KG). Our core methodology involves defining a formal ontology to systematically model the diverse entities (e.g., modules, registers, protocols) and their complex dependencies across disparate documents (specifications, reports, manuals). This structured conversion process transforms unstructured conversion process transforms unstructured documentation into a unified, queryable KG, which is the key to easily mapping and understanding the intricate web of connections inherent in the design. This paradigm shift from keyword matching to relationship-aware knowledge representation significantly simplifies knowledge discovery. Furthermore, this meticulously constructed Knowledge Graph serves as the knowledge base for Graph RAG chatbot, which provides engineers with accurate, context-aware answers to natural language queries.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionThe scaling-up of large language models (LLMs) necessitates computing systems to have multi-processor-chip architectures, elevating the importance of chip-to-chip (C2C) communication. However, designing efficient C2C hardware architectures for LLM workloads faces three key challenges: generating realistic LLM-specific C2C traffic, accurately simulating hardware-level communication at scale, and efficiently exploring the exponentially large C2C design space. We propose C2C-Explorer, an adaptive Bayesian DSE framework that integrates a LLM-workload-driven traffic generator, a scalable interconnect simulator (switch/full-mesh, up to 512 chips), and a metric-guided evaluator into a workload-to-hardware optimization pipeline, enabling systematic C2C architectural co-design under realistic LLM workloads. Validated against FPGA-based C2C prototypes, the C2C simulator achieves 2.46–8.23% end-to-end timing error across diverse traffic patterns. Its hybrid cycle & event model further accelerates large-scale simulation by up to 7.8× over a pure cycle-accurate baseline. Applied to a 32-XPU DeepSeek-R1-671B inference workload, C2C-Explorer identifies configurations that improve goodput by 44.1% and reduce memory by 98.4%. C2C-Explorer is open-source and available at https://anonymous.4open.science/r/C2C-Explorer.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionModern Non-Uniform Memory Access (NUMA) servers equipped with distributed Non-Volatile Memory Express (NVMe) storage present new challenges for I/O coordination. First, conventional page cache allocators ignore storage topology, resulting in inefficient cross-node cache placement and performance issues. Second, CPU schedulers overlook cache and storage locality, causing frequent cross-node cache accesses and further degrading performance. Third, the OS page cache employs a rigid eviction policy that fails to adapt to diverse application workloads, while even programmable alternatives often require complex manual tuning. To address these challenges, we propose Laelaps, a workload-aware I/O coordination framework for NUMA storage systems. Laelaps introduces three key techniques: (1) storage-topology-aware page cache placement, co-locating cache pages with their backing NVMe devices; (2) storage-topology-aware I/O thread scheduling that aligns thread placement with cache distribution to minimize cross-node accesses; and (3) workload-aware adaptive page caching that automatically selects eviction policies based on observed application access patterns. Evaluation shows that Laelaps achieves 1.41× geometric mean throughput improvement with 3.5% runtime overhead.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionCache partitioning mechanisms improve system performance by judiciously adjusting cache space among applications. Its effectiveness can be enhanced by partitioning in both time and space. However, existing solutions are limited to only one dimension.
This paper presents Cachence, a fine-grained cache partitioning mechanism in both time and space. Cachence profiles cache access patterns with theoretically guaranteed accuracy while incurring fixed storage overhead. Based on runtime microarchitectural statistics, we also propose a lightweight yet precise performance prediction model. Our evaluation shows that Cachence outperforms existing approaches by an average of 9.5% to 28.2%, and up to 48.8% on a 16-core system.
This paper presents Cachence, a fine-grained cache partitioning mechanism in both time and space. Cachence profiles cache access patterns with theoretically guaranteed accuracy while incurring fixed storage overhead. Based on runtime microarchitectural statistics, we also propose a lightweight yet precise performance prediction model. Our evaluation shows that Cachence outperforms existing approaches by an average of 9.5% to 28.2%, and up to 48.8% on a 16-core system.
People
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionDiffusion models, despite their exceptional generative capabilities, are bottlenecked by slow inference owing to their complex architectures and multi-step iterative processes. Although caching is a promising acceleration method, previous approaches suffer from a tight coupling with specific network structures and low computational density. Moreover, caching is challenging to integrate with other methods to further enhance performance due to non-orthogonality, as simple integration often leads only to model performance degradation.
To address these challenges, we propose CacheParallel, a novel training-free scheme that optimizes both caching and parallelism at the intra- and inter-step levels to reduce the computational load and increase the computational density simultaneously. We first introduced the fusion node to abstract the network structures into generic pre-fusion and post-fusion computational streams. For the pre-fusion stream, we applied intra-step caching optimized by the proposed CPS algorithm. For the post-fusion stream, we exploit inter-step output correlations to enable parallel computing and boost computational density. To mitigate accumulated errors and non-orthogonality issues, an error correction mechanism was designed based on our observation of the inherent linear correlations of the models.
Comparative experiments showed that we achieved significant performance improvements across various models, with the peak improvement surpassing 100\%.
To address these challenges, we propose CacheParallel, a novel training-free scheme that optimizes both caching and parallelism at the intra- and inter-step levels to reduce the computational load and increase the computational density simultaneously. We first introduced the fusion node to abstract the network structures into generic pre-fusion and post-fusion computational streams. For the pre-fusion stream, we applied intra-step caching optimized by the proposed CPS algorithm. For the post-fusion stream, we exploit inter-step output correlations to enable parallel computing and boost computational density. To mitigate accumulated errors and non-orthogonality issues, an error correction mechanism was designed based on our observation of the inherent linear correlations of the models.
Comparative experiments showed that we achieved significant performance improvements across various models, with the peak improvement surpassing 100\%.
People
Exhibitor Forum
DescriptionThe growing complexity and specialization of modern chips necessitate a fundamental shift in how we approach design and verification. Large Language Models (LLMs) present a timely and transformative opportunity to address key bottlenecks in chip design flows.
This talk will explore how the advanced natural language processing capabilities of LLMs can streamline design and verification processes, from automated verification plan generation to intelligent debug assistance—supported by real-world case studies. We will share insights from fine-tuning LLMs for DV-specific tasks, demonstrating measurable improvements in accuracy of LLMs on real-world examples.
Beyond the opportunities, we will discuss the challenges of deploying generative AI solutions in production chip design teams, including infrastructure constraints, reliability concerns, and building user trust. By integrating generative AI with modern methodologies and software stacks, this talk will outline how LLMs can accelerate chip design cycles and significantly reduce time to market for next-generation chips.
This talk will explore how the advanced natural language processing capabilities of LLMs can streamline design and verification processes, from automated verification plan generation to intelligent debug assistance—supported by real-world case studies. We will share insights from fine-tuning LLMs for DV-specific tasks, demonstrating measurable improvements in accuracy of LLMs on real-world examples.
Beyond the opportunities, we will discuss the challenges of deploying generative AI solutions in production chip design teams, including infrastructure constraints, reliability concerns, and building user trust. By integrating generative AI with modern methodologies and software stacks, this talk will outline how LLMs can accelerate chip design cycles and significantly reduce time to market for next-generation chips.
Exhibitor Forum
DescriptionDesign verification has made enormous strides at the chip level, yet a stubborn class of errors continues to plague teams working at the board and system level — voltage mismatches, misconfigured interfaces, overlooked datasheet constraints, and derating violations that standard DRC tools weren't built to catch. These aren't exotic corner cases. They're fundamental checks that experienced engineers know matter but that are tedious to perform manually and easy to miss under schedule pressure. The result: unnecessary respins, late-cycle fire drills, and eroded confidence in design closure.
This session explores how AI-assisted validation can bridge that gap by automatically constructing detailed behavioral models for every component in a design — grounded in actual datasheet specifications, not heuristic rules — and running deterministic checks on interface compatibility, power sequencing, thermal derating, and bus configuration as the schematic evolves. Unlike LLM-based "copilot" approaches, this methodology produces findings that are verifiable, traceable, and citable back to source documentation, while keeping proprietary design IP fully protected.
We'll walk through anonymized case studies from design consultancies, medical device OEMs, and industrial electronics teams where automated validation uncovered previously unknown, fabrication-blocking issues early enough to resolve with a schematic edit instead of a board respin. Attendees will leave with a concrete framework for where AI-assisted verification fits alongside existing EDA workflows — complementing, not replacing, the tools and judgment they already rely on.
This session explores how AI-assisted validation can bridge that gap by automatically constructing detailed behavioral models for every component in a design — grounded in actual datasheet specifications, not heuristic rules — and running deterministic checks on interface compatibility, power sequencing, thermal derating, and bus configuration as the schematic evolves. Unlike LLM-based "copilot" approaches, this methodology produces findings that are verifiable, traceable, and citable back to source documentation, while keeping proprietary design IP fully protected.
We'll walk through anonymized case studies from design consultancies, medical device OEMs, and industrial electronics teams where automated validation uncovered previously unknown, fabrication-blocking issues early enough to resolve with a schematic edit instead of a board respin. Attendees will leave with a concrete framework for where AI-assisted verification fits alongside existing EDA workflows — complementing, not replacing, the tools and judgment they already rely on.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGeneral matrix multiplication (GEMM) is the computational back- bone of modern AI workloads, and its efficiency is critically depen- dent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input 𝐴 along the dimension 𝑀 matches the output tile size of 𝐶. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2™ AI Engine (AIE), achieving up to a 4.54× speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16–BF16 GEMM, establishing a new performance record for XDNA2™ AIE.
People
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionApproximate Nearest Neighbor (ANN) search is a foundational primitive for AI applications such as Retrieval-Augmented Generation (RAG). CPU and GPU-based solutions face scalability bottlenecks due to limited local memory, while a multi-tier architecture using SSDs introduces high latency from coarse-grained I/O, mismatched with fine-grained data access patterns inherent to ANN search. We present CANNON (A CXL-Based Near-Memory Processing Architecture for Approximate Nearest Neighbor Search on Real Hardware), a fully offloaded Near-Memory Processing (NMP) architecture implemented on real CXL hardware. CANNON transforms the ANN search pipeline into a fine-grained, deeply pipelined dataflow architecture to maximize throughput, and introduces asynchronous hashing, a speculative execution mechanism that hides hash-check latency to prevent pipeline stalls. Evaluated on large-scale vector datasets, CANNON achieves up to two orders of magnitude performance improvement over state-of-the-art CPU and GPU baselines.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionFP-INT GEMM accelerators for weight-only quantized large language models suffer from large partial sum (psum) overhead and a lack of precision-convertibility, limiting scalability. To mitigate this, we propose CAPA, an accelerator achieving high efficiency and versatility. (1) CAPA introduces Hybrid Delta Block Floating Point (HDBFP), a novel INT-based format that reduces psum overhead, preserves accuracy, and integrates BF16/FP16 arithmetic. (2) CAPA adopts Weight Decomposition to unify INT8 and INT4 arithmetic, enabling parallel processing. We demonstrate CAPA attains identical GPU accuracy, improving area efficiency by 3.98x and power efficiency by 4.38x over the FP-FP baseline.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSeveral companies have been developing quantum computers. Co-Planar Waveguides (CPWs) couple qubits in a quantum computer and are also used to probe the quantum state in qubits. The design of meander CPW devices is an emerging area where the modeling of signal conductor's capacitance is important. Due to the complex shape of a meander CPW, a field solver is exclusively used to find the total capacitance of the signal conductor of a CPW device. In this work, we show how to model 3D capacitance in a CPW efficiently. We properly decompose the space along the path of the signal conductor into multiple regions and thus reduce the complex 3D capacitance modeling problem to a set of smaller 3D capacitance problems. Some of these smaller 3D capacitance problems (where the segment of the signal conductor is straight) are treated as 2D capacitance problems for a fixed cross section, although the shape of the cross section can vary as one moves to a next cross section. Others (where the segment of the signal conductor is a half-circle, for example) are still 3D capacitance problems. These remaining true 3D capacitance problems are further reduced into equivalent 2D capacitance problems.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionThis paper reveals and exploits a critical security vulnerability: the electromagnetic (EM) side channel of capacitive touchscreens leaks sufficient information to recover fine-grained, continuous handwriting trajectories. We present Touchscreen Electromagnetic Side-channel Leakage Attack (TESLA), a non-contact attack framework that captures EM signals generated during on-screen writing and regresses them into two-dimensional (2D) handwriting trajectories in real time. Extensive evaluations across a variety of commercial off-the-shelf (COTS) smartphones show that TESLA achieves 77% character recognition accuracy and a Jaccard index of 0.74, demonstrating its capability to recover highly recognizable motion trajectories that closely resemble the original handwriting under realistic attack conditions.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionWe present CapBench, a fully reproducible, multi-PDK dataset for capacitance extraction. The dataset is derived from open-source designs, including single-core CPUs, Systems-on-Chip, and media accelerators. All designs are fully placed and routed using 14 independent OpenROAD flow runs spanning three technology nodes: ASAP7, NanGate45, and Sky130HD. From these layouts, we extract 61,855 3D windows across three size tiers to enable transfer learning and scalability studies. High-fidelity capacitance labels are generated using RWCap, a state-of-the-art random-walk solver, and validated against the industry-standard Raphael, achieving a mean absolute error of 0.64% for total capacitance. Each window is pre-processed into density maps, graph representations, and point clouds. We evaluate 10 machine learning architectures that illustrate dataset usage and serve as baselines, including convolutional neural networks (CNNs), point cloud transformers, and graph neural networks (GNNs). CNNs demonstrate the lowest errors (1.75%), while GNNs are up to 41.4 times faster but exhibit the larger errors (10.2%), illustrating a clear accuracy–speed trade-off.
Engineering Special Session
EDA
Quantum
DescriptionAs we drive toward building the Quantum EDA (QEDA) stack, a familiar enemy stands in our way. To fully model a quantum chip—whether accurately capturing the physics of an individual qubit or solving the compact model of an entire processor—one must solve the quantum many-body problem. This is the "exponential wall" that pushes us to build a QPU in the first place. In this talk, I will lay the groundwork and thesis for a comprehensive QEDA stack. I will then discuss how Kothar's classical solvers pave a path to modeling hundreds of qubits without sacrificing the "quantumness" of the chip's components. Finally, we will establish a critical roadmap: the transition toward "using quantum computers to design quantum computers.
People
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionMicrocontroller Units (MCUs) are widely used in safety-critical systems, making them frequent attack targets. This demands lightweight defenses that remain reliable even after software compromise. Control Flow Auditing (CF-Aud) strengthens Control Flow Attestation by ensuring authenticated control-flow logs (CFLogs) are reliably delivered from a compromised prover (Prv) to a remote verifier (Vrf), enabling assessment of system behavior and support for remediation. However, existing CF-Aud designs rely on a costly busy-wait phase that limits Prv's utilization. In this work, we propose CARAMEL: a hybrid hardware-software root-of-trust architecture that reduces this bottleneck by enabling log transmission without halting execution. Its minimal communication interface and implementation are open-source.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionCoverage-Directed Generation (CDG) is essential for driving verification toward hard-to-hit coverage events, but existing DNN-augmented CDG flows have largely remained at proof-of-concept scale with strong GPU dependence. This work presents a production-grade, design-agnostic DNN-assisted CDG framework deployed on next-generation IBM POWER and Z processor units, scaling from 4 parameters and 320 events to 1,800 parameters and 36,000 events while remaining practical for industrial flows. By exploiting latent clustering in coverage events, hidden-layer width is sized by correlation structure (∼100 clusters) instead of raw event count, delivering a 508× memory reduction (30 GB to 60 MB) and 3.75× CPU training speedup (7.5 h to 2 h), enabling CPU-only deployment without accuracy loss. On a complex production unit, the DNN-guided CDG hits 59 of 74 RIT-critical no-hit events in 10 days—including 4 events missed for 272 days by state-of-the-art regressions—while requiring one template to match and surpass five baseline CDG templates. The same configuration generalizes across units without hyperparameter retuning, demonstrating scalable, transferable CDG acceleration suitable for real tapeout schedules.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionAccurate frequency measurement in high-speed data streams is crucial. While FPGA-based solutions are promising, existing designs often trade off throughput against accuracy due to limited hardware resources. This paper introduces CelestialEye, a CPU-FPGA synergistic architecture that employs a CPU-based heuristic decoding algorithm for high accuracy and offloads high-throughput tasks to the FPGA. Its pipelined hardware integrates hot/cold item separation, advanced compression, and encoding to optimize on-chip SRAM efficiency and bandwidth utilization. Furthermore, the frequency bounds generated by the FPGA process can accelerate the CPU's decoding. Experimental results demonstrate that CelestialEye significantly outperforms the state-of-the-art BitMatcher in both performance and accuracy.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionStandard-cell-based design has long been the dominant approach in VLSI design, providing scalability and compatibility with established EDA tools. However, as design technology co-optimization (DTCO) becomes increasingly important at advanced process nodes, the fixed structure of standard cells limits the ability to optimize wirelength and area effectively, imposing a critical bottleneck at the Middle-of-Line (MOL) layers. In this context, dense pin access and M0 routing resources, critical for alleviating congestion on higher metal layers, are rigidly constrained by the cell boundary. A more flexible approach is to directly place transistors on the design canvas and perform routing at the transistor level, enabling physical-level optimization beyond the logic-level constraints imposed by standard-cell design. This work introduces a novel framework featuring cell-structure-independent transistor-level placement coupled with an MOL-aware routing engine. Experimental results demonstrate that the proposed framework achieves considerable improvements in wirelength and design area compared to the commercial standard-cell-based design tool and state-of-the-art transistor-level work.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionAutomated standard cell library extension is crucial for maximizing Quality of Results (QoR) in modern VLSI design. We introduce CellE, a novel framework that leverages formal methods to achieve exhaustive discovery of functionally equivalent subcircuits. CellE applies equality saturation to the post-mapping netlist, generating an e-graph to cluster all functionally equivalent implementations. This canonical representation enables an efficient pattern mining algorithm to select the most area-optimal standard cells. Experimental results show a 15.41% average area reduction (up to 23.64% over prior work). Furthermore, characterization in a commercial flow demonstrates an 8.00% average delay reduction, confirming CellE's superior QoR optimization capabilities.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWe present a scalable verification strategy for a highly concurrent, partitionable Central DMA (CDMA) IP that supports inter‑execution‑environment transfers, multicast distribution, and native PCIe integration/exposure. The verification challenge spans deep AXI ID usage, unbounded read‑data interleaving, QoS and route‑by‑port attributes, ECC/parity reactions, and heterogeneous bus interfaces (AXI/APB/ACE‑Lite), compounded by system scale (dozens of station interfaces, MSI‑X vectors, and configurable engines/rings). Our solution is a layered, channel‑aware UVM scoreboard that cleanly separates transfer‑intent binding from protocol and data checking. A transactor decodes descriptors to publish transfer intent; the scoreboard binds read IDs (via first read ADDR of source), captures source data, then binds write IDs (via first write address of destination) to perform beat‑accurate data comparisons. Completed transfers sends out to specialized scoreboards—AXI attribute/protocol/bandwidth, writeback, and MSI‑X coalescing/ordering—with a unified completion model. Results show >50% reduction in average debug time, faster coverage closure (especially DMA cross bins), simpler scenario extensibility through reusable checkers, and earlier detection of critical issues; to date, no silicon issues have been observed. This layered approach improves observability, triage clarity, and reuse, accelerating time‑to‑market while lifting verification quality for CDMA variants.
People
Engineering Special Session
AI
Chiplet
EDA
Systems
DescriptionThis talk will explore the key challenges facing next-generation emulation and prototyping, including scalability, integration with heterogeneous systems, debugging capabilities, and support for emerging application domains such as AI and automotive. The session will then present cutting-edge solutions and best practices that address these hurdles, such as hybrid emulation approaches, cloud-based prototyping, improved hardware/software co-verification techniques, and advanced instrumentation for real-time analysis. Attendees will gain insights into recent technological advancements, practical deployment strategies, and future trends shaping the landscape of emulation and prototyping. This talk is intended for engineers, researchers, and decision-makers seeking to optimize verification flows and accelerate innovation in next-generation systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPower delivery challenges in advanced System-on-Chip (SoC) designs are increasingly critical due to stringent performance, area, and reliability requirements. Traditional approaches to channel width optimization are performed late in the design cycle, often relying on post-RTL Engineering Change Orders (ECOs). These late-stage modifications are complex, error-prone, and risk project delays, while narrow or sub-optimal interconnects exacerbate IR drop and electromigration (EM) issues, degrading power integrity and yield.
We propose an early-stage methodology that shifts EM/IR analysis and channel width optimization to the early design phase, leveraging preliminary floorplans, power grid data, and pseudo currents derived from historical designs. This proactive approach enables identification of voltage drop and current density hotspots before RTL availability, allowing designers to balance IR drop mitigation against routing resource efficiency. Results from initial implementations demonstrate strong correlation with post-RTL simulations, validating the accuracy of early predictions.
By addressing EM/IR risks upfront, this methodology reduces reliance on late ECOs, improves design robustness, and accelerates turnaround time. Future integration with automated in-design IR-fixing techniques promises further efficiency gains, ensuring reliable power delivery in next-generation SoCs.
We propose an early-stage methodology that shifts EM/IR analysis and channel width optimization to the early design phase, leveraging preliminary floorplans, power grid data, and pseudo currents derived from historical designs. This proactive approach enables identification of voltage drop and current density hotspots before RTL availability, allowing designers to balance IR drop mitigation against routing resource efficiency. Results from initial implementations demonstrate strong correlation with post-RTL simulations, validating the accuracy of early predictions.
By addressing EM/IR risks upfront, this methodology reduces reliance on late ECOs, improves design robustness, and accelerates turnaround time. Future integration with automated in-design IR-fixing techniques promises further efficiency gains, ensuring reliable power delivery in next-generation SoCs.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionFunctional verification accounts for over 50% of the IC development lifecycle, making SystemVerilog Assertions (SVAs) indispensable for rigorous digital chip verification. However, manual SVA authoring is labor-intensive and error prone. To address these challenges, we introduce ChatSVA, an end-to-end SVA generation system built upon a multi-agent framework. The AgentBridge platform facilitates systematic data generation, augmentation, and validation, decomposing complex verification processes into modular, verifiable subtasks. ChatSVA achieves syntax and function pass rates of 98.66% and 96.12%, averaging 139.5 SVAs per design with 82.50% function coverage. A ChatSVA web service is publicly available.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionFully homomorphic encryption (FHE) allows direct computation on encrypted data and thus enables privacy-preserving computation in the cloud. Among FHE schemes, CKKS efficiently supports arithmetic computation, while TFHE excels in logic operations. Recent studies have attempted to develop unified FHE accelerators that support diverse computational tasks and leverage the complementary strengths of the two schemes. However, existing unified FHE accelerators are predominantly CKKS-centric. At the operator layer, TFHE is mapped onto CKKS-oriented operators, causing significant performance degradation. Furthermore, the invocation to functional units and memory is entirely dominated by CKKS, resulting in hardware underutilization for TFHE.
In this paper, we propose three coordinated co-optimizations to address these challenges. First, we unify the computation of Fast Fourier Transform (TFHE core operation) and Number Theoretic Transform (CKKS core operation), and design a dual-mode computational flow to efficiently support both computations. Second, we propose an aggressive strategy of bootstrapping key unrolling to maximize TFHE hardware utilization. Third, we design a hardware-oriented support for FFT Shrinking KeySwitch to mitigate the performance bottleneck of TFHE KeySwitch. Accordingly, we present Chimera, a unified accelerator that redesigns the functional units and memory organization for TFHE to efficiently support these optimizations. Lastly, we develop an error, memory, and bandwidth-constrained auto-tuning framework that derives optimal TFHE configurations to maximize TFHE hardware utilization. Compared with the state-of-the-art unified FHE designs, Chimera achieves an average performance improvement by 14.84x across TFHE workloads, while maintaining comparable performance on CKKS workloads, with only a 9.6% area increase.
In this paper, we propose three coordinated co-optimizations to address these challenges. First, we unify the computation of Fast Fourier Transform (TFHE core operation) and Number Theoretic Transform (CKKS core operation), and design a dual-mode computational flow to efficiently support both computations. Second, we propose an aggressive strategy of bootstrapping key unrolling to maximize TFHE hardware utilization. Third, we design a hardware-oriented support for FFT Shrinking KeySwitch to mitigate the performance bottleneck of TFHE KeySwitch. Accordingly, we present Chimera, a unified accelerator that redesigns the functional units and memory organization for TFHE to efficiently support these optimizations. Lastly, we develop an error, memory, and bandwidth-constrained auto-tuning framework that derives optimal TFHE configurations to maximize TFHE hardware utilization. Compared with the state-of-the-art unified FHE designs, Chimera achieves an average performance improvement by 14.84x across TFHE workloads, while maintaining comparable performance on CKKS workloads, with only a 9.6% area increase.
People
Exhibitor Forum
DescriptionTiming closure remains one of the most iteration-heavy bottlenecks in chip design, spanning multiple teams and abstraction levels from RTL through signoff. Each iteration requires running tools that take hours to days, making the feedback loop between identifying a timing violation and validating a fix prohibitively slow. Extracting actionable signal from massive timing reports, mapping timing violations back to RTL, and choosing among fixes with complex PPA tradeoffs demand deep expertise and remain largely manual. We explore how AI agents can be applied to this problem: what domain-specific capabilities they need beyond general code generation and how an agent-driven "shift-left" strategy can reduce iteration count and accelerate convergence. We present ChipAgents' approach to timing closure along with the lessons we've learned along the way.
Exhibitor Forum
DescriptionOver half of frontend ASIC engineering time is spent on debugging and root cause analysis, navigating millions of lines of HDL and terabytes of waveform data. Despite this cost, hardware debugging remains almost entirely manual. This presentation introduces ChipAgents RCA, the first autonomous, agentic AI system for end-to-end ASIC root cause analysis using both code and waveform data at commercial scale.
ChipAgents RCA treats debugging as a structured search problem. The system combines three core innovations: (1) a waveform understanding engine purpose-built for AI agents that enables symbolic, query-based reasoning over massive waveform databases; (2) a novel multi-agent prover-verifier architecture that explores debugging hypotheses in parallel while enforcing skepticism and verification; and (3) a self-consistency ranking layer that calibrates confidence and surfaces the most reliable explanations to engineers.
The system has been evaluated across a diverse dataset of commercial-scale IPs, including bus fabrics, RISC-V cores, and complex protocols such as PCIe and DDR, spanning bug classes like backpressure, data corruption, protocol violations, and clock-domain issues. ChipAgents RCA achieves over 3x higher pass-at-one accuracy than state-of-the-art generic AI agents. In a representative PCIe 3.0 case study (36k lines of code, multi-level indirection), ChipAgents RCA isolated the exact root cause and patch in 10 minutes, compared to 4–8 hours of projected human effort, representing a 12x speedup.
This talk will present the system architecture, evaluation results, and lessons learned deploying autonomous debugging agents in real verification flows, highlighting how agentic AI can fundamentally change how hardware teams approach debug and verification closure.
ChipAgents RCA treats debugging as a structured search problem. The system combines three core innovations: (1) a waveform understanding engine purpose-built for AI agents that enables symbolic, query-based reasoning over massive waveform databases; (2) a novel multi-agent prover-verifier architecture that explores debugging hypotheses in parallel while enforcing skepticism and verification; and (3) a self-consistency ranking layer that calibrates confidence and surfaces the most reliable explanations to engineers.
The system has been evaluated across a diverse dataset of commercial-scale IPs, including bus fabrics, RISC-V cores, and complex protocols such as PCIe and DDR, spanning bug classes like backpressure, data corruption, protocol violations, and clock-domain issues. ChipAgents RCA achieves over 3x higher pass-at-one accuracy than state-of-the-art generic AI agents. In a representative PCIe 3.0 case study (36k lines of code, multi-level indirection), ChipAgents RCA isolated the exact root cause and patch in 10 minutes, compared to 4–8 hours of projected human effort, representing a 12x speedup.
This talk will present the system architecture, evaluation results, and lessons learned deploying autonomous debugging agents in real verification flows, highlighting how agentic AI can fundamentally change how hardware teams approach debug and verification closure.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionWhile chiplet-based 2.5D integration offers scalable high-performance computing with reduced manufacturing cost, it also poses significant challenges in design closure. Achieving timing closure in chiplet-based systems is notoriously difficult and frequently demands repeated engineering change orders (ECOs) and architectural-level iterations. To overcome these challenges, we propose ChiPlanner, a holistic {early-stage} design planner that integrates chiplet partitioning and physical planning. ChiPlanner first employs approximate placement to estimate post-chiplet-placement timing in advance of partitioning, which provides guidance for partitioning and helps reduce timing gaps. It further models inter-chiplet delay and incorporates a parallel net-weighting strategy in timing-driven physical planning to optimize overall system performance. Moreover, Bayesian optimization is leveraged to systematically explore the trade-off between chiplet manufacturing cost and system timing. Experimental results demonstrate that incorporating ChiPlanner's early-stage planning enables downstream chiplet design tools to achieve notably superior optimization outcomes, delivering average improvements of 42.4% in total negative slack (TNS) and 16.8% in worst negative slack (WNS).
These results confirm that accurate early-stage planning provides a far better initialization for chiplet design and significantly reduces the burden on later ECOs and architectural-level iterations.
These results confirm that accurate early-stage planning provides a far better initialization for chiplet design and significantly reduces the burden on later ECOs and architectural-level iterations.
People
Work in Progress
DescriptionAs automotive compute platforms evolve toward chiplet-based
architectures, the increasing heterogeneity introduces new challenges
in exploring the architecture design space as a function of
metrics such as performance, power, area, and cost. In this pursuit,
traditional high-fidelity methodologies, using detailed virtual
prototypes, can be cumbersome to build and computationally prohibitive
for early-stage design exploration. In contrast, analytical
techniques such as roofline modeling offer rapid insights but depend
on overly idealized assumptions — such as perfect computation–
communication overlap, and sustained peak bandwidth —
that do not hold in heterogeneous chiplet systems.
To address these limitations, we introduce ChipLite, a hybrid
modeling framework that extends the hierarchical roofline model
with workload-aware task mapping, realistic memory-access distributions,
and inter-/intra-chiplet fabric flow modeling. Using
ChipLite, we analyze three distinct chiplet architectures as example
case studies and demonstrate how workload completion times
are shaped by inter- and intra-chiplet congestion. Compared to
a naïve roofline model, for these case studies, ChipLite achieves
up to 6x lower error in predicting execution time, while retaining
orders-of-magnitude faster turnaround than detailed simulation
architectures, the increasing heterogeneity introduces new challenges
in exploring the architecture design space as a function of
metrics such as performance, power, area, and cost. In this pursuit,
traditional high-fidelity methodologies, using detailed virtual
prototypes, can be cumbersome to build and computationally prohibitive
for early-stage design exploration. In contrast, analytical
techniques such as roofline modeling offer rapid insights but depend
on overly idealized assumptions — such as perfect computation–
communication overlap, and sustained peak bandwidth —
that do not hold in heterogeneous chiplet systems.
To address these limitations, we introduce ChipLite, a hybrid
modeling framework that extends the hierarchical roofline model
with workload-aware task mapping, realistic memory-access distributions,
and inter-/intra-chiplet fabric flow modeling. Using
ChipLite, we analyze three distinct chiplet architectures as example
case studies and demonstrate how workload completion times
are shaped by inter- and intra-chiplet congestion. Compared to
a naïve roofline model, for these case studies, ChipLite achieves
up to 6x lower error in predicting execution time, while retaining
orders-of-magnitude faster turnaround than detailed simulation
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionAs integrated circuit designs grow in complexity, reference model development for functional verification faces increasing challenges. We propose ChipModeler, an LLM-assisted platform that streamlines reference model generation and verification through design standardization and hierarchical agile modeling. By employing a building-block generation strategy, ChipModeler significantly enhances both efficiency and quality. Evaluation on 300 diverse designs shows up to 58.99% improvement in performance, a 9.18× increase in generation capacity, and a 7.11× acceleration in design and validation cycles compared to manual methods, demonstrating ChipModeler's effectiveness in automating reference model development.
People
Workshop
DescriptionThe ChipsHub is a new community infrastructure to broadly support IC chip design and semiconductor workforce development. The objective of the Faculty Fellows program is the creation of appropriate and sharable educational and tool content to enable a new pool of universities to teach chip design curricula to previously-unserved populations of learners. To this end, the Faculty Fellows are establishing a library of end-to-end design workflows and newly developed classes for post-fab chip testing. In this workshop, the ChipsHub Faculty Fellows will present their work in developing and disseminating pedagogical materials including lectures and tutorials for chip design workflows. The discussion will include how agile open source and industrial tool flows and PDKs will carry working code forward in a robust, maintainable, and scalable fashion.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionHigh-Level Synthesis (HLS) lowers the barrier to FPGA development by allowing a wider range of programmers to design hardware accelerators. However, determining the appropriate synthesis directives (pragmas) remains a major challenge, particularly for developers without hardware expertise. As designs grow more complex and the pragma search space expands, choosing the right pragmas becomes essential for achieving low resource usage and high performance. Automated design space exploration (DSE) provides an effective solution to this challenge. The enormous search space and time-consuming design-point evaluation highlight the need for efficient search strategies. However, existing search strategies mostly rely on inefficient exhaustive search, hyperparameter-sensitive metaheuristic methods, or dedicated methods that are difficult to port and generalize.
To address these issues, we propose a rule-mining-based search strategy that efficiently guides exploration toward the most promising regions of the design space. In addition, we introduce a design space decomposition method to prune the search space, as well as a CodeLLM-based design-point evaluation method, which is both faster than directly invoking HLS tools, and more accurate than prior GNN-based approaches. Experimental results on four widely-used HLS benchmarks demonstrate that, under the same time budget, our DSE framework achieves better Quality of Results (QoR) than the state of the art. Our demo code is released at https://github.com/ScopeHLS/CHiRM-DSE
To address these issues, we propose a rule-mining-based search strategy that efficiently guides exploration toward the most promising regions of the design space. In addition, we introduce a design space decomposition method to prune the search space, as well as a CodeLLM-based design-point evaluation method, which is both faster than directly invoking HLS tools, and more accurate than prior GNN-based approaches. Experimental results on four widely-used HLS benchmarks demonstrate that, under the same time budget, our DSE framework achieves better Quality of Results (QoR) than the state of the art. Our demo code is released at https://github.com/ScopeHLS/CHiRM-DSE
People
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionCo-designing chiplet-based neural-network accelerators spans partitioning, placement, dataflow, and microarchitecture under tight energy and latency limits. Fast system-level estimators scale to large searches but blur intra-core effects that dominate energy; fine-grain reference models capture them but are too slow in-loop. We close this speed–fidelity gap with a coarse-to-fine flow: a compact, architecture-aware surrogate of intra-core cost guides the global search, and only top designs are rechecked with the reference model. Our contributions are an energy-focused feature space and a lightweight predictor across convolutional and GEMM-like workloads. On ResNet-50 and a Transformer, we reduce a manufacturing-weighted energy–delay objective by up to 62\% and 44\%, respectively.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionConfidential environments such as Trusted Execution Environments (TEEs) are increasingly used to protect the data and models of machine learning applications from adversarial attacks. Frameworks like TEE-protected Neural Networks (NN) have been deployed for secure cloud-based inference. However, recent studies have revealed one kind of ciphertext side channels that exploit the encryption process within TEE environments, known as CipherSteal attacks. Specifically, when data is transferred into a TEE using encryption schemes such as AES, the resulting ciphertext patterns can be observed to infer input data.
In this paper, we propose a series of methods named CipherShield to mitigate the ciphertext side-channel leakage and protect sensitive input data. It is a set of lightweight transformations that diffuse and decorrelate ciphertext without modifying the cryptography scheme. Our defenses, including block-based encryption, sparsity, and quantization, disrupt the per-address mapping patterns that ciphertext side-channels rely on to detect collisions while maintaining TEE compatibility and high throughput. The ciphertext hit rate, which reflects the amount of information leaked through ciphertext traces, drops from 50–80% to nearly zero across all evaluated datasets. On MNIST, CipherShield reduces classification accuracy from about 68% to as low as 2%. Evaluations on the Chest X-ray and CelebA datasets show similar reductions. Also, prototyping the block-based encryption in a TEE environment achieves markedly lower runtime than the original trace-based method, reducing encryption time by over 20x.
In this paper, we propose a series of methods named CipherShield to mitigate the ciphertext side-channel leakage and protect sensitive input data. It is a set of lightweight transformations that diffuse and decorrelate ciphertext without modifying the cryptography scheme. Our defenses, including block-based encryption, sparsity, and quantization, disrupt the per-address mapping patterns that ciphertext side-channels rely on to detect collisions while maintaining TEE compatibility and high throughput. The ciphertext hit rate, which reflects the amount of information leaked through ciphertext traces, drops from 50–80% to nearly zero across all evaluated datasets. On MNIST, CipherShield reduces classification accuracy from about 68% to as low as 2%. Evaluations on the Chest X-ray and CelebA datasets show similar reductions. Also, prototyping the block-based encryption in a TEE environment achieves markedly lower runtime than the original trace-based method, reducing encryption time by over 20x.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionIn Integrated circuit (IC) workflows, EDA tools provide accurate circuit metrics by simulating and analyzing physical characteristics, but their computational overhead makes them impractical for rapid design iterations. Thus, accurate and efficient circuit metric prediction at the register-transfer-level (RTL) stage has become a hot topic. However, current methods still struggle to efficiently bridge the gap between RTL and physical characteristics. We propose CircuitDiff, a generative pre-training framework designed to learn a unified representation space between RTL and netlist. This is achieved by encoding the RTL graph as a condition for training a netlist graph denoising diffusion model. During fine-tuning, learnable queries are used to incentivize knowledge from the pre-trained model via cross-attention mechanisms. Experimental results show that CircuitDiff achieves superior performance in the prediction of early-stage circuit metric compared to state-of-the-art models, supporting the "left-shift" paradigm while maintaining computational efficiency. Our code and data will be publicly available at \url{https://github.com/CatIIIIIIII/CircuitDiff}.
Exhibitor Forum
AI
EDA
Systems
DescriptionRicursive Intelligence is a frontier AI lab focused on building self-improving systems, starting with chip design. We are reinventing chip development and closing the loop between AI and the hardware that fuels it, recursively accelerating the path to artificial superintelligence. We are the team behind AlphaChip (Nature 2021), RL-CCD (DAC Best Paper 2023), Insta (DAC Best Paper 2025), C3PO (ASP-DAC Best Paper 2026), with hands-on experience developing Gemini, Claude, and TPUs. Our team members come from Google DeepMind, Anthropic, NVIDIA, Cadence, Apple, Stanford, MIT, Harvard, and other top institutions, and we are backed by $335M from Sequoia, Lightspeed, DST, and NVentures. In this talk, we will discuss our technical progress and long-term vision for the future of AI and chip design.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionEstimating the true hardware cost of quantum machine learning (QML) models is challenging due to repeated circuit evaluations affected by noise, decoherence, and routing delays. Conventional metrics like gate count overlook such hardware-dependent effects. We propose an analytical quantum cost model that estimates required quantum hardware resources using real device calibration data, incorporating gate durations, routing overheads, and noise-induced inefficiencies. Complementing this, a classical cost model converts FLOPs into equivalent units, providing a unified hardware-aware hybrid cost metric. Integrating both, we then propose Hyb-HANAS framework which employs multi-objective NAS (NSGA-II) to jointly optimize accuracy, execution time, and parameter count in hybrid quantum–classical networks.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionDRAM scaling increases vulnerability to peripheral faults, making symbol level ECCs essential. LPDDR6 introduces a 12-bit data beat that misaligns with Reed Solomon codes with 8 or 16-bit symbol. An 8-bit RS code meets on-die parity budgets but cannot guarantee correction. A 12-bit RS code guarantees correction but exceeds the budget. CLUE-ECC pools the 16-bit on-die and the system ECC parity for proper correction with less parity. Evaluation shows superior error handling with 33.3% on-die parity storage, 14.83% area and 24.18% power savings with the same bandwidth impact compared to a conventional RS code.
Research Special Session
Systems
DescriptionArtificial Intelligence (AI) at the edge must address conflicting performance and efficiency needs. One major challenge is the cost of data movement between processors and memories. Our proposed Near Memory Computing (NMC) architecture tackles this by providing memory arrays with lightweight arithmetic units, tailored to AI algorithms' needs for near sensor processing. We optimize holistically a) the design of NMC hardware, b) its integration within Systems-on-Chip (SoC) components, and c) edge AI applications. Our solution uses data parallelism, inherent in Deep Neural Network (DNN) models, to distribute computations across compute memory banks. It also employs aggressive quantization to reduce weight data size and lower computational demands. Additionally, it integrates seamlessly into standard SoC architectures, enabling end-to-end DNN inference near memory. Performance improvements of up to 250x are achieved compared to software execution, with only about 11% area overhead relative to similar non-compute memories.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionStochastic computing offers efficient approximate arithmetic that aligns well with error-tolerant machine learning workloads, but its deployment is limited by long bitstream latency in stochastic multiply-accumulate (MAC) units.Prior work reduces MAC latency through deterministic bitstream generation and differential accumulation, but these methods do not fully exploit the statistical property of convolution weights.This work presents a novel stochastic MAC architecture named CO-MAC, which employs center-out weight ordering and an enhanced convolution engine design to reduce effective computation cycles while maintaining high accuracy.The method sorts weights by magnitude, reuses the incremental differences in magnitudes, and applies sign handling after accumulation.This shortens counter activity, maintains accuracy with long effective bitstreams, and simplifies the MAC hardware by avoiding bidirectional counters.Across convolutional neural network workloads, CO-MAC decreases MAC latency by up to 54.8% compared to prior stochastic MAC architectures, while preserving accuracy and hardware simplicity.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionDeep neural networks (DNNs) on Neural Processing Units (NPUs) require carefully optimized operator mappings to achieve high performance, yet the mapping space grows rapidly with increasingly complex on-chip memory hierarchies. Existing dataflow models rely on a computation–data coupled paradigm that forces all tensors to share identical storage structures and data transfer paths, severely limiting the expressible mapping space. We present CODA, a computation–data decoupled dataflow paradigm that enables tensor-wise independent modeling of on-chip storage and movement. CODA introduces the non-uniform loop space to jointly represent computation and per-tensor data mappings, together with an analytical performance model and a simulated-annealing–based optimizer. Across single-operator and fused-operator workloads, CODA achieves 1.10x-1.11x and 1.14x–1.85x speedup over state-of-the-art methods.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionThe partitioned microarchitecture of Streaming Multiprocessor (SM) is a fundamental design that enables massive parallelism in modern General-Purpose Graphics Processing Units (GPGPUs). This approach, however, inherently introduces two critical inefficiencies: vast data redundancy across distributed private registers and severe conflicts and load imbalance in the limited banks. To holistically address these cross-partition bottlenecks, this paper introduces CODA, a cooperative Register File (RF) distribution and arbitration framework. CODA is composed of two synergistic mechanisms: the Cooperative Register Renaming (CRR), which eliminates data redundancy by maintaining a single physical copy of shared data across sub-cores, and the Dual-Skewed Arbitrator (DSA) that mitigates fine-grained bank conflicts by incorporating the operand collector ID into its arbitration logic. Evaluations on a diverse suite of deep learning workloads show that CODA exhibits superior performance compared to its baseline architecture, achieving a speedup of 28.9% alongside a 15.6% reduction in RF power consumption. These gains are directly attributed to a 15.9% reduction in RF load imbalance and a 12.7% lower bank conflict rate.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionElectromigration (EM) and resistance constraints are traditionally verified at sign-off, making routing electrically blind and leading to costly post-layout iterations. This paper introduces a simulation-driven routing (SDR) methodology that integrates electrical intent into the layout process, enabling EM-correct-by-construction routing. The proposed approach leverages average current datasets from circuit simulation or design intent (DI) set in schematic database to dynamically guide routing decisions. For optimal routing the automation implements customizable trunk stacking across multiple metal layers, optimized trunk widths for area efficiency, and uniform via distribution to minimize resistance. SDR supports configurable stacking strategies—Auto, Trunk, and Custom—allowing full user control through SKILL-based modifications for scalability across nodes. Visualization of current distribution further enhance topology optimization. Implemented on a 28nm analog design, the methodology achieves routing quality equivalent to golden reference while reducing EM sign-off iterations and manual effort. Results demonstrate 30% faster turnaround, zero area wastage, and improved reliability without compromising design integrity. This framework addresses critical pain points in conventional flows and provides a scalable solution for high-current nets in both mature and advanced technologies.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionFlash-based SSDs are prone to bit errors in flash cells. To ensure reliability, SSDs employ error correction codes (ECCs) that encode user data bits into codewords composed of data and redundant parity bits, enabling correction of a limited number of bit errors. In practice, however, the raw bit error rate (RBER) of SSDs fluctuates over time. We observe that for ECCs with a fixed code rate (the fraction of user data bits per codeword), longer codewords provide higher reliability, while shorter codewords offer lower read latency. To this end, we propose COLA, an adaptive coding framework that optimizes the code length (i.e., total number of user data and parity bits per codeword) for individual SSD pages to achieve low-latency reads while maintaining reliability guarantees and constant storage overhead. COLA adopts a failure-aware read mechanism that selectively transfers and decodes failed codewords, and integrates a failure-aware read-latency model to determine actual read operations based on current RBERs and select the optimal code length for each write operation. Evaluation using the MQSim simulation shows that COLA significantly reduces average and tail read latencies compared to the default fixed-code-length approach.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionIn resource-constrained tiny edge devices like MCUs, deploying data/compute-intensive AI like CNNs is challenging. Compute-in-Memory (CIM) and Binary Neural Networks (BNN) offer promising solutions by reducing memory access and simplifying computations. We propose RES-BNN, a lightweight, reconfigurable accelerator integrating MRAM-based CIM with serial BNN computation. It introduces a temporal reconfigurable input mechanism and an Integrate-and-Fire adder for serial MAC operations, plus an Exact Result Inferring method for efficient binary convolutional computations. RES-BNN achieves up to 85% energy and 83% power reduction over ADC-based CIM baselines, and enables dynamic latency-power tradeoffs, dropping power to 0.03% in serial mode for edge adaptability.
Work in Progress
DescriptionWith continuous technology scaling, accurate and efficient glitch modeling is critical for designing energy-efficient and reliable ICs. In this work, we present a new gate-level approach for glitch propagation modeling, utilizing Artificial Neural Networks (ANNs) to estimate the key glitch shape characteristics, propagation delay, and power dissipation. Moreover, we introduce a framework that automates ANN generation and integrates them into standard cell libraries, exploring different architectures to balance accuracy and memory footprint. The proposed framework employs efficient techniques to generate realistic input glitch waveforms, reduce characterization effort, and improve model accuracy, memory efficiency, and robustness. Experimental results on gates implemented in 7 nm FinFET and 45 nm bulk CMOS technologies indicate that our models exhibit a strong correlation with SPICE, achieving a mean R2 score of 0.995 across all gates and process, voltage, and temperature corners while maintaining low memory demands. Furthermore, validation on paths extracted from real circuits confirms our models' high accuracy and performance. Thus, our approach could enable accurate full-chip glitch analysis and effectively guide glitch reduction techniques.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionHypergraph partitioning is a critical step in the design of complex embedded systems, essential for optimizing task mapping on heterogeneous MPSoCs and enabling multi-FPGA prototyping.
Many existing methods rely on community detection to identify modules with dense internal and sparse external connections, typically utilizing them to constrain the coarsening phase—a widely adopted paradigm. In this work, we propose ComPart, a generalized framework that integrates diverse community detection methods to uncover high-quality clusterings throughout the post-coarsening stages (i.e., initial partitioning and uncoarsening). These discovered clusterings serve as distinct structural guides, enabling the refinement process to identify superior partitioning solutions. Our framework offers two key advantages: (1) it establishes a new paradigm that leverages community structures detected during uncoarsening to escape local optima and explore globally meaningful solution subspaces, transcending the limitations of standard local refinements; and (2) it flexibly accommodates both existing and future community detection methods. Furthermore, we theoretically generalize locally-dense decomposition—originally from graphs—to the hypergraph domain. We provide the formal extension and necessary proofs to apply this technique to hypergraphs, marking its first application in hypergraph partitioning. Specifically, we utilize this rigorously derived decomposition to guide the initial partitioning phase toward superior starting points. Experimental results on standard benchmarks demonstrate that our method consistently outperforms state-of-the-art methods in solution quality.
Many existing methods rely on community detection to identify modules with dense internal and sparse external connections, typically utilizing them to constrain the coarsening phase—a widely adopted paradigm. In this work, we propose ComPart, a generalized framework that integrates diverse community detection methods to uncover high-quality clusterings throughout the post-coarsening stages (i.e., initial partitioning and uncoarsening). These discovered clusterings serve as distinct structural guides, enabling the refinement process to identify superior partitioning solutions. Our framework offers two key advantages: (1) it establishes a new paradigm that leverages community structures detected during uncoarsening to escape local optima and explore globally meaningful solution subspaces, transcending the limitations of standard local refinements; and (2) it flexibly accommodates both existing and future community detection methods. Furthermore, we theoretically generalize locally-dense decomposition—originally from graphs—to the hypergraph domain. We provide the formal extension and necessary proofs to apply this technique to hypergraphs, marking its first application in hypergraph partitioning. Specifically, we utilize this rigorously derived decomposition to guide the initial partitioning phase toward superior starting points. Experimental results on standard benchmarks demonstrate that our method consistently outperforms state-of-the-art methods in solution quality.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionRecent experimental progress in surface-code hardware, including
demonstrations of break-even logical memory on devices with up to
hundreds of physical qubits, has materially advanced the prospects
for fault-tolerant quantum computation. This progress creates ur-
gency for compilation workflows that directly target the forth-
coming generation of devices with thousands of physical qubits,
for which algorithm execution becomes practical. We develop a
pipeline for compiling logical algorithms to physical circuits imple-
menting lattice surgery on the surface code, and use this pipeline
to identify the requirements for achieving algorithmic break-even—
where quantum error correction improves the performance of a
quantum algorithm—for the quantum approximate optimization
algorithm (QAOA). Our pipeline integrates several open-source
software tools, and leverages recent advances in error-aware uni-
tary gate synthesis, high-fidelity magic-state production, and the
calculation of correlation surfaces in the surface code. We apply our
pipeline by performing classical simulations of physical Clifford
proxy circuits produced by our pipeline, and find that 5-qubit QAOA
can reach algorithmic break-even with 2517 physical qubits (surface
code distance 𝑑 = 11) at physical error rates of 𝑝 = 10−3, or 1737
physical qubits (𝑑 = 9) at 𝑝 = 5 × 10−4. Our work thereby identifies
conditions for achieving algorithmic break-even with near-term
quantum hardware and paves the way towards an end-to-end com-
piler for early-fault-tolerant surface code architectures
demonstrations of break-even logical memory on devices with up to
hundreds of physical qubits, has materially advanced the prospects
for fault-tolerant quantum computation. This progress creates ur-
gency for compilation workflows that directly target the forth-
coming generation of devices with thousands of physical qubits,
for which algorithm execution becomes practical. We develop a
pipeline for compiling logical algorithms to physical circuits imple-
menting lattice surgery on the surface code, and use this pipeline
to identify the requirements for achieving algorithmic break-even—
where quantum error correction improves the performance of a
quantum algorithm—for the quantum approximate optimization
algorithm (QAOA). Our pipeline integrates several open-source
software tools, and leverages recent advances in error-aware uni-
tary gate synthesis, high-fidelity magic-state production, and the
calculation of correlation surfaces in the surface code. We apply our
pipeline by performing classical simulations of physical Clifford
proxy circuits produced by our pipeline, and find that 5-qubit QAOA
can reach algorithmic break-even with 2517 physical qubits (surface
code distance 𝑑 = 11) at physical error rates of 𝑝 = 10−3, or 1737
physical qubits (𝑑 = 9) at 𝑝 = 5 × 10−4. Our work thereby identifies
conditions for achieving algorithmic break-even with near-term
quantum hardware and paves the way towards an end-to-end com-
piler for early-fault-tolerant surface code architectures
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionRTL generators enable agile design of DNN accelerators, but the lack of early-stage power feedback forces designers to discover energy inefficiencies only after costly synthesis. Existing arch-level simulator-based approaches fall short for agile workflows: they require expertise and effort incompatible with rapid iteration. While machine learning struggles to capture software-hardware coupling, we reveal a key insight: compilation tells energy. Compiler toolchains in RTL generators already fuse workload and hardware characteristics—the coupling determining power. By extracting features from compiler IRs and multi-task learning, our methodology achieves practical accuracy through push-button workflows requiring no expertise. Validation on Gemmini and hls4ml demonstrates broad applicability, enabling true power-aware agile design.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionWe present a directional-transport (DT)-based remote CZ gate and compiler for zoned neutral-atom arrays that overcomes movement-bound entanglement limitations. Current AOD-based shuttling faces row/column non-crossing constraints, device-speed limits, and FOV/NA-restricted range—bottlenecks for long-distance connectivity. Our approach reserves AODs for channel setup and micro-tuning while making DT the default for remote entanglement. Under antiblockade, a detuning-modulated pi-pulse sequence drives directional transport of a Rydberg excitation along a dynamic and resettable ancilla corridor, realizing a CZ gate between stationary, non-adjacent qubits. This cuts entangling-stage duration by approximately 50% -90% versus AOD-only baselines and enables long-distance connectivity beyond objective-limited shuttling.
People
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionThis work presents COmPOSER, an open-source, end-to-end framework for RF/mm-wave design automation that translates target specifications into optimized circuits with layouts. It unifies schematic synthesis, layout generation for actives and passives, and placement/routing, incorporating physics-based equations and machine-learning-driven electromagnetic models.
Based on post-layout validation on multiple LNAs and PAs operating at up to 60GHz in a commercial 65nm process-kit, COmPOSER meets performance targets, comparable to expert manual designs, while delivering a 100-300x productivity gain.
Based on post-layout validation on multiple LNAs and PAs operating at up to 60GHz in a commercial 65nm process-kit, COmPOSER meets performance targets, comparable to expert manual designs, while delivering a 100-300x productivity gain.
Engineering Presentation
Chiplet
EDA
DescriptionAs semiconductor technology advances, the industry is moving from traditional monolithic SoC designs to heterogeneous integration enabled by advanced 2.5D and 3D IC packaging. These architectures deliver higher performance, improved power efficiency, and greater functionality by combining multiple dies in a single package. However, this shift introduces complex physical verification challenges that conventional single-die methodologies cannot address.
Traditional techniques like DRC and LVS, designed for monolithic SoCs, are inadequate for multi-die integration, which involves new elements such as vertical interconnects, micro-bumps, TSVs, and interposers with unique geometric and electrical constraints. Simple XOR-based checks fail to accurately validate bump alignment, and multi-die LVS is complicated by the lack of foundry-provided rule decks, requiring custom solutions and multiple iterations—leading to time-consuming, error-prone flows and schedule risks.
To overcome these challenges, a scalable 3D physical verification methodology is essential. Leveraging tools like Calibre 3DStack enables precise die-to-die connectivity checks, robust micro-bump and TSV alignment validation, and efficient hierarchical verification flows. This approach minimizes iterations, reduces risk, and supports next-generation heterogeneous integration by enabling hierarchy reuse for improved design efficiency and scalability.
Traditional techniques like DRC and LVS, designed for monolithic SoCs, are inadequate for multi-die integration, which involves new elements such as vertical interconnects, micro-bumps, TSVs, and interposers with unique geometric and electrical constraints. Simple XOR-based checks fail to accurately validate bump alignment, and multi-die LVS is complicated by the lack of foundry-provided rule decks, requiring custom solutions and multiple iterations—leading to time-consuming, error-prone flows and schedule risks.
To overcome these challenges, a scalable 3D physical verification methodology is essential. Leveraging tools like Calibre 3DStack enables precise die-to-die connectivity checks, robust micro-bump and TSV alignment validation, and efficient hierarchical verification flows. This approach minimizes iterations, reduces risk, and supports next-generation heterogeneous integration by enabling hierarchy reuse for improved design efficiency and scalability.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPower systems are a critical component of the modern system on chips (SoCs) and present formidable mixed-signal verification challenges due to increasing design complexity and diverse power requirements. Designers face hurdles like managing numerous power domains, operating modes, intricate power sequencing, state transitions, and critical analog-digital interactions at domain boundaries. Furthermore, verifying complex Unified Power Format (UPF) strategies for isolation and retention adds significant complexity. Traditional full SPICE simulations for comprehensive power state validation are prohibitively time-consuming, often requiring weeks of runtime and delaying time-to-market by months.
To address these profound limitations, Microchip, in collaboration with Siemens EDA, developed an optimized mixed-signal verification methodology for comprehensive SoC power system validation. This innovative approach, leveraging Siemens EDA's AI accelerated Solido Simulation Suite employs a digital-on-top strategy, seamlessly integrating high-fidelity SPICE views for critical analog modules with high-performance digital simulation for the broader RTL design and UPF strategies.
In this presentation, we will detail this novel methodology and demonstrate how it achieves more than 22X runtime reduction for power states simulation without compromising accuracy, thereby enabling rapid and accurate verification of critical power management scenarios. This approach leads to improving verification coverage, Silicon quality and accelerating time-to-market for robust, energy-efficient SoCs.
To address these profound limitations, Microchip, in collaboration with Siemens EDA, developed an optimized mixed-signal verification methodology for comprehensive SoC power system validation. This innovative approach, leveraging Siemens EDA's AI accelerated Solido Simulation Suite employs a digital-on-top strategy, seamlessly integrating high-fidelity SPICE views for critical analog modules with high-performance digital simulation for the broader RTL design and UPF strategies.
In this presentation, we will detail this novel methodology and demonstrate how it achieves more than 22X runtime reduction for power states simulation without compromising accuracy, thereby enabling rapid and accurate verification of critical power management scenarios. This approach leads to improving verification coverage, Silicon quality and accelerating time-to-market for robust, energy-efficient SoCs.
Research Special Session
AI
DescriptionSince its inception, Disney has been at the forefront of adopting hardware that bridges the gap between imagination and reality. As Extended Reality (XR) hardware reaches new heights of graphical and computational fidelity, the boundary between the viewer and the narrative is dissolving. This talk explores Disney's strategic approach to the XR landscape, focusing on how we leverage next-generation hardware to transform Disney+ into a spatial entertainment hub. Using our work with Apple Vision Pro and Meta Quest as a case study, we examine how we moved beyond traditional 2D interfaces to develop a "Spatial UI" that prioritizes immersion and presence. We will discuss how the immense compute power of modern XR devices allows for innovations previously impossible on mobile or living room screens, such as photorealistic environments and high-bitrate 3D movie playback. These technical leaps continue to drive towards a future where Disney stories are no longer confined to a screen; they unfold in real time around the audience. This session offers a deep dive into the process of creating new experiences on cutting edge hardware platforms.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionAdvanced 3D and 3.5D IC packaging significantly improves integration density but elevates thermal management challenges due to cross-layer heat coupling and complex cooling structures. Traditional solvers deliver high fidelity but are too slow for iterative design flows, while existing learning-based methods either fail to capture inter-die thermal coupling or treat cooling structures as static components, limiting their applicability in real packaging co-design scenarios.
In this work, we introduce COOL, a cooling-aware point transformer framework that represents heterogeneous assemblies (dies, interposers, TIMs, heat spreaders) as annotated 3D point clouds embedding geometric, material, and power attributes. COOL explicitly encodes geometric boundaries and cooling structures, and introduces a physics-informed boundary condition (PI-BC) loss to enforce thermal consistency at material interfaces and cooling boundaries. Extensive experiments demonstrate that COOL achieves a remarkable 2.4% NMAE on our constructed benchmark of multi-package thermal designs, substantially outperforming existing learning-based approaches while providing over 15.7× speedup compared to commercial FEM solvers.
In this work, we introduce COOL, a cooling-aware point transformer framework that represents heterogeneous assemblies (dies, interposers, TIMs, heat spreaders) as annotated 3D point clouds embedding geometric, material, and power attributes. COOL explicitly encodes geometric boundaries and cooling structures, and introduces a physics-informed boundary condition (PI-BC) loss to enforce thermal consistency at material interfaces and cooling boundaries. Extensive experiments demonstrate that COOL achieves a remarkable 2.4% NMAE on our constructed benchmark of multi-package thermal designs, substantially outperforming existing learning-based approaches while providing over 15.7× speedup compared to commercial FEM solvers.
People
DAC Pavilion Panel
Design
EDA
EDA
Design
DescriptionCome watch the EDA troublemakers answer the edgy, user-submitted questions about this year's most controversial issues! It's an old-style open Q&A from the days before corporate marketing took over every aspect of EDA company images.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionFull-chip inverse lithography technology (ILT) is critical for semiconductor manufacturing but remains difficult to maintain global solution integrity. While gradient fusion flow mitigates mask stitching artifacts, it suffers from a uniform, unweighted update policy. We identify heterogeneous smoothness and inter-clip coupling as key factors that create conflicts between local clip convergence and global integrity. We propose a coordinated clip-wise gradient scheduling framework trained via policy learning to resolve these conflicts. The method constructs a full-chip state by combining per-clip static geometric descriptors and dynamic optimization signals, and aggregates them with a graph neural network to encode inter-clip relations. From this state, a scheduler generates continuous, correlated gradient weights learned with flow-matching policy gradients, capturing cross-clip dependencies. On industry-scale layouts, the approach outperforms state-of-the-art full-chip ILT baselines.
People
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionRecent advances in large language models (LLMs) have driven exponential growth in parameter counts, amplifying memory footprints and creating acute computational bottlenecks.
Weight-only quantization is a common mitigation because weights dominate storage and incur less accuracy degradation than activation quantization. However, it leaves activations in floating point (FP), forcing FP execution whose area and energy costs far exceed those of integer (INT) compute. To address this limitation, binary coding quantization (BCQ) lowers this cost by replacing many FP multiplies with FP additions, but still requires FP multiplications for scaling factors and suffers from limited representational capacity, causing non-trivial accuracy loss. In this paper, we propose a sum-of-power-of-two scaling factor–based BCQ (SS-BCQ) that approximates FP scaling factors with a small set of power-of-two terms, eliminating FP multiplications (shift–add only) while preserving accuracy. To efficiently realize SS-BCQ in hardware, we introduce CORE, a fully pipelined FP–INT general matrix-matrix multiplication (GEMM) engine. CORE features (i) an adaptive data-mapping module that computes optimized block-wise activation scales to minimize scaling operations, and (ii) an extra processing unit that allocates a small subset of processing elements to high-importance weights to retain accuracy at low cost. Evaluated on the OPT-6.7B model, SS-BCQ reduces perplexity by 3.14 compared with prior BCQ methods. Implemented in CORE, our approach achieves up to 1.41x higher area efficiency (TOPS/mm^2) and 2.18x higher energy efficiency (TOPS/W) than state-of-the-art FP–INT accelerators, enabling a homogeneous low-precision execution path for on-device LLMs.
Weight-only quantization is a common mitigation because weights dominate storage and incur less accuracy degradation than activation quantization. However, it leaves activations in floating point (FP), forcing FP execution whose area and energy costs far exceed those of integer (INT) compute. To address this limitation, binary coding quantization (BCQ) lowers this cost by replacing many FP multiplies with FP additions, but still requires FP multiplications for scaling factors and suffers from limited representational capacity, causing non-trivial accuracy loss. In this paper, we propose a sum-of-power-of-two scaling factor–based BCQ (SS-BCQ) that approximates FP scaling factors with a small set of power-of-two terms, eliminating FP multiplications (shift–add only) while preserving accuracy. To efficiently realize SS-BCQ in hardware, we introduce CORE, a fully pipelined FP–INT general matrix-matrix multiplication (GEMM) engine. CORE features (i) an adaptive data-mapping module that computes optimized block-wise activation scales to minimize scaling operations, and (ii) an extra processing unit that allocates a small subset of processing elements to high-importance weights to retain accuracy at low cost. Evaluated on the OPT-6.7B model, SS-BCQ reduces perplexity by 3.14 compared with prior BCQ methods. Implemented in CORE, our approach achieves up to 1.41x higher area efficiency (TOPS/mm^2) and 2.18x higher energy efficiency (TOPS/W) than state-of-the-art FP–INT accelerators, enabling a homogeneous low-precision execution path for on-device LLMs.
Engineering Presentation
EDA
Security
DescriptionLimitations in Sign-off Methodology:
Traditional STA corner selection 10% lower STA corner from PMIC voltage is selected- design is constantly optimised for 10% lower voltage, thereby failing to build margin against differential drop.
IR aware STA does not account timing path's geometric imbalances (logic depths), net dominated interconnect skews (net delays & metal layer variation) - all dominant in advanced process nodes. Furthermore, this is workload dependent: fixing IR STA violations does not build margins on unseen vectors.
Frequent Silicon issues due to these gaps:
Low voltage mode scan shift Vmin jumps need to meet slack on paths which become exponentially sensitive to voltage gradients. Even small differential IR drops (capture & launch traversing through contrasting IR hotspot & cool regions) cause catastrophic slack loss
High divergence paths with structural imbalances-where clock paths are net-dominated & are operated at high speeds often fail to meet hold timing, despite good pre-silicon margins due to high interlayer metal-sheet & via resistances in lower process nodes.
Additionally, there is considerable PPA impact -higher dynamic & leakage power in clock & data paths respectively in divergent paths. Our proposed solution aims to address above gaps.
Traditional STA corner selection 10% lower STA corner from PMIC voltage is selected- design is constantly optimised for 10% lower voltage, thereby failing to build margin against differential drop.
IR aware STA does not account timing path's geometric imbalances (logic depths), net dominated interconnect skews (net delays & metal layer variation) - all dominant in advanced process nodes. Furthermore, this is workload dependent: fixing IR STA violations does not build margins on unseen vectors.
Frequent Silicon issues due to these gaps:
Low voltage mode scan shift Vmin jumps need to meet slack on paths which become exponentially sensitive to voltage gradients. Even small differential IR drops (capture & launch traversing through contrasting IR hotspot & cool regions) cause catastrophic slack loss
High divergence paths with structural imbalances-where clock paths are net-dominated & are operated at high speeds often fail to meet hold timing, despite good pre-silicon margins due to high interlayer metal-sheet & via resistances in lower process nodes.
Additionally, there is considerable PPA impact -higher dynamic & leakage power in clock & data paths respectively in divergent paths. Our proposed solution aims to address above gaps.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIdentifying recurring sub-circuits within large transistor-level designs is a common requirement in circuit analysis and verification. Traditional approaches rely on exact subgraph isomorphism, requiring identical structure and attributes between a pattern and a design. However, real designs often include small structural variations introduced by incremental changes, optimizations, or renaming, causing exact matching methods to miss relevant instances and increasing manual analysis effort.
This work presents a scalable framework for partial subgraph matching that quantifies structural similarity rather than enforcing exact equivalence. Circuits are represented as attributed graphs, and matching is formulated as an injective optimization problem using a cost-based graph edit distance. The cost function captures label mismatches, unmatched nodes, and missing or extra edges, enabling fine-grained assessment of near-miss matches. To address the computational complexity of partial matching, the framework combines biased candidate subgraph sampling with an efficient approximation strategy based on greedy initialization and iterative refinement.
The approach produces ranked candidate matches along with detailed, interpretable difference reports that highlight structural deviations between circuits. Experimental results demonstrate predictable runtime scaling with candidate budget and consistent behavior across a range of circuit patterns. The framework also establishes a foundation for automated tuning of cost parameters and learning-based candidate filtering.
This work presents a scalable framework for partial subgraph matching that quantifies structural similarity rather than enforcing exact equivalence. Circuits are represented as attributed graphs, and matching is formulated as an injective optimization problem using a cost-based graph edit distance. The cost function captures label mismatches, unmatched nodes, and missing or extra edges, enabling fine-grained assessment of near-miss matches. To address the computational complexity of partial matching, the framework combines biased candidate subgraph sampling with an efficient approximation strategy based on greedy initialization and iterative refinement.
The approach produces ranked candidate matches along with detailed, interpretable difference reports that highlight structural deviations between circuits. Experimental results demonstrate predictable runtime scaling with candidate budget and consistent behavior across a range of circuit patterns. The framework also establishes a foundation for automated tuning of cost parameters and learning-based candidate filtering.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionThe Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference.
To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)–enabled CPU–GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.
To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)–enabled CPU–GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionBinary neural networks (BNNs) have emerged as promising models for lightweight intelligent inference by binarizing inputs and weights. Compute in memory (CIM) architectures, which reduce data movements through in-situ operations, have become a strong candidate for BNN accelerators. However, prior works often overlook the security of the model, leaving BNN weights exposed in memory cells and vulnerable to threats such as cloning, tampering, and reverse engineering. This work proposes CPA-BNN, a secure BNN CIM architecture based on resistive random access memory (RRAM). We propose a 4T2R RRAM cell design which implements in-situ encryption and ciphertext convolution operations to protect the weights. In addition, an in-memory batch normalization (BN) scheme is proposed to optimize area overhead and improve compute density. Besides, we propose an intrinsic physical unclonable function (PUF) entropy extraction method and a current tilt-based masking stratege, enabling reliable key extraction within a unified array. The results show that CPA-BNN achieves ~100% key reliability and effectively prevents model attacks. Compared to state-of-the-art SRAM/NVM BNN schemes, CPA-BNN achieves >1.4× compute density and >1.6× storage density improvement.
Engineering Presentation
EDA
DescriptionHigh-performance and high-frequency cores operate with aggressive clock targets, making timing closure highly sensitive to Common Path Pessimism Removal (CPPR). While advanced clock tree synthesis (CTS) techniques such as Flex-H and multi-point CTS effectively reduce skew and clock latency, they typically rely on static, geometry-based sink assignment, limiting achievable clock–data path correlation and CPPR on critical timing paths.
This work proposes a CPPR-aware tap-point sink reassignment methodology that selectively reassigns launch and capture sinks at clock tap points to maximize shared clock path length without modifying clock topology, buffer depth, or skew constraints. A script-driven framework identifies timing-critical tap-point pairs, evaluates reassignment opportunities based on timing impact, and integrates seamlessly with standard CTS optimization flows.
Experimental results on high-performance core designs demonstrate up to 25% reduction in setup total negative slack and up to 43% improvement in register-to-register setup TNS, along with ~13% improvement in hold TNS, achieved by reassigning a limited number of sinks. The proposed approach incurs negligible power and physical impact, enabling robust timing closure in advanced high-frequency designs where CPPR is often the primary remaining optimization lever.
This work proposes a CPPR-aware tap-point sink reassignment methodology that selectively reassigns launch and capture sinks at clock tap points to maximize shared clock path length without modifying clock topology, buffer depth, or skew constraints. A script-driven framework identifies timing-critical tap-point pairs, evaluates reassignment opportunities based on timing impact, and integrates seamlessly with standard CTS optimization flows.
Experimental results on high-performance core designs demonstrate up to 25% reduction in setup total negative slack and up to 43% improvement in register-to-register setup TNS, along with ~13% improvement in hold TNS, achieved by reassigning a limited number of sinks. The proposed approach incurs negligible power and physical impact, enabling robust timing closure in advanced high-frequency designs where CPPR is often the primary remaining optimization lever.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionSource Mask Optimization (SMO), as a key enabler in Design-Technology Co-Optimization (DTCO), plays a vital role in enlarging the process window (PW) for advanced technology nodes and PDK development. Conventional SMO relies heavily on iterative optimization using lithography models to obtain a single source-mask pair for improved PW. In full-chip applications, however, a set of critical patterns are co-optimized under a common source, forming a one-source-to-many-masks optimization paradigm for maximum common process window (CPW). To overcome the limitations of existing methods, we introduce CPW-SMO, a novel single-shot generative flow that simultaneously generates an optimal source and a set of corresponding mask patterns using set-based attention mechanisms. Our approach formulates the source and mask patterns as a unified set and employs a highly parallelized generative simulator to enable efficient training. By transforming multi-objective, non-differentiable CPW into a single-objective CPW preference penalty and optimizing generators through an SMO-aware gradient response method, CPW-SMO achieves nearly 2x CPW compared to state-of-the-art methods, while delivering ~200x speedup in runtime. These improvements significantly boost the practicality and effectiveness of SMO for holistic lithography applications.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionDeploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation.
Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework to jointly perform Compression and Quantization, flexibly shaping data to fit various CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.
Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework to jointly perform Compression and Quantization, flexibly shaping data to fit various CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionWith the scaling of switching chips, modular architectures are becoming mainstream.
As a result, buffers become physically distributed across tiles, leading to poor utilization under imbalanced traffic.
To address this, we propose CrediX, a lightweight dynamic buffer management mechanism that aggregates distributed buffers into region-based shared pools and allocates usage on demand through a Regional Credit Allocator (RCA).
The design leverages existing credit-based flow control and preserves packet ordering via a simple VC-to-path mapping.
Evaluations show that CrediX reduces backpressure by 65.6% under synthetic Hotspot traffic and mitigates severe hotspot occupancy in a realistic workload.
As a result, buffers become physically distributed across tiles, leading to poor utilization under imbalanced traffic.
To address this, we propose CrediX, a lightweight dynamic buffer management mechanism that aggregates distributed buffers into region-based shared pools and allocates usage on demand through a Regional Credit Allocator (RCA).
The design leverages existing credit-based flow control and preserves packet ordering via a simple VC-to-path mapping.
Evaluations show that CrediX reduces backpressure by 65.6% under synthetic Hotspot traffic and mitigates severe hotspot occupancy in a realistic workload.
Engineering Presentation
EDA
DescriptionWith the increasing complexity of high-performance VLSI designs at advanced technology nodes below 2 nm, crosstalk-induced signal integrity and timing violations have emerged as major challenges in physical implementation flows. Traditional crosstalk analysis is typically performed after detailed routing, making post-route fixes computationally expensive and time-consuming, often leading to prolonged design closure cycles. This work proposes a machine learning–driven framework for the early prediction of crosstalk effects prior to detailed routing and guiding Router to prevent crosstalk effect. The approach leverages deep convolutional neural networks (CNNs) to extract spatial and physical features and generate a crosstalk hotspot map that accurately identifies crosstalk-critical nets. To further enhance optimization, the CNN-based predictor is integrated with a reinforcement learning–based Cadence Cerebrus flow to guide routing decisions such as net ordering and spacing. Experimental results demonstrate that, after training on 40 Cerebrus regression runs, the proposed model achieves up to 74% prediction accuracy and delivers up to 15% improvement in setup total negative slack (TNS). The trained model is reusable across similar designs within the same technology node, yielding consistent PPA improvements and significantly reducing overall design turnaround time. The proposed framework enables efficient pre-routing signal integrity analysis, minimizes costly post-routing iterations, and improves timing closure, making it well suited for next-generation VLSI physical design flows.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEarly identification of crosstalk-affected regions is critical for achieving timing closure in advanced VLSI designs. This paper presents a framework for the pre-routing prediction of coupling-induced timing and signal integrity risks within the physical design flow. The proposed method introduces a novel Coupling Capacitance Estimator (CCE) to accurately estimate coupling effects arising from neighboring interconnects.
The framework extracts multi-channel spatial features from the Innovus placement database, including congestion, pin density, timing criticality, routing density, macro blockages, and coupling characteristics, and feeds them into a CNN-based U-Net architecture to generate high-resolution crosstalk hotspot maps. Owing to its encoder–decoder structure with skip connections, the U-Net model preserves fine-grained spatial information and enables precise localization of crosstalk-sensitive regions in dense layouts.
Experimental evaluation across multiple design blocks demonstrates that the generated heatmaps closely correlate with post-routing signoff results, indicating strong predictive fidelity. By enabling reliable pre-routing crosstalk analysis, the proposed framework reduces dependence on costly post-routing iterations and supports improved timing closure, reduced design turnaround time, and enhanced physical implementation efficiency, making it well suited for advanced technology nodes.
The framework extracts multi-channel spatial features from the Innovus placement database, including congestion, pin density, timing criticality, routing density, macro blockages, and coupling characteristics, and feeds them into a CNN-based U-Net architecture to generate high-resolution crosstalk hotspot maps. Owing to its encoder–decoder structure with skip connections, the U-Net model preserves fine-grained spatial information and enables precise localization of crosstalk-sensitive regions in dense layouts.
Experimental evaluation across multiple design blocks demonstrates that the generated heatmaps closely correlate with post-routing signoff results, indicating strong predictive fidelity. By enabling reliable pre-routing crosstalk analysis, the proposed framework reduces dependence on costly post-routing iterations and supports improved timing closure, reduced design turnaround time, and enhanced physical implementation efficiency, making it well suited for advanced technology nodes.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionWith continued technology scaling, crosstalk effects have become critical for timing closure. Existing graph learning-based timing models fail to accurately predict crosstalk timing due to limited modeling of coupling behaviors and multi-net interactions. We introduce a graph prompt learning framework featuring a dual-level prompt architecture. Node-level prompts capture fine-grained coupling effects while subgraph-level prompts model holistic multi-net interactions, effectively adapting pre-trained models. Experimental results demonstrate precise crosstalk timing prediction, consistently outperforming existing methods across benchmark designs.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionScaling fault-tolerant quantum computing is increasingly constrained by the limited bandwidth and power budget across the 4\,K-room temperature (RT) interface. We present CryoZip, a cross-layer cryogenic compression framework that cooperates with an lightweight in-house quantum error correction (QEC) predecoders to reduce syndrome transmission under realistic, circuit-level noise. CryoZip targets sparse syndrome vectors with a sliding-window compression architecture sized under strict decoding-latency constraints to maximize energy efficiency. We implement and evaluate the design in 22\,nm FDSOI characterized at 4\,K, using vector-based power, performance, and area analysis to obtain realistic hardware data. CryoZip achieves up to 48.3$\times$ compression---1.81$\times$ higher than state-of-the-art compressors---across various QEC codes; when paired with the predecoder it yields over 14,238.86$\times$ bandwidth reduction (48.3$\times$ without predecoding), and delivers 3.97-25.74$\times$ energy savings for cryo-to-RT links alone, rising to 42.19$\times$ when accounting for predecoding and realistic QEC interface overheads.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionChain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm. It precisely identifies when a SlipKV entry's utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Finally, we introduce an adaptive cache budget allocation algorithm. Based on the dynamic proportion of CrystalKV, it estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization. Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionThe "memory wall'" problem continues to constrain modern applications, making scalable memory expansion via Compute Express Link (CXL) increasingly critical. While DRAM-Flash hybrid CXL systems offer a capacity solution, their performance remains hampered by inefficient access models.
This paper presents CtxMem, an OS-hardware co-designed memory expansion system built on byte-addressable CXL interconnect and inexpensive flash storage. Unlike existing approaches, CtxMem introduces an asynchronous page fault-based mechanism that offloads the entire data plane of page fault handling to the CXL device. This is achieved through three key innovations: a dynamic direct-mapped cache that maximizes DRAM utilization, a lightweight hardware profiling unit enabling accurate and rapid hot/cold page identification with minimal overhead, and a OS-hardware co-designed page fault handling procedure that efficiently offloads data migration. Our GEM5-based evaluation shows CtxMem outperforms conventional CXL-SSD by 1.4x in execution time and doubles system throughput, demonstrating efficient large-scale memory expansion.
This paper presents CtxMem, an OS-hardware co-designed memory expansion system built on byte-addressable CXL interconnect and inexpensive flash storage. Unlike existing approaches, CtxMem introduces an asynchronous page fault-based mechanism that offloads the entire data plane of page fault handling to the CXL device. This is achieved through three key innovations: a dynamic direct-mapped cache that maximizes DRAM utilization, a lightweight hardware profiling unit enabling accurate and rapid hot/cold page identification with minimal overhead, and a OS-hardware co-designed page fault handling procedure that efficiently offloads data migration. Our GEM5-based evaluation shows CtxMem outperforms conventional CXL-SSD by 1.4x in execution time and doubles system throughput, demonstrating efficient large-scale memory expansion.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionOptimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries.
While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations.
This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution.
Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization.
Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels.
On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units.
The generated kernels are open-sourced at https://anonymous.4open.science/r/cuPilot-Kernels-1656.
While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations.
This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution.
Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization.
Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels.
On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units.
The generated kernels are open-sourced at https://anonymous.4open.science/r/cuPilot-Kernels-1656.
Engineering Presentation
Design
EDA
Systems
DescriptionHigh-speed digital blocks integrated within traditional Place & Route flows pose verification and implementation challenges that analog-centric methodologies cannot fully address. As process technologies scale and timing margins shrink, these blocks exhibit strong sensitivity to parasitics, layout-dependent effects, and high-frequency interactions, demanding enhanced automation and predictability. While the Cadence Virtuoso studio provides unmatched control for custom transistor-level design and leverages Cadence Innovus for digital automation, it lacks capabilities for automated Structured Data Path (SDP) file and routing constraint creation from source.
This work introduces a unified implementation approach that bridges this gap by enabling SDP-based placement for custom standard cells with auto-generating SDP file from schematic or Verilog, and creating routing constraints such as matching and shielding based on net prefixes/suffixes from schematic/Verilog. The methodology improves design scalability, reduces manual effort, and strengthens flow automation for high-speed digital blocks. By combining Virtuoso studio custom design flexibility with Innovus-driven automation, the proposed solution delivers a robust, layout-aware implementation flow suited for next-generation high-performance SoC and PHY architectures.
This work introduces a unified implementation approach that bridges this gap by enabling SDP-based placement for custom standard cells with auto-generating SDP file from schematic or Verilog, and creating routing constraints such as matching and shielding based on net prefixes/suffixes from schematic/Verilog. The methodology improves design scalability, reduces manual effort, and strengthens flow automation for high-speed digital blocks. By combining Virtuoso studio custom design flexibility with Innovus-driven automation, the proposed solution delivers a robust, layout-aware implementation flow suited for next-generation high-performance SoC and PHY architectures.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionSparse-dense matrix multiplication (SpMM) is a core component in many critical applications such as deep learning and scientific computing. Existing SpMM accelerators employ the COO format to compress sparse matrices and rely on high-bandwidth off-chip memory for performance gains. However, four major challenges remain unaddressed: 1) Fixed 2-D partitioning strategies lead to workload imbalance among processing nodes due to irregular distribution of non-zero elements in sparse matrices. 2) Redundant metadata from the COO format results in a communication bottleneck that hinders the scalability of existing SpMM accelerators. 3) To accommodate irregular memory access patterns, the use of multiple data replicas significantly increases the pressure on on-chip storage resources. 4) Handling RAW hazards from floating-point adders in software incurs substantial pre-processing overhead. To address these challenges, we propose DAP, the first 2-D multi-chip-based architecture composed of dedicated accelerators, and SPU, a novel communication-friendly SpMM accelerator. To mitigate load imbalance caused by irregular non-zero distribution, we design a two-level matrix partitioning framework that effectively balances workloads across nodes in a 2-D computing array, achieving a performance improvement of 1.43x. Furthermore, the SPU adopts CSC over COO format to reduce communication traffic. SPU also minimizes redundant on-chip storage from data replication, with only a 0.067x performance penalty due to memory access conflicts. By employing a reservation buffer, the SPU resolves RAW dependencies without time-consuming pre-processing. Our simulation-based evaluation demonstrates DAP achieves geometric mean throughputs of 2.69x relative to NVIDIA A100 GPU at 500MHz frequency.
People
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionNon-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency.
In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment.
Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16x and decreases DSP utilization by 16x while maintaining comparable or better performance across vision Transformers and GPT-2 models.
In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment.
Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16x and decreases DSP utilization by 16x while maintaining comparable or better performance across vision Transformers and GPT-2 models.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionRegister Transfer Level (RTL) simulation is key to pre-silicon verification, yet its performance has stagnated in the face of rapidly escalating SoC complexity. Although recent CPU- and GPU-based simulators improve throughput through per-stimulus optimization or batch parallelism, they execute each stimulus in isolation and therefore fail to exploit the substantial computation locality that emerges across stimuli. As many stimuli converge to identical internal states, large portions of the circuit are redundantly re-evaluated, forming a principal bottleneck to scalable RTL simulation.
This paper introduces Dart, a Directed Acyclic Graph (DAG)-driven RTL simulation framework that systematically eliminates cross-stimulus redundancy. Dart constructs a DAG-based intermediate representation that makes structural commonality and shared subexpressions across stimuli explicit, enabling principled redundancy elimination through systematic sub-DAG merging. A computation-centric execution engine evaluates shared logic once and amortizes its results across all stimuli that traverse the corresponding state, while a lightweight state-reconstruction mechanism preserves per-stimulus correctness with negligible overhead. Across a suite of industrial RTL designs, Dart delivers speedups of up to 136.7x over Verilator and 4.1x over RTLflow, respectively.
This paper introduces Dart, a Directed Acyclic Graph (DAG)-driven RTL simulation framework that systematically eliminates cross-stimulus redundancy. Dart constructs a DAG-based intermediate representation that makes structural commonality and shared subexpressions across stimuli explicit, enabling principled redundancy elimination through systematic sub-DAG merging. A computation-centric execution engine evaluates shared logic once and amortizes its results across all stimuli that traverse the corresponding state, while a lightweight state-reconstruction mechanism preserves per-stimulus correctness with negligible overhead. Across a suite of industrial RTL designs, Dart delivers speedups of up to 136.7x over Verilator and 4.1x over RTLflow, respectively.
People
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionHigh-level synthesis (HLS) has been widely adopted to map high-level languages to hardware, significantly improving hardware design productivity. Existing HLS tools have gaps in generating custom hardware due to data dependencies and differences in intermediate representation (IR) levels.
Appropriate IRs can be created for each by separating the algorithmic behaviour and the microarchitecture description, which is an effective way to reduce development costs and achieve competitive hardware designs.
In this paper, we propose DataFlowGen, an open-source framework built on MLIR for efficient dataflow accelerator generation. DataFlowGen explicitly introduces a two-level IR to perform operations at suitable abstraction levels, capturing dataflow characteristics and multi-level hierarchy.
Leveraging these representations, we develop an automated optimizer that outlines the application kernel and performs dataflow transformations to derive a hardware-oriented control dataflow graph (H-CDFG). It enables concise representation and resource efficiency of hardware architectures.
Experiments show that DataFlowGen achieves a performance improvement with significant resource reduction compared to state-of-the-art HLS tools. The results show that our optimizer effectively leverages the expressive power of IRs thus capturing kernel parallelism.
Appropriate IRs can be created for each by separating the algorithmic behaviour and the microarchitecture description, which is an effective way to reduce development costs and achieve competitive hardware designs.
In this paper, we propose DataFlowGen, an open-source framework built on MLIR for efficient dataflow accelerator generation. DataFlowGen explicitly introduces a two-level IR to perform operations at suitable abstraction levels, capturing dataflow characteristics and multi-level hierarchy.
Leveraging these representations, we develop an automated optimizer that outlines the application kernel and performs dataflow transformations to derive a hardware-oriented control dataflow graph (H-CDFG). It enables concise representation and resource efficiency of hardware architectures.
Experiments show that DataFlowGen achieves a performance improvement with significant resource reduction compared to state-of-the-art HLS tools. The results show that our optimizer effectively leverages the expressive power of IRs thus capturing kernel parallelism.
Workshop
DescriptionWorkshop Website: https://dcgaa.dk-lab.xyz/2026
In the rapidly evolving domain of computational technologies, the transformative impact of AI continues to shape the future. DCgAA (Deep Learning-Hardware Co-Design for Generative AI Acceleration) 2026 builds on the success of its inaugural edition by diving deeper into the frontier of deep learning (DL) and hardware co-design, with an amplified focus on real-world deployment challenges and next-generation innovations for generative AI applications. This third iteration of the workshop emphasizes expanding the scope beyond foundational discussions, addressing emerging paradigms in generative AI, including multimodal fusion, real-time adaptive processing, and decentralized edge applications. Acknowledging the growing role of foundation models, diffusion models, and large-scale generative systems, this workshop prioritizes optimizing these technologies for sustainable scalability, balancing performance, energy efficiency, and accessibility across diverse computing environments such as edge devices, AR/VR platforms, and ubiquitous IoT systems. Through a blended format of keynotes, paper presentations, and poster presentations, and by engaging thought leaders, researchers, and practitioners across academia and industry, DCgAA 2026 seeks to redefine the boundaries of DL-hardware integration and promises to set new benchmarks for hardware-aware generative AI, driving innovation that is efficient, scalable, and impactful in the real world.
Learn More: https://dcgaa.dk-lab.xyz/2026
In the rapidly evolving domain of computational technologies, the transformative impact of AI continues to shape the future. DCgAA (Deep Learning-Hardware Co-Design for Generative AI Acceleration) 2026 builds on the success of its inaugural edition by diving deeper into the frontier of deep learning (DL) and hardware co-design, with an amplified focus on real-world deployment challenges and next-generation innovations for generative AI applications. This third iteration of the workshop emphasizes expanding the scope beyond foundational discussions, addressing emerging paradigms in generative AI, including multimodal fusion, real-time adaptive processing, and decentralized edge applications. Acknowledging the growing role of foundation models, diffusion models, and large-scale generative systems, this workshop prioritizes optimizing these technologies for sustainable scalability, balancing performance, energy efficiency, and accessibility across diverse computing environments such as edge devices, AR/VR platforms, and ubiquitous IoT systems. Through a blended format of keynotes, paper presentations, and poster presentations, and by engaging thought leaders, researchers, and practitioners across academia and industry, DCgAA 2026 seeks to redefine the boundaries of DL-hardware integration and promises to set new benchmarks for hardware-aware generative AI, driving innovation that is efficient, scalable, and impactful in the real world.
Learn More: https://dcgaa.dk-lab.xyz/2026
People
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionClock Tree Synthesis (CTS) constitutes a complex, discrete, and combinatorial multi-objective optimization (MOO) problem, which is typically fragmented into sequential steps, including clustering, topology generation, and buffering in traditional flows, leading to suboptimal results due to local optima. Despite significant potential in MOO, differentiable methods are inherently limited to represent dynamic topological adjustments during CTS. To solve this, We propose an end-to-end differentiable CTS framework, DCTS, based on Probabilistic Graphical Model (PGM) to re-parameterize the discrete topological search into a continuous gradient-based problem, enabling co-optimization of clock tree topology and buffer sizing within a global design space. The proposed DCTS was evaluated on ISCAS'89 and OpenCores benchmark circuits under the ASAP7 technology node. Experimental results show that it achieves competitive power, performance, and area (PPA) metrics against a leading commercial tool, along with a 2.78$\times$ speedup on large-scale circuits. Furthermore, when compared to state-of-the-art academic solutions, DCTS guarantees minimum improvements of 20\% in delay, 17\% in skew, 1\% in power, and 17\% in area, while also achieving a minimum speedup of 1.98$\times$ on large-scale designs.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionFault localization in modern processor design code is a critical yet time-consuming step during processor verification.
While recent advances in LLM-based techniques for module-level hardware design have shown promising results, automatically localizing bugs in large-scale, project-level processor designs remains challenging.
In this paper, we present BluesFL, a novel block-level LLM-based fault localization framework for processor designs.
Inspired by the way engineers debug processors, we first propose a dataflow-based code blockization approach to guide LLMs to focus on critical local code context.
We further propose a Block-level Instruction-Oriented Slicing (Blues) algorithm that enables LLMs to mimic human reasoning by analyzing instruction execution paths and processor states.
We evaluate BluesFL on a real-world RISC-V processor core comprising 19K lines of SystemVerilog code.
Experimental results demonstrate that BluesFL correctly localizes 24 bugs at Top-1, achieving 242.9% improvement over the existing state-of-the-art (7 bugs). Cost analysis shows that BluesFL requires an average of only $0.257 to localize a single bug.
While recent advances in LLM-based techniques for module-level hardware design have shown promising results, automatically localizing bugs in large-scale, project-level processor designs remains challenging.
In this paper, we present BluesFL, a novel block-level LLM-based fault localization framework for processor designs.
Inspired by the way engineers debug processors, we first propose a dataflow-based code blockization approach to guide LLMs to focus on critical local code context.
We further propose a Block-level Instruction-Oriented Slicing (Blues) algorithm that enables LLMs to mimic human reasoning by analyzing instruction execution paths and processor states.
We evaluate BluesFL on a real-world RISC-V processor core comprising 19K lines of SystemVerilog code.
Experimental results demonstrate that BluesFL correctly localizes 24 bugs at Top-1, achieving 242.9% improvement over the existing state-of-the-art (7 bugs). Cost analysis shows that BluesFL requires an average of only $0.257 to localize a single bug.
People
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionLong-term data remanence in SRAMs can pose serious security risks when discarded ICs retain sensitive information. Unlike DRAM and Flash memories, SRAMs have been largely overlooked in this field due to their very short retention periods. Prior works demonstrated that aging-induced imprints enable partial data recovery in SRAMs, but only when recorded initial power-up states are available. In this paper, we have proposed a recovery approach that eliminates this requirement by reconstructing initial states through controlled aging. By analyzing the reconstructed and aged power-up states, we demonstrated near-complete data recovery from SRAM chips after 12 hours of controlled aging using only 32 copies.
Engineering Presentation
Design
EDA
DescriptionWe propose a tool capable of Automating Analog Design Sizing at scale in an industrial setting that meets sign-off quality. There have been several academic papers which talk about optimizing analog circuit but most of them operate under the academic umbrella which prohibits them from being viable solutions in the industrial setting where there are complex device models, PVT and Mismatch (MC) simulations, complex topologies with large number of design variable and specifications that have to be met with certain priority.
The proposed tool addresses all these challenges and is proven within our company to be a "Real" Analog Design Optimization tool.
We use Deep Reinforcement Learning and unique reward shaping algorithms to optimally tune device parameters to meet design specifications provided by the designers. The tool optimizes the design across all PVT (Process, Voltage, Temperature) corners and Mismatch (Monte Carlo) corners to produce a optimized circuit that is indistinguishable from a manually fine tuned circuit expect for the fact that, it does this much faster and arrives at the best possible solution (at least as good as the designer). It does this within a practical timeframe while being computationally economical.
At TI the solution is widely deployed and, 100+ analog circuits/blocks have been optimized with the proposed solution.
The proposed tool addresses all these challenges and is proven within our company to be a "Real" Analog Design Optimization tool.
We use Deep Reinforcement Learning and unique reward shaping algorithms to optimally tune device parameters to meet design specifications provided by the designers. The tool optimizes the design across all PVT (Process, Voltage, Temperature) corners and Mismatch (Monte Carlo) corners to produce a optimized circuit that is indistinguishable from a manually fine tuned circuit expect for the fact that, it does this much faster and arrives at the best possible solution (at least as good as the designer). It does this within a practical timeframe while being computationally economical.
At TI the solution is widely deployed and, 100+ analog circuits/blocks have been optimized with the proposed solution.
Work in Progress
DescriptionClient-side deduplication can reduce redundant data transfer, but its duplicate check may expose the file existence status to attackers. Existing software-only defenses provide limited protection and impose substantial computation. Accordingly, we present RSCD, a Raptor-code and SGX Co-Designed framework that mitigate side-channel leakage in client-side deduplication. RSCD employs sparse XOR-based Raptor coding to obfuscate deduplication patterns with low overhead and uses an SGX enclave protected dirty-chunk detector to identify probing through feature analysis. Our formal analysis shows near-optimal privacy, and the experimental results demonstrate that RSCD consistently achieves lower latency, higher throughput and accuracy in attack detection than SOTA defenses.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionExisting works of large language model (LLM) decomposition mainly focus on having better performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results can demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionThis paper presents a new approach for minimising Boolean circuits subject to delay constraints.
The proposed approach extends the efficient but purely area-based Boolean circuit minimiser eSLIM.
The eSLIM minimiser reduces circuits by iteratively computing optimal replacements for small subcircuits on the fly.
We extend the SAT encoding used, for synthesising these replacements, to take account for delay.
While the additional constraints on delay restrict the set of possible candidates for the replacement, we can still harness the flexibility of making use of Boolean relations and multi-output circuits.
We implemented the proposed method as part of the industrial-strength tool ABC.
Surprisingly, in an experimental evaluation our implementation showed only a rather small deterioration in terms of area improvement, but substantial improvements in terms of delay compared to the purely area-based eSLIM approach on different benchmark sets.
The proposed approach extends the efficient but purely area-based Boolean circuit minimiser eSLIM.
The eSLIM minimiser reduces circuits by iteratively computing optimal replacements for small subcircuits on the fly.
We extend the SAT encoding used, for synthesising these replacements, to take account for delay.
While the additional constraints on delay restrict the set of possible candidates for the replacement, we can still harness the flexibility of making use of Boolean relations and multi-output circuits.
We implemented the proposed method as part of the industrial-strength tool ABC.
Surprisingly, in an experimental evaluation our implementation showed only a rather small deterioration in terms of area improvement, but substantial improvements in terms of delay compared to the purely area-based eSLIM approach on different benchmark sets.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionThe high energy cost of video AIoT systems stems from redundant operations across sensing, transmission, and computation. While prior work optimizes individual stages, the lack of cross-stage coordination forces each stage to re-detect redundancy, limiting overall efficiency. We present DeltaSight, an algorithm–hardware co-design architecture that establishes a sensor-side unified redundancy criterion serving as the common basis for redundancy elimination throughout the pipeline. Algorithmically, we generate this criterion via sensor-side block-level redundancy detection whose output matches the granularity of downstream computation, complemented by a semantic-aware sampling strategy that adapts precision to task relevance. An architecture is designed to support this algorithm with minimal hardware additions. DeltaSight gains 2.4x sensor-side and 1.7x end-to-end energy efficiency, with slight accuracy improvements.
People
Keynote
AI
DescriptionAI is emerging as the foundational capability that defines the user experience and is reshaping the semiconductor industry into a high volume, multi-market business spanning an unprecedented diversity of power envelopes and form factors—from sub-5W battery-powered devices like AI pins and up to 500W cloud servers. This transition is also driving distributed, hybrid AI across personal devices, edge, and cloud, enabling personal AI systems that deliver low latency, improved privacy, and higher reliability – essential benefits of Edge AI.
To design silicon products that address the diverse AI workloads, power budgets, and form factors for these markets, we can leverage a common, modular IP library —including CPU, GPU, AI accelerators, connectivity, and security. However, the current process of adapting each IP and integrating it into a new SoC demands significant human effort. To minimize the NRE, we need tools that can support and automate the process, enabling engineering teams to quickly pivot and assemble SoCs that can meet the full spectrum of needs, from very low-power requirements to very high-performance.
Moreover, we face additional challenges with the slowing of Moore's law and the rise of chiplets. This necessitates new tools for 3DIC floor planning, package design, and thermal design.
During the talk, I will focus on the gaps between the current state of EDA tools and the needs of the semiconductor industry to leverage common IP across diverse products. I will also share perspectives on approaches that help to close these gaps.
To design silicon products that address the diverse AI workloads, power budgets, and form factors for these markets, we can leverage a common, modular IP library —including CPU, GPU, AI accelerators, connectivity, and security. However, the current process of adapting each IP and integrating it into a new SoC demands significant human effort. To minimize the NRE, we need tools that can support and automate the process, enabling engineering teams to quickly pivot and assemble SoCs that can meet the full spectrum of needs, from very low-power requirements to very high-performance.
Moreover, we face additional challenges with the slowing of Moore's law and the rise of chiplets. This necessitates new tools for 3DIC floor planning, package design, and thermal design.
During the talk, I will focus on the gaps between the current state of EDA tools and the needs of the semiconductor industry to leverage common IP across diverse products. I will also share perspectives on approaches that help to close these gaps.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionRigid-flex printed circuit boards present unique verification challenges that conventional Design Rule Check (DRC) tools fail to address, resulting in significant first-pass manufacturing failure rates and substantial respin costs per design. The fundamental problem stems from heterogeneous stackup transitions where layer count, material properties, and physical constraints vary spatially across the board—a scenario traditional 2D geometric DRC cannot comprehend.
We present a novel context-aware verification framework that integrates region-dependent rule evaluation with programmable custom DRC. Our approach combines enhanced constraint management providing real-time, location-aware design guidance with RAVEL-based custom DRC implementing sophisticated multi-domain rules that validate electrical, mechanical, and manufacturing constraints simultaneously across varying stackup configurations.
Experimental validation on multiple designs demonstrates substantial improvement in violation detection compared to standard DRC, identifying critical inter-layer issues including routing on non-existent layers, invalid via structures in flex zones, and impedance discontinuities at material transitions. Deployment results show dramatic improvements in first-pass success rates, significant reductions in verification time, and considerable project cost savings with rapid payback periods.
Keywords: Design rule checking, rigid-flex, rigid-flex PCB, heterogeneous stackup verification, RAVEL, constraint management, inter-layer validation
We present a novel context-aware verification framework that integrates region-dependent rule evaluation with programmable custom DRC. Our approach combines enhanced constraint management providing real-time, location-aware design guidance with RAVEL-based custom DRC implementing sophisticated multi-domain rules that validate electrical, mechanical, and manufacturing constraints simultaneously across varying stackup configurations.
Experimental validation on multiple designs demonstrates substantial improvement in violation detection compared to standard DRC, identifying critical inter-layer issues including routing on non-existent layers, invalid via structures in flex zones, and impedance discontinuities at material transitions. Deployment results show dramatic improvements in first-pass success rates, significant reductions in verification time, and considerable project cost savings with rapid payback periods.
Keywords: Design rule checking, rigid-flex, rigid-flex PCB, heterogeneous stackup verification, RAVEL, constraint management, inter-layer validation
Engineering Presentation
AI
EDA
Systems
DescriptionAccurate Die-Size estimation during the design specification stage is a pivotal factor in determining the cost, manufacturability, and market competitiveness of modern Automotive System-on-Chip (SoC) products. Traditional estimation approaches, which rely on linear models and heuristics, often fall short due to the increasing complexity and diversity of SoC architectures, especially as they integrate numerous IP blocks with varying requirements. This presentation introduces a machine learning (ML) framework, leveraging Random Forest algorithms, to address these challenges by learning from historical project data - including RTL structure, hard macro specifications, architectural parameters, and physical design metrics (PNR data). The proposed workflow encompasses data collection, feature engineering, model training, and prediction phases, enabling module-level area estimation and aggregation to the full SoC die-size. Empirical results, based on data from three completed SoC projects and over 1800 unique RTL modules, demonstrate the framework's robustness, achieving an R² accuracy of up to 0.95 and a mean absolute percentage error (MAPE) of 16%. This ML-based approach aims to empower chip architects and system design engineers to perform high-confidence, early die-size planning, facilitating design exploration, area recovery analysis, and informed decision-making for competitive product development.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionWith the rapidly growing demand of cloud computing and large-scale AI models, many-core systems are facing challenges of longer global interconnect distances in Network-on-Chip (NoC). Though conventional 2D NoCs can apply relative high metal layer for global routing, the extensive repeater insertion for long-distance transmission causes significant number of via-stacking, leading to performance degradation. Targeting high-performance CPU clusters at 3nm node, this work adopts a design–technology co-optimization (DTCO) framework to evaluate long-distance NoC interconnects across four implementation schemes: conventional frontside 2D (F2D), frontside 3D (F3D) with M3D integration, F2D with backside power delivery network (BSPDN), and backside 2D (B2D) leveraging wafer backside signal routing & PDN. Based on post-layout extraction of the ARM Neoverse CSS N2 computing tile, we incorporate realistic PDN characteristics, technology-dependent RC modeling, and IR-drop-aware circuit simulation. Results show that F3D and B2D reduce delay by 53% and 68%, and energy–delay product (EDP) by 32% and 63%, respectively, compared with F2D. F2D-BSPDN achieves performance comparable to F3D. System-level NoC evaluations further demonstrate that F3D/B2D enable 2.1×/3.1× feasible link frequencies of F2D, and lower average NoC latency of F2D by 23%/35%. The DTCO analysis indicates that while F2D remains adequate for small cores (Cortex-A76), B2D is the optimal choice for mid-core (Cortex-A720, X4) and large-core (Cortex-X925, CSS N2) clusters, with F3D providing secondary benefits through repeater relocation. These findings identify backside interconnect as
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern PCB designs contain thousands of signal nets operating up to multi-GHz frequencies, coupled with components requiring tight supply tolerances and increasing power demands through complex power distribution networks. These designs require comprehensive signal and power integrity (SIPI) verification to achieve first-pass design success and minimize board iterations. However, traditional SIPI workflows create significant bottlenecks, requiring extensive electromagnetic (EM) extraction and simulation setup expertise and manual intervention across multiple tool interfaces. Conventional analysis consumes weeks to months per design iteration while limiting verification to specialized engineers—a critical constraint for organizations.
This work presents an integrated SIPI verification framework that automates the complete workflow from EM extraction through circuit simulation to performance evaluation. The methodology introduces: (1) automated mesh generation and solver configuration reducing setup time from hours to minutes, (2) template-based circuit testbenches enabling seamless SPICE and IBIS model integration with extracted parasitic networks, and (3) automated post-processing with customizable pass/fail criteria and standardized reporting templates.
Validation across several production board designs demonstrates reduction in revision time from 24% to 3% while achieving 100% first-pass prototype success rate. This framework enables engineers with basic SIPI knowledge to perform comprehensive verification and optimization, expanding team capacity and design coverage.
This work presents an integrated SIPI verification framework that automates the complete workflow from EM extraction through circuit simulation to performance evaluation. The methodology introduces: (1) automated mesh generation and solver configuration reducing setup time from hours to minutes, (2) template-based circuit testbenches enabling seamless SPICE and IBIS model integration with extracted parasitic networks, and (3) automated post-processing with customizable pass/fail criteria and standardized reporting templates.
Validation across several production board designs demonstrates reduction in revision time from 24% to 3% while achieving 100% first-pass prototype success rate. This framework enables engineers with basic SIPI knowledge to perform comprehensive verification and optimization, expanding team capacity and design coverage.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAheadComputing is developing a high-performance RISC-V core where delivering a bug-free memory subsystem is a top verification priority. The memory page walker (PW) is especially critical because it must conform to the open RISC-V Privileged Architecture specification while also meeting design-specific requirements.
This submission presents a design-for-FV methodology applied early to enforce a clean architectural boundary that improves formal reachability and convergence, enabling shift-left bug discovery. In this case, all design-specific features are isolated in wrapper RTL modules, while the page walker remains a pure implementation of the RISC-V spec. From the earliest stages, the FV team can build an independent SystemVerilog reference model for the PW as an independent implementation of the same architectural specification as RTL. Embedded SVA checkers in the reference model enable rapid formal closure, exposing RTL bugs and spec-translation issues that are difficult to catch with simulation alone. At higher integration levels, proof scalability is further improved by substituting the PW RTL with a PW abstraction model—a simplified reference model that preserves architectural behavior while reducing formal complexity.
In a short case study, we show how this flow drove interface and micro-architecture refinements, improved maintainability, and accelerated confidence in spec-compliance for a critical RISC-V memory component.
This submission presents a design-for-FV methodology applied early to enforce a clean architectural boundary that improves formal reachability and convergence, enabling shift-left bug discovery. In this case, all design-specific features are isolated in wrapper RTL modules, while the page walker remains a pure implementation of the RISC-V spec. From the earliest stages, the FV team can build an independent SystemVerilog reference model for the PW as an independent implementation of the same architectural specification as RTL. Embedded SVA checkers in the reference model enable rapid formal closure, exposing RTL bugs and spec-translation issues that are difficult to catch with simulation alone. At higher integration levels, proof scalability is further improved by substituting the PW RTL with a PW abstraction model—a simplified reference model that preserves architectural behavior while reducing formal complexity.
In a short case study, we show how this flow drove interface and micro-architecture refinements, improved maintainability, and accelerated confidence in spec-compliance for a critical RISC-V memory component.
Engineering Presentation
Design
EDA
Security
Systems
DescriptionEdge AI is increasingly popular in applications requiring real-time decision making and autonomous operation. Different from NPUs for cloud platforms, edge AI processors can be made application-specific. By tuning their ISA and memory architecture to the network models required by the application, power consumption and silicon area are drastically reduced.
Tools for application-specific instruction-set processors (ASIPs) can be used to design custom NPUs for edge AI. We present the design of "SmarT", an ASIP with a RISC-V ISA augmented with specialized vector units for convolutions and quantization, with 64 MACs. It supports circular gather/scatter addressing of vector data in parallel with computations. Low-overhead DMA moves data blocks from external to local memory.
ASIP tools enable a software path from TensorFlow using LiteRT. We optimized selected LiteRT kernels in conjunction with the processor architecture. SmarT uses only 200Kgates, while delivering 100GMAC/s performance, making it suited for many low-power sensor, audio and video applications.
Tools for application-specific instruction-set processors (ASIPs) can be used to design custom NPUs for edge AI. We present the design of "SmarT", an ASIP with a RISC-V ISA augmented with specialized vector units for convolutions and quantization, with 64 MACs. It supports circular gather/scatter addressing of vector data in parallel with computations. Low-overhead DMA moves data blocks from external to local memory.
ASIP tools enable a software path from TensorFlow using LiteRT. We optimized selected LiteRT kernels in conjunction with the processor architecture. SmarT uses only 200Kgates, while delivering 100GMAC/s performance, making it suited for many low-power sensor, audio and video applications.
Engineering Special Session
AI
Chiplet
Design
EDA
Systems
DescriptionAs chiplet architectures, 3D integration, and HBM‑class systems push thermal‑mechanical‑electrical coupling into the critical path, engineering teams are discovering that traditional simulation workflows—built around manual setup, sparse sampling, and human‑driven iteration—cannot scale to manufacturing‑resolution design. In this regime, nondeterminism, approximation, and workflow variability become failure modes rather than accelerants.
This session examines what it takes to make physics‑based AI viable for production engineering workflows, where deterministic execution, solver‑accurate validation, and reproducible physics reasoning are essential. We will explore what breaks when probabilistic or approximate methods are applied to high‑stakes physical design, the technical and operational criteria required for trustable results, and which elements of chips‑to‑systems workflows can be automated once physics reasoning operates continuously at machine scale. Drawing on semiconductor, advanced packaging, thermal, and multiphysics domains, a panel of industry and academic experts will discuss how deterministic, physics‑grounded AI enables broader design exploration within real development timelines.
This session examines what it takes to make physics‑based AI viable for production engineering workflows, where deterministic execution, solver‑accurate validation, and reproducible physics reasoning are essential. We will explore what breaks when probabilistic or approximate methods are applied to high‑stakes physical design, the technical and operational criteria required for trustable results, and which elements of chips‑to‑systems workflows can be automated once physics reasoning operates continuously at machine scale. Drawing on semiconductor, advanced packaging, thermal, and multiphysics domains, a panel of industry and academic experts will discuss how deterministic, physics‑grounded AI enables broader design exploration within real development timelines.
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionGate-level Hardware Trojan (HT) detection using Graph Neural Networks (GNNs) often suffers from limited accuracy due to the reliance on structure-only features and fragmented node-level predictions. We propose a GNN-based framework, \papertitle, to overcome these challenges through three key innovations:
(1) functionality-aware feature engineering,
(2) an edge-aware GNN architecture with Jumping Knowledge and global context aggregation, and
(3) accurate Trojan circuit localization with community-aware classification refinement.
Evaluated on the 2025 ICCAD CAD Contest Hardware Trojan benchmarks, DeTrojan achieves a 16.7% improvement in circuit-level prediction accuracy and a 47.3% relative increase in gate-level F1-score over the state-of-the-art machine learning (ML)-based methods. Furthermore, incorporating our proposed features and refinement modules into the existing approaches yield up to 16.6% gain in accuracy and a 2× improvement in the F1-score, demonstrating the effectiveness and generality of our framework.
(1) functionality-aware feature engineering,
(2) an edge-aware GNN architecture with Jumping Knowledge and global context aggregation, and
(3) accurate Trojan circuit localization with community-aware classification refinement.
Evaluated on the 2025 ICCAD CAD Contest Hardware Trojan benchmarks, DeTrojan achieves a 16.7% improvement in circuit-level prediction accuracy and a 47.3% relative increase in gate-level F1-score over the state-of-the-art machine learning (ML)-based methods. Furthermore, incorporating our proposed features and refinement modules into the existing approaches yield up to 16.6% gain in accuracy and a 2× improvement in the F1-score, demonstrating the effectiveness and generality of our framework.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionChip development timelines are often limited by device (transistor) readiness. When devices miss targets, schedules slip by additional shuttle fabrication runs. This risk is higher when teams rely on newly characterized devices in a given process, which forces process and device engineers to tune measured current-voltage (I-V) and capacitance-voltage (C-V) behavior by sweeping split parameters of device structure. Currently, this split-parameter search is largely trial-and-error: each iteration takes months, only a small fraction of the split space is measured, and design teams often run circuit simulations based on device data from non-optimized structures. We present an explainable machine-learning framework that learns from sparse tapeout measurements to predict measured LDMOS I-V/C-V curves directly from split parameters. Data augmentation and ensemble modeling improve robustness in the low-data regime. Compared with commonly used modeling baselines, the framework achieves very high accuracy (low mean relative error) while providing interpretable attributions across operating regions to identify which split parameters drive key behaviors. The trained model then screens up to billions of candidate split parameters and supports inverse design to recommend parameters that meet user-defined device performance, such as breakdown voltage. Overall, the approach converts limited silicon data into decision-ready guidance that accelerates split exploration, reduces shuttle iterations, and shortens the silicon-to-design cycle.
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
DescriptionAutonomous driving systems increasingly rely on deep neural network (DNN) based multi-task perception models for reliable, real time scene understanding. At nanoscale technology nodes, these workloads are highly susceptible to timing errors arising from temperature fluctuations, voltage droop, and device aging. Among these, temperature poses a critical challenge prolonged high thermal stress exacerbates delay faults, degrading perception accuracy and endangering safety-critical operation.
We present DFA DRIVE, a cross-layer Delay Fault Analysis Framework for Autonomous Driving that bridges circuit-level timing analysis with system level resilience evaluation. DFA DRIVE quantifies how temperature induced timing failures propagate through object detection, drivable area segmentation, and lane line segmentation, exposing task level reliability bottlenecks.
Building on this foundation, we introduce DFA-OPT, an adaptive DNN hardware mapping algorithm that dynamically reassigns systolic-array resources based on DNN layer and applicaiton level thermal sensitivity. Targeting the automotive reliability envelopes of AEC-Q100 Grade 0 (–40 °C to 150 °C) and Grade 1 (–40 °C to 125 °C), DFA-OPT restores near baseline accuracy of small, high reliability systolic arrays (e.g., 4×4) even when large systolic arrays (e.g., 256×256) experience accuracy drops of up to 4% at 150 °C, achieving comparable accuracy with up to 92% fewer computation cycles.
We present DFA DRIVE, a cross-layer Delay Fault Analysis Framework for Autonomous Driving that bridges circuit-level timing analysis with system level resilience evaluation. DFA DRIVE quantifies how temperature induced timing failures propagate through object detection, drivable area segmentation, and lane line segmentation, exposing task level reliability bottlenecks.
Building on this foundation, we introduce DFA-OPT, an adaptive DNN hardware mapping algorithm that dynamically reassigns systolic-array resources based on DNN layer and applicaiton level thermal sensitivity. Targeting the automotive reliability envelopes of AEC-Q100 Grade 0 (–40 °C to 150 °C) and Grade 1 (–40 °C to 125 °C), DFA-OPT restores near baseline accuracy of small, high reliability systolic arrays (e.g., 4×4) even when large systolic arrays (e.g., 256×256) experience accuracy drops of up to 4% at 150 °C, achieving comparable accuracy with up to 92% fewer computation cycles.
Engineering Presentation
EDA
DescriptionAs semiconductor designs scale in complexity, Design-for-Test (DFT) has become a critical bottleneck, often delayed until the post-RTL freeze stages. This paper introduces a novel "DFT-Ready Design" methodology using SOC Canvas to bridge the chronic gap between system design and DFT. Unlike conventional flows where DFT engineers rely on fragmented pre-DFT information, our approach captures system-level functional intent—such as power, clock, and I/O configurations—via an intuitive GUI at the earliest architectural stages. SOC Canvas then automatically generates IEEE 1687-compliant hardware structures and integrated control logic, ensuring total data consistency across RTL, synthesis, and DFT. We validated this methodology in a large-scale AI accelerator project, successfully reducing the total design cycle by 25% while eliminating the manual iterations typically required for DFT logic insertion. By shifting DFT responsibilities to the left and automating the generation of test-control infrastructures, this work provides a scalable foundation for modern SoC and emerging chiplet ecosystems, ensuring that designs are inherently "DFT-ready" before synthesis begins.
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionDynamic-iterative aggregation (DIA) in graph neural networks updates node states sequentially and asynchronously. This expands the receptive field without adding layers and improves accuracy-per-operation versus layer-synchronous models. However, DIA introduces fine-grained serial dependencies and highly irregular sparse traffic that undermine conventional SIMD/accelerator designs. We present DIA-CIM, a 28-nm compute-in-memory (CIM) macro co-designed for DIA. DIA-CIM employs a CSR-driven, output-stationary dataflow to keep partial sums local while streaming edges, a sparsity-priority BF16 pipeline that exploits bit/value sparsity. Fabricated in 28-nm CMOS, DIA-CIM reaches 60.01 TFLOPS/W. On representative DIA workloads, it delivers >3.56 × lower energy and >2.32 × lower latency.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionPost-layout optimization is a critical step in modern chip design. However, existing Machine Learning (ML)-assisted methods struggle to capture the cross-stage reasoning dependencies that underlie the optimization challenges. While large language models (LLMs) excel at semantic reasoning, the lack of structural and physical awareness limits their effectiveness in post-layout optimization. To address these limitations, we propose DiffDEG, a diffusion-enhanced, reasoning-aware foundation model that bridges LLM-based semantic reasoning with circuit-level structural and physical representations. DiffDEG reformulates the conventional timing graph into a Design Evolution Graph (DEG), enabling cross-stage reasoning through text-annotated netlists. By leveraging directional diffusion and self-supervised pretraining, DiffDEG jointly interprets semantic, timing, and physical information, forming a unified representation adaptable to diverse optimization tasks. Experimental results show that DiffDEG consistently enhances optimization outcomes across multiple optimization paradigms, achieving an average 12.5% performance improvement and 4x runtime speedup over commercial tools.
People
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionFill insertion need to consider not only density uniformity but also delay. Existing algorithms reduce such delay implicitly by minimizing proxies (fill amounts, overlays between fills, etc) but there remains misalignment between the reduction of proxies and improvement of the signal delay. We propose DiffFill, a novel differentiable framework for fill insertion that explicitly optimizes both uniformity and delay. At the heart of DiffFill is the CapFormer, a Transformer-based capacitance extractor that estimates capacitance values used to construct a differentiable delay objective based on the Elmore delay formulation. DiffFill significantly outperforms the state-of-the-art methods.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSoC Flat IR/EM signoff is generally done for multiple cycles thus generally mandating 2+ days to cover a single scenario- not only is the coverage limited but also expensive since even small ECO fixes trigger full analysis repeat (of same resources). Additionally, designs which have multiple hierarchical block level instantiations - this is massively computationally redundant.
Reduced Order Model (ROM) Flow: Hierarchical Abstraction for IR/EM Signoff:
ROM eliminates computational redundancy by using abstract representations of pre-verified blocks. It leverages tweaked SoC flat analysis to have appropriate block level details to enable 10-20× faster SoC turnaround and broader scenario coverage. Below is its mechanism:
The Common Connection Layer (CCL) acts as the electrical boundary between block & SoC top. ROM preserves full detail only at the CCL and CCL-1, while the lower metal layers (M0 to CCL-2) are rolled up into equivalent impedance model to maintain signoff accuracy.
Designers use a mix of detailed instances for same critical block with reduced instances to optimize resource usage as shown in Fig 1.
The Validation Problem with ROM- Trust Gap:
Context Mismatch: ROMs are generated in standalone conditions, failing to account for top-level grid impedance and adjacent block coupling.
Fidelity & Coverage Loss: Abstracting 12-14 layers can mask local voltage violations; current manual spot-checks are insufficient since these fail to quantify if CCL node voltages in all ROM instances match their power-domain & scenario specific simulation values
Objective of this work:
Systematic validation across all ROM instances & all power domains in a quick (wall time ~mins for SoC) else it would offset ROM runtime benefits.
Quantitative fidelity metrics with low violation thresholds & spatial coverage for debug to understand root cause of localised errors.
Reduced Order Model (ROM) Flow: Hierarchical Abstraction for IR/EM Signoff:
ROM eliminates computational redundancy by using abstract representations of pre-verified blocks. It leverages tweaked SoC flat analysis to have appropriate block level details to enable 10-20× faster SoC turnaround and broader scenario coverage. Below is its mechanism:
The Common Connection Layer (CCL) acts as the electrical boundary between block & SoC top. ROM preserves full detail only at the CCL and CCL-1, while the lower metal layers (M0 to CCL-2) are rolled up into equivalent impedance model to maintain signoff accuracy.
Designers use a mix of detailed instances for same critical block with reduced instances to optimize resource usage as shown in Fig 1.
The Validation Problem with ROM- Trust Gap:
Context Mismatch: ROMs are generated in standalone conditions, failing to account for top-level grid impedance and adjacent block coupling.
Fidelity & Coverage Loss: Abstracting 12-14 layers can mask local voltage violations; current manual spot-checks are insufficient since these fail to quantify if CCL node voltages in all ROM instances match their power-domain & scenario specific simulation values
Objective of this work:
Systematic validation across all ROM instances & all power domains in a quick (wall time ~mins for SoC) else it would offset ROM runtime benefits.
Quantitative fidelity metrics with low violation thresholds & spatial coverage for debug to understand root cause of localised errors.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAnalog placement with compact representations is traditionally solved using heuristic or simulated annealing methods that are difficult to integrate with a differentiable optimization engine.
This paper introduces DiffSP, a differentiable sequence-pair-based analog placement method that bridges discrete combinatorial representation with continuous gradient optimization.
We derive a smooth relaxation of the sequence pair constraint graph via Gumbel–Sinkhorn relaxation, which allows area, wirelength, and symmetry objectives to be jointly optimized via automatic gradient calculation.
A MILP-based legalization stage then enforces exact geometric and symmetry constraints.
Experiments on industrial-level OTA benchmarks show that DiffSP achieves better placement quality and post-layout performance metrics than state-of-the-art analog placers with significantly reduced runtime.
This paper introduces DiffSP, a differentiable sequence-pair-based analog placement method that bridges discrete combinatorial representation with continuous gradient optimization.
We derive a smooth relaxation of the sequence pair constraint graph via Gumbel–Sinkhorn relaxation, which allows area, wirelength, and symmetry objectives to be jointly optimized via automatic gradient calculation.
A MILP-based legalization stage then enforces exact geometric and symmetry constraints.
Experiments on industrial-level OTA benchmarks show that DiffSP achieves better placement quality and post-layout performance metrics than state-of-the-art analog placers with significantly reduced runtime.
People
Engineering Special Session
AI
Design
EDA
DescriptionThis talk explores the concept of "Digital Inside Analog" through the lens of the continuous-time pipelined analog-to-digital converter (ADC). This sophisticated architecture is an example where digital signal processing works intimately within the analog signal path to achieve orders-of-magnitude improvement in performance for the same power dissipation. Unlike traditional discrete-time pipelines, the continuous-time approach eliminates power-hungry front-end sample-and-hold amplifiers while providing inherent anti-aliasing filtering. We delve into the unique challenges of this architecture, and how analog imperfections are mitigated using "digital-inside" calibration. By leveraging digital assistance, the continuous-time pipeline achieves a lower power disspation, higher linearity and lower noise than a traditional signal chain that consists of an antialias filter followed by an ADC. Measurements from prototype lowpass and bandpass CTP ICs will be given.
People
Engineering Special Session
AI
Design
EDA
DescriptionOver the past two decades, digital logic density has increased by more than three orders of magnitude, while analog metrics such as gain, linearity, and noise have improved only incrementally. Digital calibration and compensation can cut analog design margins by up to 50%, enabling smaller and more power-efficient circuits, while techniques such as dynamic voltage and frequency scaling, algorithmic noise shaping, and digital redundancy consistently deliver order‑of‑magnitude energy savings over analog-only solutions. Furthermore, digital subsystems inherently support in‑situ testability, with scan-based and BIST methods achieving production fault coverage exceeding 99%. Reflecting this shift, modern SoCs are now more than 80% digital—even in RF, clocking, and sensor interfaces. This talk explores how digitally assisted analog techniques are reshaping mixed‑signal design, enabling scalable, low-power, and highly testable solutions for next‑generation systems.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionCo-optimizing timing and power in modern VLSI designs remains challenging under realistic static timing analysis and standard-cell libraries. Classical gate sizing often scales poorly, while learning-based sizers behave as expensive black boxes with limited generality. Recent differentiable physical optimization enables gradient-based design flows, but existing approaches still struggle to stay aligned with library-based implementations and to provide controlled timing–power trade-offs. We propose a library-native quad-gradient gate sizing framework that leverages differentiable timing to derive structured guidance for timing and power, enabling more systematic and interpretable co-optimization in the standard-cell sizing space. On the ICCAD 2025 contest benchmarks, our framework achieves, on average, 40.4 percent points (%pt) larger reduction in TNS and 16.2 %pt better total power change than the 1st-place contest flow.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionWith the rapid growths of generative models like diffusion models (DMs), distinguishing authentic images from forfeited ones has become increasingly challenging, raising privacy, security, and ethical concerns. Watermarking offers an effective solution for authentication and traceability of generated images. Unlike traditional methods, emerging watermarking techniques for DMs enable marking AI-generated content with resilience to commonly used watermark weakening or erasing techniques. However, these methods demand high computational resource and latency, posing challenges for practical use, especially on edge devices.This work presents DM-MARK, a software-hardware co-optimized diffusion framework supporting efficient watermark generation and reverse detection. DM-MARK is implemented in 12nm FinFET technology, achieving robust watermarking with improved quality and reduced overhead. Evaluations show 11.33% higher detection accuracy, 18× latency speedup over GPU, 3.1× on-chip memory savings, an average of 4.56× EMA reduction. It also achieves 212.5× speedup over the baseline ASIC design with negligible accuracy loss. The proposed DM-MARK scheme offers a scalable and practical solution for protecting AI-generated content in real-time on edge devices.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionDetailed routing, despite its long history of study, is considered one
of the most challenging problems in Electronic Design Automation,
due to complex design rules and enormous scale. In this work, we
propose a versatile maze routing algorithm to deal with the various
challenges in detailed routing. By introducing hybrid grid graph, our
maze routing algorithm can create both on-track and off-track wires
during path search. By adopting a scalable track-based resource
model and techniques like adaptive grid graph sparsification, it can
handle both large guide-based and small region-based grid graphs.
Moreover, we also propose a simple and effective rip-up and reroute
strategy. As a result, we achieve design rule violation-free on most
designs (9/10) in the ISPD 2018 detailed routing benchmarks, with
2.5% better score, 4.6% fewer vias, and 14.7% shorter runtime on
average, and significantly lower non-preferred usage, compared
with the state-of-the-art approaches.
of the most challenging problems in Electronic Design Automation,
due to complex design rules and enormous scale. In this work, we
propose a versatile maze routing algorithm to deal with the various
challenges in detailed routing. By introducing hybrid grid graph, our
maze routing algorithm can create both on-track and off-track wires
during path search. By adopting a scalable track-based resource
model and techniques like adaptive grid graph sparsification, it can
handle both large guide-based and small region-based grid graphs.
Moreover, we also propose a simple and effective rip-up and reroute
strategy. As a result, we achieve design rule violation-free on most
designs (9/10) in the ISPD 2018 detailed routing benchmarks, with
2.5% better score, 4.6% fewer vias, and 14.7% shorter runtime on
average, and significantly lower non-preferred usage, compared
with the state-of-the-art approaches.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionOptical proximity correction (OPC) is essential for mitigating lithographic distortions in semiconductor manufacturing. A standard OPC iteration involves the rasterization, lithography simulation, and correction of mask patterns. However, the non-differentiable nature of rasterization limits both optimization efficiency and flexibility. In this paper, we propose DR. OPC, a fully differentiable OPC pipeline enabled by differentiable rasterization. It natively supports advanced features like curvilinear patterns, multi-segment solving, process window improvement, and mask rule violation correction.
Experiments demonstrate that DR. OPC achieves reductions of 52.1\%, 16.0\%, and 21.1\% in L2 error, PVB, and EPE, respectively.
Experiments demonstrate that DR. OPC achieves reductions of 52.1\%, 16.0\%, and 21.1\% in L2 error, PVB, and EPE, respectively.
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionDRAM-based processing-in-memory (PIM) emerged as a promising approach to alleviate the memory wall by executing massively parallel bitwise operations inside DRAM arrays. However, most prior designs operate on a single bitline and leave the dual-rail complementary signals naturally available on each bitline pair underutilized. We present DRCA, a novel dual-rail compute-and-access scheme that exploits both rails of a bitline pair for computation and data access. DRCA integrates two dual-contact compute cells, DRCA-OR and DRCA-XOR, which leverage full-swing dual-rail signaling on the bitline pair to perform bitwise logic with high reliability. Furthermore, the design enables concurrent operations on a single bitline pair, improving the efficiency of complex bitwise logic processing. Our evaluation shows that DRCA reduces failure rate by 2.29x compared with the most robust prior PIM design, while delivering superior average performance on basic bitwise operations.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionAs semiconductor manufacturing advances toward leading-edge nodes, such as 7 nm and below, Design Rule Checking (DRC) and violation correction have emerged as critical bottlenecks in achieving design closure. This escalation arises from intricate geometric constraints and complex rule dependencies, which traditional template-based or heuristic approaches are inadequate to resolve efficiently. These methods offer limited adaptability when migrating to new process nodes. To overcome these challenges, we propose drcAgent, a novel framework for automated DRC violation correction. The framework integrates a multimodal Large Language Model (LLM) agent with a Retrieval-Augmented Generation (RAG) mechanism grounded in a Design Rule Knowledge Graph (DRKG). We develop a self-play adversarial multi-turn reinforcement learning framework where a Generator agent and a Fixer agent co-evolve, enabling the agent to iteratively improve its correction policy through real interactions with commercial EDA tools. Experimental results on real industrial-scale design cases demonstrate that the proposed framework can be effective at fixing violations, offering a promising new learning-based solution for chip design.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionVLSI layout pattern generation plays a crucial role in design for manufacturability (DFM). Patterns that incorporate specific design rule checking (DRC) violations are valuable for applications such as foundry design-rule deck development, EDA tool validation, and generating training data for AI-based DRC research. However, existing research primarily focuses on producing DRC-clean layouts by enforcing rule compliance through constraints or filtering, while ignoring the generation of controllable violations. To address this gap, we propose DRCGen, a controllable framework for generating DRC violation patterns with cross-PDK transferability. DRCGen is based on a diffusion model with conditional control, generating structured control hints to encode spatial and semantic violation information for fine-tuning the model. The model can generate violation patterns of user-specified types within a target
region based on a natural-language prompt, while maintaining DRC-compliance outside the region. Additionally, we incorporate few-shot learning to facilitate rapid transfer across different PDKs. Experimental results show that, compared to state-of-the-art methods, our approach achieves a 2.27× increase in topological diversity and a 1.02× increase in geometric diversity. Compared to Calibre LSG, we achieve 3.31× and 1.15× improvements in topological and geometric diversity, respectively.
region based on a natural-language prompt, while maintaining DRC-compliance outside the region. Additionally, we incorporate few-shot learning to facilitate rapid transfer across different PDKs. Experimental results show that, compared to state-of-the-art methods, our approach achieves a 2.27× increase in topological diversity and a 1.02× increase in geometric diversity. Compared to Calibre LSG, we achieve 3.31× and 1.15× improvements in topological and geometric diversity, respectively.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionDiffusion model deployment has been suffering from high energy consumption and inference latency despite its superior performance in visual generation tasks. Dynamic voltage and frequency scaling (DVFS) offers a promising solution to exploit the potential of the underlying accelerators. However, existing approaches often lead to either limited efficiency gains or degraded output quality because they overlook the inherent fault tolerance of the diffusion model. Therefore, in this paper, we propose DRIFT, a novel algorithm-architecture co-optimization framework that harnesses the fault tolerance for efficient and reliable diffusion model inference. We first perform a comprehensive resilience analysis on representative diffusion models. Building on these observations, we introduce a fine-grained, resilience-aware DVFS strategy that selectively protects error-sensitive network blocks, and a rollback-ABFT mechanism that adaptively corrects only critical errors by reverting to previous timesteps. We further optimize offloading intervals and reorganize data layouts to reduce memory overhead. Experiments across diverse models and datasets show that DRIFT can achieve on average 36% energy savings through voltage underscaling or 1.7x speedup via overclocking while maintaining generation quality.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionMasking is an effective defense against side-channel attacks, yet it remains costly under hardware constraints. The Caliptra Root-of-Trust is a representative case, where its masked ML-DSA implementation incurs about 6× area overhead. We propose a novel first-order masking solution that optimizes Caliptra, achieving significant improvements in area–delay efficiency. Compared to Caliptra's ML-DSA reduction, our design achieves a 12.1× speedup, reducing LUTs by 86.7% and FFs by 94.5%, while improving area–delay efficiency by 91×. The optimized architecture increases signing throughput by 1.32×. TVLA, with over 1,000,000 traces, shows no first-order leakage, satisfies Caliptra's security requirements, and significantly improves implementation efficiency.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionIC3 is the state-of-the-art model checking algorithm where generalizing
cubes by dropping literals one-by-one is the most computationally
expensive step. We propose multi-literal drop strategies that
eliminate two or more literals simultaneously. Successful n-drop saves
n−1 SAT invocations. To mitigate performance losses from failures,
we introduce deduction mechanisms that analyze counterexamples to
generalization and identify non-droppable literals early. With these, failed
multi-drop attempts are sometimes as useful as conventional
single-drops. Additionally we analyze the diminishing returns of higher-order
drops. Implementation on ABC solves 28 unique and 16 more cases than
vanilla ABC and implementation on rIC3 runs 6% faster.
cubes by dropping literals one-by-one is the most computationally
expensive step. We propose multi-literal drop strategies that
eliminate two or more literals simultaneously. Successful n-drop saves
n−1 SAT invocations. To mitigate performance losses from failures,
we introduce deduction mechanisms that analyze counterexamples to
generalization and identify non-droppable literals early. With these, failed
multi-drop attempts are sometimes as useful as conventional
single-drops. Additionally we analyze the diminishing returns of higher-order
drops. Implementation on ABC solves 28 unique and 16 more cases than
vanilla ABC and implementation on rIC3 runs 6% faster.
Late Breaking Results
DescriptionNear-memory processing (NMP) mitigates the overhead of host-memory data movement while maintaining efficient data access.
We introduce DScNMP, an architecture-dataflow co-design that employs dataflow scheduling to optimize NMP executions.
DScNMP incorporates dynamic workload scheduling, intra-cycle coordinated control, and state supervision units, ensuring efficient resource management.
Evaluation results show that DScNMP, occupies 0.006384 mm^2 in 14 nm, achieves 2.7× lower data access latency, up to 8.9× and 4.9× fewer cycles for external data movement and memory-bound workloads, respectively, and delivers 2.4× higher multiply-accumulate efficiency than Armv8.1-M.
We introduce DScNMP, an architecture-dataflow co-design that employs dataflow scheduling to optimize NMP executions.
DScNMP incorporates dynamic workload scheduling, intra-cycle coordinated control, and state supervision units, ensuring efficient resource management.
Evaluation results show that DScNMP, occupies 0.006384 mm^2 in 14 nm, achieves 2.7× lower data access latency, up to 8.9× and 4.9× fewer cycles for external data movement and memory-bound workloads, respectively, and delivers 2.4× higher multiply-accumulate efficiency than Armv8.1-M.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionIn recent years, DeepSeek has achieved strong inference performance but remains hard to deploy on energy-constrained edge devices. This paper presents the DeepSeek Processing Element (DSPE), an edge-oriented architecture that alleviates the model's heavy computational and energy demands. DSPE introduces three techniques: the MerkleTree-based Incremental Pruning Scheme (MIPS) for secure redundant-vector reduction, the Multi-Stage Boothing Lookup Method (MBLM) for bit-flip–aware approximate multiplication, and the Dynamic Adaptive Posit Processing Mechanism (DAPPM), which introduces a new DA-Posit format and its corresponding hardware multiplication architecture. Implemented in TSMC 28nm CMOS, DSPE achieves 91.7 TFLOPS/W energy efficiency compared with state-of-the-art designs and offers a scalable foundation for edge deployment.
People
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionLarge language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba–Transformer models inherit this asymmetry while adding state-space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector–matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionTransformer models achieve exceptional performance, but face high computational costs that hinder their deployment on resource-constrained devices. While bit-column sparsity (BCS) offers promising post-training acceleration, existing distribution-agnostic methods neglect natural sparsity in the most significant bits (MSBs) from Gaussian-like weight distributions, requiring aggressive accuracy-degrading modifications on the least significant bits (LSBs). This paper presents Duet, a bit-serial accelerator fully exploiting High-Order Lossless Sparsity in BCS for load-balanced Transformer acceleration. At the algorithm level, Distribution-Aware Pruning (DAP) partitions weights by a hyperparameter to maximize lossless MSB pruning opportunities, while Fixed Redundant Hierarchical Search (FRHS) optimally handles remaining compression, achieving 3.13/3.25 effective bits with negligible accuracy loss. At the architecture level, our Duet accelerator addresses four key challenges: (1) two-level shifter resolves bit significance mismatch in Duet encoding format; (2) parallel Metadata-Weight pipelines support variable bitwidths while completely hiding metadata processing overhead; (3) Activation Sum Generator supports time-multiplexed Metadata Pipeline; (4) dual mode operation handles both linear and attention layers in Transformer models. Duet accelerator ensures load-balanced execution for all Processing Elements (PEs). Experiments on BERT and ViT demonstrate 1.45× ~ 4.30× speedup and 1.32× ~ 2.94× energy improvement over SOTA accelerators.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionSuperconductor electronics have increasingly shifted away from RSFQ and its variants toward logic families that eliminate explicit gate-level clocking. While this transition enables simpler circuits and more efficient architectures, it also introduces an implicit reliance on dual-rail codes, resulting in inherent gate duplication. This work presents a duplication-aware retiming methodology for Josephson junction (JJ) count minimization, co-optimizing register placement and polarity assignment. The approach applies beyond SFQ to any monotonic circuit. We further identify cell interfaces—specifically, interconnect drivers, receivers, and fanout (FO) elements—as dominant contributors to JJ count in each cell. A new amplifier design is introduced to reduce these costs, integrated within existing SFQ cells, experimentally verified, and characterized to form a new SFQ cell library. Our results demonstrate a 63-71% JJ count reduction in single-cycle implementations and 41-66% reduction in multi-cycle implementations compared to the prior state-of-the-art. The latter establishes a new Pareto frontier, achieving both shorter critical paths and lower JJ counts than the best-to-date single-cycle implementations.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAsynchronous clock domains in modern SoC designs present critical verification challenges, as Clock Domain Crossing (CDC) issues remain the second most common cause of silicon re-spins. While static CDC ensures structural synchronization, it relies on manual constraints and waivers that may be inaccurate or hide functional bugs. Furthermore, formal analysis often leaves properties "partially proven" due to computational complexity.
This paper proposes a closed-loop methodology to validate these "design intent" assumptions dynamically. The flow begins by running structural CDC and filtering results to identify waived paths. These waivers—including those translated from TCL to constraint formats—are converted into protocol-aware SystemVerilog Assertions (SVA). By integrating these SVAs into dynamic simulation, engineers can stress-test protocols like data-hold and glitch protection.
If simulations pass, waiver validity is confirmed for sign-off; failures trigger design or constraint refinement. This methodology is a "MUST" for SoC design, uncovering corner-case bugs unreachable via manual reviews. Key advantages include significantly reduced Turnaround Time (TAT) and high-quality sign-off guaranteed by measurable assertion coverage.
This paper proposes a closed-loop methodology to validate these "design intent" assumptions dynamically. The flow begins by running structural CDC and filtering results to identify waived paths. These waivers—including those translated from TCL to constraint formats—are converted into protocol-aware SystemVerilog Assertions (SVA). By integrating these SVAs into dynamic simulation, engineers can stress-test protocols like data-hold and glitch protection.
If simulations pass, waiver validity is confirmed for sign-off; failures trigger design or constraint refinement. This methodology is a "MUST" for SoC design, uncovering corner-case bugs unreachable via manual reviews. Key advantages include significantly reduced Turnaround Time (TAT) and high-quality sign-off guaranteed by measurable assertion coverage.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionFracturable LUTs (FLUTs) creates variable logic consumption for LUT implementations since two LUTs can be merged to one FLUT under certain constraints. Traditional technology mapping algorithms fail to exploit this feature due to their static-cost area models. To bridge this gap, we introduce merging probability, a quantitative, mapping-stage metric that predicts the likelihood of LUT merging during the subsequent packing phase. Based on this, we present a dynamic-cost LUT area model, enabling area recovery better suited for FLUTs. Experimental results on EPFL benchmarks demonstrate that our method reduces the usage of FLUTs by at most of 10.3% on mainstream commercial FPGAs, compared to the state-of-the-art technology mapping algorithm, without any performance degradation.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionDynamic Graph Pattern Mining (DGPM) has been widely applied in various domains. However, existing solutions still suffer from severe memory access bottlenecks due to the irregular and data-intensive nature of DGPM workloads. In this paper, we propose DyPamear, the first full-stack hardware-software co-designed system for accelerating DGPM on practical Processing-in-Memory (PIM) hardware. DyPamear is built atop UPMEM, an emerging commercially available PIM platform. To fully exploit UPMEM's bandwidth and parallelism, DyPamear introduces a cross-layer design that integrates load-aware task distribution, data-driven asynchronous execution, and a degree-adaptive set intersection kernel to balance load and alleviate architectural constraints. Evaluations on real UPMEM hardware show that DyPamear achieves average speedups of 267.38x, 82.52x, and 8.78x over Cheetah, PimPam, and PSMiner, respectively, and scales nearly linearly to 20,480 DPUs. The source codes are available at https://github.com/DyPamear-AE/DyPamear-AE.
People
Research Manuscript
Systems
SYS3. Embedded Software
Descriptionwe propose DySL-VLA, a novel framework that addresses computational cost by dynamically
skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset.
skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionRecent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code. Despite this progress, a gap remains between model capability and the demands of real-world RTL design, including syntax errors, functional hallucinations, and weak alignment to designer intent. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising approach to bridge this gap, as hardware provides executable and formally checkable signals that can be used to further align model outputs with design intent. However, in long, structured RTL code sequences, not all tokens contribute equally to functional correctness, and naïvely spreading gradients across all tokens dilutes learning signals. A key insight from our entropy analysis in RTL generation is that only a small fraction of tokens (e.g., always, if, assign, posedge) exhibit high uncertainty and largely influence control flow and module structure. To address these challenges, we present EARL, an Entropy-Aware Reinforcement Learning framework for Verilog generation. EARL performs policy optimization using verifiable reward signals and introduces entropy-guided selective updates that gate policy gradients to high-entropy tokens. This approach preserves training stability and concentrates gradient updates on functionally important regions of code. Our experiments on VerilogEval and RTLLM show that EARL improves functional pass rates over prior LLM baselines by up to 14.7%, while reducing unnecessary updates and improving training stability. These results indicate that focusing RL on critical, high-uncertainty tokens enables more reliable and targeted policy improvement for structured RTL code generation. We will release the code upon acceptance. An anonymized repository for review is available at https://anonymous.4open.science/r/EARL-1C25.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAntenna verification is a crucial step in physical design signoff to prevent transistor damage during IC fabrication. Traditional antenna DRC methods, however, are time-consuming and rely on short-free designs, making early analysis difficult when routing shorts or incomplete implementations exist. SoC teams often require early antenna assessments to provide actionable feedback to block owners on interface-level violations, enabling timely corrective measures. Conventional flows force teams to wait for clean designs, delaying detection and slowing overall progress.
Shift-left antenna methodologies address these challenges by enabling early, incremental analysis with 2–6X faster runtimes. Short-aware techniques, combined with selective net and block verification, allow teams to focus on critical areas even when shorts are present. These approaches help identify violations sooner, refine routing strategies proactively, and reduce late-stage fixes. By integrating these flows early in the implementation process,
Shift-left antenna methodologies address these challenges by enabling early, incremental analysis with 2–6X faster runtimes. Short-aware techniques, combined with selective net and block verification, allow teams to focus on critical areas even when shorts are present. These approaches help identify violations sooner, refine routing strategies proactively, and reduce late-stage fixes. By integrating these flows early in the implementation process,
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionHierarchical power grid shorts represent a critical challenge in modern chip design, typically discovered late in the schedule during physical verification and LVS analysis at integration, causing significant delays and expensive fixes near tapeout. This work presents an innovative DEF-based methodology to detect power grid shorts early in the PNR flow with exact coordinate reporting. The approach analyzes power and ground net shapes to identify unintended collisions between different net types, particularly for multi-voltage domain crossings. By processing DEF data from individual PNR blocks in the context of their hierarchical placement, the methodology enables shorts identification in hours rather than days. This system integrated at the floorplan level of the PNR implementation flow significantly reduces physical verification churn, eliminates weeks of LVS iteration, and improves design schedule predictability while providing precise location information for rapid resolution.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionAs semiconductor technologies continue to scale toward advanced nodes, the sharp increase in transistor density has led to a substantial rise in on-chip power density, making thermal effects a first-order design concern. Traditional thermal mitigation techniques (e.g., thermal-aware coarse-grained floor-planning, thermal-aware (post-route) cell adjustment, and structural cooling enhancements)
suffer from thermal prediction inaccuracy or incur high fabrication cost, limiting their practical applicability to modern SoC designs. To overcome the limitation, in this work, we present an ML-based early-stage thermal prediction and mitigation framework that enables proactive thermal management in the course of physical design process. Precisely, our approach (1) predicts first the power density map which is the underlying source of thermodynamic behavior during the global placement stage using machine learning models, and then (2) accurately estimates the steady-state thermal map through a physics-guided thermal interpolation. The predicted temperature is subsequently leveraged by (3) an ML-model based optimization engine that adjusts the placement solution to minimize thermal hotspots without timing degradation. Experimental results demonstrate that our proposed model achieves 49.0% and 34.9% accuracy improvement in power density and thermal prediction, respectively, compared to tool estimation. When applied to thermal-aware placement optimization, the framework successfully reduces the maximum chip temperature by 9.97◦C while maintaining equivalent timing and area. These results confirm the effectiveness of our proposed early-stage thermal modeling and optimization framework in improving thermal reliability for modern power-hungry SoCs.
suffer from thermal prediction inaccuracy or incur high fabrication cost, limiting their practical applicability to modern SoC designs. To overcome the limitation, in this work, we present an ML-based early-stage thermal prediction and mitigation framework that enables proactive thermal management in the course of physical design process. Precisely, our approach (1) predicts first the power density map which is the underlying source of thermodynamic behavior during the global placement stage using machine learning models, and then (2) accurately estimates the steady-state thermal map through a physics-guided thermal interpolation. The predicted temperature is subsequently leveraged by (3) an ML-model based optimization engine that adjusts the placement solution to minimize thermal hotspots without timing degradation. Experimental results demonstrate that our proposed model achieves 49.0% and 34.9% accuracy improvement in power density and thermal prediction, respectively, compared to tool estimation. When applied to thermal-aware placement optimization, the framework successfully reduces the maximum chip temperature by 9.97◦C while maintaining equivalent timing and area. These results confirm the effectiveness of our proposed early-stage thermal modeling and optimization framework in improving thermal reliability for modern power-hungry SoCs.
Engineering Presentation
Chiplet
EDA
DescriptionFace-to-face (F2F) 3DIC integration introduces new thermal challenges due to increased power density, inter-die heat coupling, and limited vertical heat dissipation, making early thermal visibility critical during physical design. This work presents an early-stage thermal analysis methodology jointly developed by the Synopsys RedHawk-ET team and the Broadcom APD AI team to rapidly identify thermal hotspots in F2F 3DIC stackups. The proposed flow leverages early floorplan and power information to enable fast thermal evaluation, allowing designers to proactively adjust floorplanning, power distribution, or die placement to mitigate trapped heat. The methodology is lightweight enough to execute in under a few hours, making it suitable for iterative use during early implementation stages. Correlation with post-silicon thermal measurements demonstrates prediction accuracy within 3 °C across both top and bottom dice. This work highlights practical considerations, tradeoffs, and lessons learned in deploying early thermal analysis as part of a production 3DIC physical design flow.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDue to the increasing size and complexity of high-performance integrated circuits (ICs), especially in advanced technology nodes, the influence of parasitics on the layout has become dominant. Traditional signoff tools and internal scripts often suffer from slow execution, challenging maintenance, and insufficient precision, which limits their effectiveness in addressing the stringent requirements of precision ICs.
To overcome these limitations, a systematic flow for early IC layout parasitic analysis has been developed for quick and efficient detection and debug of parasitic violations. Our shift-left methodology establishes a framework for defining constraints upfront on all critical design nets, enables comparison of multiple layout revisions and provides unique capabilities for verifying large top-level design hierarchies - ensuring robust coverage across diverse design teams and technology nodes.
The flow's scalability allows its adoption for various design styles, facilitating early detection, debugging, and resolution of even minute parasitic violations that would otherwise require lengthy simulation cycles. As a result, design time is significantly reduced by minimizing iterative simulation runs, leading to substantial savings in hardware resources and software license requirements for simulation tools. This systematic early parasitic analysis flow represents a transformative advancement for precision IC development, delivering enhanced design quality, reliability, and productivity.
To overcome these limitations, a systematic flow for early IC layout parasitic analysis has been developed for quick and efficient detection and debug of parasitic violations. Our shift-left methodology establishes a framework for defining constraints upfront on all critical design nets, enables comparison of multiple layout revisions and provides unique capabilities for verifying large top-level design hierarchies - ensuring robust coverage across diverse design teams and technology nodes.
The flow's scalability allows its adoption for various design styles, facilitating early detection, debugging, and resolution of even minute parasitic violations that would otherwise require lengthy simulation cycles. As a result, design time is significantly reduced by minimizing iterative simulation runs, leading to substantial savings in hardware resources and software license requirements for simulation tools. This systematic early parasitic analysis flow represents a transformative advancement for precision IC development, delivering enhanced design quality, reliability, and productivity.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionGraph spectral sparsification plays an important role in extensive EDA applications. For preconditioned conjugate gradient (PCG) solvers, graph spectral sparsification is a promising preconditioning technique in both theory and practice. In this paper, a highly efficient and stable graph sparsification algorithm based on spectral probability is proposed. Meanwhile, targeting at minimum total solution time of the linear equation with multiple right-hand sides, an efficient self-tuning PCG framework powered by neural networks is proposed. Combining the proposed techniques, an efficient and self-tuning graph sparsification based PCG solver, named EastPCG, is finally developed. Extensive experiments on various benchmarks have demonstrated the advantages of the proposed algorithms over existing counterparts.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionZero-knowledge proofs (ZKPs) enable parties to prove possession of information without disclosure, strengthening privacy and security. However, ZKP proof generation suffers from high computational and memory overheads. We present EASY-ZKP, an end-to-end FPGA-accelerated ZKP system with multi-scalar multiplication (MSM) and number theoretic transform (NTT) architectures that improve performance and balance resources for efficient FPGA co-deployment. We further develop an automated design space exploration framework that minimizes latency under resource constraints. Prototyped on a Xilinx Alveo U280, EASY-ZKP achieves up to 19.5× speedup over a CPU implementation and up to 7.7× better energy efficiency than a GPU implementation.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe SSIR Customer Design Service (CDS) team supports projects spanning big die AI, automotive, and HPC SoCs. Schedules and designs demand static timing analysis (STA) and engineering change order (ECO) flows that close designs quickly. An ECO strategy for a large die AI SoC is difficult: the design contains about 1.6 billion instances, must meet a 1.6 GHz target, and relies heavily on multiply-instantiated-modules (MIM). MIMs complicate violation fixes because each change must be evaluated for impact on other MIMs. Additional challenges include clock path skew, extensive boundary models (BMs), and long data paths, all hindering a sustainable ECO methodology.
To address these issues, the team adopted a "divide and conquer" policy. Multiple STA models with BMs were generated for the top level, enabling parallel analysis and reducing compute load and turnaround time. The CERTUS platform, with its advanced high capacity (HC) ECO engine, prunes the netlist by identifying violations, making the flow lightweight and efficient. At the block level, Twopass and Paradigm were used as ECO tools to converge the design. SSIR worked closely with Cadence field and R&D groups to tailor the flow, creating a recipe that significantly improves fix rates while preserving timing, logical DRCs, and legalization.
To address these issues, the team adopted a "divide and conquer" policy. Multiple STA models with BMs were generated for the top level, enabling parallel analysis and reducing compute load and turnaround time. The CERTUS platform, with its advanced high capacity (HC) ECO engine, prunes the netlist by identifying violations, making the flow lightweight and efficient. At the block level, Twopass and Paradigm were used as ECO tools to converge the design. SSIR worked closely with Cadence field and R&D groups to tailor the flow, creating a recipe that significantly improves fix rates while preserving timing, logical DRCs, and legalization.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionPrefix KV cache is widely used to accelerate LLM serving by trading more storage for less computation, and state-of-the-art methods often replicate hotspot caches for load balancing. However, we observe that the few nodes that have cache replicas still lead to severe load imbalance. This paper presents ECPrefix: a new prefix KV cache framework based on erasure coding (instead of replication), which distributes encoded blocks of hot prefix caches (organized as profile-guided objects) across nodes, along with adaptive striping and pipelined reading optimizations. Evaluation shows that ECPrefix reduces TTFT by up to 52.3% over existing systems.
People
SKYTalk
AI
EDA
DescriptionThe rapid growth of artificial intelligence (AI) applications is driving unprecedented demand for high‑performance, energy‑efficient AI accelerators, placing new and complex demands on Electronic Design Automation (EDA) flows. Scaling these designs from chips to full systems introduces challenges that traditional EDA methodologies were not built to handle.
With increasing accelerator size and complexity, traditional EDA tools and methodologies face challenges spanning advanced process nodes, large‑scale parallelism, and system‑level performance and data movement. These issues increasingly emerge at the boundaries between compute, memory, interconnect, and software behavior, where system‑level interactions has significant impact on design outcomes. In addition, the speaker will discuss the evolving landscape of EDA tools, highlighting the augmentation of conventional methodologies with AI-driven approaches to better address the increasing intricacy and verification demands in hardware design. Emerging trends such as hardware‑software co‑design, faster iteration through prototyping and comprehensive design‑space exploration are discussed as key approaches to addressing these challenges.
Attendees will leave with practical insights into current limitations of EDA at AI scale, emerging solutions being deployed in production environments, and the implications for next‑generation accelerator and system design – from silicon architects to EDA practitioners working across the full chips‑to‑systems stack.
With increasing accelerator size and complexity, traditional EDA tools and methodologies face challenges spanning advanced process nodes, large‑scale parallelism, and system‑level performance and data movement. These issues increasingly emerge at the boundaries between compute, memory, interconnect, and software behavior, where system‑level interactions has significant impact on design outcomes. In addition, the speaker will discuss the evolving landscape of EDA tools, highlighting the augmentation of conventional methodologies with AI-driven approaches to better address the increasing intricacy and verification demands in hardware design. Emerging trends such as hardware‑software co‑design, faster iteration through prototyping and comprehensive design‑space exploration are discussed as key approaches to addressing these challenges.
Attendees will leave with practical insights into current limitations of EDA at AI scale, emerging solutions being deployed in production environments, and the implications for next‑generation accelerator and system design – from silicon architects to EDA practitioners working across the full chips‑to‑systems stack.
People
Work in Progress
DescriptionEdge workloads such as keyword spotting and activity recognition must process continuous FP32 sensor data under tight power–performance–area budgets, making full-precision GEMM units impractical. EdgeQ-GEMM is a compact, processor-integrated INT4/INT8 mixed-precision GEMM accelerator that performs on-the-fly quantization and computation without FP hardware. It supports two modes: adaptive mode, dynamically quantizing FP32 activations to INT4 or INT8 using a design-time QCM to expose a tunable accuracy–energy design space, and layer-wise mode, executing QAT/PTQ models with layer-mixed INT4/INT8 weights by applying on-the-fly INT32-to-INT8 activation quantization. Implemented in edge AI processors and validated via FPGA and 45nm synthesis, EdgeQ-GEMM enables efficient, flexible precision adaptation for diverse workloads.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionStochastic computing (SC) enables compact and low-complexity hardware but remains underexplored for vision applications. We propose EdgeSC, a unified stochastic framework that extends edge detection to diverse and composite operators. Pixel intensities are encoded as stochastic bitstreams, and gradients are computed through finite-state machines (FSMs) ensembles operating in the probability domain. A differentiable MUX mapping learns operator-specific behaviors without changing the architecture. Fabricated in 28-nm CMOS, EdgeSC achieves 15.9$\times$ smaller area, 4.4$\times$ lower power, and 6.3$\times$ better area-delay product than 8-bit baselines while maintaining comparable accuracy and throughput.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionQuantum computing is a promising technology for cryptography and optimization problems. Additionally, rapid single flux quantum (RSFQ) circuits are a promising technology for an interface with the quantum computer in the cryogenic chamber to reduce the number of connections required. Design of these superconducting systems requires accurate modeling of the resonant frequency of the quantum computing readout resonators, and the inductance of the RSFQ interconnect. One challenge in the design process is the time required to determine the layout of the system to achieve performance targets. Also, most commercially available extraction tools do not model the kinetic inductance present at cryogenic temperatures. Some extraction tools do exist which model kinetic inductance, and these can be used to model arbitrary cryogenic structures, but these tools are not integrated into the standard integrated circuit (IC) design platforms, require expertise to run, and have limited capacity.
We propose using a high-capacity partial element equivalent circuit (PEEC) solver with automated geometry simplification and integrable with common IC design platforms to quickly design the superconducting structures to verify circuit performance. We demonstrate accuracy by comparing with measured data and extractions with other tools. We demonstrate capacity by extracting a representative shift register.
We propose using a high-capacity partial element equivalent circuit (PEEC) solver with automated geometry simplification and integrable with common IC design platforms to quickly design the superconducting structures to verify circuit performance. We demonstrate accuracy by comparing with measured data and extractions with other tools. We demonstrate capacity by extracting a representative shift register.
People
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionHigh-level synthesis (HLS) and multi-die FPGAs have been widely applied in large-scale accelerator design. To address cross-die boundary delays and local congestion issues in HLS designs on multi-die FPGAs, prior work proposed a coarse-grained floorplanning and pipelining method to improve frequency performance. However, achieving higher frequency still requires multiple iterative implementations in Vitis to tune parameters, incurring significant time overhead. To accelerate this process, we propose a graph neural network (GNN) based floorplan quality predictor at the FPGA slot level, achieving an accuracy of 84.13% and an F1-score of 0.86 in the congestion prediction task, with an average inference latency of only 0.58ms. Compared with the traditional tool flow that requires tens of hours, our method enables millisecond-level floorplan parameter fine-tuning, improving an unroutable and 318.2MHz case to 329.2MHz and 330.3MHz, respectively. Furthermore, we integrate our method into the HLS design-space exploration framework, achieving an average frequency improvement of 11.2%, with the maximum improvement reaching 23.8%. The execution time of all benchmarks is reduced by 30.9%.
People
Engineering Presentation
EDA
Systems
DescriptionAs design sizes and complexities grow, IR-drop is increasingly becoming a long-pole activity towards final signoff of digital designs. Manual fixing of IR issues in complex grids is iterative, time-consuming and often non-converging. Industry has used some form of IR-fixing ECO strategy – but those are custom and could not be generalized. With the pre-trained IR analytics and full integration into a P&R engine, Cadence Voltus InsightAI is expected to help improve turnaround time of powergrid closure for IR by automatically reinforcing the grid with additional stripes where needed. For that to work, a detailed design impact analysis is needed to create a comprehensive methodology encompassing the tool capabilities and working around its limitations. This paper will discuss that.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEfficient power management is essential to address the growing demand for low-power connectivity solutions. It aligns with mobile compute, always on, and compute everywhere initiatives, offering increased battery lifetime, and strengthens the competitiveness of our IPs. We have developed an innovative power management architecture that standardizes processes for low-power entry and exit across various protocol adapters in USB4 routers. This streamlined approach not only enhances overall energy efficiency but also simplifies the router's power management architecture. The successful integration of our USB4 router IP -- featuring the power management function -- into a major customer's silicon has yielded remarkable outcomes, validating the effectiveness of our solution.
Engineering Presentation
Design
EDA
DescriptionStandard LCBs (Local Clock Buffers) drive a single clock gating domain and are controlled by a single clock enable signal. Even for smaller domains, the same LCB is used for clock gating, leading to a large number of under-loaded LCBs. Multi-domain clock gating circuits (or Micro Clock Gated, MCG) were introduced for handling such scenarios and for power savings. MCG LCBs drive multiple domains and have additional enable signals for separate control of each domain. Power modeling and analysis of micro-gating has its own set of challenges. Simpler approaches (averaging etc.) do not give an accurate picture of power dissipation, making it difficult to gauge the power savings obtained using micro clock gating. For correct power estimation and assessment of right set of tradeoffs during optimization, we need 1) a power model accounting for all modes of operation 2) micro domain power granularity for accuracy 3) manageable size of the power model for efficiency in both, modeling and analysis 4)one model for use across different power flows with appropriate activity signatures. In this presentation, we propose efficient and accurate power modeling techniques for MCG circuits using IEEE 2416 artifacts.
Engineering Presentation
EDA
DescriptionPower, clock, and tie networks in modern designs form massive, multi-layer, and hierarchically distributed metal connected components with millions of shapes. Accurately modelling and maintaining connectivity for such nets is critical throughout the physical design flow, yet existing approaches rely on static spatial partitioning, leading to poor scalability, excessive runtime, and incorrect connectivity artifacts in hierarchical layouts.
We present a bottom-up, shape-driven method to construct the smallest possible metal connectivity subgraphs using dynamic region queries and disjoint-set union structures. The approach enables scalable, parallel connectivity modelling while preserving synchronization between physical and logical connectivity. To handle hierarchical pin and power interactions, we introduce a pin-accessibility graph that explicitly models reachability and filters redundant guides using a graph-based algorithm. The resulting subgraphs enable independent and parallel execution of connectivity repair, tie-net routing, and traversal operations. Experimental results on production designs demonstrate up to a 60× runtime reduction with near-linear multi-thread scalability. The method is deployed in industrial server and ASIC design flows, enabling fast, correct, and scalable connectivity processing for large-scale metal networks.
We present a bottom-up, shape-driven method to construct the smallest possible metal connectivity subgraphs using dynamic region queries and disjoint-set union structures. The approach enables scalable, parallel connectivity modelling while preserving synchronization between physical and logical connectivity. To handle hierarchical pin and power interactions, we introduce a pin-accessibility graph that explicitly models reachability and filters redundant guides using a graph-based algorithm. The resulting subgraphs enable independent and parallel execution of connectivity repair, tie-net routing, and traversal operations. Experimental results on production designs demonstrate up to a 60× runtime reduction with near-linear multi-thread scalability. The method is deployed in industrial server and ASIC design flows, enabling fast, correct, and scalable connectivity processing for large-scale metal networks.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionIn this work, we propose an efficient heterogeneous probabilistic computing (EHPC) architecture based on volatile RRAM to accelerate combinatorial optimization. We fabricated the OxRAM-multiplexed EHPC circuit and successfully used it to solve max-cut problem. In hardware simulations of max-cut problems, EHPC achieves a superior solution quality to existing works in under 1s, using the same computational resources. The results of the floorplanning problems demonstrate that our EHPC architecture significantly boosts computation speed, ranging from 20× to 1,500×, with area expansion <1.7% compared to the best-performing conventional methods, highlighting its advantages in speed, efficiency, and scalability for solving COPs.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionGPU memory management is critical for efficient Large Language Model (LLM) serving. LLM memory usage primarily comprises weights, activations, and KV caches. While weights are static, activations and KV caches exhibit dynamic and unpredictable behavior, posing significant memory management challenges. Modern LLM serving systems address this through a dual-level approach: activations inherit static tensor abstractions from deep learning frameworks, while KV caches employ specialized page-table virtualization (i.e., PagedAttention). Although this reduces KV cache fragmentation, the fundamental isolation between activation and KV cache management prevents memory sharing across these spaces, leading to suboptimal utilization and 20\% throughput degradation.
To address these limitations, we propose eLLM, an elastic memory management framework. The core components of eLLM include:(1) Virtual Tensor Abstraction: Decouples the virtual address space of tensors from physical GPU memory, creating a unified and flexible memory pool;(2) Elastic Memory Mechanism: Dynamically adjusts memory allocation through runtime memory inflation and deflation, and leverages CPU memory as an extensible buffer;(3) Lightweight Scheduling Strategy: Employs Service-Level Objective (SLO)-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints.
Comprehensive evaluations demonstrate that eLLM outperforms state-of-the-art systems, achieving up to 2.32$\times$ higher throughput.
To address these limitations, we propose eLLM, an elastic memory management framework. The core components of eLLM include:(1) Virtual Tensor Abstraction: Decouples the virtual address space of tensors from physical GPU memory, creating a unified and flexible memory pool;(2) Elastic Memory Mechanism: Dynamically adjusts memory allocation through runtime memory inflation and deflation, and leverages CPU memory as an extensible buffer;(3) Lightweight Scheduling Strategy: Employs Service-Level Objective (SLO)-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints.
Comprehensive evaluations demonstrate that eLLM outperforms state-of-the-art systems, achieving up to 2.32$\times$ higher throughput.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionAdiabatic Quantum-Flux-Parametron (AQFP) is a promising superconducting logic family that combines ultra-low power consumption with high-speed switching. However, the extensive insertion of buffers and splitters (B/S) to satisfy fan-out and synchronization constraints significantly increases circuit area and depth, becoming a key bottleneck in AQFP synthesis. Existing heuristic approaches are lightweight but prone to local optima, while global or exact methods achieve higher solution quality at the expense of runtime and scalability. In this work, we propose the first Large Neighborhood Search (LNS)-guided framework for AQFP B/S insertion, which combines multi-granularity group movement with a destruct-and-repair paradigm to systematically escape local minima while ensuring legality through constraint-aware repair. Extensive experiments on ISCAS'85 and EPFL benchmarks show that our framework achieves up to 14.0% fewer B/S insertions and 13.1% fewer junctions on large EPFL circuits, while yielding the lowest circuit depth among state-of-the-art methods. It also attains up to a 3.1× runtime speedup on large benchmarks, and on the ISCAS'85 suite reduces B/S and JJs by 13.0% and 8.1% respectively.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionEmulation-based dynamic analysis detects malicious software by observing runtime behavior in a controlled environment.
Malware increasingly adopts evasion techniques to recognize such environments and hide malicious activities.
In this paper, we propose EmuDRop, a minimalistic emulation-detection attack that exploits Intel reserved opcodes.
EmuDRop leverages microarchitectural differences between real hardware and emulators to identify emulated execution.
We reverse-engineer reserved opcodes, characterize their microarchitectural effects, and evaluate EmuDRop on five Intel CPU cores and QEMU, a widely used and representative emulator.
The results show that EmuDRop reliably identifies emulated environments.
Malware increasingly adopts evasion techniques to recognize such environments and hide malicious activities.
In this paper, we propose EmuDRop, a minimalistic emulation-detection attack that exploits Intel reserved opcodes.
EmuDRop leverages microarchitectural differences between real hardware and emulators to identify emulated execution.
We reverse-engineer reserved opcodes, characterize their microarchitectural effects, and evaluate EmuDRop on five Intel CPU cores and QEMU, a widely used and representative emulator.
The results show that EmuDRop reliably identifies emulated environments.
People
Engineering Presentation
AI
Design
EDA
DescriptionVerification remains the most time-consuming phase of hardware development, with millions of simulation jobs generating diverse fail signatures. Manual triage and root cause analysis (RCA) of these failures is a critical bottleneck, consuming significant engineering effort and requiring specialized expertise.
We present an AI-driven approach that leverages agentic workflows to automate and accelerate fail triage and debug. Our solution integrates heterogeneous verification context—design specifications, HDL, waveforms, coverage data, and prior issues—through Model Context Protocol (MCP) servers, enabling agents to retrieve and correlate information efficiently. Customized agents iteratively process simulation traces, extract cone-of-influence logic, and correlate transactions with waveform and coverage data. Deployed in IBM Z hardware verification, these flows have demonstrated substantial productivity improvements, estimated reduction of manual effort by 15-40% and enhanced overall verification throughput. This work also provides insights into MCP integration challenges such as context bloat and API granularity and outlines best practices for implementing AI-driven debug solutions in complex hardware environments.
We present an AI-driven approach that leverages agentic workflows to automate and accelerate fail triage and debug. Our solution integrates heterogeneous verification context—design specifications, HDL, waveforms, coverage data, and prior issues—through Model Context Protocol (MCP) servers, enabling agents to retrieve and correlate information efficiently. Customized agents iteratively process simulation traces, extract cone-of-influence logic, and correlate transactions with waveform and coverage data. Deployed in IBM Z hardware verification, these flows have demonstrated substantial productivity improvements, estimated reduction of manual effort by 15-40% and enhanced overall verification throughput. This work also provides insights into MCP integration challenges such as context bloat and API granularity and outlines best practices for implementing AI-driven debug solutions in complex hardware environments.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionZero-knowledge proof (ZKP) provers remain costly because multi-scalar multiplication (MSM) and number-theoretic transforms (NTTs) dominate runtime as they need significant computation. AI ASICs such as TPUs provide massive matrix throughput and SotA energy efficiency. We present MORPH, the first framework that reformulates ZKP kernels to match AI-ASIC execution. We introduce Big-T complexity, a hardware-aware complexity model that exposes heterogeneous bottlenecks and layout-transformation costs ignored by Big-O. Guided by this analysis, (1) at arithmetic level, MORPH develops an MXU-centric extended-RNS lazy reduction that converts high-precision modular arithmetic into dense low-precision GEMMs, eliminating all carry chains, and (2) at dataflow level, MORPH constructs a layout-stationary CPU–TPU Pippenger MSM and optimized 3/5-step NTT that avoid on-TPU shuffles and maintain full matrix-unit utilization. Implemented in JAX/XLA, MORPH enables TPUv5p for better energy efficiency and comparable performance on MSM and NTT than SotA implementations on GPUs.
People
Engineering Presentation
Design
EDA
Systems
DescriptionAutomated software testing that integrates Android driver features is widely adopted in post-silicon environments to improve software robustness through unattended flashing, reboot, recovery, and failure analysis workflows. Such automation workflows leverage multiple Android interfaces, including ADB and fastboot, to support device control, recovery, and post-failure log collection. Applying these automated testing workflows in pre-silicon environments, particularly emulation-based platforms, is highly desirable, as it enables real software bugs and robustness issues to be discovered and mitigated early in the development cycle, prior to silicon availability.
However, extending automated software testing to pre-silicon emulation environments faces two independent challenges. First, fastboot is not natively available in pre-silicon platforms. Although fastboot is technically feasible via USB, it is prohibitively slow for use in emulation environments. In contrast, ADB demonstrates that host–target connectivity can be realized through virtualized communication mechanisms in pre-silicon–like environments, motivating a similar approach for fastboot. Second, even when failure detection is automated, test platforms still struggle to determine whether a detected kernel panic represents a genuine software bug or a non-critical event, such as a bring-up artifact. Moreover, the use of external large language models, such as ChatGPT, is often restricted by security and data confidentiality requirements.
To address these challenges, we present an unattended end-to-end software test automation framework for pre-silicon emulation environments that integrates fastboot virtualization with an on-premises LLM-based failure analysis pipeline. The proposed framework has been validated in an ongoing flagship SoC project, demonstrating early discovery of real software bugs and improved system readiness at silicon bring-up.
However, extending automated software testing to pre-silicon emulation environments faces two independent challenges. First, fastboot is not natively available in pre-silicon platforms. Although fastboot is technically feasible via USB, it is prohibitively slow for use in emulation environments. In contrast, ADB demonstrates that host–target connectivity can be realized through virtualized communication mechanisms in pre-silicon–like environments, motivating a similar approach for fastboot. Second, even when failure detection is automated, test platforms still struggle to determine whether a detected kernel panic represents a genuine software bug or a non-critical event, such as a bring-up artifact. Moreover, the use of external large language models, such as ChatGPT, is often restricted by security and data confidentiality requirements.
To address these challenges, we present an unattended end-to-end software test automation framework for pre-silicon emulation environments that integrates fastboot virtualization with an on-premises LLM-based failure analysis pipeline. The proposed framework has been validated in an ongoing flagship SoC project, demonstrating early discovery of real software bugs and improved system readiness at silicon bring-up.
Engineering Presentation
EDA
Systems
DescriptionAs semiconductor designs scale toward higher power densities and heterogeneous integration, the prediction of thermal behavior at early design stage becomes essential for both performance and reliability. Conventional board-level thermal models use simplified uniform power assumptions which fail to capture localized hotspots. Whereas high resolution power maps generated require high design collaterals, which are available only at sign-off, and are computationally expensive.
We overcome these limitations by creating a high-resolution, tile-based power map using the instance power and location files from the implementation tools and on-die routing information from GDSII file of a former design. This generates the detailed power map nearly 2x faster, with less than 2% loss in accuracy compared to the currently available methods of generation using EM-IR sign-off tools.
We demonstrated this method on a 2DIC design with the power map consisting of multiple blocks of different functionalities stitched together. The package model and heat sink system is included. The junction temperatures obtained using this method yielded 4-5°C more accurate temperatures against simplified block power based thermal analysis. This methodology helps in analyzing the thermal behavior of the chip at very early design stages and designing appropriate system level cooling solutions.
We overcome these limitations by creating a high-resolution, tile-based power map using the instance power and location files from the implementation tools and on-die routing information from GDSII file of a former design. This generates the detailed power map nearly 2x faster, with less than 2% loss in accuracy compared to the currently available methods of generation using EM-IR sign-off tools.
We demonstrated this method on a 2DIC design with the power map consisting of multiple blocks of different functionalities stitched together. The package model and heat sink system is included. The junction temperatures obtained using this method yielded 4-5°C more accurate temperatures against simplified block power based thermal analysis. This methodology helps in analyzing the thermal behavior of the chip at very early design stages and designing appropriate system level cooling solutions.
Engineering Presentation
AI
EDA
Systems
DescriptionModern SoC designs increasingly exhibit functional corner-case bugs that remain hidden until silicon validation. The failure to uncover corner-case bugs can be primarily attributed to two factors: SoC-level specification incompleteness and limitations in the verification environment.
Leveraging the fact that IP-XACT is widely adopted in integration flows to manage SoC-level design metadata(hierarchy, interface protocol, connectivity, address, signal behaviors), we proposed an automated methodology that generates a complete PSS(Portable Test and Stimulus Standard) model directly from IEEE-1685 IP-XACT metadata, thereby eliminating manual RTL-path coding effort and expanding functional coverage.
By parsing multiple IP-XACT XML files, we construct a unified JSON-based modeling database that facilitates a seamless transition to PSS structured models. From this database, structural PSS components that were base models for generating test scenarios are built automatically.
By associating the SoC-level behaviors extracted from IP-XACT specifications with the structured PSS models and IP-level action libraries, our approach systematically generates test cases targeting specific ports. In a realistic case study, the resulting model enables systematic behavior exploration and significantly expands functional coverage, achieving approximately a 30% increase in total coverage bins through automatic enumeration of design-derived function conditions.
Leveraging the fact that IP-XACT is widely adopted in integration flows to manage SoC-level design metadata(hierarchy, interface protocol, connectivity, address, signal behaviors), we proposed an automated methodology that generates a complete PSS(Portable Test and Stimulus Standard) model directly from IEEE-1685 IP-XACT metadata, thereby eliminating manual RTL-path coding effort and expanding functional coverage.
By parsing multiple IP-XACT XML files, we construct a unified JSON-based modeling database that facilitates a seamless transition to PSS structured models. From this database, structural PSS components that were base models for generating test scenarios are built automatically.
By associating the SoC-level behaviors extracted from IP-XACT specifications with the structured PSS models and IP-level action libraries, our approach systematically generates test cases targeting specific ports. In a realistic case study, the resulting model enables systematic behavior exploration and significantly expands functional coverage, achieving approximately a 30% increase in total coverage bins through automatic enumeration of design-derived function conditions.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionComputing-in-memory (CIM) architectures alleviate the data movement bottleneck by performing neural computations directly within memory arrays, which significantly addresses the resource constraints for edge AI applications. As edge applications increasingly demand user-specific adaptation, on-device deep model personalization has become essential for supporting evolving environments, privacy-sensitive data, and low-latency intelligence. However, in CIM-based systems, deep model parameters cannot be efficiently extracted for external fine-tuning due to the high cost of reading, digitizing, and transferring analog device states. Consequently, personalization is expected to be executed directly on the CIM hardware through in-situ weight updates. Prior works have focused primarily on improving inference robustness under non-ideal CIM conditions, leaving the problem of robust on-device fine-tuning fundamentally unaddressed. To fill the gap, we propose CIM-MP, a hardware–software co-optimized framework enabling stable and accurate on-device personalization under noisy CIM conditions. CIM-MP introduces a pulse-based mapping strategy that ensures convergence during in-memory weight updates. To further enhance robustness, we propose a Feature Variation Elimination (FVE) mechanism to mitigate feature-map noise in forward propagation, and a Gradient Adaptive Purification (GAP) mechanism to refine gradients during backpropagation. Experiments show that CIM-MP achieves up to 35.1% accuracy improvement over state-of-the-art approaches, demonstrating the feasibility of efficient and robust on-device learning directly on CIM platforms.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTime is a fundamental physical resource in silicon, yet in most system architectures it remains an implicit and locally confined assumption. While monolithic cyber physical systems operate effectively with stable local clocks and design time temporal constraints, these abstractions break down as systems scale across distributed nodes and networks. Network uncertainty amplifies timing errors, and the lack of system level time control limits scalable coordination and collaborative intelligence. This work presents an agile time aware system architecture that elevates time from a hidden local property to an explicit and controllable system resource. The architecture is enabled by a flexible silicon level time primitive that provides unified control and observation of clock frequency, phase, and time across hardware, runtime, network, and application layers. The proposed architecture enables fine grained frequency and phase control while maintaining architectural composability beyond protocol centric synchronization mechanisms. The effectiveness of the approach is demonstrated through hardware based multi node synchronization experiments, showing a transition from passive drift to active convergence under varying node counts and uncertainty conditions. The results indicate improved scalability, robustness, and design reuse without architectural redesign, providing a practical foundation for time aware system design that bridges the silicon system boundary.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionTracking moving targets through clutter and occlusions demands real-time sensorimotor control that adapts to dynamic uncertainty. Model predictive control (MPC) incurs high computational cost, while reinforcement learning requires task-specific retraining. We introduce ENACT (Ensemble Neural Attractor Components for Tracking), a modular framework decomposing reactive control into coordinated Dynamic Neural Fields (DNFs). Each DNF module---attention, gating, memory, context---addresses a distinct sensorimotor primitive through continuous attractor dynamics, enabling runtime reconfiguration without retraining. We benchmark ENACT on both simulation and a Cortex-M7 microcontroller, showing up to 86% lower tracking RMS error, ~2.8x faster disturbance-recovery times, predictable sub-millisecond control latency, and ~9x lower SRAM footprint compared to MPC baseline.
Research Special Session
EDA
DescriptionThis talk presents an open-source end-to-end physical design automation flow for yield-optimized, inverse-designed EPICs. We integrate three key components: (1) AI-augmented photonic inverse design and inverse lithography framework; (2) GPU-accelerated routability-optimized PIC placement; and (3) curvy-aware photonic routing with electrical-optical co-routing. This flow synthesizes EPIC netlists into fabrication-ready, yield-robust GDS layouts within an hour, enabling scalable photonic tensor cores and interconnects for next-generation AI systems. This work establishes a first-of-its-kind open-source foundation for large-scale, manufacturable EPIC design.
Tutorial
DescriptionThis tutorial provides an accessible entry point for designers across the low-power design space working on power-constrained on-device intelligence. It covers the landscape of low-power edge intelligence from algorithm, architecture, circuit, and application perspectives. The four-talk progression addresses: low-power on-device ML, memory-efficient algorithms for edge inference, neuro-inspired approaches to edge intelligence, and streaming processing in autonomous edge platforms. The tutorial serves a broad DAC audience in machine learning, including those working in edge AI or neuromorphic computing, while highlighting emerging energy-efficient edge computing and system research.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionAmbient energy harvesting technologies offer the promise of perpetual operation for batteryless Internet of Things (IoT) devices; however, their execution is frequently halted by unpredictable power failures. Traditional methods to ensure correctness, such as software checkpointing (e.g., Mementos) and newer systems for deep neural networks (e.g., DynBal), are burdened by substantial energy and latency overheads for progress preservation, limiting their use on ultra-low-power platforms. This paper introduces EnergyHDC, an energy-aware inference system designed for intermittent operation by integrating Hyperdimensional Computing (HDC). The system pairs HDC's algorithmic robustness with fine-grained energy adaptation using three key contributions: (1) an energy-proportional checkpointing that optimizes preservation granularity against available energy; (2) a dimension-first reordering that slashes per-checkpoint storage; and (3) a priority-based pruning that allows for compile-time energy-accuracy trade-offs. Evaluated on an MSP432P401R microcontroller platform under intermittent power, EnergyHDC demonstrates up to 214x speedup compared to Mementos. It also achieves 19.6x faster inference than DynBal under similar accuracy and realistic energy-harvesting conditions. These results validate that a co-design approach, coupling energy-aware execution with HDC's intrinsic robustness, can reframe intermittence from a system constraint into an opportunity for efficient edge intelligence.
Keynote
Design
DescriptionIn a keynote address at the 2007 Design Automation Conference, I discussed the potential of an emerging technology known as synthetic biology and the role integrated circuit design methodology could play in shaping this nascent field.
Now almost 20 years later, synthetic biology has matured significantly. Today It offers the opportunity not only of addressing genetic diseases, but also of engineering life itself. In fact, we are witnessing the simultaneous emergence and explosive growth of three technologies that are bound the shape of the future of humanity: genetic engineering (such as enabled by CRISPR-CAS9), artificial intelligence, and brain-machine interfaces (BMIs).
While all three are equally influential, this presentation will focus primarily on the latter – that is, direct interfaces with the human brain. Brain-machine interfaces are systems that create a direct communication pathway between the brain and external devices. By translating neural signals into actionable information, BMIs can be used to restore lost abilities, augment human capabilities, and enable entirely new forms of interaction with technology. They hold the promise of revolutionizing fields such as medicine, rehabilitation, neuroscience, and even human-machine collaboration.
We will review the state-of-the-art, identify progress, speculate where it may lead, and address some pertinent questions such as privacy and morality implications. But most importantly, we will explore the essential role that the design community can —and must—play in the conception, design, integration and deployment of these groundbreaking technologies.
Now almost 20 years later, synthetic biology has matured significantly. Today It offers the opportunity not only of addressing genetic diseases, but also of engineering life itself. In fact, we are witnessing the simultaneous emergence and explosive growth of three technologies that are bound the shape of the future of humanity: genetic engineering (such as enabled by CRISPR-CAS9), artificial intelligence, and brain-machine interfaces (BMIs).
While all three are equally influential, this presentation will focus primarily on the latter – that is, direct interfaces with the human brain. Brain-machine interfaces are systems that create a direct communication pathway between the brain and external devices. By translating neural signals into actionable information, BMIs can be used to restore lost abilities, augment human capabilities, and enable entirely new forms of interaction with technology. They hold the promise of revolutionizing fields such as medicine, rehabilitation, neuroscience, and even human-machine collaboration.
We will review the state-of-the-art, identify progress, speculate where it may lead, and address some pertinent questions such as privacy and morality implications. But most importantly, we will explore the essential role that the design community can —and must—play in the conception, design, integration and deployment of these groundbreaking technologies.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern high-performance digital circuits face significant challenges in achieving optimal timing closure due to the complexity of designs with millions of transistors. While APR tools provide sophisticated optimization capabilities, they often compromise on local optimizations due to global constraints, tool limitations, and PPA trade-offs. Non-optimized structures residing on critical paths directly impact on the achievable operating frequency, creating performance bottlenecks. During the ECO phase, setup violating paths undergo re-ordering, creating opportunities for logic optimization. The positive slack paths, which were not optimization targets during synthesis, become exposed during ECOs and offer greater potential for logic optimization to achieve timing targets. However, current solutions suffer from lack of Primetime compatibility and absence of direct APR tool integration, necessitating manual intervention that extends design closure timelines. EAGLE overcomes these constraints through a comprehensive Primetime-integrated framework that leverages ECO-phase optimization opportunities and delivers APR tool-compatible ECOs directly.
Research Special Session
AI
DescriptionNASA's Habitable Worlds Observatory (HWO) will directly image Earth-like exoplanets by suppressing starlight by a factor of 10−10 using a coronagraph instrument. Maintaining this suppression requires a closed-loop control system—high-order wavefront sensing and control (HOWFSC)—that continuously corrects optical aberrations by commanding deformable mirrors. The dominant computational kernel is a dense matrix-vector multiply (GEMV) with a precomputed gain matrix exceed- ing 106 GB in double precision at flight scale. The memory bandwidth demanded by this kernel at the required control frequency exceeds radiation-hardened processors by orders of magnitude, motivating deployment on commercial off-the-shelf (COTS) hardware aboard a co-flying satellite at Sun-Earth L2—outside Earth's magnetosphere, where single-event upsets (SEUs) from galactic cosmic rays and solar particles can corrupt computation. We apply Algorithm-Based Fault Tolerance (ABFT) to protect this GEMV. The gain matrix is severely ill-conditioned (singular values spanning 66 decades), causing checksum noise floors that challenge naive ABFT. We address this with row- scaling preconditioning and analytically derived Higham-bound adaptive thresholds that require no empirical tuning. Using physically realistic gain matrices from a FALCO coronagraph model at two actuator scales, we demonstrate 100% detection of all science-threatening faults with zero false positives and ∼6 orders of magnitude of margin between the noise floor and the dangerous-fault threshold. We further show that the gain matrix can be stored in reduced precision—as few as 23 mantissa bits with block floating point—reducing memory by 62% while consuming less than 0.03% of the contrast error budget.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionUHF RFID tagging is the fastest-growing segment of the RFID market, connecting billions of items worldwide. By enabling internet connectivity and real-time visibility into everyday objects—such as apparel, medical supplies, automotive parts, and food—Impinj RAIN RFID facilitates truly scalable Internet of Things (IoT) applications. As performance requirements for Impinj RAIN RFID systems continue to increase, designs must achieve higher sensitivity and accuracy while maintaining low power consumption. Meeting these performance requirements for billions of Impinj RAIN RFID tag chips presents significant verification challenges, where achieving high-sigma verification across all process, voltage, and temperature (PVT) corners typically requires a large number of simulations and hence long turnaround times.
To address these challenges, Impinj and Siemens EDA have collaborated on an advanced verification workflow that enables efficient high-sigma verification across PVT corners. Siemens EDA's Solido PVTMC tool intelligently identifies worst-case corners at higher sigma levels, significantly reducing the number of required simulations. When combined with the Solido SPICE simulator, this approach allowed Impinj to accelerate verification runtimes while meeting stringent accuracy and reliability targets. This presentation slide deck presents the workflow and highlights its impact on Impinj RAIN RFID design verification.
To address these challenges, Impinj and Siemens EDA have collaborated on an advanced verification workflow that enables efficient high-sigma verification across PVT corners. Siemens EDA's Solido PVTMC tool intelligently identifies worst-case corners at higher sigma levels, significantly reducing the number of required simulations. When combined with the Solido SPICE simulator, this approach allowed Impinj to accelerate verification runtimes while meeting stringent accuracy and reliability targets. This presentation slide deck presents the workflow and highlights its impact on Impinj RAIN RFID design verification.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAnalog Compute-In-Memory performance in 3D heterogeneous chiplet integration is highly sensitive to spatial temperature gradients and temporal temperature fluctuations. Conventional thermal sign-off based on the maximum junction temperature is insufficient for ensuring the memory algorithmic precision across diverse operational phases. This study proposes a System-Technology Co-Optimization (STCO) framework featuring Thermal Avoidance and Thermal Compensation strategies to enhance spatial-temporal temperature uniformity. The approach synergizes hardware-level spatial optimization, such as selective transistor density reduction and controllable heating element deployment, with software-level temporal regulation, such as load-rate modulation and dummy workflow supplement. The proposed strategies are validated by the thermal simulations of an 8-layer memory stack on a high-power SoC utilizing a granular Chip Thermal Model (CTM). The results demonstrate that the SoC-adjacent memory die has the highest thermal risk and the strategies can reduce its spatial temperature gradient by up to 25%. This work provides a practical, early-stage STCO solution for maintaining memory reliability and functional integrity in advanced 3D integrated systems.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEmbedded non-volatile memory (NVM) interfaces in next-generation microcontrollers demand rigorous verification of security and privilege semantics. While the UVM Register Abstraction Layer (RAL) standardizes register access, it offers limited control over bus-level attributes required for validating secure and protected transactions. This work introduces an enhanced user adapter layer for Cadence AMBA AHB VIP, enabling verification of four distinct access modes: Privileged-Secure, Privileged-Nonsecure, Unprivileged-Secure, and Unprivileged-Nonsecure.
Our methodology integrates an extension class within the UVM RAL flow to propagate HPROT and HNONSEC attributes through reg2bus APIs, preserving abstraction while ensuring full control over secure and protected semantics. The solution extends the default adapter, which traditionally transfers data using a fixed uvm_reg_bus_op structure, by declaring an extension class handle in reg2bus. Results demonstrate improved regression stability, reduced debug effort, and systematic coverage closure without sacrificing portability. This approach strengthens verification efficiency and confidence for advanced memory interfaces in safety-critical and secure applications.
Our methodology integrates an extension class within the UVM RAL flow to propagate HPROT and HNONSEC attributes through reg2bus APIs, preserving abstraction while ensuring full control over secure and protected semantics. The solution extends the default adapter, which traditionally transfers data using a fixed uvm_reg_bus_op structure, by declaring an extension class handle in reg2bus. Results demonstrate improved regression stability, reduced debug effort, and systematic coverage closure without sacrificing portability. This approach strengthens verification efficiency and confidence for advanced memory interfaces in safety-critical and secure applications.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionAs superconducting processors scale, understanding how physical layout shapes qubit interactions is essential for architectural reliability. Existing methods offer limited insight into how electromagnetic design choices translate into execution-level behavior. We present EPAR, an electromagnetic-to-architecture framework that predicts robustness early directly from physical design by reconstructing how design distortion modifies the effective Hamiltonian, reroutes mediated connectivity, and influences control-pulse response. Across all tested layouts, EPAR's structural scores show 100% agreement with two-qubit error trends yet reveal over 10x robustness differences among edges with identical calibrated error rates, going beyond conventional metrics to provide improved and actionable compiler guidance.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionIn advanced process nodes, the pursuit of extreme PPA optimization has driven an explosion in the demand for customized standard cells. To satisfy this demand, automated layout synthesis has been increasingly adopted to explore vast design spaces. However, this paradigm shifts the bottleneck from design creation to verification, as characterizing the massive volume of generated variants via SPICE is computationally prohibitive.
Meanwhile, conventional geometric heuristics fail to proxy PPA at advanced nodes due to dominant layout effects. Existing learning-based surrogates often lack the fidelity to capture these complex dependencies. To bridge this gap, we propose EPiCell, an electro-physical co-modeling framework for rapid PPA estimation. EPiCell features a Heterogeneous Graph Transformer (HGT) that explicitly models transistors, routing metals, and supply rails as distinct entities, unifying circuit topology with fine-grained layout geometry. By employing relation-aware attention, it effectively captures the non-local electro-physical interactions governing cell performance. Validated on a dataset of over 18,000 auto-generated layouts based on ASAP7, EPiCell achieves high fidelity against SPICE simulations, with low average prediction errors of 1.82% for leakage power, 4.01% for internal power, 3.06% for delay, and 3.29% for transition. Crucially, it demonstrates superior ranking consistency with SPICE, attaining a median Spearman Rank Correlation Coefficient of 0.90 for internal power, 0.81 for delay, and 0.70 for transition. This offers a scalable surrogate model to enable efficient design space exploration.
Meanwhile, conventional geometric heuristics fail to proxy PPA at advanced nodes due to dominant layout effects. Existing learning-based surrogates often lack the fidelity to capture these complex dependencies. To bridge this gap, we propose EPiCell, an electro-physical co-modeling framework for rapid PPA estimation. EPiCell features a Heterogeneous Graph Transformer (HGT) that explicitly models transistors, routing metals, and supply rails as distinct entities, unifying circuit topology with fine-grained layout geometry. By employing relation-aware attention, it effectively captures the non-local electro-physical interactions governing cell performance. Validated on a dataset of over 18,000 auto-generated layouts based on ASAP7, EPiCell achieves high fidelity against SPICE simulations, with low average prediction errors of 1.82% for leakage power, 4.01% for internal power, 3.06% for delay, and 3.29% for transition. Crucially, it demonstrates superior ranking consistency with SPICE, attaining a median Spearman Rank Correlation Coefficient of 0.90 for internal power, 0.81 for delay, and 0.70 for transition. This offers a scalable surrogate model to enable efficient design space exploration.
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionThis paper presents an Elegant Scan Chain Activity Probability Establishment (ESCAPE) architecture for programmable low power LBIST. In the designed low-power control circuit, a programmable probability generator periodically outputs a user-configurable sequence of probabilities to drive hold registers. A 2D AND-gate array formed by two groups of hold registers then produces a rich set of low-power control signals. By modulating the enable probability of the lockups situated before the phase shifter, ESCAPE achieves precise per-chain power management. Experimental results on industrial-scale designs demonstrate that, the proposed method achieves high coverage, while reducing peak power consumption and incurring lower hardware overhead.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
Description3D field-programmable gate arrays (FPGAs) promise higher performance through vertical integration. However, existing placement tools, largely inherited from 2D frameworks, fail to capture the unique delay characteristics and optimization dynamics of 3D fabrics. We introduce a 3D FPGA placement flow that integrates partitioning-based initialization, adaptive cost scheduling, refined delay estimation, and a simulated annealing move set — all targeted at 3D FPGA architecture. Together, these enhancements improve timing estimates and the exploration of layer assignments during placement. Compared to Verilog-To-Routing (VTR), our experiments show geometric-mean (max) critical-path delay reductions of ∼3% (∼7%), ∼2% (∼4%), ∼3% (∼8%), and ∼6% (∼18%) for four 3D architectures: 3D CB, 3D CB-O, 3D CB-I, and 3D SB, respectively. We also achieve geometric-mean (max) routed wirelength reductions of ∼1% (∼3%), ∼2% (∼8%), < 1% (∼5%), and ∼5% (∼10%), respectively. Our work will be permissively open-sourced on GitHub.
Work in Progress
DescriptionWith the scaling down of integrated circuit dimensions and the increasing complexity of transistor structures, the role of etching in manufacturing has become increasingly critical. We propose an etching simulation approach based on a video generation model, which models the evolution of the etching process as a video generation task. By embedding frames into quantized latent codeword representations by VQ-VAE (Vector Quantized Variational Autoencoder) and leveraging a temporal autoregressive prediction model, we achieve generation model of the etching process. On both simulated and experimental data, we validate the effectiveness of our model. Our approach achieves a 6,000× speedup over the Monte Carlo method while reducing the simulation MAE (Mean Absolute Error) by 14.4% compared with the state-of-the-art video generation model. Furthermore, results generated by our video based model show strong agreement with experimental data.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionEthereum's massive data requires auxiliary proofs (e.g., Merkle proofs) for trusted queries, creating significant I/O and data movement overhead.
We introduce EtherSSD, an innovative in-storage Ethereum analytics platform with computational storage devices (CSDs) designed for real-time data analysis. EtherSSD bridges the semantic gap between the host and CSDs, dissolving authenticated data queries into flattened (highly concurrent) in-CSD page accesses with slashed I/O operations from the host side. Additionally, EtherSSD incorporates an authentication engine that offloads cryptographic verification computations from host CPUs. Evaluations under real-world workloads demonstrate that it reduces authenticated query execution time, particularly the I/O and authentication overhead.
We introduce EtherSSD, an innovative in-storage Ethereum analytics platform with computational storage devices (CSDs) designed for real-time data analysis. EtherSSD bridges the semantic gap between the host and CSDs, dissolving authenticated data queries into flattened (highly concurrent) in-CSD page accesses with slashed I/O operations from the host side. Additionally, EtherSSD incorporates an authentication engine that offloads cryptographic verification computations from host CPUs. Evaluations under real-world workloads demonstrate that it reduces authenticated query execution time, particularly the I/O and authentication overhead.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionMulti-step image editing with diffusion models typically requires repeatedly executing the inversion–denoising paradigm, which leads to severe challenges in both image quality and computational efficiency. Repeated inversion introduces errors that accumulate across editing steps, degrading image quality, while regeneration of unchanged background regions incurs substantial computational overhead. In this paper, we present ExCave, a training-free multi-step editing framework that improves both image quality and computational efficiency by excavating consistency across editing steps. ExCave introduces an inversion sharing mechanism that performs inversion once and reuses its consistent features across subsequent edits, thereby significantly reducing errors. To eliminate redundant computation, we propose the CacheDiff method that regenerates only the edited regions while reusing consistent features from unchanged background regions. Finally, we design GPU-oriented optimizations to translate theoretical gains into practical reductions in end-to-end latency. Extensive experiments demonstrate that ExCave achieves superior image quality and dramatically reduces inference latency, establishing a new paradigm for accurate and efficient multi-step editing.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionLarge-scale Mixture-of-Experts (MoE) models are pivotal in modern AI, yet their massive parameter size creates a "storage wall" for fault tolerance, where limited bandwidth restricts checkpoint frequency and risks significant wasted computation. We present "EXPCheck", a dynamic expert-aware checkpointing system designed to resolve the conflict between massive MoE states and limited persistence bandwidth. Grounded in the observation that expert activation is highly imbalanced, EXPCheck employs a novel "Aging-then-Greedy Expert Selection (AGES)" policy. AGES first enforces an age-based refresh for overdue "cold" experts to prevent indefinite staleness, and then greedily allocates the remaining persistence budget to frequently updated "hot" experts. Implemented on a production-scale training stack, EXPCheck significantly reduces persistence traffic and increases checkpoint frequency by at most 5× compared to full checkpointing, while maintaining downstream model accuracy comparable to standard methods.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionSparse Mixture-of-Experts (MoE) models can outperform dense LLMs at similar computation, but their large expert parameters create high memory demand, making single-GPU deployment difficult. Offloading addresses this by storing inactive experts on CPU, yet static caches ignore dynamic routing and existing predictors for expert usage are often inaccurate or costly. We present ExpertFlow, a lightweight MoE inference system with a routing path predictor, a routing-aware token scheduler, and a predictive expert cache. Together, these components enable efficient expert loading and execution, reducing GPU memory by 93.72% and improving throughput by up to 10× on a single GPU.
People
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionWith the rapid advancement of Artificial Intelligence, the Graphics Processing Unit (GPU) has become increasingly essential across a growing number of safety-critical application domains.
Applying a GPU is indispensable for parallel computing; however, the complex data dependencies and resource contention across kernels within a GPU task may unpredictably delay its execution time.
To address these problems, this paper presents a scheduling and analysis method for Directed Acyclic Graph (DAG)-structured GPU tasks.
Given a DAG representation, the proposed scheduling scales the kernel-level parallelism and establishes inter-kernel dependencies to provide a reduced and predictable DAG response time.
The corresponding timing analysis yields a safe yet non-pessimistic makespan bound without any assumption on kernel priorities.
The proposed method is implemented using the standard CUDA API, requiring no additional software or hardware support. Experimental results under synthetic and real-world benchmarks
demonstrate that the proposed approach effectively reduces the worst-case makespan and measured task execution time compared to the existing methods up to $32.8\%$ and $21.3\%$, respectively.
Applying a GPU is indispensable for parallel computing; however, the complex data dependencies and resource contention across kernels within a GPU task may unpredictably delay its execution time.
To address these problems, this paper presents a scheduling and analysis method for Directed Acyclic Graph (DAG)-structured GPU tasks.
Given a DAG representation, the proposed scheduling scales the kernel-level parallelism and establishes inter-kernel dependencies to provide a reduced and predictable DAG response time.
The corresponding timing analysis yields a safe yet non-pessimistic makespan bound without any assumption on kernel priorities.
The proposed method is implemented using the standard CUDA API, requiring no additional software or hardware support. Experimental results under synthetic and real-world benchmarks
demonstrate that the proposed approach effectively reduces the worst-case makespan and measured task execution time compared to the existing methods up to $32.8\%$ and $21.3\%$, respectively.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAnalog circuit optimization is typically framed as black-box search over arbitrary smooth functions, yet device physics constrains performance mappings to structured families: exponential device laws, rational transfer functions, and regime-dependent dynamics. Offthe-shelf Gaussian-process surrogates impose globally smooth, stationary priors that are misaligned with these regime-switching primitives and can severely misfit highly nonlinear circuits at realistic sample sizes (50–100 evaluations). We demonstrate that pretrained tabular models encoding these primitives enable reliable optimization without per-circuit engineering. Circuit Prior Network (CPN) combines a tabular foundation model (TabPFN v2) with Direct Expected Improvement (DEI), computing expected improvement exactly under discrete posteriors rather than Gaussian approximations. Across 6 circuits and 25 baselines, structure-matched priors achieve R2 ≈ 0.99 in small-sample regimes where GP-Matérn attains only R2 = 0.16 on Bandgap, deliver 1.05–3.81× higher FoM with 3.34–11.89× fewer iterations, and suggest a shift from handcrafting models as priors toward systematic physics-informed structure identification. Our code will be made publicly available upon paper acceptance.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionLattice surgery is among the leading schemes for fault tolerant quantum computing motivated by superconducting hardware. Conventional lattice surgery compilation schemes follow a place-and-route paradigm, where logical qubits remain statically fixed in space throughout the computation. In this work, we introduce a paradigm shift by exploiting movable logical qubits via teleportation during the lattice surgery CNOT gate. We propose a proof-of-concept compilation scheme leveraging these movements, which can substantially reduce the routed circuit depth. This demonstrates that movable logical qubits can be used even on hardware with static physical qubits. An open-source implementation will be made available on GitHub.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionMulticore processors are increasingly adopted in embedded systems to meet growing performance demands. However, physical side-channel analysis of multicore architectures remains underexplored, as obtaining usable leakage is inherently challenging. Consequently, side-channel security research on such systems has lagged far behind, leaving a critical security gap. To address this gap, we reveal the electromagnetic leakage mechanisms in multicore architectures and, for the first time, demonstrate per-core leakage exploitation, thereby enabling physical side-channel analysis for these systems. As a practical extension, we present a non-intrusive side-channel monitoring method that achieves per-core granularity. To validate its feasibility and practicality, we implement a prototype on a heterogeneous SoC platform with an RF front-end, and evaluate on a commercial off-the-shelf quad-core embedded system, the Raspberry Pi 4B with ARM Cortex-A72 cores.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionMixture-of-Experts (MoE) large language models (LLMs) leverage dynamic routing and sparse activation to improve efficiency and scalability, achieving high performance with reduced computational cost. However, their complex architecture and large memory footprint pose significant challenges for deployment, particularly on resource-constrained hardware.
Post-training quantization (PTQ) is a widely used technique to reduce model size and memory usage. Existing PTQ approaches for MoE models are predominantly layer-wise and task-agnostic, optimizing reconstruction error independently within each layer. As a result, they ignore cross-layer differences in expert importance and fail to leverage task-specific signals, causing pivotal experts to be over-compressed, rarely activated experts to be over-provisioned, and overall accuracy to degrade.
To overcome these limitations, we propose ExQuant, a PTQ framework for MoE LLMs that enables global, expert-level mixed-precision quantization. ExQuant first constructs a Globally Comparable Expert Importance Metric by integrating expert routing frequency and post-ablation performance. Based on this metric, it assigns tiered bit-widths to experts and employs a precision-aware load balancing strategy to dynamically schedule computation across processing elements, fully exploiting slack between low- and high-precision workloads. Experiments demonstrate that ExQuant significantly reduces memory footprint, improves inference efficiency, and achieves $2.87-5.93\%$ accuracy improvement over existing MoE quantization methods. These results validate the effectiveness of global, expert-level mixed-precision quantization for efficient and accurate deployment of MoE LLMs.
Post-training quantization (PTQ) is a widely used technique to reduce model size and memory usage. Existing PTQ approaches for MoE models are predominantly layer-wise and task-agnostic, optimizing reconstruction error independently within each layer. As a result, they ignore cross-layer differences in expert importance and fail to leverage task-specific signals, causing pivotal experts to be over-compressed, rarely activated experts to be over-provisioned, and overall accuracy to degrade.
To overcome these limitations, we propose ExQuant, a PTQ framework for MoE LLMs that enables global, expert-level mixed-precision quantization. ExQuant first constructs a Globally Comparable Expert Importance Metric by integrating expert routing frequency and post-ablation performance. Based on this metric, it assigns tiered bit-widths to experts and employs a precision-aware load balancing strategy to dynamically schedule computation across processing elements, fully exploiting slack between low- and high-precision workloads. Experiments demonstrate that ExQuant significantly reduces memory footprint, improves inference efficiency, and achieves $2.87-5.93\%$ accuracy improvement over existing MoE quantization methods. These results validate the effectiveness of global, expert-level mixed-precision quantization for efficient and accurate deployment of MoE LLMs.
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
DescriptionModern processor architectures typically employ a fixed, small register file, which is well-suited for most computations due to its simplicity, energy efficiency, and ease of implementation. However, data-intensive applications often suffer from limited register availability; simply enlarging the register file increases code size, pressures the instruction cache, complicates decoding, and raises power consumption. To address these challenges, we propose an extension to the legacy RISC-V Instruction Set Architecture (ISA) that supports an expandable register file. Our design partitions the register file into multiple logical banks, each mirroring the standard 32-register configuration, allowing operands and destination registers to reside in different banks concurrently. We introduce instruction extensions, overhead reduction mechanisms, and exception-handling infrastructure to fully exploit the expanded register space on a scalar processor. The approach is implemented on the CVA6 CPU, a 6-stage RISC-V processor, and deployed on an FPGA with only 27% hardware overhead. Experimental results demonstrate substantial performance improvements: matrix multiplication achieves 60% speed-up with 17% energy reduction, convolutions improve by 48% with 22% energy reduction, and convolutional neural networks such as ResNet-50 achieve 83.5% speed-up with 45% energy reduction.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionNeural networks are widely deployed at the edge to process high-
dimensional sensor data, but they are susceptible to burst errors that
can corrupt weights and degrade inference accuracy. Conventional
error-correcting codes (ECC) mitigate errors but incur significant
memory overhead. Recent ECC methods for neural networks over-
write the least significant bits of the model weights with parity bits,
providing zero-overhead resilience at the expense of slightly re-
duced inference accuracy. In this paper, we propose a framework for
Embedded and Efficient Error-Correcting Code for Error-Resilient
Neural Networks called (E3-CODE). The proposed method embeds
multi-bit parity within the entire weight representation, which is
different from only modifying the LSBs of the weights. To mini-
mize the negative impact from the parity embedded ECC, weight
and parity assignments are jointly optimized via a mixed-integer
linear programming (MILP) formulation. We also propose a hybrid
ECC scheme that combines the embedded ECC with conventional
ECC to trade-off minor memory overhead for significantly im-
proved reliance. The experimental evaluation on the ImageNet and
CIFAR-10 datasets using ResNet, MobileNetV2, and EfficientNet-B0
demonstrates that E3-CODE maintains software-level accuracy in
the presence of burst errors. Compared with prior methods, the
lifetime of the edge system is extended by 4.8𝑋 with no memory
overhead and 10𝑋 with less than 2% memory overhead.
dimensional sensor data, but they are susceptible to burst errors that
can corrupt weights and degrade inference accuracy. Conventional
error-correcting codes (ECC) mitigate errors but incur significant
memory overhead. Recent ECC methods for neural networks over-
write the least significant bits of the model weights with parity bits,
providing zero-overhead resilience at the expense of slightly re-
duced inference accuracy. In this paper, we propose a framework for
Embedded and Efficient Error-Correcting Code for Error-Resilient
Neural Networks called (E3-CODE). The proposed method embeds
multi-bit parity within the entire weight representation, which is
different from only modifying the LSBs of the weights. To mini-
mize the negative impact from the parity embedded ECC, weight
and parity assignments are jointly optimized via a mixed-integer
linear programming (MILP) formulation. We also propose a hybrid
ECC scheme that combines the embedded ECC with conventional
ECC to trade-off minor memory overhead for significantly im-
proved reliance. The experimental evaluation on the ImageNet and
CIFAR-10 datasets using ResNet, MobileNetV2, and EfficientNet-B0
demonstrates that E3-CODE maintains software-level accuracy in
the presence of burst errors. Compared with prior methods, the
lifetime of the edge system is extended by 4.8𝑋 with no memory
overhead and 10𝑋 with less than 2% memory overhead.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionFloating-point multiply-accumulate (FPMAC) is crucial in scientific computing, machine learning, and graphics but remains a major performance and energy bottleneck. Conventional IEEE-754 FP-FMA schemes optimize single multiply-add fusion but overlook long-chain MAC patterns in GEMM, leading to redundant rounding and normalization, accumulated numerical error, and high delay and power overhead from wide carry-propagate adders (CPAs). We propose Factored-FPMAC (F-FPMAC), which employs 4:2 carry-save adders (CSAs) and CSA-based redundant representation inside systolic arrays to eliminate CPAs from processing elements and defer normalization to a unified post-processing stage. To prevent accumulated intermediate value overflow (AIVO) under deferred normalization, we introduce a lightweight hierarchical risk-aware boundary protection mechanism. To further reduce register overhead from redundant representation, we replace per-PE dual buffers with an array-shared buffer pool. Experimental results indicate that F-FPMAC reduces critical-path delay by 61.9%, lowers power consumption by 22.6%, improves energy efficiency by 3×, achieves nearly two orders of magnitude lower numerical error, and decreases overflow events by up to 44.5%.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern digital systems rely heavily on interconnects capable of servicing multiple requesters efficiently and fairly. Traditional arbitration logic often fails in handling complex, multi-flit, and overlapping transactions typical of today's system-on-chip (SoC) designs. To address these limitations, arbiter-multiplexer (arbmux) architectures have emerged as a promising solution, combining arbiters with multiplexers to handle variable-length, multi-port transfers. However, the increased complexity of arbmux structures poses substantial challenges for formal verification, particularly when conventional fairness properties yield spurious failures or miss critical bugs. This work presents a comprehensive comparative study and methodological advancement in the formal verification of arbmux fairness.
We introduce a novel formal verification approach, unlike traditional methods, this technique enables assertion of liveness and fairness properties only during meaningful protocol intervals, sharply reducing assertion noise and increasing the precision of bug detection. Parameterized assertion code, supported by symbolic indices, further boosts the scalability and generality of the method, facilitating application to arbitrarily complex, multi-port arbmuxes. Through practical case studies, we demonstrate that our approach detects subtle protocol violations, including premature grant switching and potential requester starvation—issues that elude prior techniques due to inadequate property focus or over-constrained assumptions.
Quantitative and qualitative results establish the superiority of our enhanced methodology: bug counterexamples are more interpretable, assertion code is more maintainable, and formal runs achieve greater coverage within manageable computational resources. Critically, we show that tracking active transactions is essential to distinguish between acceptable protocol behavior and real fairness failures, enabling designers to close the gap between specification and implementation. Ultimately, this work lays the foundation for a reusable plug-in framework for arbmux verification, providing the industry with scalable, robust, and insightful mechanisms to ensure fairness and liveness in next-generation communication fabrics.
We introduce a novel formal verification approach, unlike traditional methods, this technique enables assertion of liveness and fairness properties only during meaningful protocol intervals, sharply reducing assertion noise and increasing the precision of bug detection. Parameterized assertion code, supported by symbolic indices, further boosts the scalability and generality of the method, facilitating application to arbitrarily complex, multi-port arbmuxes. Through practical case studies, we demonstrate that our approach detects subtle protocol violations, including premature grant switching and potential requester starvation—issues that elude prior techniques due to inadequate property focus or over-constrained assumptions.
Quantitative and qualitative results establish the superiority of our enhanced methodology: bug counterexamples are more interpretable, assertion code is more maintainable, and formal runs achieve greater coverage within manageable computational resources. Critically, we show that tracking active transactions is essential to distinguish between acceptable protocol behavior and real fairness failures, enabling designers to close the gap between specification and implementation. Ultimately, this work lays the foundation for a reusable plug-in framework for arbmux verification, providing the industry with scalable, robust, and insightful mechanisms to ensure fairness and liveness in next-generation communication fabrics.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionQuantum circuit optimization is a key step toward the efficient execution of quantum algorithms. Template matching has emerged as one of the dominant approaches for simplifying quantum circuits, yet it confronts the intrinsic challenge of accommodating gate commutativity. The state-of-the-art template-matching algorithm relies on a directed acyclic graph (DAG) representation. While this DAG-based technique handles gate commutativity satisfactorily, it fails to preserve the local connectivity inherent to quantum circuits, thereby leading to relatively high matching complexity.
In this paper, we introduce a hypergraph representation (HG), a commutativity-aware representation that collapses commuting gates into a single super-node while retaining all local connectivity. This enables matches to be extended locally without revisiting the rest of the circuit, with incremental updates limited to the immediate neighborhood.
Experimental results demonstrate that our HG matcher achieves 11--606x speed-up over the DAG-based implementation in Qiskit (Iten et al., 2022) on both random and arithmetic benchmarks, while maintaining the same optimization quality. The acceleration increases with circuit size, confirming that preserving locality and connectivity is the key to scalable quantum circuit optimisation.
In this paper, we introduce a hypergraph representation (HG), a commutativity-aware representation that collapses commuting gates into a single super-node while retaining all local connectivity. This enables matches to be extended locally without revisiting the rest of the circuit, with incremental updates limited to the immediate neighborhood.
Experimental results demonstrate that our HG matcher achieves 11--606x speed-up over the DAG-based implementation in Qiskit (Iten et al., 2022) on both random and arithmetic benchmarks, while maintaining the same optimization quality. The acceleration increases with circuit size, confirming that preserving locality and connectivity is the key to scalable quantum circuit optimisation.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionA significant bottleneck in modern physical design is the runtime performance of Place-and-Route (P&R) based filler and DCAP insertion. Current methodologies often require a full removal and reinsertion of fillers during every Engineering Change Order (ECO) cycle. Near tape-out, where multiple iterations are frequent, these excessive delays substantially increase Turnaround Time (TAT) and hinder final design convergence.
To address this, a core innovation has been developed: an OASIS-based filler/DCAP insertion methodology that integrates results back to the P&R environment via Incremental DEF. This flow offers user-controllable options for coverage percentages and filler-type selection—including multi-height and ECO-specific fillers—across diverse designs. Notably, the methodology scales effectively with design complexity; the larger the block or IP, the more significant the runtime benefits achieved, whereas smaller blocks see a more modest value proposition.
This Correct-by-Construction approach leverages signoff-level rule awareness to guarantee DRC-clean insertion, effectively eliminating manual ECO loops and improving predictability. By utilizing RealTime Digital TCL APIs, the flow ensures seamless integration and full compatibility with any standard P&R environment. This optimized approach provides a streamlined path to timing closure and drastically reduces overhead during critical design phases.
To address this, a core innovation has been developed: an OASIS-based filler/DCAP insertion methodology that integrates results back to the P&R environment via Incremental DEF. This flow offers user-controllable options for coverage percentages and filler-type selection—including multi-height and ECO-specific fillers—across diverse designs. Notably, the methodology scales effectively with design complexity; the larger the block or IP, the more significant the runtime benefits achieved, whereas smaller blocks see a more modest value proposition.
This Correct-by-Construction approach leverages signoff-level rule awareness to guarantee DRC-clean insertion, effectively eliminating manual ECO loops and improving predictability. By utilizing RealTime Digital TCL APIs, the flow ensures seamless integration and full compatibility with any standard P&R environment. This optimized approach provides a streamlined path to timing closure and drastically reduces overhead during critical design phases.
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionMixture-of-Agents (MoA) is a widely adopted multi-agent paradigm, but existing MoA systems face two major challenges: excessive agent-to-agent connectivity and poor hardware efficiency. To address these two issues, we propose Faster-MoA, a unified algorithm-system co-design for efficient MoA serving. Faster-MoA has three innovations. First, we replace the conventional all-to-all topology with a hierarchical tree structure that introduces structured sparsity in agent connections. Second, we develop a run-time dynamic agent early-exit mechanism that prunes unnecessary agent connections basing on output semantic similarity and answer confidence. Third, we propose an agent-dependency-aware incremental prefilling mechanism that overlaps prefilling and decoding among agents with data dependencies to reduce inference latency. Together, these three innovations enable Faster-MoA to reduce end-to-end serving latency by up to 90% while achieving similar (only ±1% variation) or even higher task accuracy compared with MoA baselines using all-to-all agent connection.
People
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionRNS-CKKS is a fully homomorphic encryption scheme supporting fixed-point arithmetic, widely used in privacy-preserving convolutional neural network (CNN) inference.
However, its significant computational overhead, especially from bootstrapping—the most costly operation—raises deployment costs for CNN inference over RNS-CKKS.
While sparsity has proven effective in reducing computational overhead for unencrypted CNN inference, its application to large datasets (e.g., ImageNet) with RNS-CKKS-based CNN inference remains under-explored, particularly in optimizing bootstrapping operations that dominate computation time.
In this work, we observe that sparsity in CNN can be exploited to reduce the bootstrapping overhead in RNS-CKKS-based CNN inference.
Based on this observation, we propose FBS, a framework that accelerates CNN inference over RNS-CKKS by leveraging Fewer Bootstrapping Sparsity to reduce bootstrapping costs.
We propose two sparsity patterns: eliminate missing input sparsity pattern and channel sparsity pattern, to reduce the number of bootstrapping calls during CNN inference.
An iterative latency optimization framework is then presented to identify the key layers for pruning and determine the sparsity patterns to achieve effective performance.
Results show that FBS can accelerate CNN inference over RNS-CKKS by up to 1.91 times with negligible accuracy loss.
FBS will be open-sourced.
However, its significant computational overhead, especially from bootstrapping—the most costly operation—raises deployment costs for CNN inference over RNS-CKKS.
While sparsity has proven effective in reducing computational overhead for unencrypted CNN inference, its application to large datasets (e.g., ImageNet) with RNS-CKKS-based CNN inference remains under-explored, particularly in optimizing bootstrapping operations that dominate computation time.
In this work, we observe that sparsity in CNN can be exploited to reduce the bootstrapping overhead in RNS-CKKS-based CNN inference.
Based on this observation, we propose FBS, a framework that accelerates CNN inference over RNS-CKKS by leveraging Fewer Bootstrapping Sparsity to reduce bootstrapping costs.
We propose two sparsity patterns: eliminate missing input sparsity pattern and channel sparsity pattern, to reduce the number of bootstrapping calls during CNN inference.
An iterative latency optimization framework is then presented to identify the key layers for pruning and determine the sparsity patterns to achieve effective performance.
Results show that FBS can accelerate CNN inference over RNS-CKKS by up to 1.91 times with negligible accuracy loss.
FBS will be open-sourced.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionWith the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms.
Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads.
Other works discuss overlay architecture that can dynamically switch dataflow for different workloads.
However, these works are still limited by flexibility granularity and induce much resource inefficiency.
To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.
Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads.
Other works discuss overlay architecture that can dynamically switch dataflow for different workloads.
However, these works are still limited by flexibility granularity and induce much resource inefficiency.
To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionIoT messaging protocols face critical security risks from behavior bugs - specification violations that enable unauthorized data access and device compromise. Detecting such bugs requires comprehensive understanding of protocol specifications and communication semantics. This paper introduces ARES, a fully automated framework that extracts executable behavior oracles from protocol specifications for real-time compliance monitoring using LLM-driven behavior filtering. ARES evaluates six widely-used IoT protocol implementations, identifying 25 new bugs with 87.5% precision, including 21 behavior bugs. Of these, 18 have been confirmed or fixed and 10 CVEs assigned due to their severity.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPower signoff demands high-quality test selection, yet teams submit comprehensive suites to power estimation tools like Synopsys PrimePower without quantitative pre-validation. We present a real-time power validation framework with parallel analysis engines extracting multi-dimensional metrics—clock behavior, gating patterns, switching activity, bandwidth profiles—from design verification artifacts during active DV cycles.
Core technical innovation: multi-factor validation under live traffic reveals issues invisible to traditional approaches. Frequency correctness measured under realistic workloads detects violations missed by static checks. Dynamic gating analysis discovers transient efficiency degradation during burst-to-idle transitions—overlooked by idle-only validation. Rolling window bandwidth analysis surgically identifies peak power windows, while activity profiling quantifies switching intensity with documented confidence scores. Packet trace integration enables perf/watt correlation and traffic quality validation.
The framework enables Week 1-2 issue detection versus Week 5-6+ post-formal-tool feedback—fixing when design is fluid versus frozen. Test selection algorithms reduce submissions from dozens to 5-8 representative vectors with quantified coverage. Rapid test swapping completes in minutes versus multi-day formal tool queuing, enabling same-day hypothesis testing via pre-computed scores. Self-service architecture eliminates cross-team overhead, scaling from small IP blocks (minutes) to complex SoCs (hours).
Production deployment demonstrates 4× reduction in formal tool compute, Week 1 detection of dynamic gating issues, and surgical peak-window submission enabling first-time-right results.
Core technical innovation: multi-factor validation under live traffic reveals issues invisible to traditional approaches. Frequency correctness measured under realistic workloads detects violations missed by static checks. Dynamic gating analysis discovers transient efficiency degradation during burst-to-idle transitions—overlooked by idle-only validation. Rolling window bandwidth analysis surgically identifies peak power windows, while activity profiling quantifies switching intensity with documented confidence scores. Packet trace integration enables perf/watt correlation and traffic quality validation.
The framework enables Week 1-2 issue detection versus Week 5-6+ post-formal-tool feedback—fixing when design is fluid versus frozen. Test selection algorithms reduce submissions from dozens to 5-8 representative vectors with quantified coverage. Rapid test swapping completes in minutes versus multi-day formal tool queuing, enabling same-day hypothesis testing via pre-computed scores. Self-service architecture eliminates cross-team overhead, scaling from small IP blocks (minutes) to complex SoCs (hours).
Production deployment demonstrates 4× reduction in formal tool compute, Week 1 detection of dynamic gating issues, and surgical peak-window submission enabling first-time-right results.
People
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionHigh-accuracy thermal simulation is essential for modern 3D integrated circuits (ICs), but its high computational cost often hinders early-stage, thermal-aware design. To address this, we propose
FLASH3D, a fast and versatile analytical simulator for 3D steady-state thermal analysis. FLASH3D integrates spectral modal decomposition, the transfer matrix method, and an accelerated power decomposition algorithm to compute 3D temperature distributions efficiently and accurately. Compared with COMSOL, FLASH3D achieves over four orders of magnitude speedup, reducing computation time from minutes to milliseconds while maintaining a maximum absolute error below 0.5 K. Compared to the state-of-the-art machine learning (ML) method DeepOHeat, within a single inference time, FLASH3D can compute the temperature distribution of roughly 2000 slices and attains approximately 10× lower error. Furthermore, FLASH3D supports complex boundary conditions and fine-grained power maps, including curved-edge and standard-cell-level distributions, overcoming the limitations of conventional analytical methods. These features make FLASH3D an efficient, reliable, and scalable tool for early-stage thermal-aware design, providing a solid foundation for thermal optimization of large-scale 3D ICs.
FLASH3D, a fast and versatile analytical simulator for 3D steady-state thermal analysis. FLASH3D integrates spectral modal decomposition, the transfer matrix method, and an accelerated power decomposition algorithm to compute 3D temperature distributions efficiently and accurately. Compared with COMSOL, FLASH3D achieves over four orders of magnitude speedup, reducing computation time from minutes to milliseconds while maintaining a maximum absolute error below 0.5 K. Compared to the state-of-the-art machine learning (ML) method DeepOHeat, within a single inference time, FLASH3D can compute the temperature distribution of roughly 2000 slices and attains approximately 10× lower error. Furthermore, FLASH3D supports complex boundary conditions and fine-grained power maps, including curved-edge and standard-cell-level distributions, overcoming the limitations of conventional analytical methods. These features make FLASH3D an efficient, reliable, and scalable tool for early-stage thermal-aware design, providing a solid foundation for thermal optimization of large-scale 3D ICs.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionPoint-based Neural Networks (PNNs) have become a key approach for point cloud processing.
However, a core operation in these models, Farthest Point Sampling (FPS), often introduces significant inference latency, especially for large-scale processing.
Despite existing CUDA- and hardware-level optimizations, FPS remains a major bottleneck due to exhaustive computations across multiple network layers in PNNs, which hinders scalability.
Through systematic analysis, we identify three substantial redundancy in FPS, including unnecessary full-cloud computations, redundant late-stage iterations, and predictable inter-layer outputs that make later FPS computations avoidable.
To address these, we propose FlashFPS, a hardware-agnostic, plug-and-play framework for FPS acceleration, composed of FPS-Prune and FPS-Cache. FPS-Prune introduces candidate pruning and iteration pruning to reduce redundant computations in FPS while preserving sampling quality, and FPS-Cache eliminates layer-wise redundancy via cache-and-reuse. Integrated into existing CUDA libraries and state-of-the-art PNN accelerators, FlashFPS achieves 5.16× speedup over the standard CUDA baseline on GPU and 2.69× on PNN accelerators, with negligible accuracy loss, enabling efficient and scalable PNN inference.
However, a core operation in these models, Farthest Point Sampling (FPS), often introduces significant inference latency, especially for large-scale processing.
Despite existing CUDA- and hardware-level optimizations, FPS remains a major bottleneck due to exhaustive computations across multiple network layers in PNNs, which hinders scalability.
Through systematic analysis, we identify three substantial redundancy in FPS, including unnecessary full-cloud computations, redundant late-stage iterations, and predictable inter-layer outputs that make later FPS computations avoidable.
To address these, we propose FlashFPS, a hardware-agnostic, plug-and-play framework for FPS acceleration, composed of FPS-Prune and FPS-Cache. FPS-Prune introduces candidate pruning and iteration pruning to reduce redundant computations in FPS while preserving sampling quality, and FPS-Cache eliminates layer-wise redundancy via cache-and-reuse. Integrated into existing CUDA libraries and state-of-the-art PNN accelerators, FlashFPS achieves 5.16× speedup over the standard CUDA baseline on GPU and 2.69× on PNN accelerators, with negligible accuracy loss, enabling efficient and scalable PNN inference.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionThe rapid growth of DNA data makes exact sequence matching a central task in bioinformatics. These applications are both compute- and memory-intensive due to exhaustive database scans. However, conventional systems suffer from high I/O overhead caused by frequent page faults when memory capacity is limited. We propose FlashHD, an in-storage sequence matching framework that executes exact matching directly inside 3D NAND flash. FlashHD employs a hierarchical hyperdimensional computing architecture that filters dissimilar sequences across multiple stages while preserving full recall. FlashHD introduces a hierarchy-aware hyperdimensional architecture search that automatically tunes HDC hyperparameters for low latency and energy to optimize the hierarchical searching. Our evaluation reveals that FlashHD significantly outperforms other state-of-the-art systems.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionValidating high-performance AI/HPC network designs presents significant challenges, particularly when verifying multi-path features like ECMP and congestion control that standard Point-to-Point (P2P) environments cannot adequately cover. We propose a comprehensive, SystemVerilog/UVM-based network modeling framework designed to maximize controllability and verification efficiency. This framework integrates four key mechanisms: fine-grained packet control for precise error injection (e.g., drops, reordering), scalable topology modeling to simulate complex structures like Clos networks, per-path link control for bandwidth and delay manipulation, and an adaptive feedback system. By shifting from inefficient random testing to model-driven steering, our approach eliminates structural blind spots and enables the deterministic verification of deep corner cases. This method significantly reduces engineering costs while ensuring robust coverage for complex network IPs requiring physical path diversity.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern on‑chip network (NoC) IP must support extensive configurability to meet the needs of increasingly diverse systems. This flexibility spans high‑level parameters as well as fine‑grain control over topology, channel assignments, virtual channels, link widths, router microarchitecture, and routing policies. While powerful, this level of configurability increases the complexity of IP setup, architectural decision‑making, and system‑level validation.
We present a software‑defined methodology for configuring and validating highly flexible NoC IP. Using Baya Systems' Fabric Studio as an example, we illustrate how a software‑based input model captures design intent and system constraints, enabling automated generation of NoC implementations with complete topology control while ensuring correctness and deadlock avoidance across standalone and multi‑chiplet systems.
A key part of the methodology is a fast C++‑based simulation engine that models traffic flows, bandwidth demands, latency targets, and quality‑of‑service objectives. This allows rapid evaluation of NoC architectures prior to RTL development and supports extensive design‑space exploration that would be impractical using RTL‑centric flows.
By shifting configuration complexity and early performance validation into software, this approach improves scalability, reduces integration risk, and enables engineering teams to deliver highly configurable NoC IP with greater confidence. The presentation summarizes lessons learned in developing and deploying this methodology for industrial NoC design.
We present a software‑defined methodology for configuring and validating highly flexible NoC IP. Using Baya Systems' Fabric Studio as an example, we illustrate how a software‑based input model captures design intent and system constraints, enabling automated generation of NoC implementations with complete topology control while ensuring correctness and deadlock avoidance across standalone and multi‑chiplet systems.
A key part of the methodology is a fast C++‑based simulation engine that models traffic flows, bandwidth demands, latency targets, and quality‑of‑service objectives. This allows rapid evaluation of NoC architectures prior to RTL development and supports extensive design‑space exploration that would be impractical using RTL‑centric flows.
By shifting configuration complexity and early performance validation into software, this approach improves scalability, reduces integration risk, and enables engineering teams to deliver highly configurable NoC IP with greater confidence. The presentation summarizes lessons learned in developing and deploying this methodology for industrial NoC design.
People
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionFace-to-Face 3D ICs enable advanced vertical integration but pose critical challenges for clock tree synthesis (CTS). Existing pseudo-3D flows rigidly partition clock paths across dies, fragmenting common paths and crippling Common Path Pessimism Removal (CPPR), resulting in excessive skew and inefficient hybrid bonding terminal (HBT) utilization. We present FlexiCTS, a correct-by-construction CPPR-aware framework featuring adaptive cross-die buffer assignment to maximize path sharing and optimize HBT allocation. Experimental results show that FlexiCTS achieves 4.1$\times$ skew reduction and 88\% fewer HBTs than state-of-the-art methods, while simultaneously matching 2D timing quality with superior resource efficiency.
People
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionThis paper presents a comprehensive Flip-FET-based VLSI design framework that leverages both front-side and back-side metal layers for congestion-aware routing optimization. Starting from the placement result, we formulate an integer linear programming (ILP) model to determine standard cells' pin assignments that minimize dual-side nets and alleviate regional congestion simultaneously. Routing is then performed with optimal nTSV insertion for dual-side nets using an RSMT construction approach. Experimental results demonstrate that the proposed framework achieves 78.8% reduction in maximum congestion and 68.6% reduction in DRVs over a commercial design tool.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAccurate computation of the Perturbation Projection Vector (PPV) is essential for analyzing oscillator phase noise. The PPV corresponds to the steady-state solution of the adjoint system in small-signal oscillator analysis, but in large time-constant oscillators, it is often obscured by numerous slow-decaying modes. To address this, we propose a Floquet-based subspace projection (FSP) method, which confines PPV computation to a low-dimensional subspace spanned by the PPV and slow-decaying solutions. Utilizing the biorthogonality condition, we construct a linear system within this subspace to lock the PPV, since slow-decaying solutions are orthogonal to the large-signal derivative. This linear system is small-scale and can be solved directly, which results in high-accuracy PPVs. Numerical results demonstrate that FSP significantly enhances both accuracy and computational efficiency.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionThe rapid evolution of graph neural networks (GNNs) has introduced diverse computational patterns beyond the message-passing paradigm. To support end-to-end inference, existing accelerators need to integrate heterogeneous, loosely-coupled components, incurring significant inter-accelerator communication and low performance density. This paper introduces Florella, an acceleration framework for unifying inconsistent computational patterns in end-to-end GNN inference. We first propose the sliding reduction convention, a declarative language that provides a flexible and hardware-friendly representation for diverse GNN operations. Building upon this, we design a versatile architecture that enables atomic mapping of macro-operations. This architecture is centered around a novel Jacobian-logarithm unit, which enables high hardware reuse across operators by leveraging logarithmic transformation and approximation. Evaluated across a range of GNN models, Florella achieves an average speedup of 2.2x and reduces memory traffic by 2x compared to four state-of-the-art accelerators, while improving performance density by 3.3x.
People
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionChip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based solutions, current methods have the following limitations: they use random synthetic data for pre-training, require long sampling times, and often result in overlaps due to their dependence on gradient-based solvers during the sampling process. To overcome these issues, we propose FlowPlace, which features mask-guided synthetic data generation, flow-based efficient training with flexible prior injection, and hard constraint sampling for overlap-free layouts. Experiments on OpenROAD and ICCAD 2015 benchmarks show FlowPlace achieves better PPA metrics, 10-50x faster sampling efficiency, and zero overlaps.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIntel has been doing full formal verification of CPU execution datapaths for over 25 years. We present as a case study in verification engineering, the formal verification of the integer subset of VNNI instruction set that was introduced as part of Intel Deep Learning Boost. An example of an Int16 SIMD MUL (SiMUL) VNNI instruction is VPDPWSSD, which multiplies the individual signed words of the first source operand by the corresponding signed words of the second source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then summed and accumulated in the destination operand. These Int16 and Int8 instructions come in different flavors based on whether sources are signed or unsigned, and whether the intermediate sums on overflowing are saturated to maximum magnitude with the correct sign, for the result. The underlying FV engine is powered by symbolic simulation, which provides the verification engineer the ability to concretely debug verification complexity, and the ability to program around verification complexity. Our verification flow built on top of symbolic simulation using "FV as first class software" approach has enabled us to develop flexible, reusable proofs while providing full data space coverage for all flavors of SiMUL VNNI instructions.
Engineering Presentation
Design
EDA
DescriptionFormal verification of CRC and ECC hardware does not scale with conventional techniques due to large Datapath's, extensive lookup tables, and complex Galois-field arithmetic. Existing approaches rely on monolithic end-to-end properties, bounded proofs, or coarse linearity abstractions, and even then, they typically break down beyond ~512-bit widths.
This paper introduces a theorem-guided formal verification methodology that scales to production-class CRC and ECC designs. The key novelty is a systematic decomposition of functional correctness into reusable, mathematically grounded theorems capturing structural and algebraic invariants—such as linearity, syndrome correctness and consistency, and error-propagation properties of locator polynomials. These theorems are proved independently, and proof is composed of incrementally using assume–guarantee reasoning within an industry-standard formal tool (VC Formal).
We have validated the approach on two industrial designs: an IEEE 802.3 CRC with a 5120-bit pipelined Datapath, and a BCH DECTED ECC with m = 2047 and t = 2. Prior methods time out, whereas our methodology achieves complete functional verification.
To our knowledge, this is the first demonstration of scalable and complete formal verification of CRC/ECC designs at this scale using a production-ready formal tool, making the approach directly applicable to industrial verification flows.
This paper introduces a theorem-guided formal verification methodology that scales to production-class CRC and ECC designs. The key novelty is a systematic decomposition of functional correctness into reusable, mathematically grounded theorems capturing structural and algebraic invariants—such as linearity, syndrome correctness and consistency, and error-propagation properties of locator polynomials. These theorems are proved independently, and proof is composed of incrementally using assume–guarantee reasoning within an industry-standard formal tool (VC Formal).
We have validated the approach on two industrial designs: an IEEE 802.3 CRC with a 5120-bit pipelined Datapath, and a BCH DECTED ECC with m = 2047 and t = 2. Prior methods time out, whereas our methodology achieves complete functional verification.
To our knowledge, this is the first demonstration of scalable and complete formal verification of CRC/ECC designs at this scale using a production-ready formal tool, making the approach directly applicable to industrial verification flows.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionTrillion-parameter Transformers across vision, language, and action increasingly rely on reduced - precision floating- point (e.g., FP8) for dynamic range and efficiency. While their compute-intensive operations might make in-memory solutions attractive, the incompatibility of FP operations with analog computing renders many optimizations untenable for emerging compute-in/near-memory (CIM/CNM) platforms. To address this incompatibility, we present MANTIS, a foundry- SRAM -compatible, mixed-signal, CNM/CIM macro that performs vector integer multiply-accumulate opera-
tions in the analog charge domain, while applying per-vector FP8 scaling factors digitally via microscaling (MXFP). The proposed design, in a commercially available 12 nm node, reuses the digital-to-analog converter (DAC) in a successive approximation register (SAR) analog-to-digital converter (ADC) as both a bit-plane charge-domain accumulator and as part of the ADC. The design is implemented as a 4kb macro, operating at 0.8V. Our 8b ADC delivers 7.62 effective number of bits (ENOB), resulting in a precision-scalable efficiency of 22.5 TOPS/W (for MXINT3×MXINT3) to 6.43 TOPS/W(MXINT8×MXINT3). In end-to-end evaluations on MMLU (5-shot) and HellaSwag (0-shot), our design only incurs a 0.15 percentage point accuracy degradation across multiple open-weight LLMs (vs. the MXINT8 activation × MXINT3 weight, quantized baseline). For reduced-precision activations (MXINT6 / MXINT3 + FP8 scalar), our results track within ±0.3% of the quantized models, with minimal impact from analog computation.
tions in the analog charge domain, while applying per-vector FP8 scaling factors digitally via microscaling (MXFP). The proposed design, in a commercially available 12 nm node, reuses the digital-to-analog converter (DAC) in a successive approximation register (SAR) analog-to-digital converter (ADC) as both a bit-plane charge-domain accumulator and as part of the ADC. The design is implemented as a 4kb macro, operating at 0.8V. Our 8b ADC delivers 7.62 effective number of bits (ENOB), resulting in a precision-scalable efficiency of 22.5 TOPS/W (for MXINT3×MXINT3) to 6.43 TOPS/W(MXINT8×MXINT3). In end-to-end evaluations on MMLU (5-shot) and HellaSwag (0-shot), our design only incurs a 0.15 percentage point accuracy degradation across multiple open-weight LLMs (vs. the MXINT8 activation × MXINT3 weight, quantized baseline). For reduced-precision activations (MXINT6 / MXINT3 + FP8 scalar), our results track within ±0.3% of the quantized models, with minimal impact from analog computation.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern FPGA designs are highly parameterized, hierarchical, and optimized for top-down methodologies, while ASIC designs demand bottom-up flows that emphasize physical layout and timing closure. Bridging these paradigms is nontrivial due to differences in hierarchy handling, parameter resolution, and constant signal treatment. We present Morph-Localize, an automated methodology integrated into EDA toolchains that converts FPGA designs into ASIC-ready implementations. The approach introduces hierarchical boundary wrapping to preserve design hierarchy while enabling physical partitioning, parameter resolution and uniquification to eliminate runtime variability, and constant propagation with port pruning to reduce routing congestion and improve synthesis efficiency. A top-down equivalence check ensures functional correctness. Applied within IBM's Nexus chip flow, Morph-Localize demonstrates significant improvements in synthesis runtime, area utilization, and timing closure, reducing manual intervention and accelerating time-to-market for complex ASIC projects.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDespite significant advances in routing automation, manual routing remains a critical and time-consuming task in modern VLSI design, particularly for small layout experiments, debugging, and structured or analog routing. This paper introduces Free Hand Routing, a new paradigm that enables designers to sketch routing topologies directly using free-form drawings, while automatically transforming these sketches into technology-correct, DRC-clean, and manufacturable layouts.
The proposed approach extracts shapes and connectivity from free-hand drawings, maps them to technology-specific width and spacing rules, and applies a connectivity-aware geometric transformation flow. This flow aligns shapes to legal routing tracks, restores connectivity using a graph-driven shape extension mechanism, and inserts vias to ensure full 3D connectivity across metal layers. The final layout is exported using a structured, tool-agnostic JSON representation, enabling seamless integration with existing physical design tool chains.
Integrated into an industrial physical design environment, the approach has been used to generate hundreds of test layouts, reducing per-case effort from hours of manual coding to seconds of intuitive sketching.
The proposed approach extracts shapes and connectivity from free-hand drawings, maps them to technology-specific width and spacing rules, and applies a connectivity-aware geometric transformation flow. This flow aligns shapes to legal routing tracks, restores connectivity using a graph-driven shape extension mechanism, and inserts vias to ensure full 3D connectivity across metal layers. The final layout is exported using a structured, tool-agnostic JSON representation, enabling seamless integration with existing physical design tool chains.
Integrated into an industrial physical design environment, the approach has been used to generate hundreds of test layouts, reducing per-case effort from hours of manual coding to seconds of intuitive sketching.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLow-bit large language models (LLMs) use quantization to compress weights to 1/2/4 bits, significantly shrinking model size while preserving accuracy. Existing work leverages the reduced precision to cut computation. However, roofline analysis on the NVIDIA A100 shows limited inference speedup even with a 6.7× reduction in computation.This limitation is caused by insufficient memory bandwidth. Processing-in-memory (PIM) provides a promising solution for the memory bottleneck by integrating compute near data. To support efficient low-bit LLM inference, PIM should be carefully designed with joint software and hardware optimizations.
This paper presents FreeBit, a PIM-based architecture that unleashes the performance potential of low-bit LLMs. The objective is to capture the low-bit nature to better exploit PIM through hardware-software co-design. At the hardware level, a lookup-table (LUT)-centric architecture is designed to support quantized computation and minimize redundant computation. A sparsity-aware memory optimization is introduced to optimize memory access and leverage PIM bandwidth. At the software level, a static-dynamic decoupled scheduling strategy is presented to exploit PIM parallelism. Experimental results show that FreeBit effectively reduces redundant computation and memory access, and delivers notable performance improvements over CPUs, GPUs, and prior PIM baselines.
This paper presents FreeBit, a PIM-based architecture that unleashes the performance potential of low-bit LLMs. The objective is to capture the low-bit nature to better exploit PIM through hardware-software co-design. At the hardware level, a lookup-table (LUT)-centric architecture is designed to support quantized computation and minimize redundant computation. A sparsity-aware memory optimization is introduced to optimize memory access and leverage PIM bandwidth. At the software level, a static-dynamic decoupled scheduling strategy is presented to exploit PIM parallelism. Experimental results show that FreeBit effectively reduces redundant computation and memory access, and delivers notable performance improvements over CPUs, GPUs, and prior PIM baselines.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionThe zone interface has emerged as a promising standard in consumer devices to support high random read performance, as its design reduces read amplification and enables efficient read parallelism. However, to support the underlying zone interface, the kernel's I/O subsystem must enforce strictly sequential writes per zone by logically ordering requests, which results in reduced write parallelism at the kernel level. Although prior work mitigates this by offloading the request reordering process to the on-device write buffer, the limited buffer size in consumer devices cannot accommodate the growing volume of requests that require reordering. This paper proposes FreeZone, an on-device writes reordering solution that extends the write buffer size by using the high-performance SLC flash region of storage. FreeZone allows the kernel to submit I/O requests freely, improving parallel write performance. Evaluation demonstrates that FreeZone achieves up to a 2.3× IOPS improvement compared with existing zoned storage architectures.
People
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionIn semiconductor manufacturing, contour extraction from scanning electron microscope (SEM) images is essential for accurate lithographic metrology and process optimization.
However, the low contrast, high noise, and complex structures of SEM images lead to blurred contours, making precise contour extraction extremely challenging.
Existing methods struggle to capture detailed geometric features and fail to meet the high-precision requirements.
In this paper, we propose FreqSEM, a frequency-aware contour extraction framework that achieves sub-nanometer-level precision, incorporating Fast Fourier Transform (FFT) for edge feature enhancement and Segment Anything Model 2 (SAM2) for edge localization.
To exploit the characteristics of different frequency bands, we first apply FFT to extract the low- and mid-frequency components of the input image.
Subsequently, the low-frequency component, which preserves the overall brightness distribution and large-scale structures, is used to generate SEM-adapted prompts for SAM2.
Meanwhile, the mid-frequency component, which retains edges, textures, and other fine-scale details, is injected into the image encoder to enhance edge representation.
In terms of the training strategy, to further enhance SAM2's focus on edges, we design an edge-aware loss function with a weight map emphasizing boundaries and a Sobel gradient loss.
With limited training data, our method achieves an EPE mean of only 0.334 nm and a standard deviation of 0.661 nm at the minimum line width of 81 nm, significantly outperforming existing approaches.
However, the low contrast, high noise, and complex structures of SEM images lead to blurred contours, making precise contour extraction extremely challenging.
Existing methods struggle to capture detailed geometric features and fail to meet the high-precision requirements.
In this paper, we propose FreqSEM, a frequency-aware contour extraction framework that achieves sub-nanometer-level precision, incorporating Fast Fourier Transform (FFT) for edge feature enhancement and Segment Anything Model 2 (SAM2) for edge localization.
To exploit the characteristics of different frequency bands, we first apply FFT to extract the low- and mid-frequency components of the input image.
Subsequently, the low-frequency component, which preserves the overall brightness distribution and large-scale structures, is used to generate SEM-adapted prompts for SAM2.
Meanwhile, the mid-frequency component, which retains edges, textures, and other fine-scale details, is injected into the image encoder to enhance edge representation.
In terms of the training strategy, to further enhance SAM2's focus on edges, we design an edge-aware loss function with a weight map emphasizing boundaries and a Sobel gradient loss.
With limited training data, our method achieves an EPE mean of only 0.334 nm and a standard deviation of 0.661 nm at the minimum line width of 81 nm, significantly outperforming existing approaches.
SKYTalk
Design
DescriptionIn this talk, we examine how the semiconductor industry handled key inflections in the past decade and share perspectives on the path to enable 7Å in the upcoming years anticipating the next key inflection point.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionLog-structured file systems (LFSs) have shown promising potential on emerging Zoned Namespace (ZNS) SSDs. However, current ZNS-enabled LFSs still follow the block-based abstraction inherited from traditional SSDs. This abstraction enforces block-aligned writes and triggers costly read-modify-write operations for unaligned writes.
Since ZNS SSDs natively support arbitrary-sized writes, such block alignment is unnecessary and instead results in significant performance degradation.
To address this issue, we present RangeLFS, an LFS that replaces fixed-size blocks with flexible ranges to write directly arbitrary-sized file segments without alignment.
First, we propose a range-based file structure, where each range directly maps an arbitrary-sized file segment to its device address. These ranges in a file are maintained in an enhanced B+ tree, enabling efficient lookups while reducing cascading address updates and structural modifications.
Second, we propose a range-augmented page cache to support partial-page writes and perform range-based writeback, eliminating the traditional page-granularity constraint that would otherwise prevent arbitrary-sized writes.
Evaluated on a production ZNS SSD, RangeLFS improves write bandwidth by up to 6.95x compared with block-based LFSs, while maintaining comparable read performance across diverse workloads.
Since ZNS SSDs natively support arbitrary-sized writes, such block alignment is unnecessary and instead results in significant performance degradation.
To address this issue, we present RangeLFS, an LFS that replaces fixed-size blocks with flexible ranges to write directly arbitrary-sized file segments without alignment.
First, we propose a range-based file structure, where each range directly maps an arbitrary-sized file segment to its device address. These ranges in a file are maintained in an enhanced B+ tree, enabling efficient lookups while reducing cascading address updates and structural modifications.
Second, we propose a range-augmented page cache to support partial-page writes and perform range-based writeback, eliminating the traditional page-granularity constraint that would otherwise prevent arbitrary-sized writes.
Evaluated on a production ZNS SSD, RangeLFS improves write bandwidth by up to 6.95x compared with block-based LFSs, while maintaining comparable read performance across diverse workloads.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern system-on-chip (SoC) designs face exponential complexity with billions of geometries and hierarchical integration. Using traditional layout versus schematic (LVS) verification in early phases presents significant bottlenecks in complex SoC designs, leading to prolonged design cycles and increased costs on the entire verification flow since shorts are fundamental connectivity flaws that critically impact & bottleneck essential verification flows (e.g., ESD (electrostatic discharge), LUP (latch-up), Antenna, …etc.). A clean "shorts-free" database is a prerequisite for accurate & efficient execution of these flows.
In this paper, we are introducing a novel Targeted Short Isolation (SI) verification methodology designed to transform LVS from a bottleneck into a breakthrough. Our shift-left approach accelerates full verification flow by enabling early detection and remediation of shorts during the Floorplan and Place & Route stages. By decoupling targeted SI from full LVS complexity, designers can achieve focused remediation without the overhead of comprehensive connectivity checking. Our plug-and-play solution offers remarkable improvements, including up to 14X faster iterations and up to 90% memory savings, significantly reducing hardware requirements and improving design quality through early detection. Ultimately, this leads to a cleaner, "short-free" database across all hierarchies earlier in the design flow, enabling faster time-to-market for complex SoC designs.
In this paper, we are introducing a novel Targeted Short Isolation (SI) verification methodology designed to transform LVS from a bottleneck into a breakthrough. Our shift-left approach accelerates full verification flow by enabling early detection and remediation of shorts during the Floorplan and Place & Route stages. By decoupling targeted SI from full LVS complexity, designers can achieve focused remediation without the overhead of comprehensive connectivity checking. Our plug-and-play solution offers remarkable improvements, including up to 14X faster iterations and up to 90% memory savings, significantly reducing hardware requirements and improving design quality through early detection. Ultimately, this leads to a cleaner, "short-free" database across all hierarchies earlier in the design flow, enabling faster time-to-market for complex SoC designs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionLarge-scale hardware verification relies on massive regression testing across thousands of tests, seeds, and configurations. When a previously passing regression begins to fail, identifying the exact code change that introduced the failure is extremely difficult due to non-determinism, merge commits, and the absence of a known clean baseline. Traditional debugging methods require engineers to manually inspect large commit histories, often taking days or weeks and leading to ambiguity in fault attribution.
This work presents a two-stage regression-aware root-cause isolation framework that automatically identifies the precise commit responsible for a regression failure. The framework first locates a valid good baseline within the commit history using regression-validated inputs, and then performs controlled, deterministic isolation between the good and bad states to find the first failing commit. The approach is fully merge-aware, supports non-deterministic regressions, and integrates directly into production regression flows. All decisions are validated using real regression results and logged for full auditability.
The framework reduces root-cause identification from weeks to hours, eliminates manual guesswork, and provides reliable commit-level failure attribution for large-scale verification environments.
This work presents a two-stage regression-aware root-cause isolation framework that automatically identifies the precise commit responsible for a regression failure. The framework first locates a valid good baseline within the commit history using regression-validated inputs, and then performs controlled, deterministic isolation between the good and bad states to find the first failing commit. The approach is fully merge-aware, supports non-deterministic regressions, and integrates directly into production regression flows. All decisions are validated using real regression results and logged for full auditability.
The framework reduces root-cause identification from weeks to hours, eliminates manual guesswork, and provides reliable commit-level failure attribution for large-scale verification environments.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionBFP is emerging as an attractive data format for edge NPUs, combining wide dynamic range with high hardware efficiency. However, its behavior under hardware faults and its suitability for safety-critical deployments remain largely underexplored. Here, we present the first in-depth empirical reliability study of BFP-based NPUs. Using RTL-level fault injection on NPUs, our bit- and path-level analysis reveals pronounced heterogeneous vulnerabilities and shows that the conventional end-to-end check becomes largely ineffective under nonlinear block scaling. Guided by these insights, we design a fault-tolerant BFP-based NPU microarchitecture that aligns the BFP computational semantics with reliability constraints. The design uses a row/column-wise blocking strategy to decouple the fixed-point mantissa computations from the scalar exponent path, and introduces ultra-lightweight protection mechanisms for each. Experimental results demonstrate that our design achieves near–dual modular redundancy reliability with only 3.55% geometric mean performance overhead and less than 2% hardware cost.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionProviding accurate Power, Performance, and Area (PPA) feedback for RTL changes in conventional RTL-to-GDS flows is slow and resource-intensive, often taking 5-7 days for 2-3M instance block-level runs. This process requires significant compute resources, extensive cross-team coordination, and lacks standardized reporting, making quick design iterations challenging. Furthermore, RTL teams frequently need relative PPA comparisons between early and refined RTL releases, but the long physical design loop impacts schedules and decision-making.
This presentation introduces a Rapid RTL Evaluation and PPA Feedback methodology using Joules RTL Design Studio. The approach enables fast, reproducible RTL evaluation and PPA analysis, reducing turnaround time from days to hours. By leveraging consistent setup, floorplan prediction, and incremental synthesis, the methodology minimizes cross-team dependencies while ensuring correlation with production flows. Additionally, integrated features like the RTL Debug Assistant System (RDAS) provide standardized reporting, analysis and tracking, improving decision accuracy and efficiency. This solution empowers RTL teams to make informed edits quickly, accelerating design convergence, RTL signoff and reducing overall project risk.
This presentation introduces a Rapid RTL Evaluation and PPA Feedback methodology using Joules RTL Design Studio. The approach enables fast, reproducible RTL evaluation and PPA analysis, reducing turnaround time from days to hours. By leveraging consistent setup, floorplan prediction, and incremental synthesis, the methodology minimizes cross-team dependencies while ensuring correlation with production flows. Additionally, integrated features like the RTL Debug Assistant System (RDAS) provide standardized reporting, analysis and tracking, improving decision accuracy and efficiency. This solution empowers RTL teams to make informed edits quickly, accelerating design convergence, RTL signoff and reducing overall project risk.
Engineering Presentation
Design
EDA
Security
Systems
DescriptionWe present a practical pre-silicon methodology for identifying power side-channel leakage in elliptic-curve cryptography (ECC) hardware before tape-out. The approach leverages sign-off–quality activity and power artifacts to expose data-dependent leakage early in the design flow. Cryptography-aware stimuli are applied to ECC RTL to generate switching activity, followed by time-resolved dynamic power estimation and automated Welch's t-test (TVLA) analysis. The target design is an Elliptic Curve Diffie–Hellman (ECDH) core, an ECC-based IP that relies on scalar multiplication, implemented with a Karatsuba-based field multiplier optimized for area–delay efficiency. Initial analysis of the unprotected implementation reveals clear and repeatable leakage during the first cycle of the design, corresponding to public key computation. Although ECC is mathematically robust, complex arithmetic and data-dependent behavior can unintentionally reveal sensitive information through physical side channels if not addressed during implementation. Our goal is to surface such risks using analysis flows already trusted for power, performance, and area validation.
Beyond pass/fail screening, the methodology localizes dominant leakage contributors in time and logic, enabling targeted mitigation. The flow is fully scriptable, reuses standard sign-off tools, and integrates naturally into implementation regressions. Although demonstrated on ECC, the method generalizes to other cryptographic accelerators and security-critical hardware, providing a scalable path to pre-silicon side-channel hardening.
Beyond pass/fail screening, the methodology localizes dominant leakage contributors in time and logic, enabling targeted mitigation. The flow is fully scriptable, reuses standard sign-off tools, and integrates naturally into implementation regressions. Although demonstrated on ECC, the method generalizes to other cryptographic accelerators and security-critical hardware, providing a scalable path to pre-silicon side-channel hardening.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionInverse lithography technology (ILT) generates curvilinear masks for optimal wafer patterns and process windows. In practice, fabs employ ILT based on optical lithography (OL) models to improve wafer pattern fidelity, while mask shops perform mask data preparation (MDP) to ensure mask writability. The MDP process applies geometry-level mask process correction (MPC) guided by electron-beam lithography (EBL) simulations, followed by shot fracturing. However, this disjointed workflow, both between fabs and mask shops, as well as within MDP between MPC and shot fracturing, often results in suboptimal wafer patterns and inefficient mask preparation. This paper presents a novel, unified, end-to-end differentiable framework for co-optimizing wafer pattern fidelity and mask manufacturability. We achieve differentiability by formulating a physics-aware EBL simulator using the inherently differentiable error function (erf), which precisely captures energy deposition at the VSB shot level, enabling exact gradient computation. With OL simulation embedded in the optimization loop, the framework enables refinement of shot parameters to achieve wafer-level optimality. The proposed method unifies the workflow from fab to mask shop, aligning wafer-side lithography objectives with mask-side writability constraints within an end-to-end optimization flow, thereby improving mask manufacturability while preserving wafer pattern fidelity.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionMachine learning can reduce 3D-ICs thermal analysis runtime from hours to seconds, yet most existing ML-based thermal models train each chip independently from scratch, demanding large datasets while failing to exploit the mathematical equivalence between heat conduction and diffusion-type PDEs. We demonstrate that foundation neural operators pretrained on diverse PDE families enable effective transfer to 3D-IC thermal simulation. Building on this insight, we develop PNO-Therm, a specialized model obtained through targeted fine-tuning, which surpasses the previous state-of-the-art method using less than 20% of the training data. At equal dataset sizes, our method achieves 6–10× lower MAE, 3.5× reduced GPU memory consumption, and over 3× faster training while maintaining approximately 940× speedup versus FEM solvers. Validated across three representative 3D-IC designs, PNO-Therm establishes that pretrained neural operators provide a scalable pathway for high-accuracy thermal modeling under limited data.
People
Keynote
Quantum
Systems
DescriptionIn the span of four decades, quantum computation has evolved from an intellectual curiosity to a potentially realizable technology. I will describe my thesis work on macroscopic quantum tunneling and energy-level quantization that led to the Nobel Prize, as well as several other important experiments that advanced superconducting qubits. Nevertheless, the path toward a full-stack scalable technology is a work in progress. There are significant outstanding quantum hardware, fabrication, architecture, and algorithmic challenges that are either unresolved or overlooked. Here, we show how the road to scaling could be paved by adopting existing semiconductor technology to build much higher-quality qubits and employing system engineering approaches.
People
TechTalk
AI
DescriptionArtificial intelligence is now deeply embedded in the semiconductor design conversation, but separating real engineering impact from aspirational hype remains a challenge for both technical and business leaders. This TechTalk provides a clear, grounded view of where Agentic AI is actually delivering measurable value in chip design workflows today, and where expectations still exceed practical reality.
The talk will examine concrete use cases across the design lifecycle, including architectural exploration, RTL quality analysis, verification acceleration, physical design optimization, and design closure. Rather than focusing on specific tools or products, the presentation will emphasize underlying techniques, deployment patterns, and organizational lessons learned from real-world adoption. Key questions addressed include: What types of design problems are well-suited for modern agentic AI approaches? Where do data availability and model generalization break down? How do teams integrate AI into existing EDA flows without disrupting proven methodologies?
The session will also explore the implications for engineering productivity, time-to-market, and design risk, helping business stakeholders understand where Agentic AI investments pay off, and where caution is warranted. By bridging technical depth with strategic insight, this TechTalk aims to equip DAC attendees with a realistic framework for evaluating and deploying AI in semiconductor design. This presentation aligns with DAC’s AI, Design, and EDA pillars and is intended for design engineers, CAD managers, and technology decision-makers seeking practical guidance rather than marketing narratives.
The talk will examine concrete use cases across the design lifecycle, including architectural exploration, RTL quality analysis, verification acceleration, physical design optimization, and design closure. Rather than focusing on specific tools or products, the presentation will emphasize underlying techniques, deployment patterns, and organizational lessons learned from real-world adoption. Key questions addressed include: What types of design problems are well-suited for modern agentic AI approaches? Where do data availability and model generalization break down? How do teams integrate AI into existing EDA flows without disrupting proven methodologies?
The session will also explore the implications for engineering productivity, time-to-market, and design risk, helping business stakeholders understand where Agentic AI investments pay off, and where caution is warranted. By bridging technical depth with strategic insight, this TechTalk aims to equip DAC attendees with a realistic framework for evaluating and deploying AI in semiconductor design. This presentation aligns with DAC’s AI, Design, and EDA pillars and is intended for design engineers, CAD managers, and technology decision-makers seeking practical guidance rather than marketing narratives.
DAC Pavilion Panel
DescriptionSecurity can no longer be treated as a late-stage feature, a firmware patch, or a checklist item before tapeout. As chips become more complex, more connected, and more dependent on third-party IP, advanced packaging, global supply chains, and long field lifetimes, trust must be established and maintained across the entire silicon lifecycle: from architecture and RTL design through verification, implementation, manufacturing, provisioning, deployment, monitoring, and update.
This panel will examine what it means to build a secure silicon lifecycle in practice. Where should security requirements be captured? Can security verification become part of mainstream EDA signo? How should teams reason about third-party IP, FPGA bitstreams, hardware roots of trust, device identity, key provisioning, and post-deployment assurance?And who is accountable when a vulnerability appears after a device has shipped?
AI introduces a new silicon-security paradox: it can help designers create hardware faster than ever, potentially scaling vulnerabilities across complex SoCs, while also giving verification and security teams new ways to detect, explain, and remediate those vulnerabilities before tapeout.
Bringing together perspectives from academia, security technology providers, and industry practitioners, this discussion will ask whether the industry needs new tools, standards, processes, or ownership models to make security a continuous discipline across the lifetime of a chip.
This panel will examine what it means to build a secure silicon lifecycle in practice. Where should security requirements be captured? Can security verification become part of mainstream EDA signo? How should teams reason about third-party IP, FPGA bitstreams, hardware roots of trust, device identity, key provisioning, and post-deployment assurance?And who is accountable when a vulnerability appears after a device has shipped?
AI introduces a new silicon-security paradox: it can help designers create hardware faster than ever, potentially scaling vulnerabilities across complex SoCs, while also giving verification and security teams new ways to detect, explain, and remediate those vulnerabilities before tapeout.
Bringing together perspectives from academia, security technology providers, and industry practitioners, this discussion will ask whether the industry needs new tools, standards, processes, or ownership models to make security a continuous discipline across the lifetime of a chip.
SKYTalk
Systems
DescriptionThere is a rapid transition occurring in the world today - companies are driving from an internet economy to an AI economy. AI and HPC performance demands are increasing at a rapid clip and newer innovations in Si, Package and System designs are needed to meet the increased demand and to fully realize the power of AI. The demand for performance surged by nearly 10x over a two-year period (2021-2022), rapidly outpacing Moore’s Law, which traditionally doubles transistor counts roughly every 18 months. Performance and cost at the system level are key vectors that will require optimization across the entire stack to meet these demands. Heterogeneous integration of chiplets in an advanced package is essential to pave the way for sustainable scaling in the AI era.
In this presentation, I will cover our approach to chiplet design, advanced packaging, and interconnect technologies to ensure a fully optimized solution for AI needs. Intel as a system foundry is driving System Technology Co-Optimization (STCO) across the entire stack to achieve these goals. Intel has played a pioneering role in advancing chiplet-based architecture through its leadership in interconnect standardization, packaging innovation, and disaggregated design strategies and I will address key aspects in this talk.
In this presentation, I will cover our approach to chiplet design, advanced packaging, and interconnect technologies to ensure a fully optimized solution for AI needs. Intel as a system foundry is driving System Technology Co-Optimization (STCO) across the entire stack to achieve these goals. Intel has played a pioneering role in advancing chiplet-based architecture through its leadership in interconnect standardization, packaging innovation, and disaggregated design strategies and I will address key aspects in this talk.
People
TechTalk
AI
DescriptionAs design complexity accelerates, time-to-market pressures intensify, and the specialized EDA workforce declines, the traditional approach of using EDA tools in isolation and then iterating on the design is reaching its limit. Specifically, this shrinking workforce cannot handle iterating on multiple design options while ensuring designs are verified and DRC-clean in ever-shorter timeframes.
This presentation outlines how EDA AI agents can work individually and collaboratively—using advanced reasoning, multi-tool integration, and data analysis—to enable a fully autonomous end-to-end EDA workflow by leveraging direct API integrations, MCP, and Agent skills.
We demonstrate this across the complete EDA workflow: (a) C-to-RTL-to-GDS flows generate RTL and deploy PPA optimization agents; (b) custom IC and analog/mixed-signal workflows leverage autonomous setup builders and intelligent debugging with data flywheel approaches for continuous agent refinement; (c) physical verification benefits from intelligent DRC remediation, run optimization, and SVRF deck generation. Each domain showcases how agentic AI eliminates manual intervention while improving outcomes, both in-tool and across multiple tools.
To tie everything together, we then showcase how Siemens EDA is solving these challenges with the newly launched Fuse EDA AI System and the Fuse Agent, which are purpose-built to enable AI-powered automations in EDA. Lastly, we highlight that completing this transformation represents a new design philosophy: one in which engineers and AI agents collaborate as co-designers, each amplifying the other’s strengths across the full EDA workflow.
This presentation outlines how EDA AI agents can work individually and collaboratively—using advanced reasoning, multi-tool integration, and data analysis—to enable a fully autonomous end-to-end EDA workflow by leveraging direct API integrations, MCP, and Agent skills.
We demonstrate this across the complete EDA workflow: (a) C-to-RTL-to-GDS flows generate RTL and deploy PPA optimization agents; (b) custom IC and analog/mixed-signal workflows leverage autonomous setup builders and intelligent debugging with data flywheel approaches for continuous agent refinement; (c) physical verification benefits from intelligent DRC remediation, run optimization, and SVRF deck generation. Each domain showcases how agentic AI eliminates manual intervention while improving outcomes, both in-tool and across multiple tools.
To tie everything together, we then showcase how Siemens EDA is solving these challenges with the newly launched Fuse EDA AI System and the Fuse Agent, which are purpose-built to enable AI-powered automations in EDA. Lastly, we highlight that completing this transformation represents a new design philosophy: one in which engineers and AI agents collaborate as co-designers, each amplifying the other’s strengths across the full EDA workflow.
Tutorial
DescriptionLarge language models are moving beyond chat interfaces into agentic systems capable of executing complex engineering workflows. Achieving reliable, production-quality outcomes — especially in hardware design — requires disciplined system design, not prompting tricks. This three-hour tutorial takes participants from agentic system fundamentals to a concrete hardware-oriented outcome. In the first phase, attendees learn how to design agentic workflows: defining agents, building custom tools, decomposing tasks, managing context and state, and implementing feedback loops that enable self-correction under ambiguity. In the second phase, the tutorial bridges standard C code and optimized high-level synthesis (HLS) through a structured refinement prompting workflow, where users provide behavioral code alongside hardware constraints such as latency, throughput, and area. Using a 4-bit modular processor and a systolic-array ML accelerator as benchmarks, the tutorial shows how a code agent is guided from generic C to synthesis-ready hardware via microarchitectural directives, bit-accurate data types, hardware-aligned memory interfaces, and LLM-assisted feedback loops to resolve synthesis and timing issues.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionTo perform design space exploration (DSE) of spatial architectures for LLM workloads, we propose FSGen, an agile generator and estimation framework. FSGen targets the key optimization spaces of fused-operator dataflows and multi-level sparsity. Fused operator boosts performance by more than 10x with minimal effect on other PPA metrics, and power is improved by 1.4x with similar performance. We propose early-stage models with less than 12.8% error, which drastically reduce DSE runtime. We identify optimal designs and compare their optimized performance against several hardware generators, achieving 58x better efficiency. The source code of FSGen is available online.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDesigning passive components for RF and high-speed ICs at advanced nodes presents significant challenges: manual layout creation, complex DRC compliance, fragmented flows, and limited early accuracy due to lumped models. These issues lead to long iterations, manual S-parameter handling, and increased risk of silicon re-spins.
EMX Designer addresses these challenges with an integrated, automated solution inside Cadence Virtuoso and ADE. It synthesizes DRC-clean parametric cells for inductors using scalable models, meeting electrical targets such as inductance, Q-factor, SRF, and bandwidth. Analysis mode leverages EMX Solver for full electromagnetic accuracy, enabling fine-tuning of all physical parameters such as (shield, dummy fill and isolation ring) and attaching per-instance S-parameters seamlessly.
Further, EM-in-the-loop optimization through ADE's Advanced Optimization Platform (AOP) with machine learning (Neuron) delivers multi-parameter optimization directly in the design environment. This capability accelerates convergence, explores design space efficiently, and ensures golden EM accuracy without leaving Virtuoso.
By automating synthesis, analysis, testbench setup, and optimization, EMX Designer reduces turnaround from days to hours, improves productivity, and guarantees first-pass silicon success. Its combination of speed, accuracy, and flexibility empowers designers to meet aggressive performance targets at advanced nodes with 10X productivity gain.
EMX Designer addresses these challenges with an integrated, automated solution inside Cadence Virtuoso and ADE. It synthesizes DRC-clean parametric cells for inductors using scalable models, meeting electrical targets such as inductance, Q-factor, SRF, and bandwidth. Analysis mode leverages EMX Solver for full electromagnetic accuracy, enabling fine-tuning of all physical parameters such as (shield, dummy fill and isolation ring) and attaching per-instance S-parameters seamlessly.
Further, EM-in-the-loop optimization through ADE's Advanced Optimization Platform (AOP) with machine learning (Neuron) delivers multi-parameter optimization directly in the design environment. This capability accelerates convergence, explores design space efficiently, and ensures golden EM accuracy without leaving Virtuoso.
By automating synthesis, analysis, testbench setup, and optimization, EMX Designer reduces turnaround from days to hours, improves productivity, and guarantees first-pass silicon success. Its combination of speed, accuracy, and flexibility empowers designers to meet aggressive performance targets at advanced nodes with 10X productivity gain.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionTechnology Computer-Aided Design (TCAD) solves coupled partial differential equations (PDEs) to obtain spatially resolved physical fields that are essential for comprehending device behavior. However, the computationally intensive numerical solution procedures in traditional TCAD make the iterative simulation-optimization loop prohibitively costly. Existing machine learning surrogates typically regress scalar metrics directly from device conditions, consequently discarding the field-level physical information crucial for further analysis. To address this challenge, we propose FUSE-TCAD, a diffusion-based surrogate that generates the joint distribution of physical fields while supporting continuous control over geometry and bias through conditional injection mechanisms. By leveraging the probabilistic generative formulation of diffusion models, FUSE-TCAD captures the underlying statistics of coupled physical fields and preserves cross-field consistency, thereby achieving highly accurate field predictions across diverse device conditions. A physics-aware Sobolev-edge regularization strategy enforces gradient consistency to ensure high fidelity in junction regions during sampling. On SOI-FET datasets, FUSE-TCAD demonstrates robust transferability and produces high-fidelity fields using only 20% target-domain data through efficient transfer learning. Extensive experiments demonstrate that FUSE-TCAD achieves more than 30× speedup over commercial TCAD tools and maintains relative error within 0.8%, thereby supporting scalable, near-real-time device design exploration.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionWeight quantization enables efficient LLM inference but requires mixed-precision computation (e.g., FP16xINT4), which is not efficient in general-purpose processors. Existing accelerators integrate custom mixed-precision arithmetic units but follow the dot-product paradigm, which suffers from numerous multiplications. This work exploits multiplication fusion in low-bit quantization and introduces MFDP, a novel dot-product paradigm that fuses identical-weight multiplications into a single operation. Based on MFDP, we propose FuseDot, an accelerator that optimizes the fusion process by incorporating index-driven generalized matrix multiplication and a hierarchical dual-phase reordering algorithm. Additionally, a cross-PE multiplier-sharing architecture with a half-multiplier scheme further amortizes multiplier costs across multiple PEs. This work achieves 1.51–1.98x speedup and 1.19–1.51x higher energy efficiency over state-of-the-art accelerators.
People
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionPattern matching has become one of the most predominant industrial solutions in IC design-to-manufacturing flow, with broad applications in layout verification, hotspot detection and Optical Proximity Correction (OPC). However, conventional CPU-based approaches, which rely on rule-based extraction, suffer from excessive false matches and high runtime overhead. These inefficiencies create bottlenecks in both verification and manufacturing workflows at advanced nodes. To address these limitations, we introduce the first fully GPU-accelerated pattern matching framework G-Matcher, with a novel multi-stage, multi-level architecture. Our GPU-oriented development exploits massive parallelism to process ultra large scale layout data. The multi-stage pipeline filters false matches, while the multi-level paradigm utilizes specialized pattern and layout representations to accelerate the search. On industrial datasets, our framework demonstrates a 407× speedup over the 64-thread commercial tool Calibre. On public benchmarks, it achieves 7.1× speedup over a state-of-the-art CPU implementation. Importantly, G-Matcher maintains 100% matching accuracy with zero false matches across all evaluations.
Research Manuscript
G-Power: Architecture-Level GPU Power Modeling with Aggregated Knowledge Foundations from Known GPUs
2:47pm - 3:00pm PDT Monday, July 27 Mtg Room 203CSystems
SYS4. Embedded System Design Tools and Methodologies
DescriptionGraphics Processing Units (GPUs) have been serving as critical computation resources for large-scale parallel computations. With increasing chip complexity, power efficiency has become an important design objective for modern GPUs. GPU power optimization relies on fast power evaluation, requiring architecture-level GPU power model. However, because of the time-consuming power label collection, only simple microbenchmarks are adopted for training. The limitation of microbenchmarks as training data incurs low accuracy for existing architecture-level GPU power models like AccelWattch.
To address the limitation of microbenchmarks as training data, we propose G-Power, an architecture-level GPU power modeling framework that utilizes additional known GPU chips to provide additional knowledge. G-Power utilizes the aggregated knowledge foundation from additional known GPU chips and then performs fine-tuning on our target GPU. To provide foundations with additional known GPU chips and capture the similarity to utilize these foundations for fine-tuning, G-Power adopts a three-phase algorithm consisting of 1) pre-training with additional known chips, 2) attention-inspired aggregation, and 3) fine-tuning on our target GPU. We evaluate G-Power on four modern NVIDIA GPUs, demonstrating high accuracy. G-Power can achieve a low MAPE of 14% and a high correlation coefficient R of 0.88 on average, which are 22% lower MAPE and 0.36 higher R than AccelWattch.
To address the limitation of microbenchmarks as training data, we propose G-Power, an architecture-level GPU power modeling framework that utilizes additional known GPU chips to provide additional knowledge. G-Power utilizes the aggregated knowledge foundation from additional known GPU chips and then performs fine-tuning on our target GPU. To provide foundations with additional known GPU chips and capture the similarity to utilize these foundations for fine-tuning, G-Power adopts a three-phase algorithm consisting of 1) pre-training with additional known chips, 2) attention-inspired aggregation, and 3) fine-tuning on our target GPU. We evaluate G-Power on four modern NVIDIA GPUs, demonstrating high accuracy. G-Power can achieve a low MAPE of 14% and a high correlation coefficient R of 0.88 on average, which are 22% lower MAPE and 0.36 higher R than AccelWattch.
People
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionDiffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications.
We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve 1.87× and 2.37× speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve 1.87× and 2.37× speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
Engineering Special Session
AI
EDA
Systems
DescriptionModel-Based Systems Engineering (MBSE) has transformed the design of aerospace and defense (A&D) systems, but a critical gap remains between system-level models and integrated circuit design. This session explores why connecting models across all levels of hardware abstraction is necessary to meet today's modern defense design challenges, what prevents it, and the practical approaches teams are taking today to achieve this level of design maturity.
This session brings together experts from defense systems engineering, EDA tool development, semiconductor design, and MBSE methodology to tackle this integration challenge. Modern defense systems depend on custom silicon that must meet stringent requirements for performance, power, security and real-time operation. When system-level MBSE models can't directly inform chip-level design and verification, programs experience increased risk, cost overruns and integration failures.
This session begins with a 30-minute presentation that examines the current state, the approaches teams are using, and where handoffs typically break down. A 60-minute panel discussion then explores practical paths forward. Rather than waiting for the perfect integrated solution, what bridges can we build now? How can requirements traceability improve? What role can emerging technologies, such as digital twins and AI-assisted translation, play in this context? What can teams do today to reduce risk, and what requires industry-wide collaboration?
This session brings together experts from defense systems engineering, EDA tool development, semiconductor design, and MBSE methodology to tackle this integration challenge. Modern defense systems depend on custom silicon that must meet stringent requirements for performance, power, security and real-time operation. When system-level MBSE models can't directly inform chip-level design and verification, programs experience increased risk, cost overruns and integration failures.
This session begins with a 30-minute presentation that examines the current state, the approaches teams are using, and where handoffs typically break down. A 60-minute panel discussion then explores practical paths forward. Rather than waiting for the perfect integrated solution, what bridges can we build now? How can requirements traceability improve? What role can emerging technologies, such as digital twins and AI-assisted translation, play in this context? What can teams do today to reduce risk, and what requires industry-wide collaboration?
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionLogic optimization is a crucial step in digital circuit synthesis, directly impacting the final power, performance, and area (PPA) of integrated circuits. While reinforcement learning (RL) has shown promise in generating high-quality optimization flows, its limited generalization ability necessitates retraining for each new circuit, hindering practical deployment. To address this, we propose a novel RL-based framework that achieves strong zero-shot generalization across unseen circuits. Our work introduces three key innovations: (1) logic cone extraction to reduce input complexity and enable efficient reward estimation; (2) a cut-weighting mechanism that models global timing effects from local subgraphs; and (3) the integration of Policy Similarity Metric (PSM) to enhance state representation and improve zero-shot transfer. Evaluated on a set of unseen benchmark circuits, Our work outperforms state-of-the-art methods—achieving 31.7% lower worst negative slack (WNS) and 32.5% lower total negative slack (TNS)—while running in only 13% of the time required by prior approaches. This work demonstrates that generalizable RL can enable fast, high-quality logic optimization without circuit-specific retraining.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionThe combined effects of electromigration (EM) and IR drop critically affect the reliability and power integrity of power grids (PGs).
Existing numerical methods are computationally prohibitive for large-scale designs, while most machine learning (ML) approaches analyze EM and IR drop in isolation, overlooking their shared physical structure and correlations in practical flows.
To address this limitation, we propose GEMIR, a graph-based multi-task learning framework for the joint prediction of node-level static IR drop and edge-level EM-induced stress.
GEMIR employs a cross-layer node-edge attention mechanism to effectively capture the mutual dependence between these two physical fields and integrates a physics-informed neural network (PINN) to enhance physical consistency in the EM path.
Furthermore, we establish a composite optimization objective by incorporating physics-informed constraints that embed Kirchhoff's current law (KCL) and Korhonen's PDE to enhance model interpretability.
To manage the inherently coupled yet sometimes conflicting optimization dynamics resulting from these constraints, we then develop a Conflict-Gated (CG) multi-task optimization that adaptively fuses or decouples task gradients based on their alignment, thereby achieving mutual optimality.
Extensive experiments demonstrate that GEMIR outperforms existing single-task and multi-task baselines in accuracy and generalization.
Specifically, it reduces the IR drop MAE by 40.79% and EM-induced stress RMSE by 19.35%, while maintaining high computational efficiency.
Existing numerical methods are computationally prohibitive for large-scale designs, while most machine learning (ML) approaches analyze EM and IR drop in isolation, overlooking their shared physical structure and correlations in practical flows.
To address this limitation, we propose GEMIR, a graph-based multi-task learning framework for the joint prediction of node-level static IR drop and edge-level EM-induced stress.
GEMIR employs a cross-layer node-edge attention mechanism to effectively capture the mutual dependence between these two physical fields and integrates a physics-informed neural network (PINN) to enhance physical consistency in the EM path.
Furthermore, we establish a composite optimization objective by incorporating physics-informed constraints that embed Kirchhoff's current law (KCL) and Korhonen's PDE to enhance model interpretability.
To manage the inherently coupled yet sometimes conflicting optimization dynamics resulting from these constraints, we then develop a Conflict-Gated (CG) multi-task optimization that adaptively fuses or decouples task gradients based on their alignment, thereby achieving mutual optimality.
Extensive experiments demonstrate that GEMIR outperforms existing single-task and multi-task baselines in accuracy and generalization.
Specifically, it reduces the IR drop MAE by 40.79% and EM-induced stress RMSE by 19.35%, while maintaining high computational efficiency.
People
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
Description3D Gaussian Splatting (3DGS) has emerged as a popular representation technique for efficient novel view synthesis due to its explicit Gaussian-based formulation. However, achieving real-time rendering at 90 frames per second (FPS) remains challenging. Prior hardware accelerators have explored custom ASIC or FPGA implementations that deliver higher performance than GPUs, but these designs often rely on advanced technology nodes (e.g., 7nm) and substantial on-chip storage, which significantly increase deployment cost and limit practicality. Meanwhile, existing studies largely overlook the potential of modern GPUs, as the 3DGS pipeline lacks native General Matrix Multiplication (GEMM) operations, which are essential for utilizing Tensor Cores. In this paper, we propose GEMM-GS, a GPU acceleration framework that enables efficient 3DGS execution on Tensor Cores through a GEMM-compatible reformulation of the blending stage. GEMM-GS transforms the original blending process into matrix-multiplication-friendly operations and incorporates a high-performance CUDA kernel with a three-stage double-buffered pipeline to overlap computation and memory accesses. The design is plug-and-play on commodity GPUs, requiring no hardware modifications and thus ensuring practical deployability. Experimental results demonstrate that GEMM-GS achieves a $1.42\times$ speedup over the baseline 3DGS implementation and provides an additional $1.47\times$ average speedup when integrated with existing acceleration techniques.
Research Panel
AI
DescriptionGenerative AI is increasingly being applied across electronic design automation (EDA), from architecture exploration to RTL and physical design. At the same time, aerospace and defense chip designs are becoming more complex, with growing reliance on heterogeneous integration, chiplet-based architectures, and advanced packaging to support AI-driven workloads such as sensor fusion and real-time processing. These systems are often tightly coupled to larger platforms and operating environments, placing strong emphasis on architectural choices, integration boundaries, data movement, and security assumptions across chiplet and packaging interfaces early in design. They are also expected to operate reliably under harsh environmental conditions and over long service lifetimes, which further constrains architectural flexibility and integration tradeoffs.
This panel explores how these conditions shape the role of GenAI in aerospace and defense chip design. Panelists will discuss where GenAI fits within chip architecture, integration, and implementation tasks, and whether it can influence decisions that affect system behavior, performance metrics, and integration complexity. The discussion will also examine constraints on data access and sharing, the use of synthetic data, the interaction between GenAI-driven design approaches and emerging trends such as chiplets and 3D heterogeneous integration, the trustworthiness and security of GenAI-based design tools themselves, and the role of emerging GenAI training and optimization approaches, such as iterative self-refinement and multi-step reinforcement learning, in chip design.
1. In which stages of aerospace and defense chip design has GenAI demonstrated practical value, and which stages remain dominated by conventional EDA methods?
2. Can GenAI influence early architectural and integration decisions, such as chiplet partitioning, or does it primarily operate after these choices are set?
3. How does chiplet-based and 3D integration change the role of GenAI across the chip design flow?
4. How do data availability and data sensitivity affect the application of GenAI in defense-oriented chip design, and what role can synthetic data realistically play in this context?
5. Do chiplet-based architectures introduce security assumptions that GenAI struggles to capture at design time, and what security risks arise from relying on GenAI systems that cannot themselves be fully trusted?
6. What technical risks associated with GenAI in chip design deserve the most attention in aerospace and defense applications, and how can these risks be detected, bounded, or mitigated?
7. How might broader adoption of GenAI influence the skill sets required for aerospace and defense chip designers?
This panel explores how these conditions shape the role of GenAI in aerospace and defense chip design. Panelists will discuss where GenAI fits within chip architecture, integration, and implementation tasks, and whether it can influence decisions that affect system behavior, performance metrics, and integration complexity. The discussion will also examine constraints on data access and sharing, the use of synthetic data, the interaction between GenAI-driven design approaches and emerging trends such as chiplets and 3D heterogeneous integration, the trustworthiness and security of GenAI-based design tools themselves, and the role of emerging GenAI training and optimization approaches, such as iterative self-refinement and multi-step reinforcement learning, in chip design.
1. In which stages of aerospace and defense chip design has GenAI demonstrated practical value, and which stages remain dominated by conventional EDA methods?
2. Can GenAI influence early architectural and integration decisions, such as chiplet partitioning, or does it primarily operate after these choices are set?
3. How does chiplet-based and 3D integration change the role of GenAI across the chip design flow?
4. How do data availability and data sensitivity affect the application of GenAI in defense-oriented chip design, and what role can synthetic data realistically play in this context?
5. Do chiplet-based architectures introduce security assumptions that GenAI struggles to capture at design time, and what security risks arise from relying on GenAI systems that cannot themselves be fully trusted?
6. What technical risks associated with GenAI in chip design deserve the most attention in aerospace and defense applications, and how can these risks be detected, bounded, or mitigated?
7. How might broader adoption of GenAI influence the skill sets required for aerospace and defense chip designers?
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionThe prevailing surrogate-guided paradigm for CPU design space exploration (DSE) suffers from a fundamental credit assignment problem, leading to high sample complexity. This work introduces GenDSE, which reformulates DSE as a sequential decision process. Its contributions are: (1) A Markov decision process that decomposes design configuration generation into context-dependent steps. (2) A GFlowNet-based configuration generator for implicit credit assignment, linking individual decisions to final outcomes. (3) Progressive Sketching, a training paradigm to overcome GFlowNet's data hunger. Experiments show GenDSE reduces simulation costs by 90% on average while achieving superior solution quality.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAt advanced nodes beyond 3nm, IR-drop challenges intensify, demanding solutions that combine signoff-quality results with design cycle efficiency. We present an evolved power grid enhancement methodology that seamlessly integrates with leading place-and-route tools, enabling automated invocation during iterative design stages. Our approach uniquely combines OASIS and LEF/DEF formats: leveraging OASIS for maximum via and stripe insertion with signoff-quality DRC compliance, while LEF/DEF enables operation on designs with power grid shorts common in pre-closure iterations.
The methodology incorporates multi-dimensional timing-aware optimization, including net-type-based spacing, automatic track skipping around clock nets, highly timing-critical net avoidance, and delta spacing across metal layers. These protections enable aggressive power grid enhancement with minimal timing impact. Additionally, enhanced via stapling reinforces vertical connectivity in congested layers using an optimized algorithm that maximizes insertion rate while maintaining fast turnaround time. Applied to multiple designs, via stapling achieves 36.6% insertion improvement over P&R tools with up to 13.68× faster runtime, while power grid stripe insertion achieves 10% average IR-drop reduction (up to 17.3%) with only 2% TNS impact. Both maintain signoff-clean DRC results.
This evolution from our DAC 2024 work delivers signoff-quality fixes earlier in the design cycle with short resilience, streamlining designer experience and accelerating closure.
The methodology incorporates multi-dimensional timing-aware optimization, including net-type-based spacing, automatic track skipping around clock nets, highly timing-critical net avoidance, and delta spacing across metal layers. These protections enable aggressive power grid enhancement with minimal timing impact. Additionally, enhanced via stapling reinforces vertical connectivity in congested layers using an optimized algorithm that maximizes insertion rate while maintaining fast turnaround time. Applied to multiple designs, via stapling achieves 36.6% insertion improvement over P&R tools with up to 13.68× faster runtime, while power grid stripe insertion achieves 10% average IR-drop reduction (up to 17.3%) with only 2% TNS impact. Both maintain signoff-clean DRC results.
This evolution from our DAC 2024 work delivers signoff-quality fixes earlier in the design cycle with short resilience, streamlining designer experience and accelerating closure.
Exhibitor Forum
DescriptionAs semiconductor designs advance toward trillion-transistor systems, general-purpose infrastructure can become a bottleneck, slowing down design velocity and impacting your bottom line. In this session, we'll provide a technical deep dive into the workload-optimized EDA stack that Google uses to tape out its own data center chips —including the Ironwood TPU and ARM-based Axion processors—all entirely on Google Cloud.
We'll explore how to deploy high-performance compute and storage foundations to meet the intense demands of modern chip design. We'll detail how our H4D and C4A instances, paired with Google Cloud NetApp Volumes (GCNV), are engineered to handle the rigorous I/O and low-latency requirements of large-scale verification and physical design. Furthermore, we'll demonstrate how Alphabet is integrating AI directly into the production flow including using agents to optimize Verilog code and achieve targeted improvements in Power, Performance, and Area (PPA), effectively de-risking complex design cycles through a cloud-native, AI-integrated approach.
We'll explore how to deploy high-performance compute and storage foundations to meet the intense demands of modern chip design. We'll detail how our H4D and C4A instances, paired with Google Cloud NetApp Volumes (GCNV), are engineered to handle the rigorous I/O and low-latency requirements of large-scale verification and physical design. Furthermore, we'll demonstrate how Alphabet is integrating AI directly into the production flow including using agents to optimize Verilog code and achieve targeted improvements in Power, Performance, and Area (PPA), effectively de-risking complex design cycles through a cloud-native, AI-integrated approach.
People
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionGPU memory errors are a critical threat to deep learning (DL) frameworks, leading to crashes or even security issues. We introduce GPU-FUZZ, a fuzzer addressing this issue by modeling operator parameters as formal constraints. GPU-FUZZ utilizes a constraint solver to generate test cases that systematically probe error-prone boundary conditions in GPU kernels. Applied to PyTorch, TensorFlow, and PaddlePaddle, we uncovered 13 unknown bugs, which demonstrate the effectiveness of GPU-Fuzz in finding memory errors.
People
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionIn modern high-density Printed Circuit Board (PCB) design, the conventional sequential workflow often leads to a fractured ground plane, compromising signal integrity and system performance. Existing Electronic Design Automation (EDA) tools passively fill unoccupied space with copper post-routing, lacking the ability to modify wire segments to improve ground plane continuity. In this paper, we propose A Ground Plane Generation and Re-routing Aware Co-design Engine, a novel, topology-aware framework that fundamentally shifts the paradigm from passive filling to an integrated co-design process. GRACE formulates ground plane generation as an optimization problem tightly coupled with automated rerouting. It partitions the layout into copper-pourable regions and blockages, and abstracts this into a novel weighted graph model, the Ground Plane Graph (GPGraph). We then formulate an Integer Linear Programming (ILP) problem on the GPGraph to identify a globally optimal set of blockages to eliminate. By strategically rerouting a minimal number of power & signal wire segments, GRACE merges isolated copper islands to create a maximum and contiguous ground plane. Experimental results demonstrate that GRACE drastically increases the area of the ground plane by an average of 33.06% and reduces the number of the ground plane polygons by 54.34% compared to open-source EDA tools. Moreover, our method achieves a significant speed-up over manual refinement while attaining an equivalent ground-plane area and number of ground plane polygons. To the best of our knowledge, this is the first work to automate ground plane unification through an integrated rerouting-aware co-design methodology.
People
Engineering Presentation
Design
EDA
DescriptionCircuit comparison is a fundamental task to determine whether two circuit representations are structurally and electrically equivalent. Modern EDA flows frequently introduce renaming, reordering, and symmetry-preserving transformations even when functionality is unchanged. Existing pattern-matching approaches typically focus on exact topological matches and return a binary match or no-match result, providing little insight into the cause of the mismatch. Consequently, engineers must perform manual, time-consuming root-cause analysis, slowing verification cycles and increasing development cost.
This work presents a graph-isomorphism–based framework for explainable transistor-level circuit comparison. Each circuit is modeled as a labeled graph in which nodes encode device and net semantics (PFET/NFET, supply classification, externality) and edges encode transistor terminal roles (gate/drain/source). Structural correspondence is computed using the VF2 graph isomorphism algorithm, independent of naming. Candidate mappings are validated using semantics-aware electrical rules that preserve gate connectivity, allow drain–source symmetry, and enforce supply and external/internal net constraints. The framework classifies differences into meaningful categories, including exact matches, accepted connect mismatches, label mismatches, and missing or extra elements. The proposed approach enables precise circuit comparison for LVS-style verification, ECO intent validation, and IP-version diffing, significantly reducing false mismatches and manual debug effort while reliably identifying true structural and electrical changes.
This work presents a graph-isomorphism–based framework for explainable transistor-level circuit comparison. Each circuit is modeled as a labeled graph in which nodes encode device and net semantics (PFET/NFET, supply classification, externality) and edges encode transistor terminal roles (gate/drain/source). Structural correspondence is computed using the VF2 graph isomorphism algorithm, independent of naming. Candidate mappings are validated using semantics-aware electrical rules that preserve gate connectivity, allow drain–source symmetry, and enforce supply and external/internal net constraints. The framework classifies differences into meaningful categories, including exact matches, accepted connect mismatches, label mismatches, and missing or extra elements. The proposed approach enables precise circuit comparison for LVS-style verification, ECO intent validation, and IP-version diffing, significantly reducing false mismatches and manual debug effort while reliably identifying true structural and electrical changes.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionThe expanding connectivity of modern vehicle networks has widened their attack surface. It creates critical vulnerabilities to multi-stage attacks where an initial compromise can lead to control over core vehicle functions. Addressing these vulnerabilities requires precise attack traceback to avoid costly security responses. While traceback is essential for this in traditional networks, its application to automotive systems faces fundamental barriers. International homologation standards (UN R155, ISO/SAE 21434) require network architectures to maintain certification compliance, rendering traditional traceback methods (e.g., packet marking) that modify communication stacks as legally and technically infeasible. Production vehicles further exacerbate this challenge with limited resources. We present the first practical solution to this impasse: a gateway-resident traceback system operating entirely within certification boundaries. Our key insight is to reconstruct attack chains solely from existing intrusion detection systems (IDSs) alerts across host, Ethernet, and Controller Area Network (CAN) domains by modeling these alerts as nodes in a temporal graph. Edges are admitted only after four plausibility checks—topological, temporal, semantic, and kinematic—ensuring all reconstructed chains are physically and functionally plausible. Our system achieves a 0.90 edge-level F1 score (64% higher than baselines) with a 0.086 false discovery rate, representing a 72% reduction, while using only 1.2% CPU and 26 MB of Memory usage on a production Leapmotor C10. In addition, we release the dataset for traceback.
Workshop
DescriptionGREAT Workshop features three independent, GenAI-powered red team challenges. You're invited to take on one, two, or all three. With GenAI lowering the bar to hardware hacking, participation is open to everyone, so DAC attendees seize the opportunity.
Challenge 1: The AI Hardware Attack (AHA!) Challenge.
Description: The AHA! Challenge invites teams to use generative AI to insert hardware Trojans into a given design. This iteration of the challenge will be targeting the FPGA on the Hackster board from Calico Computer (https://calico.computer). The first phase of the challenge will be entirely virtual and can be done in simulation. However, finalists will be asked to attend the GREAT workshop at DAC 2026 where they will be given the Hackster board to demonstrate attacks in situ.
This challenge is open to academic teams and members of industry. Teams may consist of no more than 4 people, with academic teams expected to include an additional advisor.
Registration form: https://forms.gle/isvikvty5w5sRdnV7
Discord link: https://discord.gg/mVhGtUmu8
The majority of communication for the challenge will be handled over our Discord server, so please join and accept the rules so you can participate in the discussions.
Timeline:
- 23 May: AHA Challenge launch
- 13 June: Phase 1 due
- 17 June: Finalists notified
- 22 June: Phase 2 released
- 26 July: GREAT workshop @ DAC
Organizers: Jason Blocklove, Weihua Xiao, and Ramesh Karri
—---------------------------------------------------------------------------------------------------------------
Challenge 2: GenAI-based Hardware Intellectual Property (IP) Redaction Challenge
Description: Hardware IP redaction is emerging as a targeted protection mechanism for confidentiality threats in the modern IC supply chain. Instead of exposing a complete RTL or netlist to an untrusted collaborator, redaction selectively hides critical functions and proprietary IP and implementation details while preserving enough behavior for simulation, testing, integration, or verification. GenAI can be used as a blue-team or as a red-team. In the blue-team role, GenAI can identify what should be redacted. In the red-team role, GenAI can help an attacker reconstruct the critical functionality and proprietary IP that is redacted IP.
The competition is open to researchers/engineers from both academia and industry.
Registration Form: https://docs.google.com/forms/d/e/1FAIpQLSckFpSZbCj4XOdB4PI-UQiKacB3Qlxbq17sFi1s3VsLmQf5ZA/viewform?usp=publish-editor
Visit Challenge Website for detailed timeline and submission guidelines:
https://sites.google.com/nyu.edu/dac-great-redaction/home
Timeline:
- 23 May: Redaction Challenge launch
- 5 June: Qualification Submission Deadline
- 9 June: Finalists notification
- 10 June: Phase 2 release
- 28 June: Phase 2 Submission Deadline
- 26 July: Redaction Challenge Finals @ DAC
Organizers: Akashdeep Saha, Prithwish Basu Roy, and Ramesh Karri
—---------------------------------------------------------------------------------------------------------------
Challenge 3: Autonomous LLM Silicon Red-Teaming & Hardware Exploitation Challenge
Description: As Large Language Models (LLMs) and multi-agent frameworks advance from simple coding assistants to fully autonomous engineering agents, a critical security boundary is crossed: Can AI break silicon security before it even hits the fab? This attack-only challenge focuses on utilizing autonomous LLM agentic flows to exploit, reverse-engineer, and compromise chip designs across the entire hardware lifecycle—from high-level RTL down to the physical gate-level netlist.
Participants will act as the red-team, developing and deploying agentic systems that interface with hardware design tools and simulators to autonomously attack silicon targets (such as identifying flaws, injecting functional Hardware Trojans, or defeating netlist obfuscation). Finalists will be required to showcase a live operational dashboard during the finals that visualizes the LLM's real-time reasoning, tool-use calls, and exploit-success telemetry.
The competition is open to researchers/engineers from both academia and industry, bridging the gap between chip designers and LLM agentic flow developers.
Registration Form: https://forms.gle/uczGvoiWLUHwv1X56
Visit Challenge Website for detailed timeline and submission guidelines (including the 2-page abstract requirements):
https://seth.engr.tamu.edu/group-members/dac-llm-silicon-challenge/
Timeline:
-23 May: Silicon Red-Teaming Challenge launch
-15 June: 2-Page Abstract & Qualification Submission Deadline
-24 June: Finalists notification (Exactly 1-month notice prior to the finals)
-25 June: Phase 2 / Live Dashboard Environment release
-15 July: Phase 2 Final Code & Dashboard Freeze
-26 July: Silicon Red-Teaming Challenge Finals @ DAC 2026 (Long Beach, CA)
Organizers: JV Rajendran and Stephen Muttathil
Challenge 1: The AI Hardware Attack (AHA!) Challenge.
Description: The AHA! Challenge invites teams to use generative AI to insert hardware Trojans into a given design. This iteration of the challenge will be targeting the FPGA on the Hackster board from Calico Computer (https://calico.computer). The first phase of the challenge will be entirely virtual and can be done in simulation. However, finalists will be asked to attend the GREAT workshop at DAC 2026 where they will be given the Hackster board to demonstrate attacks in situ.
This challenge is open to academic teams and members of industry. Teams may consist of no more than 4 people, with academic teams expected to include an additional advisor.
Registration form: https://forms.gle/isvikvty5w5sRdnV7
Discord link: https://discord.gg/mVhGtUmu8
The majority of communication for the challenge will be handled over our Discord server, so please join and accept the rules so you can participate in the discussions.
Timeline:
- 23 May: AHA Challenge launch
- 13 June: Phase 1 due
- 17 June: Finalists notified
- 22 June: Phase 2 released
- 26 July: GREAT workshop @ DAC
Organizers: Jason Blocklove, Weihua Xiao, and Ramesh Karri
—---------------------------------------------------------------------------------------------------------------
Challenge 2: GenAI-based Hardware Intellectual Property (IP) Redaction Challenge
Description: Hardware IP redaction is emerging as a targeted protection mechanism for confidentiality threats in the modern IC supply chain. Instead of exposing a complete RTL or netlist to an untrusted collaborator, redaction selectively hides critical functions and proprietary IP and implementation details while preserving enough behavior for simulation, testing, integration, or verification. GenAI can be used as a blue-team or as a red-team. In the blue-team role, GenAI can identify what should be redacted. In the red-team role, GenAI can help an attacker reconstruct the critical functionality and proprietary IP that is redacted IP.
The competition is open to researchers/engineers from both academia and industry.
Registration Form: https://docs.google.com/forms/d/e/1FAIpQLSckFpSZbCj4XOdB4PI-UQiKacB3Qlxbq17sFi1s3VsLmQf5ZA/viewform?usp=publish-editor
Visit Challenge Website for detailed timeline and submission guidelines:
https://sites.google.com/nyu.edu/dac-great-redaction/home
Timeline:
- 23 May: Redaction Challenge launch
- 5 June: Qualification Submission Deadline
- 9 June: Finalists notification
- 10 June: Phase 2 release
- 28 June: Phase 2 Submission Deadline
- 26 July: Redaction Challenge Finals @ DAC
Organizers: Akashdeep Saha, Prithwish Basu Roy, and Ramesh Karri
—---------------------------------------------------------------------------------------------------------------
Challenge 3: Autonomous LLM Silicon Red-Teaming & Hardware Exploitation Challenge
Description: As Large Language Models (LLMs) and multi-agent frameworks advance from simple coding assistants to fully autonomous engineering agents, a critical security boundary is crossed: Can AI break silicon security before it even hits the fab? This attack-only challenge focuses on utilizing autonomous LLM agentic flows to exploit, reverse-engineer, and compromise chip designs across the entire hardware lifecycle—from high-level RTL down to the physical gate-level netlist.
Participants will act as the red-team, developing and deploying agentic systems that interface with hardware design tools and simulators to autonomously attack silicon targets (such as identifying flaws, injecting functional Hardware Trojans, or defeating netlist obfuscation). Finalists will be required to showcase a live operational dashboard during the finals that visualizes the LLM's real-time reasoning, tool-use calls, and exploit-success telemetry.
The competition is open to researchers/engineers from both academia and industry, bridging the gap between chip designers and LLM agentic flow developers.
Registration Form: https://forms.gle/uczGvoiWLUHwv1X56
Visit Challenge Website for detailed timeline and submission guidelines (including the 2-page abstract requirements):
https://seth.engr.tamu.edu/group-members/dac-llm-silicon-challenge/
Timeline:
-23 May: Silicon Red-Teaming Challenge launch
-15 June: 2-Page Abstract & Qualification Submission Deadline
-24 June: Finalists notification (Exactly 1-month notice prior to the finals)
-25 June: Phase 2 / Live Dashboard Environment release
-15 July: Phase 2 Final Code & Dashboard Freeze
-26 July: Silicon Red-Teaming Challenge Finals @ DAC 2026 (Long Beach, CA)
Organizers: JV Rajendran and Stephen Muttathil
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionThe workload scheduling and placement problem has been the core of resource management in various distributed systems, including modern cloud computing systems. As modern cloud systems become larger, more heterogeneous, dynamic, complicated, and energy-hungry, developing a scalable, flexible, adaptive, and effective energy-efficient resource manager for such a cloud system is becoming very challenging. Specifically, existing machine learning-based resource managers cannot scale well, and existing heuristic algorithm-based scalable resource managers do not generate the most energy-efficient solutions. To address the limitations in existing methods and solve the problem more effectively, we propose GREEN, a graph neural network-reinforcement learning (GNN-RL) based cloud resource manager that generates high energy-efficiency workload scheduling and placement solutions and scales up to large cloud systems with hundreds and thousands of servers. GREEN reduces this challenging problem to a graph optimization problem and then uses our novel RL formulation and GNN architecture for generating scheduling and placement solutions for cloud systems in various scales. Through extensive cloud simulations using COSCO and real-world experiments using CloudLab, we found that GREEN's solutions save energy by up to 2.17x than those generated by the best previous state-of-the-art (SOTA) resource managers without compromising Service Level Objective (SLO) metrics. Most importantly, GREEN can generate similarly high quality scheduling and placement solutions on systems with 100 to 1000 servers in COSCO cloud simulations.
People
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionLarge Language Models (LLMs) are rapidly becoming the backbone of modern cloud services, yet their inference costs are dominated by energy consumption on GPUs. Unlike traditional GPU workloads, LLM inference consists of two distinct stages with different characteristics: the prefill phase, which is latency-sensitive and scales quadratically with prompt length, and the decode phase, which progresses token by token with undetermined length. Current GPU power governors (for example, NVIDIA default) overlook this asymmetry, treating both phases uniformly. The result is mismatched voltage/frequency settings, leading to suboptimal voltage/frequency configurations, head-of-line blocking, and excessive energy consumption.
We introduceGreenLLM, a Service-Level Objectives (SLO) aware serving framework that minimizes GPU energy by explicitly separating prefill and decode control. At ingress, requests are routed into length‑based queues so short prompts avoid head‑of‑line blocking, tightening TTFT. For prefill, GreenLLM collects short traces on a GPU node, fits compact latency–power models over SM frequency, and solves a queueing‑aware optimization to pick energy‑minimal clocks per class. During decode, a lightweight dual‑loop controller tracks throughput (tokens-per-second) and adjusts frequency with hysteretic, fine‑grained steps to hold tail TBT within target bounds. Across Alibaba and Azure trace replays, GreenLLM achieves up to 34% reduction in total energy consumption compared to the default DVFS baseline in Alibaba/Azure trace replays, with no loss of throughput and only less than 3.5% SLO violations increase, demonstrating its effectiveness in the efficient LLM service.
We introduceGreenLLM, a Service-Level Objectives (SLO) aware serving framework that minimizes GPU energy by explicitly separating prefill and decode control. At ingress, requests are routed into length‑based queues so short prompts avoid head‑of‑line blocking, tightening TTFT. For prefill, GreenLLM collects short traces on a GPU node, fits compact latency–power models over SM frequency, and solves a queueing‑aware optimization to pick energy‑minimal clocks per class. During decode, a lightweight dual‑loop controller tracks throughput (tokens-per-second) and adjusts frequency with hysteretic, fine‑grained steps to hold tail TBT within target bounds. Across Alibaba and Azure trace replays, GreenLLM achieves up to 34% reduction in total energy consumption compared to the default DVFS baseline in Alibaba/Azure trace replays, with no loss of throughput and only less than 3.5% SLO violations increase, demonstrating its effectiveness in the efficient LLM service.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionGraph neural networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Sparse-Reversible GNN (GSR-GNN), which enables training GNNs with up to hundreds of layers while reducing both compute and memory overhead. GSR-GNN integrates reversible residual modules with a group-wise sparse nonlinear operator that compresses node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement. On sampled circuit graphs, GSR-GNN achieves up to 87.2\% peak memory reduction and over 30$\times$ training speedup with negligible degradation in correlation-based quality metrics, making deep GNNs practical for large-scale EDA workloads.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionWe present GTX, a graph-transformer–based framework for accurate parasitic prediction in analog circuits. GTX combines a customized parasitic modeling scheme with HGNNs and global attention to capture both local device interactions and long-range interconnect dependencies. Experiments show GTX achieves higher prediction accuracy than prior ML-based methods and generates predicted-parasitics netlists whose post-layout simulations closely match commercial parasitic-extracted netlists. Moreover, parasitic inference is 232.7x faster than commercial PEX, and simulations using predicted-parasitics netlists are up to 8.4x faster while closely matching the results of commercial parasitic-extracted netlists, demonstrating GTX's effectiveness in accelerating layout-aware analog design.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionDisaggregated memory systems replicate data across memory nodes to ensure strong consistency and durability, but synchronous replication quickly saturates the compute node's RDMA NIC as the replica count grows. We propose GWrite, a hardware-oriented RDMA primitive, and GWrap, a co-designed replication protocol that offload replication fan-out to memory-side RNICs while preserving one-sided RDMA semantics. A software prototype that emulates GWrite on commodity RNICs achieves up to 2.18× higher throughput in a distributed transaction system and reduces replication-induced throughput degradation from 46.7% to 19.7% in a replicated hash table, showing that exploiting memory-side RNIC outbound capacity can substantially alleviate compute-side IOPS bottlenecks in write-intensive disaggregated deployments.
Work in Progress
DescriptionModern GPUs face a memory capacity wall as large-scale workloads outgrow practical device-memory limits. Scale-up designs with heterogeneous memory (HBM plus DDR and CXL) offer larger footprints, but under memory oversubscription current systems rely on Inter-memory relocation (IMR), which drives high relocation I/O traffic over off-package links and creates severe I/O bottlenecks. We propose GZswap, a hardware-managed, on-device zswap scheme that reserves a portion of GPU memory as a compressed swap region (Zpool). Instead of evicting pages to host memory, GZswap compresses and keeps them on-device, then transparently decompresses them upon reuse, converting relocation I/O traffic over off-package interconnects into on-package memory accesses. As a result, GZswap significantly reduces inter-memory I/O traffic and improves performance and energy efficiency under oversubscription, without requiring any changes to applications.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionAbstract—With the advancement of fully homomorphic encryption (FHE), encrypted applications exhibit diverse encryption parameters and computational characteristics. However, existing FHE accelerator architectures are typically fixed and lack the flexibility to efficiently support this diversity. The diversity among FHE applications imposes urgent demands on the flexibility and efficiency of accelerator design. To address this challenge, we propose Hades, an automated framework for FHE accelerator generation. Hades analyzes the dataflow graph and encryption parameters of a given FHE application, establishes a mapping from the application to hardware architecture, and automatically searches for accelerator configurations optimized for the target workload. The automation capability of Hades allows for flexible hardware realization on FPGAs and also provides design guidance for ASIC-based accelerators. We evaluate Hades on a range of FHE applications with diverse characteristics. Experimental results demonstrate that Hades can effectively exploit application-specific features to automatically generate efficient hardware architectures. We highlight the following results: (1) compared with state-of-the-art FPGA accelerators, Poseidon and FAB, speedup achieves 1.99× to 6.58× while reducing 50% resource consumption; (2) compared with state-of-the-art ASIC accelerators, SHARP and CraterLake, speedup achieves more than 3×; (3) achieves 65%-94% hardware utilization on multiple FHE applications, more than a 2× improvement over manual accelerators.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTiming loops across non-resettable flops represent a subtle yet critical class of design bugs that can escape traditional verification flows and manifest as silicon failures. In NVIDIA's complex SoC and GPU architectures, such loops can lead to unpredictable behavior, deadlocks, and even security vulnerabilities, especially when reset signals are not propagated to all sequential elements. Our innovative solution introduces a formal, automated methodology for identifying timing loops involving non-resettable flops directly at the RTL phase, leveraging advanced static analysis and machine learning-based anomaly detection. By integrating this solution into the early design flow, we enable designers to catch these issues before synthesis, dramatically reducing the risk of costly silicon re-spins.
People
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionPost-training quantization (PTQ) avoids expensive training process and it is effective in accelerating Deep Neural Networks (DNNs). Since the representation capacity of a quantizer decreases exponentially with precision, PTQ encounters fundamental limitations in the low-precision domain. As an alternative, adaptive-precision quantization (APQ) leverages the sparsity in data distribution to reduce precision without sacrificing resolution. This is especially suitable for activations, as they tend to be sparser and more vulnerable than weights. However, existing APQ approaches require data to heavily cluster near zeros, which is not guaranteed in asymmetric quantization that is preferred for activations. Furthermore, the varying precision can lead to an unbalanced computational workload, making it difficult to effectively harvest the theoretical performance gain. To solve these problems, we propose a novel quantization and accelerator design, harnessing adaptive-precision (HAP). It leverages a per-group dynamic zero-point to generalize APQ. Moreover, its intra-channel grouping strategy makes it possible to balance variable-precision workload via reordering. Leveraging a channel-level dual-precision weight quantization scheme, it achieves superior accuracy compared with existing PTQ solutions at the level of 4 or 5 bits for a variety of DNN families. On hardware, a novel bit-serial accelerator featuring a lightweight reorder engine is developed. Results show it achieves a 2.65x speedup and a 55% energy reduction on average compared to existing accelerators.
Engineering Presentation
Design
EDA
DescriptionNegative space verification is crucial for hardware security: inputs that are invalid or unexpected can trigger silent and delayed failures that designers often treat as "don't care". These failures. If exploitable, can cause security issues. We show that unlike software fuzzing, formal verification (FV) is better suited for hardware negative space verification due to hardware's hierarchical structure, structured inputs, and subtle failure modes. Our recommended workflow uses Compliance Monitors to organize constraints and checks, with specific expertise deployed in selectively disabling assumptions to enable a particular negative space. Custom properties are used when Compliance Monitors are unavailable, or to mitigate state explosion. We applied this approach to three interface protocols across multiple IPs: several IPs proved clean, early bugs were found and fixed in others, and custom FV enabled efficient post-fix proofs. Benefits include reduced pre-silicon escape costs, faster IP-level verification turnaround, improved specification clarity, and higher implementation quality. For cases where negative space specification is unavailable, we are experimenting with GenAI-based automatic negative specification mining, and our early results are encouraging. Attendees will learn practical patterns, choices, how to document and prove negative behaviors effectively and robustly, and our experiments with GenAI for mining negative space specification.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionBird's-eye-view (BEV) fusion of multi-modal sensor data is a leading approach for high-performance 3D object detection. However, state-of-the-art models rarely exceed 10 FPS even on powerful GPUs, limiting their use in real-time applications. In this paper, we present a neural architecture search (NAS) framework that identifies models achieving both high accuracy and real-time inference. The search space is explicitly designed with hardware constraints to balance accuracy and computational efficiency on resource-limited platforms. To better align LiDAR and camera features in the BEV conversion module, we introduce a lightweight depth estimation network and a LiDAR–camera cross-attention mechanism that enhances detection accuracy with minimal overhead. On the challenging nuScenes benchmark, our model achieves 70.1% mAP, which represents a 2.3% point improvement over the baseline model, while running at 35.6 FPS on the Nvidia Jetson AGX Orin, demonstrating its suitability for real-time autonomous driving.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionWe propose HBSpec, a hybrid-bonding (HB)-based heterogeneous accelerator for tree-based speculative LLM inference. Unlike existing near-memory processing (NMP)-enabled heterogeneous accelerators that place processing elements (PEs) on DRAM dies with limited computation capability, HBSpec customizes PEs on the incorporated logic die to enhance NMP computation capacity. HBSpec is also equipped with a communication-aware data mapping policy to reduce cross-bank access overhead, an arithmetic intensity-aware scheduler to dynamically assign operators to the most suitable hardware units, and a Branch-KV pool to eliminate memory fragmentation. Experimental results show that HBSpec outperforms NPU-only, NPU+LPDDR, and NPU+HB baselines by 15.56x, 2.36x, and 2.13x, respectively.
People
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionHyperdimensional Computing (HDC) encodes information and data into high-dimensional distributed vectors that can be manipulated using simple bitwise operations and similarity searches, offering parallelism, low-precision hardware friendliness, and strong robustness to noise. These properties are a natural fit for SQL database workloads dominated by predicate evaluation and scans, which demand low energy and low latency over large fact tables. Notably, HDC's noise-tolerance maps well onto emerging ferroelectric NAND (FeNAND) memories, which provide ultra-high density and in-storage compute capability but suffer from elevated raw bit-error rates. In this work, we propose HDDB, a hardware–software co-design that combines HDC with FeNAND multi-level cells (MLC) to perform in-storage SQL predicate evaluation and analytics with massive parallelism and minimal data movement. Particularly, we introduce novel HDC encoding techniques for standard SQL data tables and formulate predicate-based filtering and aggregation as highly efficient HDC operations that can happen in-storage. By exploiting the intrinsic redundancy of HDC, HDDB maintains correct predicate and decode outcomes under substantial device noise (up to 10% randomly corrupted TLC cells) without explicit error-correction overheads. Experiments on TPC-DS fact tables show that HDDB achieves up to 80.6× lower latency and 12,636× lower energy consumption compared to conventional CPU/GPU SQL database engines, suggesting that HDDB provides a practical substrate for noise-robust, memory-centric database processing.
People
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionHeRo is a framework for efficiently running agentic RAG on resource-limited, heterogeneous mobile devices. It builds a performance model that accounts for workload, accelerator features, and memory contention, then uses a lightweight, dependency-guided scheduler with shape-aware decomposition, criticality estimation, and concurrency control. This approach adapts to dynamic workloads and significantly reduces latency, achieving up to 10.94× speedup over prior methods.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionModern cloud servers routinely co-locate multiple latency-sensitive microservice instances to improve resource efficiency.
However, the diversity of microservice behaviors, coupled with mutual performance interference under simultaneous multithreading (SMT), makes large-scale placement increasingly complex.
Existing interference-aware schedulers and isolation techniques rely on coarse core-level profiling or static resource partitioning, leaving asymmetric hyperthread-level heterogeneity and SMT contention dynamics largely unmodeled.
We present Hestia, a hyperthread-level, interference-aware scheduling framework powered by self-attention. Through an extensive analysis of production traces encompassing 32,408 instances across 3,132 servers, we identify two dominant contention patterns—sharing-core (SC) and sharing-socket (SS)-and reveal strong asymmetry in their impact. Guided by these insights, Hestia incorporates (1) a self-attention-based CPU usage predictor that models SC/SS contention and hardware heterogeneity, and (2) an interference scoring model that estimates pairwise contention risks to guide scheduling decisions.
We evaluate Hestia through large-scale simulation and a real production deployment. Hestia reduces the 95th-percentile service latency by up to 80%, lowers overall CPU consumption by 2.3% under the same workload, and surpasses five state-of-the-art schedulers by up to 30.65% across diverse contention scenarios.
However, the diversity of microservice behaviors, coupled with mutual performance interference under simultaneous multithreading (SMT), makes large-scale placement increasingly complex.
Existing interference-aware schedulers and isolation techniques rely on coarse core-level profiling or static resource partitioning, leaving asymmetric hyperthread-level heterogeneity and SMT contention dynamics largely unmodeled.
We present Hestia, a hyperthread-level, interference-aware scheduling framework powered by self-attention. Through an extensive analysis of production traces encompassing 32,408 instances across 3,132 servers, we identify two dominant contention patterns—sharing-core (SC) and sharing-socket (SS)-and reveal strong asymmetry in their impact. Guided by these insights, Hestia incorporates (1) a self-attention-based CPU usage predictor that models SC/SS contention and hardware heterogeneity, and (2) an interference scoring model that estimates pairwise contention risks to guide scheduling decisions.
We evaluate Hestia through large-scale simulation and a real production deployment. Hestia reduces the 95th-percentile service latency by up to 80%, lowers overall CPU consumption by 2.3% under the same workload, and surpasses five state-of-the-art schedulers by up to 30.65% across diverse contention scenarios.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionHybrid-bonding 3D DRAM (HB-DRAM) delivers massive in-situ bandwidth but shifts LLM accelerator bottlenecks from off-chip memory to the Network-on-Chip, where decentralized bank-local bandwidth must be coordinated across arrays. We present HiBand, a bandwidth-aware HB-DRAM accelerator that co-designs array-level execution with a four-mode reconfigurable NoC. HiBand combines grouped tensor parallelism and cross-head array-level co-execution to confine most traffic to short-range links while overlapping bandwidth-bound attention with compute-bound feed-forward layers. On LLM models mapped to a 32-array HB-DRAM accelerator, HiBand achieves up to 4.28× speedup over an HBM3 GPU-style accelerator and 1.67× speedup over a state-of-the-art HB-DRAM–based NMP design.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionIn the integrated circuit design flow, a high-level specification is synthesized into a gate-level netlist. However, this synthesis often results in the loss of the association between the high-level and gate-level designs. Maintaining signal correspondences between these two levels is crucial for preserving the design hierarchy, which in turn facilitates design debugging, engineering change orders (ECOs), cross-abstraction-level model training, and other synthesis and verification tasks. In this work, we present an automated approach for recovering the hierarchical boundaries of high-level components in gate-level netlists, even in cases where the synthesis process has completely removed the original boundary signals. The method is implemented as an open-source tool, and the experimental results demonstrate its robustness in recovering hierarchical boundaries. Our results may benefit various potential applications in synthesis and verification.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIR (voltage) drop signoff represents a critical bottleneck in modern SOC designs. In modern large-scale designs (e.g., GPUs, AI), traditional flat IR analysis has become computationally infeasible due to datacenter limitations. On the contrary, resorting to design fragmenting and running flat IR drop on multiple partitions parallelly is time-consuming and has the risk of compromised QoR due to missed boundary conditions, assumed symmetry and coordination overhead.
This paper proposes a hierarchical methodology to address capacity and performance in the context of multiple concurrent designs. First, an initial EMIR run is completed at the block level with adequate coverage. Second, a hierarchical model of the blocks, called the reduced order model (ROM), is then created with these block runs. The full chip IR (complete metal stack) is then run with these models. This methodology has multiple-fold performance improvement with its accuracy within a small margin of error compared to traditional flat analysis. This methodology hence fundamentally enables full-chip and 3DIC IR closure for AI SOCs that was previously impossible.
Keywords: Dynamic IR, hierarchical modelling, hierarchical signoff
This paper proposes a hierarchical methodology to address capacity and performance in the context of multiple concurrent designs. First, an initial EMIR run is completed at the block level with adequate coverage. Second, a hierarchical model of the blocks, called the reduced order model (ROM), is then created with these block runs. The full chip IR (complete metal stack) is then run with these models. This methodology has multiple-fold performance improvement with its accuracy within a small margin of error compared to traditional flat analysis. This methodology hence fundamentally enables full-chip and 3DIC IR closure for AI SOCs that was previously impossible.
Keywords: Dynamic IR, hierarchical modelling, hierarchical signoff
Engineering Presentation
EDA
Security
DescriptionAs System-on-Chip (SoC) designs transition toward massive scale and multi-die architectures, achieving near-zero Defective Parts Per Million (DPPM) is increasingly hampered by the limitations of traditional structural Design-for-Test (DFT). Scan-based testing frequently leaves critical coverage gaps in custom circuits, security blocks, and hard-to-detect logic—gaps that often manifest as expensive field returns. While functional verification patterns provide a solution to fill these voids, traditional "flat" SoC-level fault grading is currently rendered unfeasible by prohibitive runtimes that stretch into weeks and memory requirements that dwarf standard server capacities.
This paper introduces a paradigm shift in functional fault grading through a hierarchical, parallelized framework optimized for complex SoCs. Our methodology features two primary innovations:
Automated Identification of Optimal Observation Points: A smart algorithm that pinpoints the internal nodes with the highest impact on fault coverage, significantly reducing the manual effort required for coverage closure.
Hierarchical Partitioning for Parallel Simulation: A modular flow that breaks complex SoCs into sub-scopes, allowing fault grading to run in parallel and dramatically reducing the compute and memory footprint.
The effectiveness of this methodology is validated on a production-grade Tensor SoC. Our results demonstrate a transformative improvement in verification efficiency: a simulation of 10,000 faults—previously considered unfeasible due to convergence and runtime, was completed in hours using standard hardware. By converting weeks of simulation into a predictable two-day workflow, this scalable framework enables faster coverage closure and ensures superior silicon quality for the next generation of high-reliability SoC designs.
This paper introduces a paradigm shift in functional fault grading through a hierarchical, parallelized framework optimized for complex SoCs. Our methodology features two primary innovations:
Automated Identification of Optimal Observation Points: A smart algorithm that pinpoints the internal nodes with the highest impact on fault coverage, significantly reducing the manual effort required for coverage closure.
Hierarchical Partitioning for Parallel Simulation: A modular flow that breaks complex SoCs into sub-scopes, allowing fault grading to run in parallel and dramatically reducing the compute and memory footprint.
The effectiveness of this methodology is validated on a production-grade Tensor SoC. Our results demonstrate a transformative improvement in verification efficiency: a simulation of 10,000 faults—previously considered unfeasible due to convergence and runtime, was completed in hours using standard hardware. By converting weeks of simulation into a predictable two-day workflow, this scalable framework enables faster coverage closure and ensures superior silicon quality for the next generation of high-reliability SoC designs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTo proof an IC concept early, critical areas of the design are synthesized using a PD synthesis flow.
IP block data may be incomplete and IP blocks may be added to a design later. Manually defining physical parameters of IP blocks based on functional requirements is an inaccurate and time-consuming step during synthesis. To enable synthesis for timing-critical subsets of incomplete IC designs, a novel PD synthesis methodology demonstrates the detection of any IP blocks of an IC design during synthesis, the scaling of IP blocks to proper sizes, and placing the respective blocks which increases the PD synthesis efficiency by 60%. The system comprises a configuration database storing IP block configuration data indexed using block-specific hierarchical identifiers. The method comprises retrieving initial configuration data for IP blocks from the configuration database using the block-specific hierarchical identifiers. A recommender system fetches existing IC configuration data and recommends sizes for blocks of the incomplete IC design. In addition, size changes for IP blocks can be provided to automatically generate updated IP block configuration data using size changes to modify the initial IP block configuration. Based on the changes, the configuration database is updated using the new IP block configuration.
IP block data may be incomplete and IP blocks may be added to a design later. Manually defining physical parameters of IP blocks based on functional requirements is an inaccurate and time-consuming step during synthesis. To enable synthesis for timing-critical subsets of incomplete IC designs, a novel PD synthesis methodology demonstrates the detection of any IP blocks of an IC design during synthesis, the scaling of IP blocks to proper sizes, and placing the respective blocks which increases the PD synthesis efficiency by 60%. The system comprises a configuration database storing IP block configuration data indexed using block-specific hierarchical identifiers. The method comprises retrieving initial configuration data for IP blocks from the configuration database using the block-specific hierarchical identifiers. A recommender system fetches existing IC configuration data and recommends sizes for blocks of the incomplete IC design. In addition, size changes for IP blocks can be provided to automatically generate updated IP block configuration data using size changes to modify the initial IP block configuration. Based on the changes, the configuration database is updated using the new IP block configuration.
Work in Progress
DescriptionWe present a learning-based transistor-level design optimization framework for digital logic circuits that fully regenerates schematics and layouts after synthesis or place-and-route. Unlike conventional gate-level optimization limited to fixed standard-cell variants, it directly tunes transistor parameters. Bayesian learning with physics-guided regression refines delay and power characteristics, while a layout-in-the-loop engine ensures DRC-clean results through probabilistic search and geometric sampling. To enable fast convergence, the framework adopts a hierarchical structure that synchronizes local optimizations with global updates. Across multiplier, divider, and adder-tree circuits, it achieves up to 12% dynamic-power reduction and 5–10% delay improvement within 28–52 hours using four-core parallelism.
People
Late Breaking Results
DescriptionSoC designs face GPIO encroachment forcing irregular redistribution layer and bump rearrangements on uppermost BEOL metal layers. Optimising density, pitch, and power grid mesh lacks automated methods. We present a hierarchical two-stage ML framework for rapid RDL and bump encroachment modelling in GPIO-constrained flip-chip SoCs. Stage 1 employs classification and regression with sliding-window feature aggregation to predict instance-level IR drop. These predictions aggregate spatially to tile-level features that feed Stage 2, which predicts top-metal voltages for each power net. Validated on industrial sub-micron designs with 25M+ instances, achieving 20-50x acceleration with MAE ≤ 1 mV, R2 ≥ 0.995 accuracy. This enables rapid exploration of 100's configurations, saving up to
25% cost in die area & package penalties for volume production.
25% cost in die area & package penalties for volume production.
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
DescriptionReal-time automotive workloads modeled as directed acyclic graphs (DAGs) with probabilistic execution times face severe latency bottlenecks, where shared tasks couple multiple sensor-to-actuator paths.
Minimizing end-to-end latency under such variability requires navigating a complex joint space of task mappings and offsets. To address this, we propose a Pareto-prompted metaheuristic with hierarchical mapping-then-offset refinement, utilizing Pareto dominance to search for non-dominated schedules.
Evaluated on production navigation-on-autopilot workloads running on NVIDIA Orin-N–based vehicles, our method satisfies all constraints and significantly outperforms an industrial static baseline, reducing mean latency by up to 12% and 99th-percentile latency by up to 15%.
Minimizing end-to-end latency under such variability requires navigating a complex joint space of task mappings and offsets. To address this, we propose a Pareto-prompted metaheuristic with hierarchical mapping-then-offset refinement, utilizing Pareto dominance to search for non-dominated schedules.
Evaluated on production navigation-on-autopilot workloads running on NVIDIA Orin-N–based vehicles, our method satisfies all constraints and significantly outperforms an industrial static baseline, reducing mean latency by up to 12% and 99th-percentile latency by up to 15%.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionDiffusion Transformer (DiT)-based video generation models (VGMs)
achieve state-of-the-art visual quality through global attention mod-
eling but incur heavy computational overhead due to the quadratic
complexity of attention. In high-resolution or long-duration videos,
attention often dominates inference latency. However, much of this
computation is redundant—many attention scores contribute little
to the output and can be safely skipped or approximated. Fully
exploiting this redundancy requires (1) identifying important re-
gions in large attention maps, (2) determining adaptive retention
ratios across heads and blocks, and (3) handling scores of varying
importance efficiently. We present HierPAS, a hardware–software
co-optimized design that accelerates VGMs using hierarchical pre-
cision and adaptive sparsity. HierPAS employs a lightweight eager
attention method to estimate attention patterns and a sampling-
based entropy analysis to derive head-wise retention ratios with
minimal cost. It applies progressively reduced precision to less
critical regions and integrates a configurable top-𝑘 engine with a
unified multi-precision GEMM engine supporting multiple preci-
sions in one datapath. Evaluations show that HierPAS improves
energy efficiency by up to 178×over NVIDIA H20 and 7.7×over
state-of-the-art accelerators, with negligible loss in video quality.
achieve state-of-the-art visual quality through global attention mod-
eling but incur heavy computational overhead due to the quadratic
complexity of attention. In high-resolution or long-duration videos,
attention often dominates inference latency. However, much of this
computation is redundant—many attention scores contribute little
to the output and can be safely skipped or approximated. Fully
exploiting this redundancy requires (1) identifying important re-
gions in large attention maps, (2) determining adaptive retention
ratios across heads and blocks, and (3) handling scores of varying
importance efficiently. We present HierPAS, a hardware–software
co-optimized design that accelerates VGMs using hierarchical pre-
cision and adaptive sparsity. HierPAS employs a lightweight eager
attention method to estimate attention patterns and a sampling-
based entropy analysis to derive head-wise retention ratios with
minimal cost. It applies progressively reduced precision to less
critical regions and integrates a configurable top-𝑘 engine with a
unified multi-precision GEMM engine supporting multiple preci-
sions in one datapath. Evaluations show that HierPAS improves
energy efficiency by up to 178×over NVIDIA H20 and 7.7×over
state-of-the-art accelerators, with negligible loss in video quality.
Work in Progress
DescriptionThis brief identifies a linear increase in the Elmore delay with fan-in size for Dynamic Leakage Suppression (DLS) logic structures, unlike quadratic increase for Static CMOS logic. We leverage this property to design high fan-in (HFI) DLS standard cells (stdcells) that improves power, performance, and area (PPA) of ultra-low power (ULP) chips for energy harvesting applications - an important domain not addressed by previous HFI works. We show important stdcell design considerations (timing, voltage swing) and PPA improvements using modeling and post-synthesis results on multiple designs (including M0+/ RISCV based SoCs) which show over 53% power, 62% delay, and 6% area reduction.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionNeutral atom quantum computers (NAQCs) are among the most promising computational platforms for quantum computing. Controlling and measuring individual atoms and their states, which often requires multiple imaging and image-analysis procedures, is typically the most time-consuming task during computation and contributes significantly to overall cycle times.
To resolve this challenge, we propose a highly-parallel atom-detection accelerator for tweezer-based NAQCs. Our design builds on an existing state-reconstruction method and combines an algorithm-level optimization with a Field Programmable Gate Array (FPGA) implementation to maximize parallelism and reduce the run time of the image-analysis process. We identify and overcome several challenges for an FPGA implementation, such as introducing a prefetching mechanism to improve scalability and customizing bus transfers to support large bandwidths.
Tested on a Xilinx UltraScale+ FPGA, our design can analyze a 256×256-pixel fluorescence image in just 115 𝜇s, achieving 34.9× and 6.3× speedups over the original and optimized CPU baseline, respectively. Moreover, our accelerator can maintain consistent resource utilization across various atom array sizes, contributing to the ongoing efforts toward scalable and fully integrated FPGA-based control systems for NAQCs.
To resolve this challenge, we propose a highly-parallel atom-detection accelerator for tweezer-based NAQCs. Our design builds on an existing state-reconstruction method and combines an algorithm-level optimization with a Field Programmable Gate Array (FPGA) implementation to maximize parallelism and reduce the run time of the image-analysis process. We identify and overcome several challenges for an FPGA implementation, such as introducing a prefetching mechanism to improve scalability and customizing bus transfers to support large bandwidths.
Tested on a Xilinx UltraScale+ FPGA, our design can analyze a 256×256-pixel fluorescence image in just 115 𝜇s, achieving 34.9× and 6.3× speedups over the original and optimized CPU baseline, respectively. Moreover, our accelerator can maintain consistent resource utilization across various atom array sizes, contributing to the ongoing efforts toward scalable and fully integrated FPGA-based control systems for NAQCs.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionReal-time path planning is essential for autonomous mobile robots operating in complex and dynamic real-world environments. Conventional path planning algorithms are limited by the need to store the entire environment map in memory and to maintain either the cost values of explored nodes or their connectivity relationships. Consequently, both memory and computational loads increase rapidly with the scale and complexity of the environment. This overhead is particularly exacerbated when replanning is frequent due to unpredictable dynamic obstacles. In this paper, we propose the hippocampus-inspired periodic mapping and navigation (HIP-MaN) algorithm. This employs multi-periodic grid modules to encode unique spatial locations as phase combinations, effectively compressing the entire environment into a compact periodic representation. HIP-MaN directly computes the goal direction based solely on the phase differences between grid modules and generates detours only when a collision is predicted, minimizing replanning costs. In a 200x200 m^2 environment, HIP-MaN demonstrates near-optimal path planning quality, showing only a 5-22% increase over the optimal path length while achieving 3-50x and 5-346x faster path generation in static and dynamic environments, respectively.
People
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionPrivacy-preserving machine learning aims to defend against adversaries without sacrificing task accuracy.
In latency-critical and resource-constrained settings, existing cryptographic and encoding approaches
incur heavy overheads that translate into intolerable delays and energy costs.
We present HoloCode, a hybrid optical–electronic encoding pipeline that delivers strong privacy with sub-5ms latency at a fraction of the energy of prior state-of-the-art.
HoloCode encodes only task-relevant signals while shielding sensitive features, resists inversion attacks, and locks models with a private key to prevent misuse.
HoloCode builds on an edge–cloud collaboration framework, where inference is pushed to the edge to cut latency, at the cost of higher edge energy.
To break this trade-off, we adopt an optical–digital hybrid pipeline that leverages zero-energy optical processing to reduce latency and edge energy simultaneously.
Against strong privacy-preserving baselines, HoloCode achieves10x faster inference and 50% lower edge energy, while preserving accuracy and resisting privacy feature leakage and reconstruction attacks.
In latency-critical and resource-constrained settings, existing cryptographic and encoding approaches
incur heavy overheads that translate into intolerable delays and energy costs.
We present HoloCode, a hybrid optical–electronic encoding pipeline that delivers strong privacy with sub-5ms latency at a fraction of the energy of prior state-of-the-art.
HoloCode encodes only task-relevant signals while shielding sensitive features, resists inversion attacks, and locks models with a private key to prevent misuse.
HoloCode builds on an edge–cloud collaboration framework, where inference is pushed to the edge to cut latency, at the cost of higher edge energy.
To break this trade-off, we adopt an optical–digital hybrid pipeline that leverages zero-energy optical processing to reduce latency and edge energy simultaneously.
Against strong privacy-preserving baselines, HoloCode achieves10x faster inference and 50% lower edge energy, while preserving accuracy and resisting privacy feature leakage and reconstruction attacks.
People
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionChip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionTransformers excel at sequential modeling but attention and large matrix computation incur high latency and energy. The potential of the hybrid ReRAM-SRAM computing-in-memory(CIM) accelerator is constrained by the redundant attention sparsity. Exploiting sparsity introduces significant overhead in Top-K query-key identification or causes accuracy degradation in prior works. We present HP-CIM, a hybrid accelerator with a ReRAM-based hash predictor(ReHP) that exploits device variability for low-cost projections and couples ReRAM CAM with a K-winner-take-all(K-WTA) circuit for Top-K selection. Furthermore, an optimizable bias-softmax mechanism compensates information loss. Across diverse tasks, HP-CIM delivers 9.05–310.04× energy efficiency and 2.48-16.93× speedups over state-of-the-art CIM-based transformer accelerators.
People
Research Special Session
AI
DescriptionAutomatic synthesis of circuit topologies is challenging due to the large combinatorial design space, strict connectivity and component constraints, and costly simulator evaluations. Existing approaches including genetic algorithms, graph- or LLM-based methods often produce invalid or suboptimal netlists and fail to exploit reusable design abstractions. We present HRLTopo, a hierarchical RL framework with both high-level low-level policies to synthesize valid, high-performance circuit topologies. Policies are implemented with LLMs and optimized through RL, where the high-level policy learns to propose semantic subgoals that guide exploration, and the low-level policy learns to generate incident-encoded netlists satisfying constraints while achieving downstream objectives. A structured reward is also proposed to drive this hierarchy. Experiments on power converter topology synthesis show that HRL-Topo produces more valid and higher-quality circuits, improves sample efficiency, and outperforms baselines. Proposed hierarchical RL, incorporating LLM-based goal generation and constraint-aware netlist construction, provides an effective and scalable approach for design automation.
People
Engineering Presentation
EDA
Security
DescriptionGlitch power is the next major optimization frontier. It could cost up to 25% of total dynamic power. It is too late and hard to do algorithms or architecture change in P&R stage. So, it is important to identify such risk and optimize the design in RTL stage.
In this paper, we propose a new data driven glitch methodology to analyze glitch power in RTL stage. The delay aware glitch propagation is faster and easy to setup compared with other methodology. With the glitch source report the designers can identify the highest-impact source and optimize their design to minimize the glitch power.
In this paper, we propose a new data driven glitch methodology to analyze glitch power in RTL stage. The delay aware glitch propagation is faster and easy to setup compared with other methodology. With the glitch source report the designers can identify the highest-impact source and optimize their design to minimize the glitch power.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionModern large language model (LLM) serving platforms deploy multiple models across different GPUs, requiring routers to direct incoming queries to appropriate LLMs. However, existing routing approaches primarily rely on static model attributes such as size or FLOPs to estimate serving costs. This static cost modeling fails to capture the dynamic behavior of real deployments, where the same model can exhibit vastly different inference latencies depending on hardware type (e.g., H100 vs.\ V100), current system load (e.g., running and waiting queue lengths), and resource contention (e.g., KV-cache usage and GPU utilization). Such hardware-agnostic routing leads to suboptimal decisions, resulting in SLO violations, queue buildup, and underutilized GPUs. To address these challenges, we present \textbf{HW-Router}, a dynamic routing framework that integrates real-time hardware signals into model selection to enable accurate latency prediction and intelligent, SLO-aware routing decisions. Our approach incorporates model-specific features (architecture, size, input length) alongside hardware metrics including queue lengths, KV-cache utilization, and recent TTFT/TPOT performance, and uses a lightweight latency predictor to estimate per-model-per-GPU serving time. Evaluations across diverse workloads show that HW-Router achieves \textbf{3.4–3.9$\times$ lower end-to-end latency}, \textbf{46–48 percentage points higher SLO attainment}, \textbf{6–8$\times$ lower GPU load skew}, and a \textbf{3.1–3.4$\times$ reduction in waiting-queue fraction} compared to state-of-the-art router baselines, CARROT and IRT, with only \bm{$\sim 200 \mu s$} of additional routing overhead and no loss in output quality. These results highlight the importance of real-time hardware feedback for scalable, predictable, and well-balanced multi-LLM serving.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionRecent studies show that incorporating Processing-in-Memory (PIM) into neural processing units (NPUs) can significantly improve the efficiency of large language model (LLM) inference on edge devices.
However, the limited compute capability of DRAM-PIM and the use of fixed dataflow strategies still constrain overall performance gains.
To address these limitations, we propose a hybrid PIM architecture that integrates non-volatile spin-orbit torque magnetic random-access memory (SOT-MRAM)-based PIM into existing NPU-PIM systems, fully exploiting the high compute capability and bandwidth of SOT-MRAM together with the large storage capacity of DRAM.
On top of this architecture, we further design an adaptive compute scheduling and dataflow optimization framework.
Using NSGA-II-based multi-objective dataflow space exploration, our framework identifies Pareto-optimal hardware resource allocations and dataflow configurations under different deployment objectives.
Experimental results show that our approach reduces inference latency by up to 8.72× and power consumption by 11.74× compared to NPUs, and further achieves 6.24× lower latency and 7.76× higher power efficiency than NPU-PIM baselines.
However, the limited compute capability of DRAM-PIM and the use of fixed dataflow strategies still constrain overall performance gains.
To address these limitations, we propose a hybrid PIM architecture that integrates non-volatile spin-orbit torque magnetic random-access memory (SOT-MRAM)-based PIM into existing NPU-PIM systems, fully exploiting the high compute capability and bandwidth of SOT-MRAM together with the large storage capacity of DRAM.
On top of this architecture, we further design an adaptive compute scheduling and dataflow optimization framework.
Using NSGA-II-based multi-objective dataflow space exploration, our framework identifies Pareto-optimal hardware resource allocations and dataflow configurations under different deployment objectives.
Experimental results show that our approach reduces inference latency by up to 8.72× and power consumption by 11.74× compared to NPUs, and further achieves 6.24× lower latency and 7.76× higher power efficiency than NPU-PIM baselines.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionIncremental placement optimization improves power, performance, and area (PPA) by refining cell locations, gate sizing, and buffering under strict physical constraints. Conventional incremental flows, composed of discrete heuristic stages with isolated static timing analyses, often converge slowly and yield inconsistent PPA gains due to the lack of a unified formulation. Recent advances in differentiable placement enable gradient-based refinement on smooth surrogates of wirelength, density, and timing, offering high scalability but limited robustness near legality and timing-closure boundaries. To combine their strengths, we propose HyPAS, a hybrid optimization framework that integrates discrete sign-off optimization with differentiable gradient refinement for placement and sizing co-optimization. The discrete stage, implemented with OpenROAD, ensures legality, timing closure, and buffer-aware repair, while the differentiable stage, built on DREAMPlace, performs gradient-guided placement and sizing using a GNN-based timing-power surrogate. A Straight-Through Estimator bridges continuous gradients with discrete library parameters, enabling end-to-end physical co-optimization. Experiments show HyPAS delivers superior PPA gains with competitive runtime against 2025 ICCAD CAD Contest results and SOTA methods.
People
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionComputing-in-Memory (CIM) is a promising solution to the memory wall, yet most prior studies optimize only one level—macro, accelerator, or architecture—rather than the full stack. This work presents HYPER-CIM, a hierarchical predictive exploration and realizable design flow that integrates all three levels. We build a scalable, fully digital CIM template in 28 nm with tunable accumulation length, local storage length, precision, parallel channels, and pipeline depth. More than 8k silicon-consistent design points were used to train a multi-head hierarchical circuit-optimization (MHCO) surrogate model, which predicts power, performance, and area (PPA) across 297M configurations. The resulting CIM "white-box" model offers circuit-faithful visibility for architecture-level design-space exploration (DSE) and Pareto search. Guided by this flow, we fabricated and silicon-verified four processing-element (PE)-flow CIM macros and one cross-level-flow CIM (CF-CIM) macro in 28 nm CMOS technology. Based on chip test results, the best-performing macro achieves 90.8 TOPS/W and 1.23 TOPS/mm², yielding a figure-of-merit (FoM) improvement of 31.12×–3.8×10⁶× over prior CIM designs. Under identical specifications, the CF-CIM improves energy efficiency from 52.13 TOPS/W to 67.9 TOPS/W and area efficiency from 0.41 TOPS/mm² to 0.51 TOPS/mm².
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionAnalog circuits exhibit rich multi-terminal relationships that are poorly captured by conventional graph-based learning methods. We introduce \coin, a domain-aware foundation model that operates directly on a hypergraph representation tailored to transistor-level behavior. \coin integrates specialized transistor modeling, voltage-aware positional encoding derived from topological distances to VDD/VSS, and a transformer-enhanced hypergraph neural network to improve expressivity and structural discrimination. We evaluate \coin on circuit classification and specification regression.
Experimental results show that \coin significantly outperforms both graph- and hypergraph-based baselines, proving the efficacy of our domain-aware approach.
Experimental results show that \coin significantly outperforms both graph- and hypergraph-based baselines, proving the efficacy of our domain-aware approach.
Research Manuscript
Hyperefs: Adaptive Clique Sampling for Scalable Effective Resistance Estimation in Large Hypergraphs
3:42pm - 3:54pm PDT Tuesday, July 28 Mtg Room 203ABEDA
EDA7-II. Physical Design and Verification
DescriptionWe propose HyperEFs, a scalable framework for effective resistance estimation in large hypergraphs based on adaptive clique sampling. The proposed sampling strategy enables smooth transitions between star and clique expansions of hyperedges, substantially improving the quality of Krylov subspace embeddings. HyperEFs estimates pairwise effective-resistance distances in large hypergraphs in nearly linear time and integrates seamlessly into state-of-the-art multilevel hypergraph partitioning frameworks, yielding significantly improved solution quality. Extensive experiments on VLSI benchmarks demonstrate up to 100× speedup over HyperEF 2.0 with greatly reduced memory usage. On large-scale hypergraph datasets such as Titan23, our framework achieves superior cut sizes, demonstrating exceptional runtime scalability and efficiency.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionLiDAR semantic segmentation plays a pivotal role in scene understanding for edge applications such as autonomous driving. Once deployed in the real world, the model must adapt to its surrounding environment through rapid adaptive training and updates, even with limited compute and energy resources
on an edge device. Existing segmentation models rely on large neural networks, which need significant memory and computing resources for post-deployment adaptation. To address the above challenges, we introduce HyperLiDAR, the first lightweight LiDAR segmentation framework based on Hyperdimensional
Computing (HDC) for adapting to streaming point cloud scans after deployment. HDC is a brain-inspired approach well-suited for efficient on-device learning. HyperLiDAR combines a pretrained feature extractor with HDC training to support lightweight adaptation on edge devices. We further design a buffer selection
strategy to mitigate the high data volume in each scan. Extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and three representative devices demonstrate that HyperLiDAR outperforms state-of-the-art segmentation methods and accelerates retraining speed by up to 13.8×.
on an edge device. Existing segmentation models rely on large neural networks, which need significant memory and computing resources for post-deployment adaptation. To address the above challenges, we introduce HyperLiDAR, the first lightweight LiDAR segmentation framework based on Hyperdimensional
Computing (HDC) for adapting to streaming point cloud scans after deployment. HDC is a brain-inspired approach well-suited for efficient on-device learning. HyperLiDAR combines a pretrained feature extractor with HDC training to support lightweight adaptation on edge devices. We further design a buffer selection
strategy to mitigate the high data volume in each scan. Extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and three representative devices demonstrate that HyperLiDAR outperforms state-of-the-art segmentation methods and accelerates retraining speed by up to 13.8×.
People
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionSparse-matrix sparse-matrix multiplication (SpGEMM) is a key kernel across many domains, yet its performance on modern architectures is dominated by highly irregular data movement. Wafer-scale chips (WSCs) offer unprecedented bandwidth and massive parallelism, making them a compelling platform for SpGEMM. However, sparsity-induced irregularity generates substantial communication overheads, resulting in severe underutilization.
In this work, we present HyperWafer, the first communication-aware SpGEMM framework for WSCs, combining a hypergraph-guided execution model with dedicated architectural support. HyperWafer captures SpGEMM's inherently high-order dataflow dependencies, which arise from multiway row sharing and overlapping reduction scopes, using a weighted hypergraph abstraction that enables communication-centric partitioning and topology-aware mapping aligned with the wafer's physical bandwidth distribution. To efficiently realize this model, HyperWafer integrates a sparsity-aware SpGEMM execution engine that sustains high per-PE throughput under irregular workloads together with a lightweight, congestion-sensitive runtime routing substrate that preserves effective on-wafer communication along hypergraph-optimized paths. Across diverse workloads, HyperWafer delivers average speedups of 979.97x, 125.23x, 12.14x, and 5.14x over state-of-the-art CPU, GPU, FPGA, and WSC SpGEMM implementations.
In this work, we present HyperWafer, the first communication-aware SpGEMM framework for WSCs, combining a hypergraph-guided execution model with dedicated architectural support. HyperWafer captures SpGEMM's inherently high-order dataflow dependencies, which arise from multiway row sharing and overlapping reduction scopes, using a weighted hypergraph abstraction that enables communication-centric partitioning and topology-aware mapping aligned with the wafer's physical bandwidth distribution. To efficiently realize this model, HyperWafer integrates a sparsity-aware SpGEMM execution engine that sustains high per-PE throughput under irregular workloads together with a lightweight, congestion-sensitive runtime routing substrate that preserves effective on-wafer communication along hypergraph-optimized paths. Across diverse workloads, HyperWafer delivers average speedups of 979.97x, 125.23x, 12.14x, and 5.14x over state-of-the-art CPU, GPU, FPGA, and WSC SpGEMM implementations.
People
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionModern VLSI designs contain tens of billions of components, making scalable hypergraph partitioning essential for parallel or hierarchical optimization. Although the multilevel partitioning paradigm remains effective, coarsening can distort structural information—especially in hypergraphs with many high-degree hyperedges—leading to substantial refinement overhead and limited scalability. Recent works incorporate spectral information, but only heuristically and without directly targeting the partitioning objective or enforcing constraints, leaving refinement to recover quality. We introduce HySpecPro, a single-level hypergraph partitioner that performs end-to-end optimization in a spectral embedding space. HySpecPro constructs embeddings from a bipartite Laplacian and performs efficient projection-based search, supported by a fully GPU-accelerated implementation. Experiments show that HySpecPro delivers cut quality comparable to state-of-the-art multilevel methods while scaling linearly with the total hyperedge degree.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThis work proposed a self-compensating p-bit unit and constructed an in-situ adaptive diffusion sampling (iADS) architecture. To the best of our knowledge, this is the first time achieving in-situ Gaussian noise generation and scaling in hardware. An iADS circuit composed of 256 p-bit units was fabricated and generated MNIST images in experiments. Furthermore, a hardware-calibrated 1,048,576 p-bit iADS architecture, achieved training and generation for CIFAR10 (FID=9.96) and CelebA-HQ, by combining hardware adaptive scaling. This approach achieves approximately 8× reduction in energy consumption and roughly 143× reduction in area relative to PRNGs, underscoring its potential for highly efficient acceleration of DMs.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionDeploying MoEs for multi-user inference demands low latency with tight cost. MoE acceleration strategies cache and prefetch experts, assuming temporal locality and predictable routing. When these fail, wrong experts inflate latency, enabling DoS. We expose this vulnerability in GPU-centric MoE offloading and present Icarus, a gradient-based universal attack injecting an adversarial prefix embedding to disable such acceleration. Icarus combines Temporal-Locality Minimization and Expert-Prefetch Misleading to perturb decoding, plus a scheduler balancing targets with active exploration. Across models/devices, Icarus raises cache replacements by 85.4%, cuts prefetch accuracy by 12.5%, slows decoding to 0.7×, and drives outputs to maximum length.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionWhile microarchitectural exploration for well-established processors relies on cycle-accurate simulators (CAS) for accurate evaluation, exploring without knowing both a microarchitectural optimization's capability to eliminate critical-path latency and its
theoretical upper bound often leads to blind iteration. To address
this, we propose IdealSim, an efficient and general simulation framework for constructing ideal scenarios on CAS, enabling early-stage
identification of promising designs. Applied to the RTL-validated
CAS of the XiangShan processor to explore the ideal critical-path
value predictor, IdealSim achieves 7.93% higher modeling accuracy
than prior work and shows that the critical-path value predictor
can deliver an average performance gain of 5.61% (and up to 37.92%)
in ideal scenarios. Our results provide an accurate performance
upper bound, guiding subsequent microarchitectural trade-offs and
optimization of the XiangShan processor.
theoretical upper bound often leads to blind iteration. To address
this, we propose IdealSim, an efficient and general simulation framework for constructing ideal scenarios on CAS, enabling early-stage
identification of promising designs. Applied to the RTL-validated
CAS of the XiangShan processor to explore the ideal critical-path
value predictor, IdealSim achieves 7.93% higher modeling accuracy
than prior work and shows that the critical-path value predictor
can deliver an average performance gain of 5.61% (and up to 37.92%)
in ideal scenarios. Our results provide an accurate performance
upper bound, guiding subsequent microarchitectural trade-offs and
optimization of the XiangShan processor.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAdvanced-node standard cell libraries often include thousands of cells characterized across 50+ PVT corners.
Identifying high-risk cells, those most prone to LVF variation or exhibiting abnormal trends across drive strength, VT, or gate length, is critical for:
- Guiding PnR and STA tools to apply Don't Use / Don't Touch (DUDT) constraints
- Estimating SoC-level PPA impact
- When such cells are used, including derates
- Prioritizing cells or arcs for redesign and improvement.
We propose a cell scoring methodology that ranks cells based on Liberty-derived parameters such as nominal delay combined with LVF sigma variation, leakage power, CCSP dynamic current (peak and area), CCSN noise susceptibility and more.
Through the automated ranking system, cells with worst LVF spread and CCS anomalies were quickly identified, which provided for easy analysis, guidance for SOC designers, and subsequent recommendations to aide on improvements at the cell level.
Alongside this key value, the work empowered NXP to improve efficiency of their QA at a rate of 1.5x when compared to their previous manual methodologies for ranking.
Identifying high-risk cells, those most prone to LVF variation or exhibiting abnormal trends across drive strength, VT, or gate length, is critical for:
- Guiding PnR and STA tools to apply Don't Use / Don't Touch (DUDT) constraints
- Estimating SoC-level PPA impact
- When such cells are used, including derates
- Prioritizing cells or arcs for redesign and improvement.
We propose a cell scoring methodology that ranks cells based on Liberty-derived parameters such as nominal delay combined with LVF sigma variation, leakage power, CCSP dynamic current (peak and area), CCSN noise susceptibility and more.
Through the automated ranking system, cells with worst LVF spread and CCS anomalies were quickly identified, which provided for easy analysis, guidance for SOC designers, and subsequent recommendations to aide on improvements at the cell level.
Alongside this key value, the work empowered NXP to improve efficiency of their QA at a rate of 1.5x when compared to their previous manual methodologies for ranking.
Additional Meeting
DescriptionJuly 27th, 2:30 PM – 3:30 PM
Following the successful completion and publication of IEEE 2416-2025, which incorporates significant enhancements in multi-supply modeling, AMS extensions, and improved interoperability across abstraction levels, we will be initiating the next revision cycle of the standard through a new IEEE PAR. This effort is driven by ballot feedback, emerging use cases, and ongoing industry adoption needs around system-level power modeling and signoff integration. To kick off this next phase, we are organizing a one-hour face-to-face working group meeting at DAC, bringing together contributors from industry and academia to align on scope, technical direction, and key focus areas for the upcoming revision. This session is open to interested participants and is intended to foster collaboration and early engagement as we shape the future evolution of IEEE 2416.
Watch for Si2 DAC updates!
Following the successful completion and publication of IEEE 2416-2025, which incorporates significant enhancements in multi-supply modeling, AMS extensions, and improved interoperability across abstraction levels, we will be initiating the next revision cycle of the standard through a new IEEE PAR. This effort is driven by ballot feedback, emerging use cases, and ongoing industry adoption needs around system-level power modeling and signoff integration. To kick off this next phase, we are organizing a one-hour face-to-face working group meeting at DAC, bringing together contributors from industry and academia to align on scope, technical direction, and key focus areas for the upcoming revision. This session is open to interested participants and is intended to foster collaboration and early engagement as we shape the future evolution of IEEE 2416.
Watch for Si2 DAC updates!
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionNeural circuit simulators promise fast alternatives to SPICE but remain impractical due to three fundamental limitations: per-circuit hyperparameter tuning, cold-start problems from lacking crosstask knowledge sharing, and prohibitive online retraining costs that negate computational advantages. We introduce IGNITE, a pretrained neural simulator that addresses these challenges through in-context learning. Built on a self-supervised topology encoder and Prior-data Fitted Network decoder, IGNITE adapts to new circuits via a single forward pass conditioned on a small context set, eliminating online training overhead. An information-theoretic active learning strategy enables intelligent sample selection, maximizing data efficiency. IGNITE transfers knowledge across topologies and technology nodes, achieving R2 > 0.9 with only 95 samples in average—a 15.8× improvement over SOTA. Integration with optimization frameworks yields 2−30× speedup for transistor sizing and more than 5× speedup for yield optimization, demonstrating practical viability for industrial analog circuit design.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionRecent advances in GPU-accelerated hypergraph partitioning have achieved substantial performance gains but remain limited to full partitioning. In particular, the lack of support for incrementality is a critical limitation for being used by many CAD applications, where circuit hypergraphs iteratively undergo incremental modifications as part of optimization loops. To overcome this limitation, we present iHyperG, the first GPU-parallel incremental k-way hypergraph partitioner. iHyperG introduces a scalable delta-based hypergraph data structure for efficient incremental modifications on a GPU, along with an effective incremental partitioning algorithm that rebalances partitions in a single pass and refines only cut-critical vertices. Experimental results show that iHyperG achieves average speedups of 190x for modification and 83x for partitioning over a state-of-the-art GPU-parallel hypergraph partitioner, while maintaining comparable partitioning quality.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionThe problem of k-way hypergraph partitioning is fundamental with significant applications in various fields, including VLSI design and scientific computing. State-of-the-art hypergraph partitioners commonly employ a multi-level framework encompassing coarsening, initial partitioning, uncoarsening, and refinement phases. However, many existing methods do not scale well to problems requiring a large number of partitions (i.e., large k). In pursuit of exceptionally high solution quality, existing memetic approaches often execute their two key operations, recombination and mutation, by invoking separate, standalone multi-level partitioners. This design choice, however, renders them significantly more time-consuming than standard multi-level partitioners. To make such memetic approaches more practical, we propose an advanced memetic framework, IMPart, which introduces novel recombination and mutation operators and integrates them directly into the uncoarsening phase of a single multi-level framework. This transforms the local searches of different granularities in the traditional multi-level framework into a sophisticated, collaborative search. Experimental results on multiple standard benchmarks demonstrate our framework more effectively escapes local optima and explores the global solution space for higher-quality solutions, substantially outperforming all existing hypergraph partitioners for large-k-way hypergraph partitioning. Our framework highlights a new paradigm for the development of advanced hypergraph partitioners.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSystems in Package (SiP) pose the interesting problem of behaving as one entity whilst being made up of multiple chip(let)s.
From a cybersecurity standpoint, this entails the capability to face user as if it were a single Root of Trust (RoT).
In this presentation we present mechanisms to implement distributed trustworthiness. It is possible to bootstrap trust in various consistent chip(let)s during manufacturing.
The initial trust arises from a preliminary authentication, through classical methods, such as:
1. Pre-shared secret
2. Pre-installed certificate
Those allow to extend the perimeter of trust across discrete chiplets, thereby re-aggregating the system. After pairwise authentication, hierarchical services can be launched, such as secure boot, attestation and secure storage.
Secure boot is the function to present the status of the SiP is starting up. At minimum it is binary ("genuine" or not), but in general the status is more verbose:
- list of activated / deactivated functions and capabilities
- limitations of some functions
- rescue modes
- version of the BOM
It allows to determine how the system can be used (fully, partially), and let the user decide of the course of action:
- diagnose / repair
- update for weakness fix or security patching, etc.
Remote attestation is the mechanism to inquire about the healthy state of the system. It can be coupled with some derived services, such as:
- derivation of an SBOM-based identity
- retrieval of sealed data (stored at rest for self)
- attestation of the next layer when the system trustworthiness is hierarchical
Attestation is a well-documented service, described by IETF and TCG on the one hand, and whose evaluation is guided by the novel, ISO/IEC 19790:2025 (expected to be the forthcoming FIPS 140-4).
Secure storage is the service to securely store information at rest while preserving its integrity and its confidentiality. The challenge in SiP architecture is that one secure file system can be hosted in a non volatile memory (say an HBM), addressable by different chiplets. The prior chiplets mutual authentication allows to manage storage root keys in such a way access control & isolation can be ensured cryptographically. Such mechanism is akin L.O.C.K. (Layered Open-source Cryptographic Key management) for CALIPTRA.
SiP implements holistic security services, transparently despite the underneath design disaggregation into multiple chiplets (which have been duly provided to be able to authenticate one with each other). The authentication can work in different topologies, such as those defined by Foundation Chiplet System Architect (FCSA) specification:
- star = "compute and hub system"
- daisy-chain = "compute tile system"
This allows to comply to regulation, such as Platform Firmware Resiliency (SP 800-193) or European Cyber Resiliency Act (EU CRA), by being both (1) secure by design and (2) manageable in-the-field.
From a cybersecurity standpoint, this entails the capability to face user as if it were a single Root of Trust (RoT).
In this presentation we present mechanisms to implement distributed trustworthiness. It is possible to bootstrap trust in various consistent chip(let)s during manufacturing.
The initial trust arises from a preliminary authentication, through classical methods, such as:
1. Pre-shared secret
2. Pre-installed certificate
Those allow to extend the perimeter of trust across discrete chiplets, thereby re-aggregating the system. After pairwise authentication, hierarchical services can be launched, such as secure boot, attestation and secure storage.
Secure boot is the function to present the status of the SiP is starting up. At minimum it is binary ("genuine" or not), but in general the status is more verbose:
- list of activated / deactivated functions and capabilities
- limitations of some functions
- rescue modes
- version of the BOM
It allows to determine how the system can be used (fully, partially), and let the user decide of the course of action:
- diagnose / repair
- update for weakness fix or security patching, etc.
Remote attestation is the mechanism to inquire about the healthy state of the system. It can be coupled with some derived services, such as:
- derivation of an SBOM-based identity
- retrieval of sealed data (stored at rest for self)
- attestation of the next layer when the system trustworthiness is hierarchical
Attestation is a well-documented service, described by IETF and TCG on the one hand, and whose evaluation is guided by the novel, ISO/IEC 19790:2025 (expected to be the forthcoming FIPS 140-4).
Secure storage is the service to securely store information at rest while preserving its integrity and its confidentiality. The challenge in SiP architecture is that one secure file system can be hosted in a non volatile memory (say an HBM), addressable by different chiplets. The prior chiplets mutual authentication allows to manage storage root keys in such a way access control & isolation can be ensured cryptographically. Such mechanism is akin L.O.C.K. (Layered Open-source Cryptographic Key management) for CALIPTRA.
SiP implements holistic security services, transparently despite the underneath design disaggregation into multiple chiplets (which have been duly provided to be able to authenticate one with each other). The authentication can work in different topologies, such as those defined by Foundation Chiplet System Architect (FCSA) specification:
- star = "compute and hub system"
- daisy-chain = "compute tile system"
This allows to comply to regulation, such as Platform Firmware Resiliency (SP 800-193) or European Cyber Resiliency Act (EU CRA), by being both (1) secure by design and (2) manageable in-the-field.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis presentation details a methodology to achieve ISO-26262 ASIL-D compliance for Digital Signal Processors (DSPs) used in automotive Battery Management Systems (BMS). ASIL-D is the most stringent automotive safety level, requiring 99% Single Point Fault Metric (SPFM). The primary safety mechanism employed is Dual Core Lock Step (DCLS). To validate and improve diagnostic coverage, the team utilized an ISO 26262 certified tool, which can be used for FuSa design verification upto ASIL-D, to perform exhaustive gate-level fault injection analysis. Diagnostic Software Libraries were developed and continuously refined to exercise all DSP core functionalities, verify each iteration to evaluate the diagnostic fault coverage. The Constant Analysis identified SAFE faults—those related to unused logic blocked by configuration registers. Since these faults cannot violate safety goals, marking them as SAFE improves the SPFM, and reduce debug and development efforts. Analysis of "End of Trace" (EOT) faults revealed areas where data was not toggling sufficiently, software vectors were improved using random data to ensure fault propagation through proprietary hardware. In summary, combining hardware DCLS with advanced constant analysis and refined software workloads, the methodology successfully reaches the 99% SPFM required for automotive safety.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThis paper presents Boundary Suppressed K-Means Quantization (BS-KMQ), a nonlinear (NL) quantization method to reduce analog-to-digital converter (ADC) resolution in the in-memory computing (IMC) systems. ReLU and clamping often cause value accumulation near distribution edges, leading to biased clustering and suboptimal quantization. BS-KMQ mitigates this by removing such outliers before clustering, yielding more informative quantization levels. It achieves at least 3X lower quantization error compared to linear, Lloyd–Max, CDF and K-means methods. The resulting NL references are implemented using a reconfigurable in-memory NL-ADC with 7X area improvements compared to previous two works. Evaluated on ResNet-18, VGG-16, Inception-V3, and DistilBERT, BS-KMQ improves up to 66.8%, 25.4%, 66.6%, and 67.7% higher post‑training quantization accuracy compared to linear quantization. After low-bit finetuning, it maintains competitive accuracy with significantly fewer ADC levels (3/3/4/4b). System-level simulation on ResNet-18 (6/2/3b) shows up to 4X speedup and 24X energy efficiency over existing IMC accelerators.
Workshop
DescriptionModern computer architectures and the device technologies used to manufacture them are facing significant challenges, limiting their ability to meet the performance demands of complex applications such as Big Data processing and Artificial Intelligence (AI). The In-Memory Architectures and Computing Applications Workshop (iMACAW) workshop seeks to provide a platform for discussing In-Memory Computing (IMC) as an alternative architectural approach and its potential applications. Adopting a cross-layer and cross-technology perspective, the workshop will cover state-of-the-art research utilizing various memory technologies, including SRAM, DRAM, FLASH, RRAM, PCM, MRAM, and FeFET. Additionally, the workshop aims to strengthen the IMC community and offer a comprehensive view of this emerging computing paradigm to design automation professionals. Attendees will have the opportunity to engage with invited speakers, who are pioneers in the field, learn from their expertise, ask questions, and participate in panel discussions.
Engineering Presentation
Design
EDA
Security
Systems
DescriptionPhysical vulnerabilities in off-chip interfaces between sensors and secure microcontrollers expose systems to attacks such as data manipulation, spoofing, and hardware Trojans. Traditional countermeasures like shielding and tamper-evident enclosures are costly and impractical for low-cost designs. This work proposes an all-digital, in-situ architecture that leverages inherent system variability—arising from IC packaging, PCB traces, and process mismatches—to generate unique timing signatures for each sensor-controller pair during factory calibration. These signatures, derived from pulse width and skew measurements of data bus transitions, are stored in non-volatile memory and continuously monitored during in-field transactions. Deviations from expected signatures indicate tampering, triggering alerts or shutdowns. The solution is protocol-independent, incurs minimal area and power overhead, and integrates seamlessly into existing microcontroller interfaces. Results demonstrate reliable detection of interface anomalies without additional mechanical protection, offering a robust, low-cost alternative for securing sensor channels against physical attacks.
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionThe deployment of large language models (LLMs) is increasingly targeting heterogeneous unified-memory architectures (HUMA) for edge and cost-sensitive computing. However, GPU-centric inference systems perform suboptimally on HUMA due to shared DRAM bandwidth contention and inefficient static operator placement. This paper introduces InferWeave, a bandwidth-aware inference framework that co-designs offline planning and runtime scheduling for HUMA. InferWeave's offline planner formulates operator placement and data staging as a unified optimization problem with DRAM bandwidth as a first-class constraint. At runtime, a bandwidth-aware asynchronous pipeline scheduler enforces these plans by regulating memory access rates, enabling non-blocking execution across host and device. Evaluated on LLaMA and GPT models using the MT-3000 processor, InferWeave achieves up to 2.3x higher throughput than FlexGen and llama2.c, demonstrating its effectiveness in enabling efficient LLM inference on integrated architectures.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAbstract: The rapid evolution of automotive System-on-Chips (SoCs) integrating 10BaseT Ethernet subsystems demands advanced verification methodologies to ensure reliability, performance, and safety compliance. This paper presents an innovative multiplatform verification framework designed to address the unique challenges posed by next-generation automotive Ethernet SoCs. By combining simulation, emulation, and formal verification techniques, the framework leverages the strengths of each platform to achieve comprehensive coverage and accelerate verification cycles.
Simulation offers detailed visibility for debugging complex corner cases, emulation enables high-speed execution of large-scale transaction scenarios, and formal methods provide exhaustive protocol property checking. This hybrid approach not only improves verification thoroughness but also significantly reduces time-to-market pressures. The framework supports scalable verification of multidrop topologies and multiple Ethernet instances, essential for diverse automotive applications ranging from advanced driver-assistance systems (ADAS) to infotainment.
A case study on an automotive 10BaseT Ethernet subsystem demonstrates a 15% increase in coverage and a 40% reduction in verification cycle time compared to traditional single-platform approaches. Early detection of critical integration bugs further validates the framework's effectiveness. Future enhancements include integration of AI/ML-driven test generation and extension to emerging automotive Ethernet standards such as 100BaseT1 and Time-Sensitive Networking (TSN). This work establishes a robust foundation for reliable and efficient verification of automotive Ethernet SoCs, supporting the next wave of connected and autonomous vehicles.
Simulation offers detailed visibility for debugging complex corner cases, emulation enables high-speed execution of large-scale transaction scenarios, and formal methods provide exhaustive protocol property checking. This hybrid approach not only improves verification thoroughness but also significantly reduces time-to-market pressures. The framework supports scalable verification of multidrop topologies and multiple Ethernet instances, essential for diverse automotive applications ranging from advanced driver-assistance systems (ADAS) to infotainment.
A case study on an automotive 10BaseT Ethernet subsystem demonstrates a 15% increase in coverage and a 40% reduction in verification cycle time compared to traditional single-platform approaches. Early detection of critical integration bugs further validates the framework's effectiveness. Future enhancements include integration of AI/ML-driven test generation and extension to emerging automotive Ethernet standards such as 100BaseT1 and Time-Sensitive Networking (TSN). This work establishes a robust foundation for reliable and efficient verification of automotive Ethernet SoCs, supporting the next wave of connected and autonomous vehicles.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionLarge language models (LLMs) have achieved strong results in software programming tasks, which motivates their application to hardware design, especially in generating Verilog code from natural language specifications. Existing methods mainly use instruction tuning, which optimizes token-level probability and does not match the need for functional correctness in Verilog generation. To address this, we use reinforcement learning (RL) with feedback from verification tools, so that the training objective directly reflects functional correctness. RL training with verification feedback is limited by the lack of functional verification code, or testbenches. To solve this, we propose an automatic testbench generation framework that addresses the problems of hallucination and low coverage in LLM-generated testbench by decomposing the task and using additional information from electronic design automation tools. We then train Verilog generation LLMs using reinforcement learning with testbench feedback, achieving state-of-the-art performance. Our method gives consistent gains across different base models and open-source Verilog generation LLMs, showing its generalizability. All datasets, codes, and models are released to support further research in this area.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionPower side-channel attacks allow adversaries to recover executed instructions and reconstruct program flow from power traces. However, existing approaches have primarily focused on low-frequency embedded platforms with relatively simple processor designs. When applied to modern commercial CPUs, complex microarchitectural behaviors, such as out-of-order execution and deep pipelining, introduce substantial execution variability and noise, severely hindering reliable instruction recovery from power measurements.
In this work, we present the first systematic demonstration of instruction-level side-channel disassembly on a commercial x86 processor operating at 3.1 GHz. We construct a large-scale dataset containing 645 instructions—including AVX/SSE extensions—and more than 50,000 voltage waveform samples collected under realistic multi-threaded operating conditions. To capture cross-cycle dependencies, we propose a novel end-to-end sequence-to-sequence framework that directly reconstructs instruction sequences from voltage waveforms, integrating CNN-Conformer encoders for feature learning and Transformer decoders for sequence generation. Experiments show that our method achieves 34.70% sequence prediction accuracy, outperforming static segmentation and classifier-based baselines.
In this work, we present the first systematic demonstration of instruction-level side-channel disassembly on a commercial x86 processor operating at 3.1 GHz. We construct a large-scale dataset containing 645 instructions—including AVX/SSE extensions—and more than 50,000 voltage waveform samples collected under realistic multi-threaded operating conditions. To capture cross-cycle dependencies, we propose a novel end-to-end sequence-to-sequence framework that directly reconstructs instruction sequences from voltage waveforms, integrating CNN-Conformer encoders for feature learning and Transformer decoders for sequence generation. Experiments show that our method achieves 34.70% sequence prediction accuracy, outperforming static segmentation and classifier-based baselines.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn advanced semiconductor technology nodes, designing a robust and area efficient I/O ring poses significant challenges due to tighter design margins, increased integration density, and stringent reliability requirements. To cultivate this efficiency requires a methodology for high performance and robustness while maintaining efficient area utilization in I/O ring design for advanced process nodes. This paper presents an integrated design approach that optimizes layout, and system-level parameters to achieve high performance robustness by achieving area efficient I/O ring. The integrated approach results in significantly reduction in I/O ring area and power routing complexity while maintaining compliance with ESD, electromigration, and noise tolerance specifications.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionWith the deceleration of Moore's Law, three-dimensional (3D) integration via hybrid bonding enables higher performance.However, existing 3D placers exhibit limitations: pseudo-3D placers suffer from modeling inaccuracies, while true-3D placers neglect timing closure.Prior 3D static timing analysis (STA) lacks accurate parasitic modeling, impeding timing optimization.This paper presents the first timing-driven true-3D placement framework for hybrid-bonding-based integrated circuits.We develop a 3D STA engine with accurate parasitic modeling and integrate a sample-based 3D wirelength model with hybrid weighting for global placement, followed by Lagrangian-based detailed placement.Results demonstrate 30-1002x reduction in TNS and 3-14x reduction in WNS over state-of-the-art placers.
People
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionDevising a design specification in a Hardware Description Language, like Verilog or VHDL, is the first step in the automated design of hardware circuits and systems. Traditionally, design specification has been a manual process, thus cumbersome and error prone. To address these limitations, this paper proposes a methodology to automatically generate Verilog code for a problem description in natural language. It combines the capabilities of Large Language Models, Retrieval Augmented Generation, classifiers, heuristics, simulation, and reasoning to understand, adaptively prompt, and elaborate, in addition to traditional learning, as means to bridge the semantic gaps in code generation. The methodology integrates the top-down code generation part with the bottom-up functional correctness and performance evaluation, e.g., code reviews, design parallelism, critical path, and hardware resource sharing. Experiments compare the proposed methodology with state-of-the-art Verilog/VHDL code generation methods.
Engineering Presentation
EDA
Security
DescriptionModern SoCs integrate multiple asynchronous clock domains, making Clock Domain Crossing (CDC) for control signals a critical design challenge. Conventional n-flop synchronizers, though robust, remain continuously clocked, causing significant dynamic power dissipation and area overhead—especially when control signals toggle sparsely. This paper introduces an activity-aware CDC architecture that intelligently gates synchronizer clocks, activating only during signal transitions. By grouping control signals into synchronizer banks and employing transition detection, the proposed method drastically reduces idle power while preserving CDC fidelity. As demonstrated in 28 nm FDSOI at 600 MHz, the architecture achieves up to 90% dynamic power savings for low toggle rates and over 30% area reduction compared to traditional synchronizer designs. Results highlight scalability for large SoCs with thousands of control signals, delivering exceptional energy efficiency without functional compromise. Fully digital and technology-agnostic, this solution enables automated integration into RTL flows, making it ideal for next-generation low-power, high-performance systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionConventional Reset Domain Crossing (RDC) practices rely on pessimistic structural analysis, often ignoring critical design intent. This oversight causes two major issues: millions of false metastable paths are reported due to ignored hierarchical reset sequencing, and tools fail to recognize reset-driven clock-gaters that physically prevent metastability. Consequently, designers are burdened by a manual, trial-and-error process of defining ignore-paths and constraints to silence this noise.
This paper presents an intelligent RDC methodology that utilizes boolean reasoning to automatically extract design intent from the hardware. The proposed flow introduces native reset sequencing awareness to learn assertion ordering and reset-driven clock-off awareness to determine when reset assertions disable capture clocks. By automatically filtering functionally impossible paths, the methodology provides accurate reporting with dramatically reduced noise.
The outcome is a practical, adoption-ready RDC solution that eliminates the burden of manual false-path definitions and constraint tuning. This approach ensures that RDC analysis is aligned with actual silicon behavior rather than testing a designer's ability to manually write complex constraints. Overall, the flow improves verification quality while significantly reducing the manual effort required for RDC closure.
This paper presents an intelligent RDC methodology that utilizes boolean reasoning to automatically extract design intent from the hardware. The proposed flow introduces native reset sequencing awareness to learn assertion ordering and reset-driven clock-off awareness to determine when reset assertions disable capture clocks. By automatically filtering functionally impossible paths, the methodology provides accurate reporting with dramatically reduced noise.
The outcome is a practical, adoption-ready RDC solution that eliminates the burden of manual false-path definitions and constraint tuning. This approach ensures that RDC analysis is aligned with actual silicon behavior rather than testing a designer's ability to manually write complex constraints. Overall, the flow improves verification quality while significantly reducing the manual effort required for RDC closure.
Research Special Session
AI
DescriptionAndroid powers billions of devices across an array of form factors that move, live, and are part of the everyday life of its global community. Across these devices and form factors different protocols and sensors are used to produce, share, and consume media. As wearables advance in size, connectivity, and compute they are transforming how users consume and produce content. This is in conjunction with a landscape where ambient sensing and processing is critical to how these emerging form factors support multimodal media interactions. This talk will unpack security & privacy considerations related to ambient processing, public and semi-private (e.g., shared devices, TV) media consumption, and the importance of these considerations across the Android ecosystem of form factors.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFormal verification based on contract-based and assume–guarantee reasoning is a powerful technique for scalable system validation, yet it often suffers from state-space explosion when implicit design invariants are not explicitly captured. In practice, modern SMT-based model checkers spend significant effort exploring unreachable or irrelevant regions of the state space, leading to non-convergence even for correct designs. This work proposes an Invariant-Augmented Contract-Based Formal Verification methodology that systematically integrates State Space Tunneling (SST) as a counterexample-driven refinement mechanism. When a target property fails to converge, tunneling is used to generate non-reset counterexamples that expose unreachable but solver-visible behaviors. These behaviors are analyzed to extract missing inductive invariants, which are encoded as helper assertions and fed back into component contracts or assertion sets. Once proven, the invariants are safely assumed in subsequent proofs, effectively pruning unreachable state regions without over-constraining the design. The approach introduces an iterative refinement loop that progressively strengthens contracts, improves compositional reasoning, and accelerates convergence of complex proofs. By combining industrial SST workflows with contract-based verification, the proposed method bridges the gap between architectural intent and low-level formal analysis. This work demonstrates how invariant augmentation transforms SST from a debugging aid into a systematic scalability mechanism for large, complex hardware designs.
People
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionPhysical design is a critical step in the electronic design automation (EDA) process, where the gate-level netlist from logic synthesis is converted into a GDS-II file. Recently, AI methods have been increasingly employed to address challenges in physical design. This paper proposes a new paradigm that integrates routing metric evaluation and optimization on chip layouts, forming a ``Generate–Evaluate–Optimize–Generate'' (GEOG) closed-loop. In our experiments, the pre-routing generation model achieves an mean relative error (MRE) of only 1%, with a mean absolute error (MAE) below one movement unit. Post-processing reduces both MRE and MAE to zero while ensuring connectivity. The evaluation model performs comparably to iSTA on the test set, with a 336× speedup. Furthermore, the optimization strategy effectively improves wirelength, delay, slew, R, and C by 10.49%, 13.03%, 14.85%, 10.51%, and 6.46%, respectively, and produces routing results superior to commercial tools in key metrics.
People
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionThis paper presents an analytical global placement framework that concurrently integrates PDN optimization, effectively addressing both global and local IR-drop challenges. To enable accurate and efficient voltage drop estimation, we implement a fast, GPU-accelerated IR-drop analysis engine based on Modified Nodal Analysis (MNA). For PDN optimization, we introduce an iterative strategy that enhances power delivery robustness by inserting additional power delivery paths to reduce global IR-drop. To further address localized IR-drop issues, we propose a novel density control mechanism that constrains cell placement based on the maximum tolerable power load at each PDN node, as determined by IR-drop severity. Experimental results show that the proposed methodology significantly reduces IR-drop violations without incurring additional wirelength, runtime, or routability overhead compared to state-of-the-art IR-drop-aware placement techniques.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
Description3D Gaussian Splatting (3DGS) has emerged as the state-of-the-art (SOTA) technique for novel view synthesis. This paper introduces an algorithm that employs a two-stage pipeline. First, a 2D Gaussian (2DGS) model is rendered via the nearest Gaussian's color. Subsequently, this depth map is utilized for the second stage: a sort-free, weighted rendering of the complete 3DGS model. We further propose CODA, a unified hardware accelerator co-optimized to execute this 2D-3D hybrid pipeline. Experimental results demonstrate that CODA exhibits an 8.4× to 11.6× speedup over the NVIDIA RTX 3060M and a 2.7× to 3.9× speedup over the NVIDIA RTX 4090.
DAC Pavilion Panel
DescriptionMachine Learning (ML) and Reinforcement Learning (RL) are now standard in EDA, and Generative and Agentic AI is growing rapidly. Engineering leaders now face a critical question: beyond the hype, what is the actual Return on Investment (ROI) in chip design? This panel brings together industry experts to share real stories and measure the true value of AI over the past few years. We cover the full workflow, from HLS to digital implementation and verification.
Speed-ups vs. productivity boosts: We distinguish between raw speed improvements, such as faster simulations and verification, and true productivity gains from GenAI tools used for copilot-style documentation assistance, tool setup, and debugging. The discussion moves away from simple metrics (e.g., lines of code per hour) to the real drivers of ROI: reducing design cycles and getting to working silicon faster.
Junior vs. senior chip designers: The panel will also examine the impact across different experience levels. For juniors, AI speeds up work but risks weakening their grasp of the basics. For seniors, does AI handle repetitive tasks, or does it turn these senior architects into "error checkers" for AI-generated code? We compare cases where AI saved weeks of effort with those where poor code or settings required expensive manual fixes, and we discuss the strategies (e.g., data flywheel, model fine-tuning, human training) needed to reduce these risks.
AI Agents and MCP: Finally, we discuss the shift to autonomous EDA AI Agents that can automate certain flows and, in the future, drive full-flow automations. Since the current roadblock is that disconnected tools stall progress, Model Context Protocol (MCP) and Agent-to-Agent (A2A) are emerging as standards for agentic workflows and, potentially, for RTL-to-GDSII flows. We debate whether this "glue" can truly work, the investment needed to build it, whether fully autonomous workflows are realistic or just a dream, and how they unlock further ROI.
Speed-ups vs. productivity boosts: We distinguish between raw speed improvements, such as faster simulations and verification, and true productivity gains from GenAI tools used for copilot-style documentation assistance, tool setup, and debugging. The discussion moves away from simple metrics (e.g., lines of code per hour) to the real drivers of ROI: reducing design cycles and getting to working silicon faster.
Junior vs. senior chip designers: The panel will also examine the impact across different experience levels. For juniors, AI speeds up work but risks weakening their grasp of the basics. For seniors, does AI handle repetitive tasks, or does it turn these senior architects into "error checkers" for AI-generated code? We compare cases where AI saved weeks of effort with those where poor code or settings required expensive manual fixes, and we discuss the strategies (e.g., data flywheel, model fine-tuning, human training) needed to reduce these risks.
AI Agents and MCP: Finally, we discuss the shift to autonomous EDA AI Agents that can automate certain flows and, in the future, drive full-flow automations. Since the current roadblock is that disconnected tools stall progress, Model Context Protocol (MCP) and Agent-to-Agent (A2A) are emerging as standards for agentic workflows and, potentially, for RTL-to-GDSII flows. We debate whether this "glue" can truly work, the investment needed to build it, whether fully autonomous workflows are realistic or just a dream, and how they unlock further ROI.
Research Panel
EDA
DescriptionAs 3D heterogeneous integration moves from research prototypes to production systems, long-standing assumptions in EDA are being stress-tested by strong cross-die coupling, multi-physics interactions, and manufacturing-driven constraints that extend beyond traditional 2D and 2.5D abstractions.
This panel examines whether today's EDA tools can be incrementally adapted to meet these demands or whether their underlying architectures are fundamentally ill-suited for true 3D integration. By bringing together perspectives from EDA vendors, system designers, and academia, the discussion aims to identify critical gaps, debate the limits of current approaches, and outline what a credible next-generation EDA paradigm for 3D-heterogeneous systems must look like.
This panel examines whether today's EDA tools can be incrementally adapted to meet these demands or whether their underlying architectures are fundamentally ill-suited for true 3D integration. By bringing together perspectives from EDA vendors, system designers, and academia, the discussion aims to identify critical gaps, debate the limits of current approaches, and outline what a credible next-generation EDA paradigm for 3D-heterogeneous systems must look like.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionElliptic curve cryptography (ECC) has become a preferred choice for securing resource-constrained Internet-of-Things (IoT) devices, offering strong security with compact key sizes. As the computational core of ECC, elliptic curve point multiplication (ECPM) dominates both the performance and silicon area of ECC coprocessors. However, implementing ECPM in area-constrained applications remains challenging due to its significant arithmetic complexity and the diversity of field parameters. Most existing hardware accelerators are tailored to a specific elliptic curve modulus, limiting their scalability and forcing costly redesigns to support multiple curves. To overcome this inflexibility, we propose ISA-PM, a unified instruction set architecture (ISA) extension and microarchitecture that supports flexible ECPM computation across various elliptic curves, including widely used Montgomery and short Weierstrass forms. To effectively reduce the design area of ISA-PM, we propose a three-phase computation scheme that decomposes complex ECPM operations into iterative execution of small bit-width modular operations. Furthermore, we develop a design space exploration method based on bit-width partitioning and operation parallelism to balance resource utilization and computational latency, ensuring efficient implementation under area constraints. Experimental results based on AMD Virtex-7 FPGA demonstrate that ISA-PM delivers up to 8.37x improvement in area–time product (ATP) over the state-of-the-art lightweight ECPM hardware accelerators.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWe propose a novel protection diode architecture (aka lightning rods) that enables process-induced discharge while eliminating the leakage current induced due to protection diode placement. The lightning rods are placed on the periphery of the die, outside of the Integrated Passive Devices (IPD) circuitry, hence eliminating the leakage path between the capacitor electrodes and the protection diodes. Historically, protection diodes need to be placed in the active region of the Integrated Passive Devices (IPD) such as capacitors, inductors, heat elements, resistive temperature detectors (RTD), interposers, IO connectors (e.g. EMIB) to allow for process-induced discharge. Process-induced charging is proportional to the floating metal cumulative area. When the charge accumulation reaches a certain level, discharge occurs often leading to defects like burn mark/void /metal shorting etc. Protection diodes are typically connected to the metal plates to eliminate the defects due to process-induced discharge (PID), and hence diode becomes part of the active circuitry and increases current leakage. Therefore, placing the protection diodes in the vicinity of the active circuit but not directly connected to it eliminates leakage. In this case, these protection diodes act as lightning rods, similar to atmospheric discharge through lightning rods.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionWe present Janus, a compiler-based security framework that mit-
igates transient execution attacks like Spectre and control-flow
hijacking on ARM64 platforms. Janus integrates speculative execu-
tion and control flow dependencies with PA modifiers, using PA and
BTI microarchitectural features to prevent control-flow speculation
attacks and secure both control flow and speculative execution
through existing control-flow integrity mechanisms. To optimize
performance, Janus minimizes overhead by merging defense opera-
tions across different defense layers (modifier fusion) and reusing
registers of protected variables (carrier reuse), while maintaining
strong security guarantees. Evaluation on SPEC CPU2017 shows
an average performance overhead of 3.85%, with real-world appli-
cations exhibiting overheads ranging from 2.97% to 7.80%. Janus
offers effective speculative execution security and low performance
and code size overhead, making it a robust solution for ARM-based
systems.
igates transient execution attacks like Spectre and control-flow
hijacking on ARM64 platforms. Janus integrates speculative execu-
tion and control flow dependencies with PA modifiers, using PA and
BTI microarchitectural features to prevent control-flow speculation
attacks and secure both control flow and speculative execution
through existing control-flow integrity mechanisms. To optimize
performance, Janus minimizes overhead by merging defense opera-
tions across different defense layers (modifier fusion) and reusing
registers of protected variables (carrier reuse), while maintaining
strong security guarantees. Evaluation on SPEC CPU2017 shows
an average performance overhead of 3.85%, with real-world appli-
cations exhibiting overheads ranging from 2.97% to 7.80%. Janus
offers effective speculative execution security and low performance
and code size overhead, making it a robust solution for ARM-based
systems.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionSuperconducting Rapid Single Flux Quantum (RSFQ) logic is a promising candidate for high-performance computing due to its ultra-high speed and ultra-low power. Since most RSFQ logic gates require synchronization with a single clock pulse, RSFQ circuits have a clock-driven gate-level pipelined architecture in nature. However, the gate-level pipelined architecture leads to layouts with a high width-to-height ratio, especially in large circuits, because the layout width increases with the number of logic stages while the height is determined by the tallest column. To address this, we propose JSPlace, a shape-controllable and length-matching placement algorithm for RSFQ circuits. Our algorithm folds the multi-stage pipelined placement into three segments and merges columns via mixed-integer linear programming (MILP) to achieve the target width-to-height ratio. Subsequently, dynamic programming determines the vertical positions of nodes within each column, followed by connectivity-driven repositioning to minimize total vertical wirelength. Experimental results on ISCAS85 and EPFL benchmarks demonstrate that JSPlace achieves effective control of layout width-to-height ratio while significantly outperforming the state-of-the-art in layout area and wirelength.
People
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionThree-dimensional (3D) point cloud models have been widely employed in modern 3D perception tasks such as robotics, autonomous driving, and virtual reality. The k-nearest-neighbor (KNN) search serves as a cornerstone operation for point cloud models, providing the essential mechanism for defining and exploiting local spatial relationships within unstructured data. However, the massive scale of point cloud data and the high computational complexity of Euclidean distance-based top-k searches in KNN impose substantial computational overhead. Conventional edge computing platforms struggle to achieve real-time and high efficiency of point cloud processing. In this work, we propose KCD-CAM, a KNN accelerator using ReRAM-based charge-domain content-addressable memory (CAM) for efficient point cloud processing. The proposed KCD-CAM employs a 4T2R CAM cell capable of performing in-situ range search, effectively replacing complex Euclidean distance calculations with massively parallel operations. In addition, corner-clipped (CC) iterative top-k search scheme and dual-granularity voxel hashing (DG-VH) are employed to enhance accuracy and parallelism. Performance benchmarks in real-world datasets demonstrate that KCD-CAM achieves 279.79× higher speed and 3282× greater energy efficiency than GPU implementations. Compared to the SOTA KNN accelerators, our KCD-CAM also achieved 8.51× and 4.76× improvements in speed and energy efficiency, respectively.
People
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
DescriptionWe propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. To reduce cache invalidation caused by memory updates, we propose Static-Dynamic Memory Construction, which groups memory by update frequency and manages memory at different granularities. To reduce the impact of cross-attention ignorance between different groups, we propose Multi-hop Memory Re-computation, which dynamically identifies and recovers critical memory interactions through iterative memory importance propagation. We also propose Layer-balanced Memory Loading, a scheduling strategy that eliminates unbalanced KV loading and computation overhead between different layers caused by KV recomputation.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLarge language models (LLMs), such as GPT-3 and Llama-2, impose extreme memory-bandwidth demands, creating severe data movement bottlenecks. Processing-in-Memory (PIM) mitigates this by performing computations near memory; however, current PCIe-based PIM systems deliver only a small fraction of their theoretical throughput. On our multi-channel PIM emulation platform, PIM cores are active for only 6.55% of execution time, primarily due to (1) serialized host–device communication that limits channel-level parallelism, (2) insufficient request-generation capability in conventional DMA engines, and (3) per-bank microarchitectures that serialize batch execution.
We address these bottlenecks with a co-designed memory system and microarchitectural solution: Channel-Level Burst (CL) for autonomous per-channel operand generation, PDMA for high-throughput PIM-oriented request scheduling, and an integrated PIM core for enabling true multi-batch parallelism.
Across GEMM microbenchmarks and four LLMs, CL provides 9.10x speedup, PDMA adds 29.6x, and multi-batch execution contributes 102.4x. Together, they deliver a 145.8x improvement over a baseline multi-channel PIM system, enabling PIM to surpass CPU throughput in memory-bound LLM decode kernels and approach the efficiency of GPUs.
We address these bottlenecks with a co-designed memory system and microarchitectural solution: Channel-Level Burst (CL) for autonomous per-channel operand generation, PDMA for high-throughput PIM-oriented request scheduling, and an integrated PIM core for enabling true multi-batch parallelism.
Across GEMM microbenchmarks and four LLMs, CL provides 9.10x speedup, PDMA adds 29.6x, and multi-batch execution contributes 102.4x. Together, they deliver a 145.8x improvement over a baseline multi-channel PIM system, enabling PIM to surpass CPU throughput in memory-bound LLM decode kernels and approach the efficiency of GPUs.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionVision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV.
We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
Exhibitor Forum
DescriptionAI-driven methodologies are transforming custom RF and microwave design by overcoming long-standing challenges in scalability, complexity, and time-to-market. This session highlights Keysight's advances in surrogate modeling, secure AI/ML pipelines, and agentic automation that capture expert knowledge and accelerate co-design and verification.
Through practical case studies, we'll explore multi-physics integration across packaging, multi-physics, and system-level PHY waveform domains, along with high-dimensional optimization and reinforcement learning. We'll also showcase how digital twins are streamlining workflows and closing validation loops in wireless and defense applications.
Through practical case studies, we'll explore multi-physics integration across packaging, multi-physics, and system-level PHY waveform domains, along with high-dimensional optimization and reinforcement learning. We'll also showcase how digital twins are streamlining workflows and closing validation loops in wireless and defense applications.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionMixture of Experts (MoE) improves LLM expressiveness without proportional compute,
yet its large parameter footprint hinders deployment in storage- and
latency-constrained environments.
Existing pruning methods attempt to reduce redundancy
but often suffer severe accuracy loss at high pruning ratios due to limited pruning metrics.
To tackle this challenge, we
propose **KL-MoE**, a hierarchical pruning framework based on KL divergence scoring mechanism.
KL-MoE first clusters experts by the
similarity of experts. Then we develop
a **KL-based Scoring** mechanism to retain the most representative
expert within each cluster by jointly capturing
the *local* and *global*
functionality.
In addition, we introduce the **Linear Restore** strategy, a lightweight
mapping strategy that refines the outputs of the pruned MoE layer
to approximate the original layer, thus recovering the accuracy
of the pruned models.
Extensive experiments across multiple models
and tasks demonstrate that KL-MoE yields average gains
of **12.89%**, **7.24%** and **6.14%** over state-of-the-art methods
O-Prune, MoE-I$^2$ and HC-SMoE, respectively, while delivering a 1.31x
inference speedup.
Our code is available
at https://anonymous.4open.science/r/KL-MoE-a.
yet its large parameter footprint hinders deployment in storage- and
latency-constrained environments.
Existing pruning methods attempt to reduce redundancy
but often suffer severe accuracy loss at high pruning ratios due to limited pruning metrics.
To tackle this challenge, we
propose **KL-MoE**, a hierarchical pruning framework based on KL divergence scoring mechanism.
KL-MoE first clusters experts by the
similarity of experts. Then we develop
a **KL-based Scoring** mechanism to retain the most representative
expert within each cluster by jointly capturing
the *local* and *global*
functionality.
In addition, we introduce the **Linear Restore** strategy, a lightweight
mapping strategy that refines the outputs of the pruned MoE layer
to approximate the original layer, thus recovering the accuracy
of the pruned models.
Extensive experiments across multiple models
and tasks demonstrate that KL-MoE yields average gains
of **12.89%**, **7.24%** and **6.14%** over state-of-the-art methods
O-Prune, MoE-I$^2$ and HC-SMoE, respectively, while delivering a 1.31x
inference speedup.
Our code is available
at https://anonymous.4open.science/r/KL-MoE-a.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionA hybrid Solid-State Drives (SSDs) integrates different modes of flash cells (e.g., single-level cell (SLC) and Quad-Level Cell (QLC)) and enables them to convert between each other, achieving both high performance and storage capacity. However, this hybrid design introduces a significantly larger design space than traditional SSDs with additional design factors such as flash conversion and data migration across different flash modes, leading to higher optimization complexity. Efficient management of such complexity requires deep hybrid SSD knowledge and dynamic adjustment mechanisms. Large language models (LLMs) offer a promising solution through their contextual reasoning and adaptive coordination capabilities.
In this work, we explore the potential of using LLMs in understanding and efficiently managing hybrid SSD design space. We find that leveraging LLMs for knowledge-guided optimization of management parameters enables substantial performance gains. Building on these insights, we propose LLM-hybridSSD, an integrated optimization framework that formulates hybrid SSD management as a parameter-tuning problem, employs an LLM-based tuner for adaptive configuration, and applies reinforcement learning-based fine-tuning to align local lightweight models with domain-specific knowledge. Experimental results show an average 58.92% increase in throughput and a 28.56% reduction in write amplification (WA) compared with state-of-the-art schemes under different real-world workloads.
In this work, we explore the potential of using LLMs in understanding and efficiently managing hybrid SSD design space. We find that leveraging LLMs for knowledge-guided optimization of management parameters enables substantial performance gains. Building on these insights, we propose LLM-hybridSSD, an integrated optimization framework that formulates hybrid SSD management as a parameter-tuning problem, employs an LLM-based tuner for adaptive configuration, and applies reinforcement learning-based fine-tuning to align local lightweight models with domain-specific knowledge. Experimental results show an average 58.92% increase in throughput and a 28.56% reduction in write amplification (WA) compared with state-of-the-art schemes under different real-world workloads.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionWe present L2L, exploring end-to-end design optimization from logic to layout. The L2L framework includes logic, circuit, and layout stages. Logic exploration introduces two-stage transistor network synthesis. At the circuit stage, we propose two SMT models for transistor network synthesis: Connected-Diffusion (arbitrary, stack-height-bounded, multi-solution) and Grid-Scaffold (series-parallel). At the layout stage, an ILP-based flow co-optimizes circuit topology and double-height layouts, producing two 3-input P-class libraries (area- and metal-optimized). Block-level experiments demonstrate that full inclusion of 3-input P-class functions achieves up to 1.3% lower power, 5.6% higher performance, and 4.1% smaller area (average: 0.8%, 0.6%, 2.6%) outperforming the best prior baseline.
Late Breaking Results
DescriptionMicrofluidic devices enable miniaturized, automated laboratory operations, but their design typically requires specialized expertise and labor-intensive CAD workflows. We present a language-driven framework that synthesizes manufacturable microfluidic designs directly from natural-language prompts. The framework introduces an LLM-based high-level synthesis paradigm, using domain-specific fine-tuned models to translate user intent into a structured JSON design specification for subsequent layout synthesis. An automated validation and correction pipeline detects and fixes geometric and physical inconsistencies to ensure design viability.
People
Late Breaking Results
DescriptionHeterogeneous integration via hybrid bonding optimizes power, performance, and area (PPA) in multi-die systems. A core challenge is assigning circuit blocks to dies in different process technologies. Existing analytical formulations employ smoothly varying sigmoid proxies, causing assignment oscillation, or hard step functions that undermine robustness. We introduce a soft die indicator based on a piecewise-smooth surrogate with a strategically scheduled gradient decay, preserving early differentiability while progressively sharpening decisions. Building on this indicator, we develop (i) a smoothed multi-die wirelength model considering pins on both inside and on net bounding boxes to guide optimization precisely, unlike most previous models considering only pins on the boxes, and (ii) a size-compensated density formulation for heterogeneous dimensions. An adaptive preconditioner further balances gradients between macros and standard cells. Experimental results on the 2023 ICCAD CAD Contest benchmarks show that our placer achieves the smallest inter-die wirelength and fastest runtime among published academic placers.
Late Breaking Results
DescriptionThis paper introduces a coarse-to-fine printed circuit board (PCB) placement framework bridging high-level functional design intent and physical implementation. First, a large language model (LLM)-guided component grouping engine utilizes multi-modal feature serialization, topology-enriched retrieval-augmented generation (RAG), and multi-agent reasoning to capture functional constraints missed by standard netlists. Subsequently, a group-aware placement phase employs a hierarchical initial placement strategy via B*-tree packing to physically realize these modules. This prevents the pitfalls of blind initialization, providing an optimal starting point for group-aware global placement and constraint graph-based mixed integer linear programming (MILP) legalization. Experimental results demonstrate that the proposed framework significantly surpasses advanced placers in post-routing quality.
Late Breaking Results
DescriptionGlobal placement relies heavily on differentiable wirelength models to guide the optimization process. While the half-perimeter wirelength is gold standard due to its computational efficiency and smooth approximations, it suffers from a significant fidelity gap as it fails to capture the internal routing topology of multi-pin nets. In this paper, we propose a novel differentiable wirelength model based on the rectilinear minimum spanning tree (RMST). By leveraging the matrix-tree theorem and the log-sum-exp function, we transform the combinatorial RMST problem into a smooth, continuous objective suitable for global placement. Experimental results demonstrate that our model significantly reduces the fidelity gap, achieving a 3.4% reduction in RMST wirelength and a 3.2% reduction in routed wirelength compared to the widely used HPWL-based baseline, while maintaining acceptable runtime.
Late Breaking Results
DescriptionThis paper presents a high-performance programmable bootstrapping (PBS) accelerator for privacy-preserving computation under Torus Fully Homomorphic Encryption (TFHE) scheme. The proposed design features a heterogeneous processing architecture, including a barrel-shifter-based polynomial rotation unit, integer processing elements (PEs) for level decomposition, and complex-number arithmetic PEs for polynomial multiplication. Our proposed design completes one PBS in 47𝜇s, which is over 8×latency reduction while consuming comparable hardware resources compared to state-of-the-art designs.
Work in Progress
DescriptionSpiking Vision Transformers (SViTs) are developed as an energy-efficient alternative to conventional ViTs. To maximize efficiency gains of SViT processing, we propose MorphAtt, a novel digital SViT accelerator that expedites the inference through streamlined processing. Specifically, it processes multi-head self-attention (MHSA) operations using spiking Query-Key-Value, spiking self-attention, and reparameterization modules, as well as inter-module buffers to mitigate traffic congestion in on-chip memory accesses. Experimental results show that, MorphAtt achieves 792-1605 GOPS of throughput, while incurring ∼39-55 mW of power consumption and 1.5 mm2 of area, which lead to 20.3-29.1 TOPS/W of energy efficiency. These results demonstrate that, our MorphAtt offers better performance and efficiency trade-offs than state-of-the-art, thus enabling highly energy-efficient vision-based systems at the edge.
Work in Progress
DescriptionNear-term (NISQ) quantum processors are limited in qubit count, connectivity, and coherence, which constrains the size of quantum neural networks (QNNs) that can run monolithically on a single device. We propose a heterogeneous chiplet \emph{ensemble} architecture that targets a realistic setting with access to many small devices but no coherent interconnect: we assign one chiplet per model and use intra-model circuit cutting to decompose each logical QNN into resource-feasible subcircuits. Cross-chip quantum dependencies are replaced by classical stitching and ensemble aggregation, converting device heterogeneity into predictive diversity while avoiding coherent inter-chip communication. We further develop a resource-constrained co-design flow that jointly selects cut locations, chiplet/model assignment, and sampling budgets under qubit and noise constraints. Across MNIST, Fashion-MNIST, and Digits, our framework scales to larger logical models under strict per-device limits and yields consistent accuracy improvements (up to 3-8%) in both ideal simulation and noise-calibrated backends, motivating new CAD-style challenges in partitioning, mapping, and sampling-aware cost modeling for quantum chiplet systems.
Work in Progress
Late Breaking Results: A Scalable and Efficient Multi-Layer 3D-Printed Microfluidic Design Synthesis
5:46pm - 5:47pm PDT Monday, July 27 Exhibit HallDescriptionAdditive manufacturing, or 3D printing, holds great promise for microfluidic fabrication, enabling complex multi-layer 3D layouts, yet practical physical synthesis remains computationally prohibitive under strict timing and volumetric constraints. In 3D-printed microfluidic devices, precise control over fluid behavior is essential, demanding careful coordination of both arrival timing and delivered volume. The state-of-the-art approach couples placement, routing, and fluidic constraints into a monolithic optimization problem, which scales poorly and can require hours per iteration. We introduce a novel two-stage framework, augmented with a hierarchical conflict-resolution mechanism to ensure high routability. Compared to the state-of-the-art, our results show orders-of-magnitude speedups in complex cases, demonstrating its high efficiency.
Late Breaking Results
DescriptionThis paper presents a unified analytical multi-bit flip-flop (MBFF) clustering and placement framework that addresses industrial compatibility constraints while jointly optimizing timing, power, and area. Our framework combines (1) efficient timing and power models, (2) prioritized multi-density maps, and (3) a clustering-affinity mechanism. Together, they guide flip-flops toward merging-friendly regions and enable integrated placement and clustering. Our framework outperforms all teams in the 2025 ICCAD CAD Contest on Power and Timing Optimization Using Multi-Bit Flip-Flops, being the only approach to generate valid solutions across all testcases. Moreover, we refined the imbalanced contest library to better reflect clustering performance. Under this refined library, our method remains consistently superior.
Work in Progress
DescriptionAutomated chip design techniques are commonplace in digital IC design. Hardware description languages and register-transfer logic (RTL) code allow a design to be abstracted beyond a specific process. However, analog circuit performance is inherently tied to the specific technology and tradeoffs made during the design process. Developments in machine and reinforcement learning show promise for automating the analog design process, though these algorithms require an objective function to optimize. In this work, we demonstrate the use of Procrustes distance as an analog circuit sizing heuristic suitable for machine learning optimization,
sizing an actively-loaded differential amplifier with a constant-gm bias circuit in an open-source 130 nm process, achieving over 60% increase in unity gain frequency over a manually-sized benchmark.
sizing an actively-loaded differential amplifier with a constant-gm bias circuit in an open-source 130 nm process, achieving over 60% increase in unity gain frequency over a manually-sized benchmark.
Work in Progress
DescriptionAnalog sizing remains challenging since circuit parameters vary significantly across different Process Design Kits (PDKs) and circuit topologies. While recent AI-based approaches attempt to automate this process, many rely on large models or struggle to generalize across technology nodes. This paper propose EasySize, a lightweight LLM-assisted sizing framework that leverages the difficulty of satisfying performance metrics within the search space to dynamically construct task-aware loss functions guiding heuristic search. EasySize generalizes to multiple operational amplifiers across 180nm–22nm nodes and achieves competitive performance while reducing simulation cost.
Late Breaking Results
DescriptionRecent studies show that abstract syntax trees (ASTs) can improve the syntactic validity of large language model (LLM)-generated high-level synthesis (HLS) code. However, structure-enhanced methods often degrade at inference due to reliance on explicit structural signals during training. We show that robustness depends on how structure is introduced, rather than simply whether it is used. We propose ASTFusion, a learn--then--utilize framework that decouples structural acquisition from structural conditioning. ASTFusion internalizes hierarchical and dependency-aware structure through reconstruction from incomplete implementations, and exploits partial structural cues as optional guidance during generation. On HLSEval, ASTFusion improves Func@1 from 40.00% to 72.34% over standard HLS fine-tuning and from 57.69% to 72.34% over prior AST-enhanced fine-tuning, while reaching 98.71% Synth@10. These results show that progressive structural enhancement is effective for robust HLS code generation under realistic inference constraints.
Late Breaking Results
DescriptionVision-language-action (VLA) models enable robots to follow instructions in typed text, limiting broader deployment in natural speech.The traditional solution by prepending a speech recognition before VLA to convert speech into texts would incur additional latency and propagate errors. To address this, we present Speech-pi_0.5, which maps raw speech directly to motor commands without explicit textual transcriptions. Specifically, built on pi_0.5, we use a cross-modal projection to convert voice frames directly into audio tokens with significant response latency reduction. To further improve the accuracy, we adopt split LoRA adaptation with dedicated audio and task adapters, under a two-stage training. Speech-pi_0.5 achieves 3\times lower response latency than ASR pipeline while maintaining competitive task success on LIBERO, and generalizes robustly to unseen accents where a single adapter degrades significantly.
Work in Progress
DescriptionModern memory-intensive applications increasingly adapt Non Volatile Memorys (NVMs). However, relying on manufacturer datasheets or simulation-based models lead to discrepancies in real world energy profiles. We propose an adaptable measurement framework designed for the precise characterization of read and write energy in commercial of the shelf NVM chips. By utilizing an FPGA-MPSoC with a custom memory controller and an FMC extension board, our setup enables power measurements via an oscilloscope. Unlike traditional estimation methods, this framework captures direct physical measurements from MRAM, STT-RAM, and FRAM devices, providing a ground-truth reference for energy-efficient system design.
Work in Progress
DescriptionTo mitigate workload-driven thermal throttling, advanced mobile packages adopt heat escape structures, which can bias the memory interface to one die side and significantly tighten the trade-off between thermally-constrained performance and interconnect length. We formulate thermal throttling-aware floorplanning that jointly optimizes die area, peak temperature, and xPU-to-memory-interface wirelength under such structure-induced asymmetry. We construct Pareto fronts via spatio-thermal co-exploration and extract a knee point to produce an actionable budgeting rule: a modest +10% area margin yields a −6.7°C peak temperature reduction with only +4.7% wirelength overhead, effectively guiding early-stage floorplanning.
Late Breaking Results
DescriptionWe propose a dynamic routing strategy that adapts expert usage to input difficulty while reducing overall computation. We also quantize the gating weights using a multiplier-free scheme and apply
post-training calibration to preserve routing decisions. Compared with Top-K routing, dynamic routing improves multi-task performance by 5.07%. During inference on the ZCU102, it reduces MoE
expert FFN computation by 24.71% and lowers end-to-end latency by 4.3%, while gating quantization reduces DSP usage by 85%.
post-training calibration to preserve routing decisions. Compared with Top-K routing, dynamic routing improves multi-task performance by 5.07%. During inference on the ZCU102, it reduces MoE
expert FFN computation by 24.71% and lowers end-to-end latency by 4.3%, while gating quantization reduces DSP usage by 85%.
People
Late Breaking Results
DescriptionBackside power delivery networks (BSPDNs) alleviate front-side routing congestion, yet they introduce new challenges, including high resistance of buried power rails (BPRs) and nano-through-silicon vias (TSVs), as well as stringent TSV spacing constraints. This paper presents the first power-aware BSPDN synthesis framework that simultaneously accounts for post-placement power demand and strict TSV spacing rules. The proposed framework incorporates a lightweight resistive-network model to accurately evaluate effective instance voltage (EIV) and employs a three-phase optimization strategy to strategically align TSV allocation with power-intensive regions. Experimental results show that our approach reduces the maximum EIV drop by up to 25\% compared with a state-of-the-art post-placement BSPDN method.
Work in Progress
DescriptionTrue Random Number Generators (TRNGs) are essential for hardware security, yet existing DRAM-based designs primarily extract entropy into external buffers for cryptographic use, increasing exposure to leakage and profiling attacks. This work introduces Encoding-In-Memory TRNG (EIM-TRNG), a DRAM-based framework that leverages RowHammer-induced charge leakage to generate true random one-time entropy directly within commodity DDR4 memory. By driving selected cells into metastable states, intrinsic thermal noise and process variability produce inherently unpredictable bit outcomes without requiring hardware modification or protocol changes. Unlike prior approaches that export random bits, EIM-TRNG confines entropy generation and application inside memory, reducing attack surfaces and eliminating the need for persistent key storage. Experimental characterization of real DDR4 modules demonstrates robust randomness behavior and practical deployability, transforming a reliability vulnerability into a hardware-efficient security primitive.
Work in Progress
DescriptionOpen-vocabulary object detection enables language-driven object search for mobile robots, but on embedded GPUs a fixed 8 Hz detector can dominate the mission energy budget. We propose an
energy-aware runtime that couples a tri-state scheduler (SLEEP/EXPLORE/TRACK at 1.5/4/8 Hz) with dual-precision switching between INT8 and FP16 YOLO-World-S on a Jetson Orin Nano robot. A tegrastats-based logger provides mission energy and a task-level energy-per-detection metric. Across 48 real-robot missions, the scheduler cuts mission energy by 7.5% relative to a fixed 8 Hz FP16 baseline while using 33% fewer inferences, and achieves similar success rate and time-to-detect.
energy-aware runtime that couples a tri-state scheduler (SLEEP/EXPLORE/TRACK at 1.5/4/8 Hz) with dual-precision switching between INT8 and FP16 YOLO-World-S on a Jetson Orin Nano robot. A tegrastats-based logger provides mission energy and a task-level energy-per-detection metric. Across 48 real-robot missions, the scheduler cuts mission energy by 7.5% relative to a fixed 8 Hz FP16 baseline while using 33% fewer inferences, and achieves similar success rate and time-to-detect.
Work in Progress
DescriptionFPGAs provide customizable hardware acceleration that enables efficient, low-latency execution of machine learning inference through application-specific parallelism. While resource utilization and latency can typically be estimated early in the design process, accurate power consumption analysis generally requires completing the full hardware design flow, which may take several hours. In this paper, we present a framework that rapidly identifies high-quality, energy-efficient FPGA designs without requiring full compilation. The proposed approach converges to an optimal design point within seconds, achieving up to a 1000$\times$ speedup compared to conventional methods.
Work in Progress
DescriptionFeMFET multi-level cell (MLC) operation promises high-density weight storage for spiking neural network (SNN) inference in compute-in-memory (CIM) systems. We characterise 60 FeMFET devices and show that 4-state operation produces an S2-S3 threshold-voltage separation of only 1.12 sigma (21.5% read error), rendering 4-state weights information-theoretically insufficient for reliable classification regardless of training method. Using an information-theoretic bit-budget framework, we derive BB_min ≈ 6.64 bits as the minimum information capacity for 99%-accurate 4-class inference. We validate the framework on a synthetic 4-class IR fall-detection task: weight precision of 4 bits or higher consistently exceeds the bound and achieves 96-99% accuracy under post-training quantization, while 1-bit precision collapses to chance (25%) regardless of timesteps. 3-state FeMFET operation (separation ≥ 3.35 sigma, read error probability = 0.001) achieves 74.9% under post-training quantization, whereas 4-state (read error probability = 0.215) degrades to 43.2% ± 10.6% under physical device noise. The framework translates directly into a hardware specification, providing a principled pre-training viability criterion for any CIM system with characterised device distributions.
Late Breaking Results
DescriptionWe propose FOCUS, a hardware-aware PTQ recovery framework for low-bit LLM inference. FOCUS uses Fisher information to select approximately 1.5% structurally critical row-column intersections and optimizes their corrections via ridge regression to match output error, enabling a small set of parameters to compensate global quantization error. Unlike fixed-rank SVD updates that distribute correction capacity uniformly, FOCUS concentrates precision on error-critical locations while remaining lightweight. It supports latency-hidden CPU-GPU co-execution, where sparse compensation is offloaded to the CPU. Experiments show that FOCUS improves perplexity and downstream accuracy under both 2-bit and 4-bit PTQ, achieving up to 29% higher token throughput on the Llama-3.2-3B model.
Work in Progress
DescriptionThe scale of Neutral Atom (NA) quantum computers requires automated compilation tools. Designing the required heuristic methods demands a deep understanding of complex hardware trade-offs, for which visualizations can provide crucial insights. This work introduces NAViz, the first publicly available app to visualize quantum computations on NA devices in real-time. A case study demonstrates how NAViz was instrumental in identifying and resolving inefficiencies in an existing compilation strategy, leading to a new, more performant one. The tool is available in open source.
Late Breaking Results
DescriptionVariational quantum circuits (VQCs) are typically evaluated at the logical design level when analyzing trainability. However, execution on real quantum devices requires hardware-aware compilation to satisfy qubit connectivity and gate constraints.
In this paper, we examine how transpilation alone alters gradient statistics. Using parameter-shift differentiation and gradient variance estimation, we compare logical and transpiled circuits across three representative ansatz families: EfficientSU2 (dense entanglement), TTN (tree tensor network), and RealAmplitudes (linear entanglement). We observe architecture-dependent trainability shifts: densely entangling circuits exhibit pronounced gradient reshaping in shallow regimes, structured tensor-network circuits remain comparatively robust, and linear architectures show mixed behavior. Deep circuits across all families display minimal sensitivity to compilation. These findings demonstrate that hardware mapping acts as an implicit structural transformation of the optimization landscape, motivating compilation-aware analysis and co-design for VQCs.
In this paper, we examine how transpilation alone alters gradient statistics. Using parameter-shift differentiation and gradient variance estimation, we compare logical and transpiled circuits across three representative ansatz families: EfficientSU2 (dense entanglement), TTN (tree tensor network), and RealAmplitudes (linear entanglement). We observe architecture-dependent trainability shifts: densely entangling circuits exhibit pronounced gradient reshaping in shallow regimes, structured tensor-network circuits remain comparatively robust, and linear architectures show mixed behavior. Deep circuits across all families display minimal sensitivity to compilation. These findings demonstrate that hardware mapping acts as an implicit structural transformation of the optimization landscape, motivating compilation-aware analysis and co-design for VQCs.
Work in Progress
DescriptionThis paper proposes an OPQ-based two-stage search framework for efficient sparse coding in KV cache compression. By integrating Optimized Product Quantization (OPQ) with a filter-and-refine strategy, our framework identifies a compact candidate subset using compressed metadata before performing exact refinement. Experimental results show that the proposed method reduces computational volume by up to 2.46x and memory traffic by 4.5x compared to the Batch-OMP baseline, while maintaining exact support recovery with only 12.5% memory overhead.
Late Breaking Results
DescriptionDue to rising electricity demand, accurate short-term load forecasting is increasingly important for grid stability and efficient energy management, particularly in resource-constrained edge settings. We present a hardware-efficient Quantum Reservoir Computing (QRC) framework based on a fixed, untrained quantum circuit with Chebyshev feature encoding, brickwork entanglement, and single- and two-qubit Pauli measurements, avoiding quantum backpropagation entirely. Using the Tetouan City Power Consumption dataset, we examine the effect of post-training fixed-point quantization on the classical readout layer, with the reservoir architecture selected through a genetic search over 18 candidate configurations. Under finite-shot evaluation, 8-bit and 6-bit quantization maintain forecasting accuracy within 1\% of the FP32 baseline while reducing readout memory by 75\% and 81\%, respectively. These results suggest that quantized readout can improve the hardware efficiency and deployment practicality of QRC for memory-constrained energy forecasting.
Work in Progress
DescriptionMicrofluidic biochips enable complex experiments but lack simulators for coupled physical fields, causing fatal flaws like mixing failure to evade detection until post-fabrication. We propose a framework integrating a high-fidelity fluid-solute-coupled simulator with a graph-transformer forecaster. Experiments demonstrate that the simulator accurately captures complex interactions, while the forecaster achieves 36,000x speedup at 98% accuracy, bridging the gap between fidelity and efficiency for optimized biochip design.
People
Work in Progress
DescriptionEvaluation of Electronic Design Automation (EDA) tools, circuits, and systems often relies on outdated hardware benchmarks. Existing benchmark designs, which are predominantly RISC-V CPUs, provide limited representation of components found in modern SoCs. This hinders (i) evaluating the capabilities of design tools and flows and (ii) providing diverse datasets to train machine learning (ML) models for EDA. We introduce HighTide\footnote{Repository URL omitted for anonymous review.}, a benchmarking suite that tracks open-source designs and evolves to maintain a contemporary evaluation suite.
Late Breaking Results
DescriptionStatic worst-case execution time (WCET) analysis for multicore real-time systems remains pessimistic because write-invalidate coherence perturbs remote cache states at run time, preventing per-core hit/miss classification. We present HUSH, a write-update coherence that silences interference while bounding update traffic with time-based self-invalidation. The timeout defines a sharing window and is tuned offline to balance predictability and performance. On SPLASH-2 in gem5, HUSH reduces worst-case total memory latency (WCML) by 30.7x and 8.4x vs. PMSI and PENDULUM*, while keeping average performance within 1.17x of MSI baseline.
People
Late Breaking Results
DescriptionMasked autoregressive (MAR) models have recently shown strong potential in visual generation tasks. Unlike traditional autoregressive model that generates tokens sequentially, MAR predicts multiple tokens in parallel at each generation step (stage), resulting in non-linear growth in computation and memory across stages. To meet the demands, we introduce Hybrid-bonding (HB) architectures to provide high memory bandwidth and scalable compute capability through 3D integration of DRAM and logic banks. However, existing static mapping strategies cannot effectively adapt to the stage-wise workload shifts due to limited remote-bank bandwidth and constrained per-bank compute capacity. In this work, we propose a \textbf{stage-aware mapping strategy} for MAR inference on HB-based architectures that dynamically balances communication and computation costs across stages. Experimental results show up to \textbf{1.75$\times$ speedup} and \textbf{1.56$\times$ energy efficiency} compared with baselines.
Work in Progress
DescriptionScarcity and noise in open-source datasets severely limit Large Language Models (LLMs) in Register Transfer Level (RTL) design. To address this, we propose a targeted data selection framework using Low-rank Gradient Similarity Search (LESS). By leveraging gradient-based influence estimation, LESS filters detrimental data by selecting training examples that align with the target task's gradient trajectory. Experiments show that fine-tuning on just 5% of LESS-selected data matches full-dataset training performance. Furthermore, using LESS-selected data for second-stage fine-tuning outperforms fully trained models, whereas random selection degrades them. Prioritizing data quality over quantity thus offers a promising path to state-of-the-art automated hardware design.
Work in Progress
DescriptionRecent advances in topology synthesis demonstrate strong potential to improve standard cell power, performance, and area (PPA). However, the state-of-the-art method prioritizes area and relies on topology-level timing estimates that ignore the dominant impact of post-layout interconnect parasitics. We address this limitation with a layout-aware synthesis framework that embeds an in-loop, half-perimeter wirelength (HPWL)-based parasitic proxy during synthesis. By pruning candidates with high physical costs during topology generation, our approach ensures that topological benefits effectively translate into realized post-layout performance. Validated in a 4nm technology, our framework matches prior best-in-class area reduction while improving timing by up to 13.8% over topology-only synthesis and 25.05% over industrial baselines.
Late Breaking Results
DescriptionAnalog integrated circuit (IC) layout automation is fundamentally hindered by the semantic gap in constraint extraction and the premature binding of device shapes. Traditional serial flows rely on simplistic, netlist driven constraint handling and decouple device generation from floorplanning. This rigidity drastically restricts the feasible solution space, often leading to severe dead-space overhead and performance degradation. To address these bottlenecks, we propose CoLLM-DDM, a novel analog layout automation framework. It synergizes a Constraint-aware Large Language Model (CoLLM) engine, which intelligently extracts comprehensive topological and signal-flow constraints from both schematics and netlists, with a Deferred Decision Making (DDM) based floorplanning algorithm. By dynamically exploring full-spectrum device folding schemes and deferring geometric binding to the global placement phase, CoLLM-DDM maximally preserves optimization freedom. Experimental results on industrial grade analog circuits demonstrate that, compared to the state-of-the-art framework, CoLLM-DDM achieves a 63% reduction in dead space and a 7% improvement in core net half-perimeter wirelength (HPWL), all while maintaining a sub-1.5-second runtime.
Late Breaking Results
DescriptionAs transistor scaling advances beyond the 3 nm node, the Flip-FET (FFET) architecture has emerged to provide dual-sided pin accessibility. However, ensuring routable dual-sided connectivity in compact FFET standard cells remains challenging. This work presents a scalable FFET standard cell layout synthesis framework that jointly optimizes transistor placement, pin accessibility, and in-cell routing through two tightly coupled optimization phases: a satisfiability modulo theories (SMT)-based merge-aware transistor placement phase, followed by a dynamic programming (DP)-based wave propagation routing phase that integrates dynamic pin assignment. Experimental results on 3.5T FFET cells show reductions of 5%, 10%, and 22% in cell width, via count, and wirelength over DAC-25, and 1%, 6%, and 14% over ASPDAC-26. Furthermore, the proposed framework successfully routes all 2.5T FFET cells, achieving up to 27% area reduction compared to the corresponding 3.5T designs.
Work in Progress
DescriptionSemi-structured N:M sparsity theoretically reduces arithmetic cost but often fails to deliver proportional end-to-end latency reduction on modern GPUs. We identify that a key bottleneck lies in irregular execution paths and poor Tensor Core utilization in existing sparse kernels. This work presents MD-SpMM, a Tensor-Core-native N:M sparse CUDA kernel that restructures sparse computation into micro-dense MMA-aligned dataflow. By decoupling sparsity handling from the execution-critical path and introducing inference-scale-aware adaptive parallelism, MD-SpMM restores dense-like execution efficiency while preserving sparsity benefits. Preliminary results across multiple GPU platforms demonstrate up to 2x average speedup over dense cuBLAS and significant gains over state-of-the-art N:M sparse kernels, validating the practicality of Tensor-Core-aligned N:M acceleration.
Late Breaking Results
DescriptionAs resistance and routing congestion intensify at advanced nodes, backside power delivery networks (BSPDNs) have emerged to offload power routing from congested frontside metal. In BSPDNs, nano-through-silicon vias (n-TSVs) bridge backside power layers to frontside transistors, and their placement critically impacts IR-drop, thermomechanical stress, and manufacturability. Unlike traditional frontside PDNs, where TSV candidates can be placed in general whitespace, BSPDN n-TSVs are constrained to buried power rail (BPR) locations, and their stress effects are non-negligible. Prior staggered assignment schemes mitigate congestion but introduce manufacturing complexity and excessive via counts. We present the first IR-drop-aware n-TSV optimization flow for BSPDNs under n-TSV density constraints that enhance packaging manufacturability. We provide a mixed integer linear programming (MILP) formulation based on our circuit model for backside power delivery networks. We use efficient model transformation and legalization methods to obtain our final assignment solutions. Compared with the state-of-the-art staggered assignment method, we reduce the required number of n-TSVs by an average of 56.68% to satisfy the given IR-drop constraints within a reasonable runtime.
Late Breaking Results
Late Breaking Results: Novel Qubit Mapping for Two-Dimensional Trapped-Ion Quantum Computing Systems
6:56pm - 7:00pm PDT Monday, July 27 Exhibit HallDescriptionEffective qubit mapping facilitates quantum algorithm implementations in physical quantum computing architectures. Recent work reported promising qubit mapping on one-dimensional trapped-ion systems. Due to manufacturing complexity, however, it is insufficient to map all logical qubits into one single qubit array in a large-scale quantum circuit; instead, we shall divide a quantum circuit into subcircuits for extending to two-dimensional trapped-ion systems. Nevertheless, operating a two-qubit gate between different subcircuits reduces the circuit fidelity. As a result, it is desirable to develop an effective partitioning algorithm that can maintain fidelity and minimize the execution time. This paper develops an effective divide-and-conquer algorithm for each array. Unlike traditional min-cut partitioning to balance the qubit number on each array, we propose a depth-aware partitioning algorithm to expand the solution space for all mapping solutions and optimize the execution time on each array. Besides, we develop a satisfiability modulo theories-based algorithm to optimize the mapping solution for each array. Experimental results show that our algorithm can averagely achieve a 14% total time step reduction and a 30% total fidelity improvement for commonly used benchmarks.
Late Breaking Results
DescriptionAs conventional standard-cell methodologies increasingly limit Design Technology Co-Optimization (DTCO) in advanced nodes, direct transistor-level placement emerges as a crucial solution for minimizing wirelength and area. However, existing analytical placers optimize continuous coordinates, struggling to capture the strictly discrete and directional nature of active breaks.
To overcome this, we propose a physics-aware generative framework featuring a novel Augmented Split-Graph to explicitly model complex active break constraints. By employing a continuous latent diffusion process that deterministically decodes into discrete Sequence Pairs, our method bridges continuous generative modeling and discrete physical constraints, guaranteeing strictly overlap-free layouts. Experimental results demonstrate that our framework significantly outperforms a leading commercial standard-cell-based baseline, achieving a 12.4\% reduction in average wirelength, a 13.4\% increase in active-sharing count, and a 7.4\% overall reduction in layout area.
To overcome this, we propose a physics-aware generative framework featuring a novel Augmented Split-Graph to explicitly model complex active break constraints. By employing a continuous latent diffusion process that deterministically decodes into discrete Sequence Pairs, our method bridges continuous generative modeling and discrete physical constraints, guaranteeing strictly overlap-free layouts. Experimental results demonstrate that our framework significantly outperforms a leading commercial standard-cell-based baseline, achieving a 12.4\% reduction in average wirelength, a 13.4\% increase in active-sharing count, and a 7.4\% overall reduction in layout area.
Late Breaking Results
DescriptionA 3D-IC architecture packs a design with enhanced functionality
and density into a small footprint while improving performance
and lowering costs. By leveraging through-silicon via (TSV) and
die-to-die bump technology, power can be efficiently delivered to
the top and bottom dies. Moreover, advanced backside power deliv-
ery technology enables a more streamlined power delivery network
for 3D-IC. However, the deployment of TSVs and power/ground
bumps while achieving fast and accurate verification further com-
plicates power delivery network (PDN) construction. To tackle this
challenge, in this work, we collect golden IR drop results from com-
mercial tools for different configurations and densities on TSVs,
backside metals, die-to-die bumps and C4 bumps then apply and
compare with the efficient tree-based methods (Random Forest and
XGBoost) and deep neural networks (U-Net and Inception U-Net).
In saving runtime, compared to heuristic manual approach, Ran-
dom Forest can achieve a 5.8𝑋 speedup (including commercial tool
run time). In reducing the power mesh metal ratio, we devise an
Inception U-Net combined with image rotation to generate end-
to-end IR-drop prediction results for 3D-IC, achieving an MSE of
0.002/0.005 (bottom/top die), 𝑅2 of 0.97/0.95 (bottom/top die), saving
7.6% metal usage, and delivering more than a 5% improvement in
IR-drop results, which are very close to the golden results reported
by commercial tools. The experiments are conducted on industrial
homogeneous design integrations manufactured by a 3nm process.
Our results show that our approach can quickly and accurately
predict IR drop, establish a PDN with small resource usage, and
greatly reduce the time for back-and-forth verification.
and density into a small footprint while improving performance
and lowering costs. By leveraging through-silicon via (TSV) and
die-to-die bump technology, power can be efficiently delivered to
the top and bottom dies. Moreover, advanced backside power deliv-
ery technology enables a more streamlined power delivery network
for 3D-IC. However, the deployment of TSVs and power/ground
bumps while achieving fast and accurate verification further com-
plicates power delivery network (PDN) construction. To tackle this
challenge, in this work, we collect golden IR drop results from com-
mercial tools for different configurations and densities on TSVs,
backside metals, die-to-die bumps and C4 bumps then apply and
compare with the efficient tree-based methods (Random Forest and
XGBoost) and deep neural networks (U-Net and Inception U-Net).
In saving runtime, compared to heuristic manual approach, Ran-
dom Forest can achieve a 5.8𝑋 speedup (including commercial tool
run time). In reducing the power mesh metal ratio, we devise an
Inception U-Net combined with image rotation to generate end-
to-end IR-drop prediction results for 3D-IC, achieving an MSE of
0.002/0.005 (bottom/top die), 𝑅2 of 0.97/0.95 (bottom/top die), saving
7.6% metal usage, and delivering more than a 5% improvement in
IR-drop results, which are very close to the golden results reported
by commercial tools. The experiments are conducted on industrial
homogeneous design integrations manufactured by a 3nm process.
Our results show that our approach can quickly and accurately
predict IR drop, establish a PDN with small resource usage, and
greatly reduce the time for back-and-forth verification.
Work in Progress
DescriptionQuantum sensing promises measurement sensitivities beyond classical limits, but the study of sensing protocols remains challenging due to limited access to specialized quantum sensing hardware. Existing approaches evaluate sensing protocols using either circuit-level simulations or physics-based sensor models, leaving a gap between protocol design and experimental validation. This work introduces Q-STEP, a cross-layer framework for systematically evaluating quantum sensing protocols using both reference models and experiment-oriented workflows. In Q-STEP, circuit-level simulations implemented in Qiskit generate a reference model that predicts the expected behavior of sensing sequences. Experimental sensing datasets are generated using the Qudi control framework and analyzed with physics-based NV center models in QuTiP and QuaCCAToo. Using this workflow, we evaluate Ramsey, Hahn echo, and CPMG sensing sequences and examine how coherence decay behavior observed in experiment-oriented datasets compares to circuit-level predictions. By enabling cross-layer validation between circuit-level models and physics-based sensor analysis, Q-STEP provides a structured methodology for validating quantum sensing protocols prior to deployment on quantum sensing platforms.
Work in Progress
DescriptionVariational quantum circuits (VQCs) are promising yet costly to deploy because redundant parameters and deep entangling layers amplify noise and runtime. We present Q-Compression, a post-training pipeline that masks small-magnitude angles, snaps low-sensitivity parameters to a low-bit grid using a curvature proxy, and freezes noise-dominated updates via gradient-variance tracking. Across MNIST, FashionMNIST, and KMNIST (binary 0-vs-rest), Q-Compression preserves accuracy and often slightly improves it while reducing circuit operations and depth. Best settings (e.g., keep ratio 0.75, 16 levels) maintain $>0.997$ accuracy with substantial complexity savings. These reductions yield shallower, uniform circuits that are better suited to NISQ execution and tuning.
Late Breaking Results
DescriptionThis work presents a novel approach to formulate Logic Equivalence Checking (LEC) and Test Pattern Generation (TPG) problems as Quadratic Unconstrained Binary Optimization (QUBO) formulations, allowing them to be solved with quantum algorithms such as Quantum Annealing and Quantum Approximate Optimization Algorithm. We propose an advanced QUBO formulation approach that significantly re- duces the qubit, quantum gate and quantum circuit depth requirements, thereby improving the scalability and efficiency of quantum algorithms. To the best of our knowledge, this is the first work that formulates LEC and TPG problems in the QUBO. This work opens up new possibilities for leveraging quantum computing in the domains of verification and testing.
People
Late Breaking Results
DescriptionLayer-wise N:M sparsity balances accuracy and hardware acceleration for Vision Transformers (ViTs), yet identifying effective configurations is costly due to fine-tuning overhead and latency-induced fragmentation. We present HaLSpar, a hardware-aware framework that couples a recoverability-driven zero-shot proxy (RCG) with a latency-constrained search strategy. By estimating recovery potential without repeated fine-tuning, HaLSpar directly optimizes sparsity configurations. On ImageNet-1K with multiple ViT and Swin models, it achieves up to 270x faster search while delivering 1.5x–2.5x speedups with minimal accuracy degradation.
Late Breaking Results
DescriptionLocal Binary Pattern Network (LBPNet) concentrates representational power in a small set of learned spatial sampling offsets, creating a high-leverage fault surface. We propose ROAST, a white-box reverse-training attack that updates only offsets to maximize the loss, then maps adversarial offsets to a minimal-bit-flip schedule using truncated Hamming distance on FP32 encodings. On MNIST and SVHN, ROAST induces >70% and 64% accuracy drops while flipping only ~4–5% of offset bits, outperforming BFA in damage-per-bit and avoiding border-saturation artifacts.
Late Breaking Results
DescriptionEfficient OARSMT construction is a critical bottleneck for modern VLSI routing. Current solutions either achieve high speed but ignore obstacles, or guarantee legality at the expense of prohibitive runtimes. In this paper, we introduce a scalable framework that predictively compresses the routing search space without sacrificing wirelength quality. We first present a learning-guided candidate pruning to condense the obstacle-expanded graph into a set of high-confidence Steiner candidates. Then, we establish a novel Delaunay-driven topological sparsification that geometrically confines any optimal obstacle-avoiding rectilinear tree to a linear-size subspace. Evaluations on extreme-density benchmarks demonstrate a $1.63\times$ speedup and a 3.2\% wirelength reduction over the state-of-the-art solver. Furthermore, integrating our framework into TritonRoute-WXL reduces the total routing runtime by 7.0\% and wirelength by 0.3\% on ICCAD 2019 benchmarks, without introducing any additional DRC violations.
People
Late Breaking Results
DescriptionHybrid GPU–CPU deployment of large Mixture-of-Experts (MoE) models is often latency-bound by CPU–side expert execution and cross-device orchestration overhead. To overcome this limitation, we propose a structured expert routing strategy, implemented via asymmetric expert skipping, which reduces expert activation in latency-critical layers while applying lightweight magnitude compensation and norm calibration to preserve model accuracy. Experiments on quantized Qwen3 (30B and 235B) models show that our method improves end-to-end generation throughput by up to 60% while achieving 3.6% higher accuracy than uniform expert reduction under the same compute budget.
People
Work in Progress
DescriptionMemristive devices are promising for neuromorphic and in-memory computing, but practical analog implementations often face variability and fabrication challenges. This work presents a synthesizable 8-bit digital memristor emulator implemented in Verilog and targeted for FPGA realization. The design employs a discrete-time, threshold-based state update mechanism with Q1.15 fixed-point arithmetic at 10~kHz sampling. Simulation results demonstrate frequency-dependent pinched hysteresis consistent with memristive behavior, as well as programmable SET/RESET resistance modulation and stable non-volatile retention under zero excitation. The bounded 8-bit state ensures deterministic operation suitable for hardware deployment. The proposed architecture provides a compact and reproducible digital platform for architectural exploration of memristor-inspired computing systems, with scalability toward FPGA-based arrays for neuromorphic and edge computing applications.
People
Work in Progress
DescriptionCompile success has become the dominant evaluation metric in LLM-for-hardware research. We demonstrate that at the protocol level, compile success is nearly uncorrelated with functional correctness. We introduce three formal metrics --- Repair Efficiency Score (RES), Verification Gap (VG), and Specification Coverage Ratio (SCR) --- that operationalize the compile-to-correctness gap, and report their values from an instrumented case study of LLM-generated UVM testbench generation for an AHB2APB bridge. A compilation-driven agentic repair loop resolved 37 compile errors in 15 calls (RES = 2.47), yet VG = 0.80 remained after full automation: 80% of functional failures were invisible to the compiler. A taxonomy of eight protocol-level failure modes characterizes precisely where current LLMs reach their limit. The functionally corrected testbench detected a previously unknown RTL race condition in the bridge's registered pipeline logic.
People
Work in Progress
DescriptionWe present a lightweight, mapless path-planning framework for resource-constrained robots operating in dynamic environments. The method incrementally builds small local maps from onboard sensors, selects intermediate subgoals, and applies A* planning only within these local regions to navigate step-by-step toward the goal, ensuring constant memory and bounded computation. Deployed on an Arduino Nano 33 BLE Rev2, the framework achieves per-step planning in 1.3--10.6~ms using 88~kB flash and 205~kB RAM, outperforming reactive and incremental baselines that fail under identical conditions.
Late Breaking Results
DescriptionThermal management in 3DIC designs is critical due to high power density from vertically stacked dies. While FEM-based thermal analysis provides accurate results, it is computationally prohibitive for design space exploration. We present a generic 3DIC multi-physics foundation model architecture that structurally supports multi-modal 3DIC inputs and task-specific decoders. We instantiate this architecture for 3DIC thermal analysis, supporting arbitrary HTC distributions, design sizes, and power patterns. Our model achieves 7% MAPE compared with golden FEM simulations on unseen cases. Compared to an FEM solver (2 hours, 16GB), our approach runs in <1 second using 8MB, achieving a 7200X speedup with a 2000X memory reduction for a representative test case.
Late Breaking Results
DescriptionIn recent years, customized and low-precision CNNs have emerged,
tailored for TinyML applications and well-suited for FPGA deploy-
ment. DSP Packing has been proposed to increase the computa-
tional density of the limited DSP blocks on modern FPGAs, but
current solutions under-utilize key DSP components, like the pre-
adder. In this work, we introduce R-Pack, a novel activation- and
weight-packing technique, able to fully utilize DSP blocks, doubling
their multiplication density to boost inference performance. Our
evaluation showcases our framework's capabilities in reducing the
inference latency by 80% on average, for a mean accuracy loss of
only 2.23%, compared to the baseline state-of-the-art hls4ml tool.
tailored for TinyML applications and well-suited for FPGA deploy-
ment. DSP Packing has been proposed to increase the computa-
tional density of the limited DSP blocks on modern FPGAs, but
current solutions under-utilize key DSP components, like the pre-
adder. In this work, we introduce R-Pack, a novel activation- and
weight-packing technique, able to fully utilize DSP blocks, doubling
their multiplication density to boost inference performance. Our
evaluation showcases our framework's capabilities in reducing the
inference latency by 80% on average, for a mean accuracy loss of
only 2.23%, compared to the baseline state-of-the-art hls4ml tool.
Late Breaking Results
DescriptionAccurate timing-library generation across PVT conditions has become increasingly expensive in advanced technology nodes. This paper introduces a robust, training-free method for inferring CCS library entries at target PVT corners by interpolating voltage waveforms, rather than currents, using shape-preserving interpolation. By working in the voltage domain, the approach avoids the instability and outliers often seen in ML-based current prediction and produces physically consistent results. Experiments on the ASAP7 and CAD Contest datasets show that our method outperforms leading ML regressors, reducing CCS error by more than 100×. The result is faster, safer digital signoff with minimal additional characterization overhead.
Late Breaking Results
DescriptionModern printed circuit boards (PCBs), especially high-power PCB designs, impose diverse design requirements that demand automated routing. Although grid-based data structures can enforce design rules, they often incur excessive via usage and long runtimes due to insufficient global planning. This paper presents a constraint-driven routing framework that systematically addresses these challenges. Our framework features three key techniques: (1) a crossing-aware global routing method that models and enforces pair-spacing and width-range constraints, (2) a spatial-order preservation strategy that expands the solution space by avoiding premature pruning of alternative routing states, and (3) a piecewise net-sizing technique that allocates net widths according to routing resources while maintaining rule compliance. Experimental results demonstrate the effectiveness of our proposed framework while satisfying all design rules.
Work in Progress
DescriptionNitrogen-vacancy (NV) centers in diamond provide a highly sensi-
tive platform for quantum sensing. However, extracting meaningful
information from noisy and lossy measurement data remains a ma-
jor challenge. Quantum machine learning (QML) offers a powerful
framework for parameter estimation by learning complex rela-
tionships between quantum sensing data and underlying physical
signals. In this work, we demonstrate the role of QML in enhancing
magnetic field estimation through an NV-center-inspired magne-
tometry experiment. We formulate the sensing task as a regression
problem and compare classical machine learning models trained
on classical measurement data with quantum kernel-based models
trained on pre-measurement coherent quantum states. Our results
establish a theoretical upper bound on sensing performance achiev-
able when learning from coherent quantum states. These findings
indicate that the effectiveness of QML in quantum sensing critically
depends on access to coherent quantum information. They also
motivate future sensing architectures that integrate quantum sen-
sors with quantum-native learning pipelines to unlock improved
sensing performance
tive platform for quantum sensing. However, extracting meaningful
information from noisy and lossy measurement data remains a ma-
jor challenge. Quantum machine learning (QML) offers a powerful
framework for parameter estimation by learning complex rela-
tionships between quantum sensing data and underlying physical
signals. In this work, we demonstrate the role of QML in enhancing
magnetic field estimation through an NV-center-inspired magne-
tometry experiment. We formulate the sensing task as a regression
problem and compare classical machine learning models trained
on classical measurement data with quantum kernel-based models
trained on pre-measurement coherent quantum states. Our results
establish a theoretical upper bound on sensing performance achiev-
able when learning from coherent quantum states. These findings
indicate that the effectiveness of QML in quantum sensing critically
depends on access to coherent quantum information. They also
motivate future sensing architectures that integrate quantum sen-
sors with quantum-native learning pipelines to unlock improved
sensing performance
Late Breaking Results
DescriptionAccurate prediction of future blood glucose (BG) levels is critical for
the effective management of type 1 diabetes. Existing approaches
for glucose prediction often rely on deep learning models that are
computationally intensive. In this work, we propose an adaptive and
lightweight recursive least squares (RLS)-based BG predictor that
enables online model updates while maintaining low computational
complexity. Our RLS implementation on a tiny FPGA delivers 1.55
𝜇s latency and 51.3 𝜇J energy consumption per prediction. Evalua-
tion on the OhioT1DM dataset demonstrates an RMSE of 18.83 and
32.12 for 30 and 60 minute prediction horizons, respectively. Our
FPGA BG predictor outperforms the state-of-the-art (SoA) far-edge
implementations across all metrics: latency, energy, and error.
the effective management of type 1 diabetes. Existing approaches
for glucose prediction often rely on deep learning models that are
computationally intensive. In this work, we propose an adaptive and
lightweight recursive least squares (RLS)-based BG predictor that
enables online model updates while maintaining low computational
complexity. Our RLS implementation on a tiny FPGA delivers 1.55
𝜇s latency and 51.3 𝜇J energy consumption per prediction. Evalua-
tion on the OhioT1DM dataset demonstrates an RMSE of 18.83 and
32.12 for 30 and 60 minute prediction horizons, respectively. Our
FPGA BG predictor outperforms the state-of-the-art (SoA) far-edge
implementations across all metrics: latency, energy, and error.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionDifferentiable techniques for timing optimization have be-come the dominant paradigm in global placement, surpassing weight-tuning heuristics by directly leveraging gradients that encompass full-chip context. However, in detailed placement, state-of-the-art flows still rely on rule-based or enumeration-centric methods that ignore existing gradient landscape, sacrificing significant wirelength only for marginal timing gains. To overcome this issue, we present LATTE, the first legality-assured, end-to-end differentiable timing-driven detailed placer that fuses contin-uous gradient smoothness with detailed discrete granularity. Particularly, at each iteration, LATTE relaxes local density constraints around selected timing-critical cells, steering relocation from exact timing and wirelength gradients with step sizes in annealed schedules, and enforces end-of-place legality via incremental legalization and bad-move filtering. Across input placements from mainstream commercial and academic placers, LATTE consistently delivers significant timing improvement with minimal routed wirelength impact while ensuring legal final solutions. On 15 designs in a 7nm technology node, LATTE on average improves state-of-the-art timing-driven detailed placers including DREAMPlace-4.0 TCAD and Rsyn by 29.7% in TNS and 44.4% in routed wirelength, with all metrics verified by an industry-leading commercial tool.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionAccurate power analysis is critical in VLSI design, as it directly impacts power optimization strategies. However, traditional approaches are often hindered by the substantial runtime required for per-cycle toggle propagation in the netlist, which propagates register toggle information through combinational logic. To address this, we propose LEAP, the first work to enable per-cycle toggle propagation prediction with both high accuracy and efficiency. This is achieved through a novel, linear-complexity graph transformer capable of simulating toggle propagation, along with specially designed self-supervised pre-training tasks that enable the model to capture circuit structure and functionality. LEAP achieves a 7.6x speedup over the EDA tool in toggle propagation, and attains a near-perfect area under the Precision-Recall curve (PR-AUC) of 0.99 for prediction results. Moreover, LEAP can be seamlessly integrated with other machine learning based power models into LEAP‑Power. This integration enables precise per‑cycle layout power prediction directly from post‑synthesis netlists, achieving a mean absolute percentage error (MAPE) of only 4.55%. By bypassing toggle propagation in the netlist, LEAP‑Power delivers substantial runtime gains, running 5.3x faster than the model without LEAP.
People
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionAs modern CPUs grow increasingly configurable, design space exploration (DSE) has become essential for navigating complex architectural trade-offs. However, existing DSE approaches predominantly rely on statistical correlations, resulting in opaque decision processes. They cannot clearly explain the underlying causes of how configurations affect PPA outcomes, which in turn reduces designers' confidence in automated recommendations. To address this limitation, we introduce causal learning into the DSE pipeline and develop CAL-DSE, a framework that constructs a validated causal graph combining statistical evidence with LLM-informed domain knowledge. Building on this structure, the causal graph decomposes the high-dimensional design space, enabling both interpretability and efficient exploration. Experimental results on RISC-V processor show that CAL-DSE achieves up to 4.12× hypervolume improvement while revealing validated causal pathways between design parameters and PPA outcomes.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionTiming closure is a key objective in FPGA placement. As designs scale, heterogeneous interconnections induce strong RC effects, weakening delay–wirelength correlation and limiting existing placers with inaccurate timing models and fixed weighting. We propose LS-Placer, a learning-based, slack-aware timing-driven placement framework that jointly models net and logic delays through a graph representation and integrates the learned timing model into a dynamic global placement flow with adaptive TNS-guided weighting. On average, LS-Placer improves WNS and TNS by ~11% and ~19% and achieves 3% shorter critical path delay than Vivado 2021.2.
People
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionThe IC3/PDR algorithm is a cornerstone of hardware model checking, and machine learning (ML) has been explored to guide its critical inductive generalization step. However, prior ML methods are built upon a per-clause graph analysis paradigm, requiring repetitive and costly graph processing for every clause, creating a severe scalability bottleneck. Therefore, we introduce LeGend, a framework that completely replaces this paradigm with one-time global representation learning. LeGend architects a domain-adapted self-supervised learning task to generate latch embeddings that encode global circuit properties. These pre-computed embeddings enable a lightweight model to predict high-quality lemmas with negligible overhead, effectively decoupling expensive learning from fast inference. Experiments show our approach accelerates two state-of-the-art IC3/PDR engines across a diverse set of benchmarks, presenting a promising path to scale up formal verification.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionThermal management critically affects the reliability of chiplet-based heterogeneous integrated circuits in the post-Moore era. The design process demands hundreds of thermal simulations across varying power distributions, cooling conditions, and structural configurations. Conventional methods based on finite element methods (FEM) and compact thermal model (CTM) lack the flexibility for efficient multi-scenario evaluations and incur prohibitive computational cost when modeling complex vertical stacks with cross-scale interconnects. We propose LegoTherm, a modular thermal modeling framework that decomposes chiplet-based systems into reusable reduced-order components. The framework exploits the inherent hierarchical structure of heterogeneous integration to construct high-fidelity 3D meshes capturing micro-interconnection details through modular discretization, then generates reduced-order modules by aggregating thermally coupled interface ports. This approach preserves critical hotspot accuracy while enabling rapid reassembly for different design scenarios. Evaluated on industrial-scale 2.5D and 3D packaging benchmarks, LegoTherm achieves up to 10.39X speedup compared to COMSOL while maintaining mean relative error below 0.40%. The framework reduces thermal design iteration cycles from weeks to half a day, addressing a critical bottleneck in modern IC design.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionLearning Parity with Noise (LPN) has emerged as a promising foundational primitive for post-quantum cryptography (PQC). However, existing LPN accelerators primarily optimize isolated computation while overlooking execution flow and suffering from data transfer bottlenecks. Recent advances in AI accelerators have demonstrated remarkable capabilities attributed to their novel architecture paradigms. Inspired by this insight, this work presents LHPA, the first AI-inspired hardware accelerator for LPN acceleration on FPGAs. LHPA implements a heterogeneous architecture comprising configurable hardware engines. Then, LHPA employs a stage-level scheduling strategy and establishes a bandwidth-oriented performance model. LHPA achieves a 52.80x performance gain over the state-of-the-art LPN architecture.
Engineering Presentation
AI
Design
EDA
DescriptionMemory design in deep sub-micron nanometer technologies is riddled with numerous challenges. Dealing with device variability is an extremely challenging task and is aggravated for ultra-low voltage regime. Memory cell associated with peripheral circuits such as read-assist using wordline underdrive (WLUD) directly impacts the cell current and hence the performance of the memory. Memory cell current is extremely sensitive to wordline voltage level. Peripheral circuits such as sense-amplifier, write-assist, replica tracking, and other adaptive circuits need to be designed carefully minimizing variability.
We have demonstrated the use of AI augmented circuit optimization using ASO.ai technology considering variability prone read-assist circuit. AI based circuit optimization has been used in the last 2-3 years extensively for analog circuits like bandgap, VCO etc. which are typically not constrained by area. Here, we are demonstrating that the same technology can be used for critical memory circuits.
The optimizer runs a highly reduced set of simulations to optimize the target measurement. WLUD circuit being PVT adaptive, optimization across PVT corners is needed. Manual optimization is difficult to achieve due to large number of independent variables. We could optimize the read-assist circuit using ASO.ai in quick turnaround time of ~2 weeks including iterations in layout design and back-annotations compared to ~5 weeks with manual approach. We could further reduce the variability by 30% compared to manual approach. This resulted in a performance gain of ~10%. ASO.ai technology not only optimizes for performance but also for area and dynamic/static power metrics
We will also discuss our ongoing work that is targeting bit-cell optimization, where SNM (Static Noise Margin) and WM (Write Margin) are optimized to meet required target sigma. Here we are trading-off between SNM and WM maintaining the same bit-cell area.
We have demonstrated the use of AI augmented circuit optimization using ASO.ai technology considering variability prone read-assist circuit. AI based circuit optimization has been used in the last 2-3 years extensively for analog circuits like bandgap, VCO etc. which are typically not constrained by area. Here, we are demonstrating that the same technology can be used for critical memory circuits.
The optimizer runs a highly reduced set of simulations to optimize the target measurement. WLUD circuit being PVT adaptive, optimization across PVT corners is needed. Manual optimization is difficult to achieve due to large number of independent variables. We could optimize the read-assist circuit using ASO.ai in quick turnaround time of ~2 weeks including iterations in layout design and back-annotations compared to ~5 weeks with manual approach. We could further reduce the variability by 30% compared to manual approach. This resulted in a performance gain of ~10%. ASO.ai technology not only optimizes for performance but also for area and dynamic/static power metrics
We will also discuss our ongoing work that is targeting bit-cell optimization, where SNM (Static Noise Margin) and WM (Write Margin) are optimized to meet required target sigma. Here we are trading-off between SNM and WM maintaining the same bit-cell area.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionRISC-V has recently ratified a vector cryptography extension. Ex-
haustive formal verification of hardware designs that implement
this extension is crucial for security. However, scaling verification
for designs with such large bit-widths is challenging. We present
the first formal verification of Marian, an open-source implementa-
tion of the RISC-V vector cryptography extensions. We show that
proof modularization enables us to obtain an unbounded proof for
Marian. Together with our systematic invariant identification, we
reach an additional speedup of 174%. Our evaluation shows that
invariants that assert properties of counters, such as bounds or
directions, or handshakes are particularly effective in improving
the verification times. We show the generalizability and reusability
of our invariant identification methodology by formally verifying
another custom implementation of RISC-V vector cryptography
extensions. During verification, we found a violation that turned
out to be a flaw in the specification, which is now being updated.
haustive formal verification of hardware designs that implement
this extension is crucial for security. However, scaling verification
for designs with such large bit-widths is challenging. We present
the first formal verification of Marian, an open-source implementa-
tion of the RISC-V vector cryptography extensions. We show that
proof modularization enables us to obtain an unbounded proof for
Marian. Together with our systematic invariant identification, we
reach an additional speedup of 174%. Our evaluation shows that
invariants that assert properties of counters, such as bounds or
directions, or handshakes are particularly effective in improving
the verification times. We show the generalizability and reusability
of our invariant identification methodology by formally verifying
another custom implementation of RISC-V vector cryptography
extensions. During verification, we found a violation that turned
out to be a flaw in the specification, which is now being updated.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn recent years, the demand for high-efficiency cameras has increased significantly in the mobile industry. There is a need to support fast and efficient signal transmission between camera devices and mobile processors. Signal transmission and processing occur across multiple layers derived from the seven layers of the OSI (Open Systems Interconnection) model. This paper focuses on the first layer, the Physical Layer, whose implementation is known as PHY. Key functions of the PHY include data transmission, signal conversion, and signal modulation. This paper addresses the use of Liberty Modeling techniques to validate critical inter-lane skew and sync symbol specifications for multi-lane high-speed PHY transmitters. This modeling enables early and accurate SoC-PHY timing checks within the existing digital flow, reducing the risk of missing validation, avoiding potential 1–2-month re-implementation cycles, and improving overall time-to-market and design robustness for high-speed SoCs. The study explains how constraints are modeled in Liberty files to ensure proper validation during timing analysis. These techniques are scalable and applicable to similar constraints in Liberty models of other multi-lane PHYs.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionState-of-the-art large language models rely on the bfloat16 (BF16) format for stable training, effectively avoiding underflow and overflow. However, inference is heavily bottlenecked by data movement across chiplet systems. While floating-point standards allocate fixed bits, our profiling reveals that BF16 exponent streams exhibit surprisingly low Shannon entropy (less than 3 bits), indicating high inherent compressibility. To exploit this, we propose LEXI, a novel lossless, exponent-only scheme based on Huffman coding. LEXI enables on-the-fly compression of activations and caches and stores weights in compressed form for just-in-time decompression near compute, without degrading overall system throughput. Integrated into a GF 22 nm Simba architecture, LEXI reduces inter-chiplet communication time by 33–45% and end-to-end LLM inference latency by 30–35% on modern models (Jamba, Zamba, Qwen). With a minimal 0.09% area overhead, LEXI marks a critical step toward efficient LLM deployment on modular chiplet systems.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionVarious real-world applications rely on in-memory dynamic graphs that must efficiently handle frequent updates while supporting low-latency analytics on evolving structures. Achieving both objectives remains challenging due to the trade-off between update efficiency and traversal locality, particularly under highly skewed degree distributions. This motivates the design of graph indexing schemes optimized for in-memory graph management on modern multi-core CPUs. We present LHGstore, a degree-aware Learned Hierarchical Graph storage that, for the first time, integrates learned indexing into graph management. LHGstore designs a two-level hierarchy that decouples vertex and edge access and further organizes each vertex's edges using data structures adaptive to its degree. Lightweight arrays are used for low-degree vertices to maximize traversal locality, while learned indexes are applied to high-degree vertices to improve update throughput. Extensive experiments show that LHGstore achieves 5.9-28.2× higher throughput and significantly faster analytics than SOTA in-memory graph storage systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFoundational IPs like standard cells, IOs and memories are vital for SoC development, with the Liberty (.lib) format central to capturing timing, power, and statistical variation. However, modern libraries are increasingly complex and foundry-specific, posing significant challenges for IP evaluation and selection.
Key hurdles include:
- High Cell Count: Advanced libraries contain thousands of cells, each with multiple timing arcs and conditions.
- Formatting Inconsistencies: Variations in naming conventions, data models, and variants across IP providers complicate analysis.
- Iterative PPA Estimation: Traditional Power, Performance, and Area (PPA) assessment is a time-consuming, iterative process involving Synthesis, Static Timing Analysis (STA), and Place & Route.
- Early Selection Impact: Choosing the wrong library early can lead to timing violations, power inefficiencies, area overruns, and costly redesign cycles.
To address these issues, we present a methodology for Liberty-based profiling to enable early, aligned analysis of IP views, including support for statistical models (LVF). It facilitates the comparison of timing and power trends across different variants (e.g., threshold voltage i.e. vt corners, base vs. derived cells), streamlining the library selection process.
Key hurdles include:
- High Cell Count: Advanced libraries contain thousands of cells, each with multiple timing arcs and conditions.
- Formatting Inconsistencies: Variations in naming conventions, data models, and variants across IP providers complicate analysis.
- Iterative PPA Estimation: Traditional Power, Performance, and Area (PPA) assessment is a time-consuming, iterative process involving Synthesis, Static Timing Analysis (STA), and Place & Route.
- Early Selection Impact: Choosing the wrong library early can lead to timing violations, power inefficiencies, area overruns, and costly redesign cycles.
To address these issues, we present a methodology for Liberty-based profiling to enable early, aligned analysis of IP views, including support for statistical models (LVF). It facilitates the comparison of timing and power trends across different variants (e.g., threshold voltage i.e. vt corners, base vs. derived cells), streamlining the library selection process.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionStandard cell technology is advancing rapidly, with each new generation introducing greater complexity and heightened design challenges. As designs evolve, meeting critical design goals across new processes becomes increasingly demanding, making Design-Technology Co-Optimization (DTCO) more important than ever. To support these goals and address rising expectations for performance, efficiency, and scale, designers rely on quantifying and comparing IPs using Power, Performance, and Area (PPA) metrics. These PPA values, extracted from Liberty (.lib) data, play a crucial role in enabling early IP selection during the design flow.
A key to measuring the progress of each technology node is the ability to compare it against past designs and processes. Historically, specialized tools for precise cell-to-cell PPA analysis across different technologies did not exist, making trend verification and anomaly detection challenging. Effective cross-technology comparisons would help designers confirm expected improvements or identify outlier results.
This paper presents a novel approach utilizing Siemens' Solido Library Profiler solution to perform robust PPA analysis across multiple technologies. By efficiently extracting, loading, and storing PPA data, this tool enables quick, visual comparison of new and legacy technologies, supporting comprehensive DTCO activities and confident development direction.
A key to measuring the progress of each technology node is the ability to compare it against past designs and processes. Historically, specialized tools for precise cell-to-cell PPA analysis across different technologies did not exist, making trend verification and anomaly detection challenging. Effective cross-technology comparisons would help designers confirm expected improvements or identify outlier results.
This paper presents a novel approach utilizing Siemens' Solido Library Profiler solution to perform robust PPA analysis across multiple technologies. By efficiently extracting, loading, and storing PPA data, this tool enables quick, visual comparison of new and legacy technologies, supporting comprehensive DTCO activities and confident development direction.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionDeploying Vision Transformers (ViTs) on near-sensor analog accelerators demands training pipelines that are explicitly aligned with device-level noise and energy constraints. We introduce a compact framework for silicon-photonic execution of ViTs that integrates \textbf{measured hardware noise}, \textbf{robust attention training}, and an \textbf{energy-aware processing flow}. We first characterize bank-level noise in microring-resonator (MR) arrays—including fabrication variation, thermal drift, and amplitude noise—and convert these measurements into closed-form, activation-dependent variance proxies for attention logits and feed-forward activations. Using these proxies, we develop \emph{Chance-Constrained Training} (CCT), which enforces variance-normalized logit margins to bound attention rank flips, and a \emph{noise-aware LayerNorm} that stabilizes feature statistics without changing the optical schedule. These components yield a practical ``measure $\rightarrow$ model $\rightarrow$ train $\rightarrow$ run'' pipeline that optimizes accuracy under noise while respecting system energy limits. Hardware-in-the-loop experiments with MR photonic banks show that our approach restores near-clean accuracy under realistic noise budgets, with no in-situ learning or additional optical MACs.
People
Work in Progress
DescriptionIn this paper, we introduce Lighthouse RL, a sample-efficient reinforcement learning (RL) approach for analog circuit sizing. Traditional methods lack generalization across different performance targets, while standard RL approaches waste resources exploring unpromising regions. Our method addresses these inefficiencies through a strategic reset strategy that initializes episodes from high-performing configurations discovered during training, called "lighthouses". These states, which are closer to the target objectives, guide exploration toward promising regions. When compared to RL and Bayesian optimization methods from the literature, we demonstrate the effectiveness of our approach on a 2D benchmark problem and on two analog circuits, showing significant improvements in sample efficiency (up to 1.72× faster), optimization performance (100% vs. 0-87% success rate), generalization (75% vs. 0-50% extrapolation success), and objective maximization. This efficiency is particularly valuable for computationally expensive black-box optimization problems, and our reset strategy can be used as a plug-and-play enhancement for any RL-based optimization approach.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionMetasurface-based photonic computing delivers ultra-high computing power density and enables ultrafast, low-power MVM. However, inverse design and system-level simulation remain prohibitively expensive due to large-scale Maxwell solvers. We present LightningFNO, a dual Fourier Neural Operator (FNO) framework that unifies inverse design and fast forward modeling for photonic MVM units. Within this framework, an inverse FNO generates candidate topologies, and a convolutional FNO subsequently predicts their optical responses, thereby facilitating deployment at the neural network level. LightningFNO achieved a speedup of 1,245,550 times compared to adjoint-based optimization for 4x4 designs while maintaining comparable accuracy. Furthermore, fabricated 3x3 prototypes demonstrated a root mean square error (RMSE) of 0.043, and the system attained 97.69% accuracy on the MNIST dataset using metasurface-deployed kernels.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionExact synthesis of CNOT circuits is a core primitive in quantum compilers, but existing SAT-based and database-based methods suffer from encoding overhead, memory bottlenecks, and poor scalability. We propose Lin-search, a framework for exact CNOT synthesis based on iterative deepening search. Lin-search explores circuits in non-decreasing gate count while pruning partial solutions using variable-mismatch constraints, matrix-based lower bounds, and canonicalization under local circuit equivalences. In random benchmarks, Lin-search finds optimal circuits with up to 15 qubits and 15 CNOT gates in 600 seconds on a laptop, achieving up to speedup 100x-1000x and scalability of +3 qubits over a SAT-based baseline incorporated in Qiskit. When integrated as a rewriting kernel in a Clifford+T optimization flow, Lin-search further reduces CNOT count by 16% compared to the original circuits on average and decreases timeout counts by 30% compared to the SAT-based method, demonstrating its effectiveness as a practical exact synthesis engine for quantum compilers.
Work in Progress
DescriptionIn this paper, we present DFT static verification and debug as a powerful methodology that is scalable to designs with billions of gates. The proposed methodology avoids the pitfalls of alternative approaches such as simulation, formal verification, and language learning models (LLMs) and complements the design sign-off flow. We highlight four important DFT verification problems, viz., sequential depth, scan isolation, DFT connectivity checking, and early RTL fault coverage analysis. Each problem comes with different pitfalls in existing methodology which prevents arbitrary scaling in design size. We present efficient and scalable static algorithms that scale linearly with the size of the design graph (O(V+E) in the worst case, where V is the number of gates/instances and E is the number of point-to-point nets connecting the instances). We also present results on large scale (100+ million gates) industrial designs from mobile SoC and AI/Edge application domains.
Engineering Presentation
EDA
Security
DescriptionThe increasing complexity and scale of modern ASIC designs have made RTL linting a critical yet challenging aspect of the hardware development process. Traditional manual linting workflows are slow, error-prone, and often overwhelmed by the sheer volume of warnings and errors generated by industry-standard tools. This bottleneck not only delays design cycles but also forces teams into undesirable tradeoffs between speed and code quality.
In this work, we present a novel, fully automated RTL lint remediation platform that leverages codemod technology—previously successful in software engineering—and adapts it for hardware design. Our system proactively scans RTL codebases using nightly runs of industry-standard lint tools, applies context-aware automated fixes, and validates each change through rigorous RTL lint and design verification (DV) workflows. Disruptive or faulty changes are automatically triaged, while human-in-the-loop review ensures correctness and maintains expert oversight.
By automating repetitive error correction and integrating robust validation, our solution enables ASIC engineers to focus on high-value design tasks, significantly improving both efficiency and code quality.
This work represents a step-change in RTL design automation, bridging the gap between software and hardware codemod practices, and offers a scalable, adaptive, and reliable framework for next-generation ASIC development.
In this work, we present a novel, fully automated RTL lint remediation platform that leverages codemod technology—previously successful in software engineering—and adapts it for hardware design. Our system proactively scans RTL codebases using nightly runs of industry-standard lint tools, applies context-aware automated fixes, and validates each change through rigorous RTL lint and design verification (DV) workflows. Disruptive or faulty changes are automatically triaged, while human-in-the-loop review ensures correctness and maintains expert oversight.
By automating repetitive error correction and integrating robust validation, our solution enables ASIC engineers to focus on high-value design tasks, significantly improving both efficiency and code quality.
This work represents a step-change in RTL design automation, bridging the gap between software and hardware codemod practices, and offers a scalable, adaptive, and reliable framework for next-generation ASIC development.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionThree-dimensional (3D) printing enables low-cost, rapid prototyping of complex microfluidic biochips, but light penetration during printing distorts small, multi-layered structures. Existing solutions are either inaccessible or overlook localized geometric effects, leading to inaccuracies that degrade device performance. We propose a novel physics-aware, data-driven design-for-manufacturing approach to improve dimensional fidelity. Using a new dataset of design-fabrication discrepancies, our machine learning-assisted method predicts spatially resolved compensation values. Experiments on complex multi-layer devices demonstrate that our method significantly enhances geometric accuracy and consistently outperforms existing compensation strategies, even with low-cost hobby-grade printers and off-the-shelf resins.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionVideo LLMs have achieved remarkable performance in processing hour-long videos, yet they suffer from severe memory overhead and latency due to expanding KV caches—critical bottlenecks for real-world online applications. To tackle this, LiveVLM is proposed: a training-free, query-agnostic framework featuring two key mechanisms. The Vision Sink Bucketing (VSB) processes video streams in real time, retains long-term details, and eliminates redundant KVs, while the Position-agnostic KV Retrieval (PaR) decouples positional embeddings to reduce irrelevant context interference via efficient page-level retrieval. Extensive experiments confirm LiveVLM delivers state-of-the-art accuracy on LLaVA-OneVision, outperforming both training-free query-agnostic methods and training-based online models.
Additional Meeting
DescriptionJuly 27th, 1:00 PM – 2:30 PM
The Si2 LLM Benchmarking Coalition is an effort under Silicon Integration Initiative (Si2) focused on advancing empirical evaluation for AI in chip design. At this special session the coalition members will present live demos and strategic outlook discussions.
The goal of the coalition is to bring together high-quality AI-for-chip design datasets into a consistent, reproducible format, operate a centralized leaderboard maintained by Si2 staff, and, as a community, establish reporting standards for how these benchmarks are used in the literature. More broadly, we aim to align the field around shared evaluation, improving benchmarks collaboratively rather than having them compete in isolation. You can find more details here: https://si2.org/llm-benchmarking-coalition/. We are also in the process of launching a public leaderboard to track model and agent performance, with a regular cadence of updates as datasets and methodologies evolve.
Presentations will start at 1:00 PM, we have lunch for the first fifty who attend.
Watch for Si2 DAC updates!
The Si2 LLM Benchmarking Coalition is an effort under Silicon Integration Initiative (Si2) focused on advancing empirical evaluation for AI in chip design. At this special session the coalition members will present live demos and strategic outlook discussions.
The goal of the coalition is to bring together high-quality AI-for-chip design datasets into a consistent, reproducible format, operate a centralized leaderboard maintained by Si2 staff, and, as a community, establish reporting standards for how these benchmarks are used in the literature. More broadly, we aim to align the field around shared evaluation, improving benchmarks collaboratively rather than having them compete in isolation. You can find more details here: https://si2.org/llm-benchmarking-coalition/. We are also in the process of launching a public leaderboard to track model and agent performance, with a regular cadence of updates as datasets and methodologies evolve.
Presentations will start at 1:00 PM, we have lunch for the first fifty who attend.
Watch for Si2 DAC updates!
Research Special Session
AI
DescriptionThe growing complexity of modern integrated circuits has significantly increased the demands placed on hardware engineers, particularly within design, simulation, and verification workflows. These processes are inherently iterative and often rely on extensive manual effort, making them time-consuming and prone to errors. Consequently, there is an increasing need for more efficient and scalable Electronic Design Automation (EDA) solutions. Large Language Models (LLMs), which are trained on extensive human knowledge and interact naturally through text, present a promising opportunity to support front-end EDA tasks. By assisting with design generation and verification, LLMs have the potential to substantially improve circuit design productivity. In this talk, we demonstrate how LLMs can automate critical stages of the hardware development lifecycle. We introduce methods for automatically generating circuit designs and their corresponding testbenches from a shared specification, and we illustrate how LLMs can enhance design quality in established workflows such as high-level synthesis. Finally, we discuss the current challenges and limitations of LLM integration in EDA and outline future research opportunities for advancing LLM-enabled front-end design.
Research Special Session
AI
DescriptionAnalog/RF circuits are ubiquitous in electronic systems interfacing with the physical world. Their design however remains time-consuming and error-prone. An LLM-boosted agentic workflow in combination with RL-based optimization offers a route to the fully automated synthesis of analog/RF circuits from spec to layout. The flow covers both topology selection, circuit sizing and layout synthesis. The collaboration of multiple agents executes an automatic reasoning-driven synthesis flow which calls LLMs and tools for RL-based optimization, layout and simulation when needed, sequentially iterating through and optimizing the candidate netlists until a topology can successfully satisfy the target performance specifications. The combination with LLMs offers the capabilities of reasoning and explainability about the design solutions generated. The flow will be illustrated for several design examples.
People
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPlacement quality strongly impacts routability and timing in DRAM peripheral circuits. While expert designers manually identify structural patterns of cells to be clustered during placement in small scale circuits to enhance the quality of results (QoR), this approach does not scale to large DRAM peripheral designs.
This research proposes an automated cell clustering method using large language models (LLMs) for DRAM peripheral circuit placement. Expert knowledge describing the target structural patterns for clustering is written as natural language prompts. These prompts enable the LLM to interpret the netlist as a graph and identify the target structural patterns. The identified clusters are treated as single placement instances to achieve structure-aware placement. The proposed approach allows flexible detection of target structures and remains robust to structural variations without explicit rule-based programming.
Experimental results on a DRAM peripheral circuit demonstrate a 80.6% reduction in the number of DRVs. Setup WNS improves by 0.9% while hold WNS degrades by 128%, highlighting the importance of timing-aware clustering as future work.
This research proposes an automated cell clustering method using large language models (LLMs) for DRAM peripheral circuit placement. Expert knowledge describing the target structural patterns for clustering is written as natural language prompts. These prompts enable the LLM to interpret the netlist as a graph and identify the target structural patterns. The identified clusters are treated as single placement instances to achieve structure-aware placement. The proposed approach allows flexible detection of target structures and remains robust to structural variations without explicit rule-based programming.
Experimental results on a DRAM peripheral circuit demonstrate a 80.6% reduction in the number of DRVs. Setup WNS improves by 0.9% while hold WNS degrades by 128%, highlighting the importance of timing-aware clustering as future work.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionHigh-Level Synthesis (HLS) plays a pivotal role in AI accelerator design, but significant challenges remain in achieving full automation, particularly in areas like module partitioning, dataflow orchestration, and throughput maximization. Current solutions, such as compiler-based optimizations and large language models (LLMs), face limitations in adaptability, system-level optimization, and handling complex computational models. We propose a novel LLM-powered framework that automates the generation and optimization of system-level HLS architectures. The framework generates modular dataflow architectures, refines them through pattern-based optimizations, and produces synthesizable hardware implementations, demonstrating an average performance improvement of 11.78× over current SOTA approaches.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionSignal Temporal Logic (STL) is a formal specification language for describing real-time and real-valued properties of Cyber-Physical Systems (CPS).Accurate and automatic translation of CPS specifications in natural language (NL) into formal STL formulas is crucial. Traditional methods relying on manual templates or deep-learning models are limited and lack flexibility.Recently, large-language models (LLMs) based methods partially addressed these issues, but did not consider the usefulness of the inherent structure of STL both for the translation itself and the result evaluation.To address these issues, we propose STLGen, a novel LLM-enhanced automatic transformation framework from NL to STL, which introduces a two-stage generation process with a structured natural language, named NL2, as the intermediate representation.In Stage 1, NL is converted into well-defined NL2 through structured prompt engineering. In Stage 2, NL2 is converted into STL formulas using a converter.Leveraging LLM-aided generation and closed-loop verification with matching algorithms, as well as fine-tune models in two stages with instruction fine-tuning and LoRA.Additionally, we introduce two evaluation metrics: the structure accuracy to assess STL syntax impact on logic and the STL-SCOTES to evaluate semantic consistency via STL trajectories.Experimental results demonstrate that our method outperforms state-of-the-art methods across the classic evaluation metrics and our proposed metrics.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionHigh-quality timing constraints decisively influence the results of logic synthesis, yet their creation still relies on time-consuming, expert-level manual work. Although large language models (LLMs)
and multi-agent systems have recently delivered strong performance on related EDA automation tasks—most notably script generation—their capability for constraint authoring remains largely untapped. We propose an LLM-centric multi-agent framework that automatically produces Synopsys Design Constraint (SDC) files from natural-language specifications and the corresponding RTL. First, we introduce a rig-
orously aligned data-construction flow that couples specifications, RTL, and expert-validated SDCs. The resulting corpus simultaneously serves as a reproducible benchmark. Second, we instantiate three cooperative agents: (i) a specification-parser agent that extracts timing and design constraints from textual documents; (ii) a false-path discovery agent that identifies and annotates false and multi-cycle paths, and (iii) an SDC-generation agent that emits complete, tool-ready constraint files. Evaluated on open-source designs, the framework improves constraint correctness, completeness, and coverage over both rule-based baselines and state-of-the-art general-purpose LLMs by 9.8% on average, while reducing human effort from hours to minutes. These results substantiate the viability of LLM-powered constraint generation and represent a further step toward fully autonomous, AI-driven EDA toolchains.
and multi-agent systems have recently delivered strong performance on related EDA automation tasks—most notably script generation—their capability for constraint authoring remains largely untapped. We propose an LLM-centric multi-agent framework that automatically produces Synopsys Design Constraint (SDC) files from natural-language specifications and the corresponding RTL. First, we introduce a rig-
orously aligned data-construction flow that couples specifications, RTL, and expert-validated SDCs. The resulting corpus simultaneously serves as a reproducible benchmark. Second, we instantiate three cooperative agents: (i) a specification-parser agent that extracts timing and design constraints from textual documents; (ii) a false-path discovery agent that identifies and annotates false and multi-cycle paths, and (iii) an SDC-generation agent that emits complete, tool-ready constraint files. Evaluated on open-source designs, the framework improves constraint correctness, completeness, and coverage over both rule-based baselines and state-of-the-art general-purpose LLMs by 9.8% on average, while reducing human effort from hours to minutes. These results substantiate the viability of LLM-powered constraint generation and represent a further step toward fully autonomous, AI-driven EDA toolchains.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionEmerging device characteristics modeling is indispensable for potential circuit design integration. In the modeling process, the Verilog-A language is employed, leveraging the physical parameters and test results of the device. This work demonstrates the large language model (LLM) empowered Verilog-A iterative (LLMVA) for the modeling process of spin-transfer-torque magnetic-tunnel-junction (STT-MTJ). LLM enhances the quick interaction with device-level characteristics and circuit-level indicators, and realize design technology co-optimization (DTCO). In macro level, the 4-Mb 28-nm magnetic-random-access-memory (MRAM) macro is designed and then tape-out. To the best of the authors' knowledge, this is the first work to use LLM for device modeling and MRAM DTCO. The iterative test results show that the deviation between Front-test/Post-test results of MTJ does not exceed 9.2%, proving the effectiveness of the proposed LLMVA modeling process. With the proposed LLMVA agent, we have decreased our designer and time cost by about 50%.
People
Research Special Session
AI
DescriptionWhile AI advances are disrupting many creative and economic models in the arts and entertainment, most popular efforts in these fields have focused on the impact of cloud-hosted, centralized capabilities. This talk will explore the opportunities in creative applications for the distribution and decentralization of AI models onto user (audience) devices, which has the potential for cost savings, enhanced privacy, lower latency, operation with limited connectivity, and device-to-device collaboration for new audience experiences. It will cover current research at UCLA in future fan experiences supported by the ability to run AI inference, including media generation and computer vision, directly on mobile devices, and the tradeoffs and challenges in comparison to other approaches. As a case study, we will discuss the porting of 2D and 3D media generation models and pose recognition to NPUs on contemporary mobile phones, and proof-of-concept entertainment applications conceived for these capabilities. Through this case study we will discuss the opportunities for hardware support of local AI processing in service of novel human-computer interactions and the potential evolution of traditional media formats towards AI models.
Engineering Presentation
EDA
DescriptionLocal layout effects (LLEs) in advanced analog nodes cause variations in device threshold voltage (Vth) and mobility that depend heavily on the placement of neighboring structures and the stress in the FEOL stack. Conventional flows require LVS, PEX, and post‑simulation, making iterative compensation time‑consuming and impractical for early‑stage design. This work introduces an artificial neural network (ANN)‑based surrogate model that predicts ΔVth using only the relative distances between devices, independent of LVS status. By starting from the saturated Vth and applying the predicted ΔVth, device placement can be optimized to reduce dummy area and improve matching.
To train the models, 180 k samples per LLE item were generated, and input features were reduced through binning by device type, achieving a target relative error below 0.05 %. The solution was integrated with Calibre (LVS & xACT) and validated on a 2‑nm SF pilot run. In the test case of a StrongARM latch comparator, the ANN‑driven approach—compared with manual designs that relied solely on legacy process knowledge—maintained a Vth standard deviation of zero and eliminated 33 % of unnecessary dummy regions, all without invoking full post simulation.
To train the models, 180 k samples per LLE item were generated, and input features were reduced through binning by device type, achieving a target relative error below 0.05 %. The solution was integrated with Calibre (LVS & xACT) and validated on a 2‑nm SF pilot run. In the test case of a StrongARM latch comparator, the ANN‑driven approach—compared with manual designs that relied solely on legacy process knowledge—maintained a Vth standard deviation of zero and eliminated 33 % of unnecessary dummy regions, all without invoking full post simulation.
Work in Progress
DescriptionGlobal routing is a crucial step in VLSI design that has become increasingly more complex as chip sizes and design scales grow. Many global routers divide the process into several stages: two-pin decomposition, congestion map generation, maze routing and layer assignment. Each of these stages requires routing thousands to millions of nets providing opportunities for parallelization. In this paper, we demonstrate that global routing is not embarrassingly parallel and describe a lock-based approach to shared-memory parallel global routing by identifying net dependencies through the use of an R-tree. This lock framework is flexible enough to be applied to every stage of the global routing pipeline. We evaluate our global router on the ISPD 2008 and ISPD 2019 global routing contest benchmark suites. Our approach achieves a significant speedup in runtime without a reduction in quality.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionGraph-based Approximate Nearest Neighbor Search (GANNS) has been widely adopted for vector retrieval owing to its high scalability and low latency. However, existing GANNS frameworks face severe latency bottlenecks on billion-scale datasets due to excessive SSD I/O. This work presents LOHA, a latency-optimized CPU–storage hybrid architecture for product quantization (PQ)-based billion-scale ANNS. LOHA decouples graph traversal and re-ranking workloads, executing the former on the CPU while offloading the latter to an in-storage accelerator that exploits SSD-level parallelism. Furthermore, a speculative re-ranking mechanism pipelines both stages to minimize idle time and reduce end-to-end latency. Experiments show that LOHA achieves throughput improvements of up to 9.6x, 6.2x, and 4.0x over CPU-, GPU-, and in-storage architectures, respectively.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis paper introduces a low power optimization flow based on time-driven power analysis to reduce dynamic power. Our approach is to integrate simulation-based toggle rate estimation, which was not actually introduced into commercial tools due to runtime and limited input vectors despite high accuracy, into real-world implementation. We replace the probabilistic approach using Switching Activity Interface Format (SAIF), which is commonly used in commercial tools for toggle rate estimation of combinational logic gates, with time-driven simulation using Fast Signal Data Base (FSDB), and evaluate the effects of power, timing, and runtime. In addition, to solve the runtime increase, which is the most critical limitation of time-based flow, we propose Activity Window Profiling, an innovative approach that applies a partial duration representing the entire duration of the FSDB. Our FSDB-based flow with activity window profiling is practically implemented in the state-of-the-art synthesis & physical design tool. We evaluated our method for industrial designs with large circuits and long simulation times. When applied in the P&R stage, our methodology achieved a dynamic power reduction of 5.02% on average compared to the RTL-SAIF flow. In addition, by selectively introducing activity window profiling, runtime was reduced without significant total power change, preventing the design TAT increase of timing-based simulation. Finally, our new approach was introduced into real-world implementations without exception.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionAs the critical stage in chip design, routing comprises global routing and detailed routing, with track assignment (TA) serving as an intermediate step to bridge the two routing phases. However, existing methods consider only 2D mismatches and overlook 3D coordination, resulting in degraded routing quality. This paper proposes LRTA, a routability-driven 3D-aware track assignment framework with layer reassignment to coordinate the resources across layers while preserving connectivity. In addition, a GPU-accelerated TA scheme is proposed to construct candidates and dynamically assign tracks, enabling efficient concurrent evaluation. Experimental results demonstrate that, compared with existing works, the proposed LRTA achieves significant improvements in routability estimation with shorter runtime.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionQubit readout is a critical operation in quantum computing systems, which maps the analog response of qubits into discrete classical states. Deep neural networks (DNNs) have recently emerged as a promising solution to improve readout accuracy . Prior hardware implementations of DNN-based readout are resource-intensive and suffer from high inference latency, limiting their practical use in low-latency decoding and quantum error correction (QEC) loops.
This paper proposes LUNA, a fast and efficient superconducting qubit readout accelerator that combines low-cost integrator-based preprocessing with Look-Up Table (LUT) based neural networks for classification. The architecture uses simple integrators for dimensionality reduction with minimal hardware overhead, and employs LogicNets (DNNs synthesized into LUT logic) to drastically reduce resource usage while enabling ultra-low-latency inference. We integrate this with a differential evolution based exploration and optimization framework to identify high-quality design points. Our results show up to a 10.95× reduction in area and 30% lower latency with little to no loss in fidelity compared to the state-of-the-art. LUNA enables scalable, low-footprint, and high-speed qubit readout, supporting the development of larger and more reliable quantum computing systems.
This paper proposes LUNA, a fast and efficient superconducting qubit readout accelerator that combines low-cost integrator-based preprocessing with Look-Up Table (LUT) based neural networks for classification. The architecture uses simple integrators for dimensionality reduction with minimal hardware overhead, and employs LogicNets (DNNs synthesized into LUT logic) to drastically reduce resource usage while enabling ultra-low-latency inference. We integrate this with a differential evolution based exploration and optimization framework to identify high-quality design points. Our results show up to a 10.95× reduction in area and 30% lower latency with little to no loss in fidelity compared to the state-of-the-art. LUNA enables scalable, low-footprint, and high-speed qubit readout, supporting the development of larger and more reliable quantum computing systems.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionAs processor designs grow more complex, verification remains bottlenecked by slow software simulation and low-quality random test stimuli. Recent research has applied software fuzzers to hardware verification, but these rely on semantically blind random mutations that may generate shallow, low-quality stimuli unable to explore complex behaviors. These limitations result in slow coverage convergence and prohibitively high verification costs. In this paper, we present Lyra, a heterogeneous RISC-V verification framework that addresses both challenges by pairing hardware-accelerated verification with an ISA-aware generative model. Lyra executes the DUT and reference model concurrently on an FPGA SoC, enabling high-throughput differential checking and hardware-level coverage collection. Instead of creating verification stimuli randomly or through simple mutations, we train a domain-specialized generative model, LyraGen, with inherent semantic awareness to generate high-quality, semantically rich instruction sequences. Empirical results show Lyra achieves up to 1.27x higher coverage and accelerates end-to-end verification by up to 107x to 3343x compared to state-of-the-art software fuzzers, while consistently demonstrating lower convergence difficulty.
People
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionFully homomorphic encryption (FHE) enables direct computation on encrypted data that ensures robust privacy protection for machine learning systems. However, the substantial computational overhead of FHE necessitates efficient acceleration techniques for practical deployments.
In GPU-accelerated homomorphically encrypted neural network (HENN) inference, we observe that the primary performance bottleneck shifts from computation to memory transfer due to huge memory footprints w.r.t. the limited device memory. To address this, we propose a memory-aware design framework to fully utilize GPU memory by adaptively trading computation for reduced memory footprint. Our framework introduces a static model to estimate memory footprint and latency for HENN inference. We propose an automatic design space exploration framework to generate optimal cryptographic, bootstrapping,
and encoding configurations, thereby effectively minimizing execution latency with the given GPU memory capacity. Experiments with homomorphically encrypted ResNet-20 and ResNet-18 across various GPU devices show up to 4.97$\times$ speedup with a 91.88\% reduction in memory footprint. It also demonstrates the first deployment of full-fledged homomorphically encrypted ResNet-20/18 at 128-bit security level on an RTX~4060~Ti GPU with 16 GB of device memory.
In GPU-accelerated homomorphically encrypted neural network (HENN) inference, we observe that the primary performance bottleneck shifts from computation to memory transfer due to huge memory footprints w.r.t. the limited device memory. To address this, we propose a memory-aware design framework to fully utilize GPU memory by adaptively trading computation for reduced memory footprint. Our framework introduces a static model to estimate memory footprint and latency for HENN inference. We propose an automatic design space exploration framework to generate optimal cryptographic, bootstrapping,
and encoding configurations, thereby effectively minimizing execution latency with the given GPU memory capacity. Experiments with homomorphically encrypted ResNet-20 and ResNet-18 across various GPU devices show up to 4.97$\times$ speedup with a 91.88\% reduction in memory footprint. It also demonstrates the first deployment of full-fledged homomorphically encrypted ResNet-20/18 at 128-bit security level on an RTX~4060~Ti GPU with 16 GB of device memory.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionCompact models are a critical component of foundry process design kits (PDKs), enabling accurate simulation of device behavior across bias, geometry, and temperature conditions. As semiconductor technologies scale, compact models have rapidly increased in complexity, often involving hundreds of strongly coupled parameters. Traditional parameter extraction flows rely heavily on manual tuning and iterative optimization using global and local algorithms, frequently requiring weeks of expert effort, and suffering from convergence to sub-optimal local minima.
This work presents a machine learning (ML) optimizer–based automatic compact model extraction flow that significantly accelerates and simplifies model development. The proposed framework leverages a Python based, ML-driven optimizer to construct generic, scalable and repeatable extraction flows with substantially fewer optimization steps. The flow decomposes full global model extraction into structured stages, including CV extraction, wide large IV fitting, geometry scaling, minimum length effects, and temperature dependence, with optional log scale sub steps to capture secondary effects such as DIBL and GIDL such that full global model extraction flow is covered.
A key advantage of the optimizer used is its derivative-free nature, robustness in high dimensional parameter spaces, and ability to simultaneously extract multiple parameters using the Optuna optimization framework. Proof of concept results are demonstrated on GaN HEMT devices using the ASM HEMT DC model, with successful validation across production grade compact models such as BSIM4 and PSP for IV and CV data. Overall, the proposed ML based flow reduces extraction time from weeks to hours, lowers reliance on deep device modeling expertise, and enables faster compact model delivery for DTCO applications
This work presents a machine learning (ML) optimizer–based automatic compact model extraction flow that significantly accelerates and simplifies model development. The proposed framework leverages a Python based, ML-driven optimizer to construct generic, scalable and repeatable extraction flows with substantially fewer optimization steps. The flow decomposes full global model extraction into structured stages, including CV extraction, wide large IV fitting, geometry scaling, minimum length effects, and temperature dependence, with optional log scale sub steps to capture secondary effects such as DIBL and GIDL such that full global model extraction flow is covered.
A key advantage of the optimizer used is its derivative-free nature, robustness in high dimensional parameter spaces, and ability to simultaneously extract multiple parameters using the Optuna optimization framework. Proof of concept results are demonstrated on GaN HEMT devices using the ASM HEMT DC model, with successful validation across production grade compact models such as BSIM4 and PSP for IV and CV data. Overall, the proposed ML based flow reduces extraction time from weeks to hours, lowers reliance on deep device modeling expertise, and enables faster compact model delivery for DTCO applications
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionPreemptive scheduling is gaining attention in GPU scheduling for distributed training because it reduces job completion times (JCT). However, our analysis of production traces reveals that, surprisingly, it can increase JCT in practice. We identify two key contributors to this inefficiency. First, preemptions become "futile" when a job is preempted right after being loaded, before it begins execution. We find this futile preemption wastes the load time and so inflates JCT ∼1.6×. Second, existing GPU schedulers run at fixed intervals (e.g., 360 s) rather than upon new job arrivals or when a job completes. So, new jobs must wait until the scheduler kicks in, which, we find, increases JCT ∼2.6×. To address the problems, we introduce Lazer, a novel job scheduler that predicts when to preempt jobs based on job-specific and cluster conditions. Lazer is designed to efficiently explore the scheduling space and adapt to diverse job and GPU cluster characteristics based on Bayesian optimization. Our extensive evaluation shows that Lazer significantly outperforms state-of-the-art schedulers—reducing JCT by 1.2×–233.3×, waiting time by 2×–11690×, and futile preemptions by 23×–67×.
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
DescriptionGeneral Matrix Multiplication (GEMM) is a fundamental kernel in scientific computing and deep learning and often dominates both performance and energy consumption, particularly in edge deployments with strict power and resource constraints. AMD's Versal ACAP offers heterogeneous components (AIEs, PL, PS) that can address these challenges, but identifying efficient mappings across these units is challenging, with prior work largely overlooking power-performance trade-offs. We introduce MALIWAN, an automated framework that generates Pareto-optimal GEMM mappings on Versal ACAP devices. MALIWAN combines fast analytical-model–based sampling with data-driven ML, using on-board measurements to train a surrogate that drives large-scale Design Space Exploration. Based on a collection of ≈6,000 on-board experiments, we first provide a comprehensive analysis of how different mapping configurations affect performance and power. We then evaluate MALIWAN on the Versal VCK190, demonstrating geomean improvements of 1.23× (up to 2.5×) in throughput and 1.25× (up to 2.7×) in energy efficiency over state-of-the-art works. Compared to NVIDIA GPUs, MALIWAN achieves up to 2.5× energy efficiency
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionNear-DRAM Processing (NDP) has demonstrated great
potential for memory-bound
operators in edge-side large language model (LLM) inference.
Featuring a xPU-NDP heterogeneous system,
existing designs concentrate on optimizing the execution flow
within processing units,
yet ignoring the bottleneck arising from xPU-NDP data transfer.
To tackle this challenge, we propose Maple, an efficient mapping exploration framework.
Given a specific workload and NDP architecture,
Maple
adopts an address-mapping-based description method to construct a
comprehensive
search space that
encompasses
resource grouping,
tensor partitioning and transfer binding, thereby
facilitating joint optimization of computation and data transfer.
Experiments show that Maple achieves an improvement of up to 3.34$\times$ in performance
and 1.49$\times$ in energy compared with existing approaches on mainstream NDP architectures.
potential for memory-bound
operators in edge-side large language model (LLM) inference.
Featuring a xPU-NDP heterogeneous system,
existing designs concentrate on optimizing the execution flow
within processing units,
yet ignoring the bottleneck arising from xPU-NDP data transfer.
To tackle this challenge, we propose Maple, an efficient mapping exploration framework.
Given a specific workload and NDP architecture,
Maple
adopts an address-mapping-based description method to construct a
comprehensive
search space that
encompasses
resource grouping,
tensor partitioning and transfer binding, thereby
facilitating joint optimization of computation and data transfer.
Experiments show that Maple achieves an improvement of up to 3.34$\times$ in performance
and 1.49$\times$ in energy compared with existing approaches on mainstream NDP architectures.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionTechnology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open-source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent-based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly outperforms direct evolution and strong baselines, achieving 10.04\% area reduction versus ABC and 7.93\% versus mockturtle, with 46.6\%--96.0\% $S_{overall}$ improvement on EPFL benchmarks, while explicitly navigating the area--delay trade-off. We have open-sourced our framework to foster reproducibility and further research.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionAs large language models (LLMs) are often deployed with long, context-rich prefixes, the prefix KV cache frequently exceeds GPU memory capacity. Although offloading the prefix KV cache to host memory or storage alleviates capacity pressure, it introduces severe I/O stalls that increase Time-to-First-Token (TTFT) latency. To address this challenge, we present Marlin, an I/O-efficient prefix KV cache retrieval system for long-prefix LLM inference. Marlin employs a dispersion-based token selector to precompute a
compact, query-agnostic subset of important prefix tokens, and a sensitivity-guided head classifier that assigns different KV retrieval policies to classify prefix-sensitive and query-sensitive heads. An overlap-optimized attention pipeline further hides offload latency by overlapping head-specific KV transfers with attention computation. Experimental results demonstrate that Marlin significantly reduces TTFT compared to state-of-the-art methods while maintaining comparable model accuracy.
compact, query-agnostic subset of important prefix tokens, and a sensitivity-guided head classifier that assigns different KV retrieval policies to classify prefix-sensitive and query-sensitive heads. An overlap-optimized attention pipeline further hides offload latency by overlapping head-specific KV transfers with attention computation. Experimental results demonstrate that Marlin significantly reduces TTFT compared to state-of-the-art methods while maintaining comparable model accuracy.
People
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionIn integrated circuit manufacturing, the detection of nanoscale wafer defects is critical for yield improvement and root-cause analysis. However, scanning electron microscopy (SEM) defect datasets in modern production lines are typically long-tailed, with pronounced intra-class diversity and subtle inter-class differences. Most existing methods do not take these properties into account, leading to low recall on tail defects and frequent confusion between categories. To address this, we propose DiffTail, a framework that combines mask-guided diffusion with process-aware metric optimization. The mask-guided diffusion uses defect masks together with SEM-specific prompts that encode defect knowledge to synthesize tail defects at specified locations and of specified types, while latent-space fusion with normal SEM images preserves process-consistent background textures. The process-aware metric optimization module groups class prototypes based on image features that are correlated with process steps. It then applies inter-cluster separation and margin-based constraints to easily confused class pairs, making visually similar defects easier to distinguish. Extensive experiments show that DiffTail improves tail-class detection and segmentation over state-of-the-art methods. To facilitate reproducibility, a subset of the data is available at \url{https://anonymous.4open.science/r/Dataset-107B}, and the full dataset will be released upon acceptance.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionMasked diffusion enables region-specific image synthesis but suffers from computational redundancy, since the entire image is processed each timestep even though only the masked region requires generation. To address this, we introduce MASQ, a hardware–software co-designed accelerator for masked diffusion. Our approach performs stage-wise MXINT8/4/2 precision assignment that dynamically reflects spatial and semantic importance, complemented by timestep-aware scheduling and optimized non-matrix operations. MASQ features a block-wise multi-precision compute engine and mask management unit, efficiently handling our approach. It achieves up to 16.06x and 5.39x speedup and 4.18x and 4.93x energy-efficiency gain over A100 and Orin NX, respectively, while preserving quality.
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionDeploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAdvanced Analog-to-Digital Converter (ADC) designs at 28nm and below demand stringent layout symmetry to ensure parasitic balance and electrical integrity. Conventional unit-cell based approaches for OpAmp arrays rely on manual placement and routing, which fail to preserve logical connectivity and require repetitive Engineering Change Orders (ECO) and Design Rule Check (DRC) corrections across multiple cells. This paper introduces a Design Intent-driven Group Array methodology that enables synchronous placement and routing of matched devices for high-performance ADCs. The proposed flow leverages Group Array constructs to replicate edits globally, maintain connectivity, and support operations such as stretch, chop, and cloning directly on the layout canvas. Nested Group Arrays further optimize symmetry for half-cell structures, while integrated routing assistants complete top-level interconnects. Experimental results demonstrate a 30% reduction in turnaround time (TAT) and significant improvement in DRC/ECO efficiency compared to conventional methods. By combining constraint-driven automation with synchronous array cloning, this approach delivers scalable, manufacturable, and electrically robust layouts for advanced-node ADC implementations.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAt advanced process nodes such as 5nm and 3nm, power efficiency is increasingly limited by glitch activity rather than functional switching alone. Aggressive timing closure, deep logic cones, and high fanout amplify delay mismatches, causing glitch power to contribute 15-25% of dynamic power in real silicon paths. When identified late in the flow, glitch-related inefficiencies are costly to fix and yield limited benefit. This work shows that early RTL-stage glitch optimization delivers significantly higher return on investment (ROI).
We present an RTL-level glitch analysis and optimization workflow using RTL Power Delay-Aware Glitch (DAG) Analysis, enabling designers to detect and address glitch behavior prior to synthesis. The methodology starts with top-level average power analysis, followed by hierarchical glitch reporting to localize glitch-prone logic. The Glitch Source Report ranks the top glitch-generating logic based on downstream impact, separating self-induced and propagated glitch power to guide prioritization.
The Detailed Glitch Report traces complete glitch propagation paths to terminating registers, revealing how logic depth, fanout, glitch event rates, and downstream capacitance amplify power loss. Across multiple advanced-node designs, targeted RTL fixes achieved 10-30% dynamic power reduction, demonstrating that early, delay-aware glitch optimization is a practical, high-ROI strategy for modern SoCs.
We present an RTL-level glitch analysis and optimization workflow using RTL Power Delay-Aware Glitch (DAG) Analysis, enabling designers to detect and address glitch behavior prior to synthesis. The methodology starts with top-level average power analysis, followed by hierarchical glitch reporting to localize glitch-prone logic. The Glitch Source Report ranks the top glitch-generating logic based on downstream impact, separating self-induced and propagated glitch power to guide prioritization.
The Detailed Glitch Report traces complete glitch propagation paths to terminating registers, revealing how logic depth, fanout, glitch event rates, and downstream capacitance amplify power loss. Across multiple advanced-node designs, targeted RTL fixes achieved 10-30% dynamic power reduction, demonstrating that early, delay-aware glitch optimization is a practical, high-ROI strategy for modern SoCs.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionSubstantial weight and KV cache transfers between the GPU and HBM severely constrain Large Language Model (LLM) inference system performance. While LLM compression techniques can reduce memory footprint and transfer volume, existing methods are often task-specific and compromise model accuracy. This work proposes a mega-buffer architecture (MBA) that integrates gigabyte-scale on-chip buffers utilizing the embedded DRAM (eDRAM) technology recently introduced by TSMC. To fully leverage this large on-chip memory, we introduce a KV cache prioritized mapping (KVP) scheme that minimizes inefficient KV cache traffic between the HBM and the chip. Furthermore, a highly efficient pipeline integrating double buffering mechanism (DB) is co-designed with an iteration-aware eviction strategy (IA) to enhance data reuse and sustain high compute utilization. Evaluation results show that MBA attains 5.99x and 3.63x end-to-end speedup, and 8.86x and 4.93x energy efficiency improvement against GPU and LLM accelerator baselines, demonstrating a highly efficient architectural solution for LLM inference.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe collaboration between AWS and Siemens addresses the critical challenge of matching variable electronic design automation (EDA) computing demand with static on-premise resources, which is both difficult and capital intensive. Siemens' Solido Intelligent Custom IC Verification, enabled by AWS cloud infrastructure, delivers massive scalability and accelerates project schedules, empowering engineering teams to efficiently manage fluctuating workloads. This partnership has resulted in the development of "cloud flight plans" for rapid deployment of Siemens' Solido Custom IC solutions on AWS, providing application notes, best practices, and direct deployment support. The reference flow for deploying on the cloud ensures accelerated time to deployment, optimized workflows, and direct collaboration between Siemens EDA and AWS experts. By leveraging cloud resources, organizations can minimize capital expenditures, maximize engineering productivity, and reduce time-to-market for custom IC designs. This approach not only streamlines deployment but also supports sustainable operations and risk reduction, making it a transformative solution for modern EDA challenges.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionScale-up networks support memory semantics between pooled, disaggregated processor and memories; meanwhile, it triggers complex any-processor-to-any-memory accesses and creates intricate contention of micro-architectural components. Nevertheless, existing simulators fail to adequately characterize the interference due to oversimplified memory access paths. This paper presents the MemFlow simulator. MemFlow features a modular, fine-grained hardware abstraction of disaggregated memory systems, provides a flow-centric simulation scheme to trace the complete memory access path from load-store queues through cache hierarchies to memory controllers, evaluates micro-architectural contention with detail and fidelity, and enables automatic design space exploration to mitigate contention-induce performance degradation.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSTL (Software Test Library) is a very common, if not the most common, Safety Mechanism (SM) used for ASIL-B designs, mostly for CPUs, DSPs, and the like. STL runs on the CPU in operational mode. The operational software is stopped to allow the STL to run.
Hence, if the STL run-time is long, it will reduce CPU availability and the overall system performance. Because of this, it is critical to reduce the execution time of the STL to lower its impact on the system performance
From the opposite side, to increase the Diagnostic Coverage (DC) to meet the ASIL-B requirements, STL often needs to be long, as more and more sections are added to cover undetectable faults.
In this presentation, we present a novel methodology, technology, and results from an actual project, where the team was able to reduce the STL total run time to below 3% of what it originally was at the beginning of the project, while keeping the same DC results.
Hence, if the STL run-time is long, it will reduce CPU availability and the overall system performance. Because of this, it is critical to reduce the execution time of the STL to lower its impact on the system performance
From the opposite side, to increase the Diagnostic Coverage (DC) to meet the ASIL-B requirements, STL often needs to be long, as more and more sections are added to cover undetectable faults.
In this presentation, we present a novel methodology, technology, and results from an actual project, where the team was able to reduce the STL total run time to below 3% of what it originally was at the beginning of the project, while keeping the same DC results.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionNeural networks from CNNs to LLMs achieve remarkable success, yet analog computing-in-memory (CIM) using memristor crossbars suffers from computational inaccuracies caused by device defects and circuit noise. While tolerating small errors, significant outliers must be corrected. We propose MECA-CIM, a Maximum Distance Separable (MDS) error correction framework requiring only two redundancy columns. Leveraging Hessian trace-based sensitivity quantification, we introduce layer-wise adaptive protection allocation. Our analog-domain syndrome decoder enables efficient error detection and correction with minimal latency. Experimental results demonstrate robust outlier resistance across CNNs and Transformers, achieving large area-energy-delay product (AEDP) reduction compared to conventional ECC systems.
Exhibitor Forum
DescriptionMicrosoft Discovery brings agentic AI to chip design by combining specialized AI agents with a graph‑based knowledge engine to accelerate the entire specification‑to‑sign‑off journey. Running on Azure HPC and Azure NetApp Files for predictable performance at scale, Discovery complements the EDA tools teams use today and enhances time‑to‑results and quality‑of‑results through a partner‑first approach. Attendees will see how Discovery integrates seamlessly with industry ecosystems—highlighting recent advances with partners like Synopsys, from copilots to emerging agentic AI flows—and how organizations can turn tribal engineering know‑how into reusable, governed agents that operate consistently across projects. The session will also cover how to pilot agentic workflows, connect Discovery to existing toolchains, and take advantage of new GPU‑accelerated advances reshaping both design and manufacturing, all while keeping humans in control under enterprise‑grade governance on Azure.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionThe fundamental challenge of noise in quantum computing necessitates encoding quantum information into logical spaces protected by quantum error-correcting codes (QECCs). However, no single code supports a fully transversal, and thus fault-tolerant, implementation of all gates required for universality. A promising approach is code switching, where logical information is transferred between QECCs that together enable a universal gate set. Since switching introduces overhead and increases error rates, minimizing such operations is crucial. This work presents an efficient min-cut–based algorithm for compiling circuits with the minimal number of code switches, providing the first automated approach to code-switching compilation.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionTo leverage user-specific data, retrieval-augmented generation (RAG) is used in multimodal large language model (MLLM) applications. Conventional retrieval has low accuracy, and advanced multi-vector retrieval (MVR) still lacks optimal accuracy and efficiency due to ignored query-image alignment and redundant image segments. We propose MIRAGE, an efficient image retrieval framework. It uses a hierarchical paradigm for better alignment, cuts redundancy by leveraging cross-hierarchy ranking consistency and hierarchy sparsity, and auto-configures dataset parameters. Experiments show it boosts accuracy and reduces computation by up to 3.5 times versus existing MVR systems.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionKASLR is a critical defense on macOS, whose implementation on M-series silicon and the latest macOS is further strengthened by hardware–software mitigations, with no practical bypass. We perform a systematic reverse-engineering of the TLB on M-series silicon and uncover a novel user–kernel TLB side-channel, TLBarge, that enables precise monitoring of kernel instruction execution. Leveraging TLBarge, we demonstrate MistleTunnel, the first attack that compromises macOS 26's 15-bit KASLR by recovering its lower 8 bits. Operating solely through normal syscalls, MistleTunnel collapses the effective KASLR entropy to 0.39%, significantly reducing its intended security.
People
Research Manuscript
Miter-Aware LUT Mapping: Aligning Structure and Solvability for Efficient Logic Equivalence Checking
11:18am - 11:30am PDT Wednesday, July 29 Mtg Room 202ABEDA
EDA2. Design Verification and Validation
DescriptionLogic Equivalence Checking (LEC) is often bottlenecked by synthesis-induced structural perturbations and XOR-dense regions that hinder SAT solving. We contend that the *modeling* of the miter is as critical as the SAT solver itself and propose a miter-aware mapping framework that reformulates the problem before solving by constructing a LUT-based miter preserving structural correspondence and exposing high-level logic relations. It integrates equivalence-preserving mapping, Gaussian-guided XOR modeling, and solver-oriented LUT selection to produce solver-efficient representations. Experiments on comprehensive benchmarks show up to 92.1% reduction across state-of-the-art SAT solvers, highlighting the importance of solver-aware modeling in enhancing LEC efficiency.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDesigning circuits with mix of different types transistor to achieve the right balance between performance and power has been a long used strategy. The semiconductor industry has been dealing with variability of transistors and also the extent of miscorrelations between them through skewed corners and derates. Lack of automation to fix these violations led to elongated cycletimes to address them manually. We cover a significant improvement in the prevalent methodology. Cadence's Vt skew flow can handle miscorrelations between transistors efficiently, eliminating overhead of additional STA runs. Flow is supported in Innovus and Tempus enabling optimization and timing signoff. Solution relies on a creation of bounded GBA which leads to significant pessimism in Innovus impacting optimization QoR. Signoff accuracy is not impacted by this as PBA analyzes for this effect accurately by exploring all possible skew combinations depending on the transistor types in path. The optimization challenge is overcome by employing methodology which performs VT skew analysis both in GBA and PBA modes and derives margin. Finally we present results that indicate that the Vt skew flow is able to identify paths vulnerable to transistor skew and optimize them. We have also benchmarked the analysis accuracy against SPICE
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionGlass interposers enable 5.5D chiplet stacking, offering an advance beyond silicon-based technology. However, routing in such dense, double-sided environments is challenging, and existing methods do not account for glass interposer design rules. To address this, we propose a 5.5D RDL routing algorithm. It begins by allocating routing resources, employs double-sided global routing for efficient guidance, and uses a novel pixel-based detailed routing to finalize results. Experiments show the proposed method achieves 100% routability and reduces wirelength by 37% compared to a 2.5D RDL router.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Description• Increasing design complexities, tighter noise margins, along with higher PPA (Power, Performance, Area) expectations with shorter design cycles in advanced FinFet technology nodes demands the need for a faster and exhaustively optimal power delivery network (PDN) synthesis. These PDNs have to optimally balance the EMIR budget needs as well as other design targets like PPA and routability.
• Traditional PDN synthesis methodologies employ the random iterative hit-and-trial approach by sweeping the very limited random set of design parameters, Hence they lack in coverage, sub-optimal for PPA and consumes long implementation cycles, which impacts the confidence on PDN quality and time-to-market.
• To overcome these limitations, We propose a faster and exhaustive ML based design space exploration methodology for optimal PDN synthesis that figures out the power integrity sensitive design parameters like critical power grid metal layers, metal width, metal pitch, bump locations etc. and the optimal value combinations of which meets the specified design targets. Our methodology offers better design coverage as designers can explore many more design parameters to sweep them across a wide range with fine grained steps along with faster run times to build and explore the analysis based meta models. Our results show a very well correlated meta model w.r.t simulations and shows better combinations of some of the specified metal layers (M6,M8,M9) widths and pitches for an optimal PDN which meets specified dynamic IR drop budgets with high confidence.
• Traditional PDN synthesis methodologies employ the random iterative hit-and-trial approach by sweeping the very limited random set of design parameters, Hence they lack in coverage, sub-optimal for PPA and consumes long implementation cycles, which impacts the confidence on PDN quality and time-to-market.
• To overcome these limitations, We propose a faster and exhaustive ML based design space exploration methodology for optimal PDN synthesis that figures out the power integrity sensitive design parameters like critical power grid metal layers, metal width, metal pitch, bump locations etc. and the optimal value combinations of which meets the specified design targets. Our methodology offers better design coverage as designers can explore many more design parameters to sweep them across a wide range with fine grained steps along with faster run times to build and explore the analysis based meta models. Our results show a very well correlated meta model w.r.t simulations and shows better combinations of some of the specified metal layers (M6,M8,M9) widths and pitches for an optimal PDN which meets specified dynamic IR drop budgets with high confidence.
Work in Progress
ML Framework for Secondary Power Grid Routing Resistance Prediction with Piecewise Linear Regression
6:30pm - 6:31pm PDT Monday, July 27 Exhibit HallDescriptionSecondary power grids (Sec-PG) are critical for maintaining Always-On infrastructure, including isolation and retention logic, during primary power-down cycles. To accommodate Vdd-GATED and Vdd-AON
supply rails, Sec-PG cells utilise double-height standard cells requiring specialised dual-rail topologies. While existing physical design flows leverage location-aware placement to cluster these instances, verifying routing resistance remains a prohibitive bottleneck due to the iterative nature of
sign-off IR extraction. We present a hierarchical ML framework for a-priori Sec-PG resistance estimation using spatial tile-based partitioning and adaptive piecewise linear regression. Evaluated on industrial test chip designs with up to 405k instances, our method achieves R2 ≥ 0.995 and MAE ≤ 1Ω. This enables at least 100× faster design iteration with < 1% accuracy loss compared to traditional R-effective calculation from commercial resistance extraction tools
supply rails, Sec-PG cells utilise double-height standard cells requiring specialised dual-rail topologies. While existing physical design flows leverage location-aware placement to cluster these instances, verifying routing resistance remains a prohibitive bottleneck due to the iterative nature of
sign-off IR extraction. We present a hierarchical ML framework for a-priori Sec-PG resistance estimation using spatial tile-based partitioning and adaptive piecewise linear regression. Evaluated on industrial test chip designs with up to 405k instances, our method achieves R2 ≥ 0.995 and MAE ≤ 1Ω. This enables at least 100× faster design iteration with < 1% accuracy loss compared to traditional R-effective calculation from commercial resistance extraction tools
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn advanced technology nodes, timing pessimism caused by power integrity variation has become a critical challenge for sign-off closure. In particular, Sigma AV reported by RedHawk analysis introduces excessive timing margins under non-uniform power density conditions. Conventional switch cell selection assumes uniform power density and applies a single switch cell across the design, failing to capture post-placement current variation.
This work presents a machine learning–based, Sigma AV–aware switch cell optimization methodology. Instead of directly predicting Sigma AV, the proposed approach learns per–switch-cell current behavior extracted from RedHawk analysis using a gradient-boosted decision tree model. Current-critical switch cell instances are identified and selectively optimized using mixed switch cells.
Experimental results demonstrate a clear reduction in Sigma AV variation and the associated Sigma AV–induced slack shift. As a result, worst-case timing slack at p1 endpoints is consistently improved under sign-off conditions, confirming the effectiveness and practicality of the proposed approach.
This work presents a machine learning–based, Sigma AV–aware switch cell optimization methodology. Instead of directly predicting Sigma AV, the proposed approach learns per–switch-cell current behavior extracted from RedHawk analysis using a gradient-boosted decision tree model. Current-critical switch cell instances are identified and selectively optimized using mixed switch cells.
Experimental results demonstrate a clear reduction in Sigma AV variation and the associated Sigma AV–induced slack shift. As a result, worst-case timing slack at p1 endpoints is consistently improved under sign-off conditions, confirming the effectiveness and practicality of the proposed approach.
Work in Progress
DescriptionAccurate timing estimation in VLSI circuits is strongly influenced by interconnect parasitics. Current timing analysis models the distributed RC network as either Lumped-Capacitance or π model. In this work, the Double-π model is introduced as a new RC representation that captures higher-order distributed effects. Higher-order models offer more accuracy at the cost of increased computation cost. Using a single RC equivalent model for all interconnects often leads to either excessive computational cost or loss of accuracy. This work presents a machine learning–based framework that automatically and rapidly identifies the most suitable simplified RC representation—Lumped Capacitance, π, Double-π, or Distributed—for precise gate delay estimation. The framework determines the minimal RC model that maintains delay deviation within 1% of a fully distributed network. A dataset of 10,000 randomized RC networks was generated for each inverter size and timing arc to support model training and validation. Across all evaluated inverter sizes, the proposed framework achieved an average classification accuracy—based on the best-performing model per configuration—of 94% for T_p (propagation delay) and 93% for T_(rf-out)(output transition time). The proposed framework enables dynamic and optimal selection of interconnect models across a wide range of complexities and types, including both mathematical and circuit-based representations, thereby supporting accurate and scalable timing analysis for advanced VLSI design flows.
Engineering Special Session
AI
Design
EDA
Systems
DescriptionTo address pacing threats and increasing system complexity, governments must shorten acquisition timelines while maintaining rigor despite constrained resources. To meet this challenge, Booz Allen developed a set of Model-Based Systems Engineering (MBSE) tools and workflows that accelerate both the interrogation of legacy documentation and the generation of new model-based artifacts. Using LLM+RAG techniques, the toolset extracts requirements from traditional document-based system products, while generation capabilities rapidly produce specification artifacts directly in model-native formats such as Cameo or DOORS. The agentic AI4SE tool validates requirements against INCOSE guidelines, improving the quality of newly generated content and recommending enhancements for requirements derived from legacy sources. With this capability, Booz Allen has extracted hundreds of requirements from unstructured and inconsistent documentation within minutes, providing system engineers with initial MBSE baselines and supporting the development of new acquisition artifacts to meet the accelerated timelines.
Research Special Session
EDA
DescriptionQuantum processors are starting to perform the primitives needed to build a fault-tolerant quantum computer, a critical step on the way to utility-scale quantum computers. These primitives are expensive, and they remain error-prone on the highly space-limited quantum processors available today. In this talk, I will show how detailed hierarchical modeling of fault-tolerant primitives leads to insight into how hardware noise mechanisms impact quantum error correction. Building upon that understanding of errors and drawing on fault-tolerant architecture, and quantum algorithms, I will discuss how to benchmark quantum processors as they transition from the noisy intermediate scale to the early fault tolerant scale.
Engineering Presentation
Design
EDA
DescriptionThe performance of large semiconductor devices, such as power MOSFETS that carry large currents, depends not only on their size or area but also their layout and package implementation choices. Factors such as device aspect ratios, metal routing strategies and pad locations influence current flow through the device and impact device characteristics such as its on resistance (Rdson) and reliability.
It has been challenging to model these layout and configuration dependent effects. Optimizing the device layouts and pad configurations has been a manually iterative and time-consuming task, often done towards the end of the design cycle, resulting in increased risk of sub-optimal designs.
We propose a new methodology utilizing the features available in Ansys (now Synopsys) optiSLang tool to address this problem. We demonstrate a novel flow which integrates Cadence Virtuoso p-cell/skill scripts for device layout generation with Ansys Totem tool with PMIC utility for layout extraction and analysis. We utilize in-built optiSLang algorithms to run sensitivity analysis and develop meta-models. We envision a flow where process technology or PDK development teams will generate a device meta-model early on which can be exported as a python function. Design or packaging teams can use the simple function to identify and implement optimally sized devices as per their product requirements.
We will present a couple of examples to describe the optiSLang based flows and present power MOSFET device modeling and optimization results.
It has been challenging to model these layout and configuration dependent effects. Optimizing the device layouts and pad configurations has been a manually iterative and time-consuming task, often done towards the end of the design cycle, resulting in increased risk of sub-optimal designs.
We propose a new methodology utilizing the features available in Ansys (now Synopsys) optiSLang tool to address this problem. We demonstrate a novel flow which integrates Cadence Virtuoso p-cell/skill scripts for device layout generation with Ansys Totem tool with PMIC utility for layout extraction and analysis. We utilize in-built optiSLang algorithms to run sensitivity analysis and develop meta-models. We envision a flow where process technology or PDK development teams will generate a device meta-model early on which can be exported as a python function. Design or packaging teams can use the simple function to identify and implement optimally sized devices as per their product requirements.
We will present a couple of examples to describe the optiSLang based flows and present power MOSFET device modeling and optimization results.
Research Special Session
EDA
DescriptionThe demand for extreme-scale AI systems drives the need for advances in compute density and interconnect bandwidth. Co-packaged optics (CPO) offers a path forward by integrating electronic and photonic systems, but designing these tightly coupled platforms presents challenges. Traditional simulations excel at device-level accuracy but fail to capture system-level effects like reflections, spectral interactions, and packaging parasitics. This talk presents a co-simulation and verification methodology for CPO systems, emphasizing unified modeling of photonic and electronic components. By integrating these simulations, we enable more reliable predictions, faster iterations, and first-time-right designs for advanced CPO-based compute platforms.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionIn modern very large-scale design, cell-level placement is prohibitively slow and memory-intensive, hindering rapid design-space exploration for integrated circuits. Cell-level placement consistently shows that cells within the same functional module tend to group together, making module-level placement a practical abstraction for fast yet accurate design estimation. To this end, we introduce a module-level placement framework with a force-directed soft-module deformation scheme, where functional modules are modeled as deformable entities that adapt aspect ratios under area preservation. This reduces the size of the problem by several orders of magnitude, while retaining high fidelity to cell-level placement. On industrial-scale benchmarks, our approach achieves a 56x average speedup over conventional placers, incurs only about 3% inter-module HPWL error, and recovers 93% of the top-10% longest nets. These results establish a practical and scalable solution for early-stage placement estimation.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionDespite the existence of various automated PCB placement frameworks, the industry still relies heavily on human engineers. This is because these frameworks cannot perform placement according to specific user requirements and preferences, which limits their flexibility and industrial adoption. Moreover, the existing frameworks lack a modular perspective in placement, resulting in outputs that often perform poorly and fail to meet practical requirements. To address these problems, we propose ModuPlace, leveraging the powerful capabilities of LLMs to perform modular PCB placement with preference-optimized constraint graph generation. In ModuPlace, the entire component set is partitioned into modules at different granularities, enabling hierarchical and modular placement. Moreover, ModuPlace utilizes fine-tuning and preference optimization to enhance the quality of constraint graph generation and align the results with those of human experts. Experimental results demonstrate that ModuPlace outperforms all baselines, achieving superior placement quality across all metrics.
People
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionSparse Mixture-of-Experts (MoE) models offer a powerful way to increase the capacity of large language models without proportionally increasing computation. Yet, their extremely large expert weight parameters create severe memory challenges for consumer-grade GPUs and edge devices. Existing offloading and hybrid CPU–GPU approaches fall short in fully utilizing CPU resources, and their coarse-grained scheduling and poorly overlapped CPU-GPU data transfers leave CPU cores frequently idle, thus causing pipeline bubbles and exacerbating overall system inefficiency. We present MoE-Balance, a system that efficiently utilizes heterogeneous CPU-GPU resources for MoE inference. The key innovations include a fine-grained dynamic scheduler that distributes expert tasks based on real-time load, and a cross-layer prefetching and chunked transfer mechanism that overlaps computation with weight movement via PCIe. Experiments on Mixtral-8x7B show that MoE-Balance achieves 1.6x higher throughput and 37.5% lower end-to-end latency compared with state-of-the-art hybrid baselines, demonstrating strong performance gains under memory-constrained settings.
Exhibitor Forum
DescriptionThe EDA industry is once again in a state of disruption as an emerging
ecosystem sweeps through the chip development flow with new Agentic AI tools
promising to bring chips and IP to market faster. Advanced by former hardware and
verification engineers, now entrepreneurs, who are creating startups to build products
based on AI technology with the hopes of solving some of the more pressing design and
verification challenges.
While perceived as disruptive and competitive, the landscape featuring new Agentic AI
tools is also a collaborative one where startups are working alongside established EDA
players to create a new market dynamic. The possibility of significant advancements
has never been greater, promising improvements in enhanced productivity, time to tape
out and expediting the entire chip development process.
A panel of knowledgeable experts will distill the excitement surrounding the innovation
happening now in chip design and verification, and the broader implications for
technological advancements. They will attempt to determine if Agentic AI is truly an
industry disruption or merely a means of augmenting existing technology and
methodologies. They will be questioned and challenged by noted verification expert and
standards guru Tom Fitzpatrick who will encourage audience participation.
The goal is a lively discussion on the expected changes within chip design as Agentic AI
tools become more mainstream, the unexpected applications and predictions for the
future. Audience members will leave with a healthy appreciation for the value of the
emerging Agentic AI tools and the collaborative environment of EDA today.
ecosystem sweeps through the chip development flow with new Agentic AI tools
promising to bring chips and IP to market faster. Advanced by former hardware and
verification engineers, now entrepreneurs, who are creating startups to build products
based on AI technology with the hopes of solving some of the more pressing design and
verification challenges.
While perceived as disruptive and competitive, the landscape featuring new Agentic AI
tools is also a collaborative one where startups are working alongside established EDA
players to create a new market dynamic. The possibility of significant advancements
has never been greater, promising improvements in enhanced productivity, time to tape
out and expediting the entire chip development process.
A panel of knowledgeable experts will distill the excitement surrounding the innovation
happening now in chip design and verification, and the broader implications for
technological advancements. They will attempt to determine if Agentic AI is truly an
industry disruption or merely a means of augmenting existing technology and
methodologies. They will be questioned and challenged by noted verification expert and
standards guru Tom Fitzpatrick who will encourage audience participation.
The goal is a lively discussion on the expected changes within chip design as Agentic AI
tools become more mainstream, the unexpected applications and predictions for the
future. Audience members will leave with a healthy appreciation for the value of the
emerging Agentic AI tools and the collaborative environment of EDA today.
Engineering Presentation
EDA
Systems
DescriptionInserted logic from EDA flows (e.g., DFT, MBIST, implementation refinement) often lacks end-to-end assurance that it is structurally verified. We present Morph-Inspect, a cross-collateral methodology that guarantees every inserted construct is found and checked by structural tools. First, we perform an HDL semantic diff between baseline and post‑insertion designs to discover additions—modules, instances, ports, signals, expressions, and subexpressions. Each insertion is labeled and logged into a queryable database with file, line range, type, parent, timestamp, and content. Second, verification collateral (lint/DFT/structural reports) is ingested and mapped to those labels. A correlation engine computes Inserted-Logic Check Coverage, highlights omissions, and produces actionable recommendations. Optional AI assistance classifies checked vs. un-checked constructs and proposes rule updates. The approach integrates with CI to enforce coverage thresholds and emit diffs when gaps occur. Morph-Inspect flags missing checks (e.g., scan_enable ports) and validates checking avoiding late-cycle rework. This methodology improves design quality, accelerates debug, and provides auditable evidence that inserted structures are faithfully verified before sign-off.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis study focuses on electromagnetic fault injection attacks.
The custom methods have been demanded such that consistently analyse both the electromagnetic irradiation and the resulting physical phenomena within the circuit.
Therefore, this research has developed an integrated analysis method combining an electromagnetic field solver and chip PDN modelling techniques.
The electromagnetic field solver models the EM coil and the metal forming the main power delivery network (PDN) within the chip in order to analyse induced currents.
The chip PDN model is an equivalent circuit model based on RLC networks created from design information.
This method was applied and analysed using a test chip.
The results showed that the heatmap changed depending on the position of the EM coil and the differences in the power delivery paths to the PDN.
Regarding the differences in power delivery paths, the voltage drop inside the chip during the attack was measured using on-chip voltage monitoring circuit, confirming agreement between simulation and measurements.
As this model can include logic transition information, it is expected that it can be extended to digital circuit fault analysis.
The custom methods have been demanded such that consistently analyse both the electromagnetic irradiation and the resulting physical phenomena within the circuit.
Therefore, this research has developed an integrated analysis method combining an electromagnetic field solver and chip PDN modelling techniques.
The electromagnetic field solver models the EM coil and the metal forming the main power delivery network (PDN) within the chip in order to analyse induced currents.
The chip PDN model is an equivalent circuit model based on RLC networks created from design information.
This method was applied and analysed using a test chip.
The results showed that the heatmap changed depending on the position of the EM coil and the differences in the power delivery paths to the PDN.
Regarding the differences in power delivery paths, the voltage drop inside the chip during the attack was measured using on-chip voltage monitoring circuit, confirming agreement between simulation and measurements.
As this model can include logic transition information, it is expected that it can be extended to digital circuit fault analysis.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionPrivacy-preserving machine learning via Fully Homomorphic Encryption (FHE) offers strong data confidentiality guarantees but suffers from prohibitive computational overhead. In this work, we improve the performance of private inference via fully homomorphic encryption on multi‑GPU environments. We achieve this through three key innovations: (1) an optimized tensor‑level FHE library based on the CKKS scheme that fuses low‑level encrypted primitives into high‑throughput GPU kernels, (2) an adaptive memory manager that dynamically orchestrates encrypted‑tensor placement to control peak memory usage, and (3) a multi‑GPU execution engine that partitions and balances workloads across devices to maximize utilization under constrained memory budgets. We evaluate our method on representative encrypted‑ML benchmarks and compare against state‑of‑the‑art CPU- and GPU‑based FHE systems, showing comparable single-GPU results since no existing work provides a multi-GPU implementation. Our eight-GPU configuration achieves strong scaling of 7.14x over the single-GPU execution, which in turn yields a 9.7x end-to-end cumulative improvement over the best prior work. We additionally implement all fused tensor-level operators within a full ResNet-20 inference pipeline on CIFAR-10, demonstrating that the method generalizes beyond MNIST-scale workloads. Results show efficient parallel scaling, with 8-GPU execution on ResNet-20 reaching 7.1x speedup over the state of the art under realistic memory constraints.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWith the advent of AI and the rise of Software-Defined Vehicles (SDV), automotive designs are growing exponentially in complexity and functionality. Ensuring safety, security, and early bug detection is more critical than ever. A promising approach is combining AI with formal verification to achieve both productivity and exhaustive coverage. Recent advances in local Large Language Models (LLMs) make it feasible to generate syntactically and functionally correct assertions without exposing proprietary data to external servers, addressing privacy concerns associated with frontier models.
This work proposes a multi-agent system using local LLMs to automate formal verification setup for digital IPs. Inputs include RTL filelists, assertion specifications, and a chosen formal technique. To simplify complexity, predefined "modes" guide the master agent with context-specific prompts (e.g., symbolic-data-checker mode). Specialized agents then: (1) create a formal testbench, (2) identify relevant DUT signals, (3) generate auxiliary code for symbolic variables and assumptions, (4) connect signals to symbolic variables, and (5) produce final assertions. An optional step validates syntax and provides feedback for refinement. The proposed tool supports multiple modes, aiming to improve formal checker quality and streamline verification flows for diverse IPs in the automotive domain.
This work proposes a multi-agent system using local LLMs to automate formal verification setup for digital IPs. Inputs include RTL filelists, assertion specifications, and a chosen formal technique. To simplify complexity, predefined "modes" guide the master agent with context-specific prompts (e.g., symbolic-data-checker mode). Specialized agents then: (1) create a formal testbench, (2) identify relevant DUT signals, (3) generate auxiliary code for symbolic variables and assumptions, (4) connect signals to symbolic variables, and (5) produce final assertions. An optional step validates syntax and provides feedback for refinement. The proposed tool supports multiple modes, aiming to improve formal checker quality and streamline verification flows for diverse IPs in the automotive domain.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs integrated circuit designs scale, flat synthesis becomes impractical due to excessive runtime and memory demands, making hierarchical synthesis essential. However, traditional hierarchical flows often suffer from suboptimal Quality of Results (QoR) caused by floorplan–pin placement interdependencies, leading to timing violations and routing congestion. This work introduces a multi-agent framework leveraging Large Language Models (LLMs) to address these challenges through cross-boundary optimization in floorplan-aware hierarchical synthesis. The framework employs Orchestrator, SubBlock, and Negotiation agents that collaborate via a message-passing architecture to co-optimize block placement, orientation, and interface pin planning. By integrating iterative feedback loops from synthesis reports, the system autonomously refines timing and area metrics across iterations. Proof-of-concept validation on Cadence's DDR Multi-Channel Memory Controller demonstrates significant improvements, with critical path timing shifting from -690 ps violation to +602 ps slack, enabling early QoR predictability in the design cycle. This approach aims to have shift-left strategy for timing closure, before PD handoff, ensuring clearer routing behavior and reducing integration risks early in the design cycle.
Engineering Presentation
AI
Design
EDA
DescriptionSecure Multi-Agent Generative AI Pipeline for Context-Aware Design Verification
The escalating complexity of modern System-on-Chip (SoC) designs has relegated approximately 70% of the total design cycle to functional verification. This critical phase is currently plagued by knowledge fragmentation, where engineers lose significant time manually parsing massive technical specifications (JEDEC, LPDDR, HBM), and the repetitive overhead of writing boilerplate UVM code and assertions. While Large Language Models (LLMs) offer a potential solution, public AI services present an unacceptable security risk for the leakage of proprietary RTL and confidential company data.
This presentation puts forth a novel, on-premise agentic AI pipeline designed to automate the verification lifecycle within a secure local environment. Our architecture utilizes a "Team of Experts" approach orchestrated by a central Classifier Agent that routes complex queries to specialized sub-agents for code generation, log parsing, and spec retrieval. By leveraging Retrieval-Augmented Generation (RAG) and decoupled fine-tuning, the system provides high-fidelity, context-aware assistance tailored to internal design methodologies without risking external data exposure. This modular framework significantly enhances engineering productivity, reduces time-to-market (TTM), and ensures the security of sensitive silicon IP.
The escalating complexity of modern System-on-Chip (SoC) designs has relegated approximately 70% of the total design cycle to functional verification. This critical phase is currently plagued by knowledge fragmentation, where engineers lose significant time manually parsing massive technical specifications (JEDEC, LPDDR, HBM), and the repetitive overhead of writing boilerplate UVM code and assertions. While Large Language Models (LLMs) offer a potential solution, public AI services present an unacceptable security risk for the leakage of proprietary RTL and confidential company data.
This presentation puts forth a novel, on-premise agentic AI pipeline designed to automate the verification lifecycle within a secure local environment. Our architecture utilizes a "Team of Experts" approach orchestrated by a central Classifier Agent that routes complex queries to specialized sub-agents for code generation, log parsing, and spec retrieval. By leveraging Retrieval-Augmented Generation (RAG) and decoupled fine-tuning, the system provides high-fidelity, context-aware assistance tailored to internal design methodologies without risking external data exposure. This modular framework significantly enhances engineering productivity, reduces time-to-market (TTM), and ensures the security of sensitive silicon IP.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe exponential growth of neural network architectures targeting human-level intelligence has driven the development of hyperscale, transformer-based AI accelerators, with training requirements reaching up to 8 × 10^23 floating-point operations (FLOPs). This has introduced extreme computational demands, resulting in significant simultaneous switching currents (SSC) and severe degradation of power integrity (PI). Traditional vector-based simulations, despite requiring millions of vectors, leave coverage for the worst-case scenario for Power Integrity (PI) highly uncertain and incur prohibitive runtimes. Existing vectorless approaches offer limited design coverage and lack alignment with local and global power budgets.
This work proposes a multi-cycle, non-propagation-based vectorless methodology that leverages user-defined instance-level power constraints to model switching scenarios across cycles. By incorporating timing windows from static timing analysis and enforcing regional power compliance, the approach achieves granular coverage without unnecessary pessimism. The methodology ensures >90% switching coverage across all instance categories within feasible simulation durations, enabling accurate PDN signoff and identification of grid weaknesses.
The proposed solution is highly scalable for flat, multi-billion SoC designs and addresses reliability concerns without extensive vector generation, making it a practical alternative for modern hyperscale AI systems.
This work proposes a multi-cycle, non-propagation-based vectorless methodology that leverages user-defined instance-level power constraints to model switching scenarios across cycles. By incorporating timing windows from static timing analysis and enforcing regional power compliance, the approach achieves granular coverage without unnecessary pessimism. The methodology ensures >90% switching coverage across all instance categories within feasible simulation durations, enabling accurate PDN signoff and identification of grid weaknesses.
The proposed solution is highly scalable for flat, multi-billion SoC designs and addresses reliability concerns without extensive vector generation, making it a practical alternative for modern hyperscale AI systems.
Engineering Presentation
EDA
Security
DescriptionThe IBM Small to Large (S2L) methodology is a flow that transitions flat small block physical design models to abutted large block models. One of the biggest challenges in the S2L flow has been latch tree optimization. Latch trees are fully described in the input logic, so design changes often required resource-intensive manual work to update the latch trees in the input logic to re-optimize for timing and routing. Schedule constraints often inhibited designers' ability to perform or maintain this work resulting in inefficient implementations. Multi Cycle Interconnect Synthesis (MCIS), which generates multi cycle latch trees, is integrated into the S2L flow to automate latch tree optimization and quickly adapt to logic or physical design changes. Parallel processing using a cost function considering timing, power and routing impact is implemented to ensure the best construction for each latch tree. Enhancements to MCIS are made to meet the requirements of the S2L flow, including built in awareness to the large block hierarchical boundaries. Compared to the custom-built solution, this method achieves better timing while using fewer cloned latches and improved routability.
Engineering Presentation
AI
Design
EDA
DescriptionTraditional FSDB (Fast Signal Database) capture forces verification teams into an impossible choice: capture everything and face unsustainable storage costs, or capture selectively and risk missing critical debug data. We present a three-tier intelligent framework with live dashboard that eliminates this tradeoff, achieving >90% storage reduction while improving debug coverage through specification-aware adaptive intelligence.
Tier 1: Global Policy Intelligence - A centralized database enables architects to define capture strategies for thousands of tests through pattern-based rules, achieving 10× policy reuse vs. per-test hardcoding. This layer ensures critical baseline tests always capture while intelligently filtering stable tests.
Tier 2: Specification-Aware Runtime Adaptation - Compares actual performance against expected specifications in real-time. Bandwidth degradation, credit starvation, and protocol violations trigger targeted captures only when sustained deviations occur, reducing false triggers by >75% compared to traditional approaches.
Tier 3: Regression-Wide Cross-Test Learning - Enables fast tests to inform still-running tests about systemic issues, with hints containing precise timing windows for surgical FSDB chunks rather than full dumps. Engineers can manually override policies during live regressions or use automated steering for pending tests.
Demonstrated results include 92.5% storage reduction, 50× faster time-to-root-cause through cross-test coordination, and 10× policy reuse—all achieved with zero simulation overhead, enabling verification teams to break free from the traditional storage-debug tradeoff.
Tier 1: Global Policy Intelligence - A centralized database enables architects to define capture strategies for thousands of tests through pattern-based rules, achieving 10× policy reuse vs. per-test hardcoding. This layer ensures critical baseline tests always capture while intelligently filtering stable tests.
Tier 2: Specification-Aware Runtime Adaptation - Compares actual performance against expected specifications in real-time. Bandwidth degradation, credit starvation, and protocol violations trigger targeted captures only when sustained deviations occur, reducing false triggers by >75% compared to traditional approaches.
Tier 3: Regression-Wide Cross-Test Learning - Enables fast tests to inform still-running tests about systemic issues, with hints containing precise timing windows for surgical FSDB chunks rather than full dumps. Engineers can manually override policies during live regressions or use automated steering for pending tests.
Demonstrated results include 92.5% storage reduction, 50× faster time-to-root-cause through cross-test coordination, and 10× policy reuse—all achieved with zero simulation overhead, enabling verification teams to break free from the traditional storage-debug tradeoff.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionGoogle's Connect Disconnect Block (GCDB) manages interface connectivity for power management and clock gating. It orchestrates power state transitions (HWACG/HWAPG) and interface fencing to prevent data loss or coherency issues. Verification of GCDB must eliminate deadlocks, live-locks, and FSM hangs while ensuring protocol compliance, clock-gate integrity, correct FSM transitions and stability during Q-Stop clock randomization.
As the GCDB manages handshakes between physically separated blocks across asynchronous clock domains, signals experience non-deterministic physical routing latencies. These delays, combined with asynchronous clocks, can cause signals to arrive out of order, potentially breaking the design behaviors or causing system hangs. We must account for these path delays to ensure the handshake mechanisms don't fail, hang or cause data corruption.
This paper presents a multi-tiered verification strategy using Arch-Formal for high-level architecture verification and Formal Property Verification (FPV) for RTL analysis. To address physical realities, we enhanced FPV with custom FV modules to mimic signal skew and CDC uncertainties. In Arch-Formal, we modeled GCDB with unbounded path delays. Integrating these randomized arrival times proved instrumental in exposing elusive race conditions and deadlocks within both the Arch and design, ensuring robust operation against PVT variations and physical implementation constraints.
As the GCDB manages handshakes between physically separated blocks across asynchronous clock domains, signals experience non-deterministic physical routing latencies. These delays, combined with asynchronous clocks, can cause signals to arrive out of order, potentially breaking the design behaviors or causing system hangs. We must account for these path delays to ensure the handshake mechanisms don't fail, hang or cause data corruption.
This paper presents a multi-tiered verification strategy using Arch-Formal for high-level architecture verification and Formal Property Verification (FPV) for RTL analysis. To address physical realities, we enhanced FPV with custom FV modules to mimic signal skew and CDC uncertainties. In Arch-Formal, we modeled GCDB with unbounded path delays. Integrating these randomized arrival times proved instrumental in exposing elusive race conditions and deadlocks within both the Arch and design, ensuring robust operation against PVT variations and physical implementation constraints.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionGroup-wise quantization is an effective strategy for low-bit inference, where a shared scale is assigned to each smaller group of tensor values. However, existing formats struggle to capture the diverse value distributions found across modern workloads. In this work, we introduce MXP, a posit-inspired microscaling format that incorporates tapered precision to dynamically adjust exponent and fraction bits at the element level. MXP achieves an optimal balance between precision and dynamic range, thereby reducing quantization errors. We further develop a hardware accelerator tailored to MXP. Experimental results demonstrate that MXP incurs less performance degradation, while reducing area and power consumption.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionGenomic analysis workflows, such as single-cell RNA sequencing (scRNA-seq), demand massive sequence matching and classifications, yet are fundamentally limited by the data movement overhead and bandwidth bottleneck of von Neumann architectures. Although in-memory computing (IMC) has emerged as a promising solution, most existing IMC-based genome accelerators rely on digital encoding schemes, which introduce excessive hardware overhead and limit processing efficiency. This work presents a Multi-level Memristor-based Self-adaptive CAM (content addressable memory) Architecture (M²CAM) for genome processing acceleration. Unlike conventional digital approaches, the proposed design leverages analog CAM encoding based on multi-level memristor conductance states, enabling compact representation of nucleotides with significantly reduced device count and improved parallelism. Furthermore, a self-adaptive error correction mechanism dynamically adjusts matching precision through hierarchical operation modes, ensuring robust sequence matching under device variations. A parallelized genome classification framework is developed to demonstrate the system's efficiency, using single-cell RNA sequencing as a representative application. Experimental results show that the proposed architecture achieves a 50%~75% reduction in device usage, while on real scRNA-seq datasets, M²CAM delivers 131.2×~3088.9× and 161.2×~269.6× improvements in energy consumption and latency, compared with CPU/GPU and traditional bioinformatics tools (e.g., STAR).
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionAccurate and efficient lithography simulation is critical for virtual fabrication and shift-left manufacturing.
While thin-mask models and single-plane aerial images suffice at older nodes, advanced nodes require 3D in-resist aerial images to capture depth-dependent effects from mask thickness, sidewalls, and multilayer stacks.
Rigorous EM solvers provide this fidelity but are too slow for ILT or full-chip use, and thin-mask learning methods fail to model the underlying vector field needed for 3D propagation.
This paper introduces a new paradigm that reconstructs the full 3D in-resist vector field from the image-plane vector field and 3D mask/film-stack parameters.
Based on this insight, we develop N3Litho, a GPU-native numerical-neural framework that couples a GPU-accelerated vectorial Abbe solver with an NGP-based multi-resolution hash backbone for fast 3D field reconstruction.
The method achieves near-EM fidelity with orders-of-magnitude speedup, and the resulting 3D aerial images can be directly used by resist models, OPC/ILT pipelines, and hotspot detection flows
While thin-mask models and single-plane aerial images suffice at older nodes, advanced nodes require 3D in-resist aerial images to capture depth-dependent effects from mask thickness, sidewalls, and multilayer stacks.
Rigorous EM solvers provide this fidelity but are too slow for ILT or full-chip use, and thin-mask learning methods fail to model the underlying vector field needed for 3D propagation.
This paper introduces a new paradigm that reconstructs the full 3D in-resist vector field from the image-plane vector field and 3D mask/film-stack parameters.
Based on this insight, we develop N3Litho, a GPU-native numerical-neural framework that couples a GPU-accelerated vectorial Abbe solver with an NGP-based multi-resolution hash backbone for fast 3D field reconstruction.
The method achieves near-EM fidelity with orders-of-magnitude speedup, and the resulting 3D aerial images can be directly used by resist models, OPC/ILT pipelines, and hotspot detection flows
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionDriven by the dominance of AI workloads and the resulting pressure on in-core SRAM, shared memory serves as a critical intermediate buffer in GPUs. We observe substantial data redundancy, with up to 60.2% of subsequent memory requests reaccessing data already loaded into shared memory. This redundancy remains largely underexplored, increasing bandwidth demands. We propose Nash, a lightweight mechanism that tracks data provenance via a Source Location Table (SLT) to detect and redirect redundant requests within or across neighboring SMs. Cycle-accurate simulation shows Nash achieves up to 18.3% performance improvement, 9.8% energy savings, and 25.3% reduction in global memory traffic.
People
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionThe Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without a proportional increase in computational cost. However, on-device deployment of MoE models still faces a critical challenge due to the large memory requirement for storing all expert parameters. In this work, we proposed NASiC, a 3D NAND-based CAM-selected multibit CIM architecture through algorithm-hardware co-optimization, tailored to the high-density storage and sparse computation requirements of MoE models. V-ASIC architecture achieves improved throughput, high area- and energy-efficiency, indicating its great potential for on-device MoE inference.
People
Research Special Session
Systems
DescriptionAs AI scales, both industry and academia have long pursued near-data processing as a path to major performance gains. Today, performance is no longer the only challenge—energy is the real bottleneck. Data movement now consumes orders of magnitude more energy than computation itself, making traditional architectures increasingly unsustainable for next-generation AI data centers as well as battery-constrained edge devices. In this new energy-limited computing era, near-data processing has moved from an intriguing idea to a practical necessity. Approaches span in-sensor, in-memory, in-storage, and in-network processing, depending on where data is created and resides. Among them, near-memory processing is rapidly becoming reality. This talk explores the shift toward tightly integrated compute-memory systems, including 3D architectures that dramatically cut data movement and energy use. I expect we will see more domain-specific architectures and heterogeneous systems built around that idea. It's really a move toward a memory-centric design for AI infrastructure.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionQuantum Error Correction (QEC) demands efficient implementation of syndrome extraction circuits. However, existing compilers for neutral-atom processors largely miss the opportunity to co-optimize these circuits by exploiting both the structural properties of Quantum Error-Correcting Codes (QECCs) and the physical constraints of neutral-atom architectures. In this work, we introduce NEAT, an SMT-based compiler that jointly optimizes qubit mapping and syndrome extraction scheduling for a broad class of stabilizer-based QECCs, achieving depth-optimal execution with minimal shuttling overhead on neutral-atom platforms. Across a wide range of QECCs, NEAT consistently achieves near-optimal circuit depth and reduces atom movement by 3×–30× compared the baseline compiler Enola. Logical-level simulations further demonstrate 2×–20× lower logical error rates under realistic hardware noise. A hierarchical symmetry-breaking formulation and relaxed parallel-motion constraints substantially improve solver scalability, yielding up to 100× speedup in compilation time. Together, these results show that NEAT produces depth-optimal, movement-efficient, and logically robust syndrome extraction schedules, while scaling effectively to large QECCs on neutral-atom hardware.
People
Additional Meeting
Need Pre-RTL Power? Have More Corners than You Can Afford? Test Drive an IEEE 2416 Power Calculator!
10:00am - 12:00pm PDT Monday, July 27 Mtg Room 104CDescriptionJuly 27th, 10:00 AM – 11:45 AM
Running PVT corner simulations today means higher cost, effort, and time; IEEE 2416 changes that by modelling pre-RTL/system-level and gate level power. You can build PVT independent power models once and use them to quickly analyze power across design phases, abstraction levels, and operating conditions.
Si2 has been working in tandem with IEEE 2416 for many years, proving out use cases with proof-of-concept code and demos.
With the approval of IEEE 2416-2025, Si2 is proud to introduce the next generation of UPM PowerCalc, a new reference implementation for the standard.
This new power calculator has been written from the ground up with a focus on portability, modularity, and brevity. Utilizing mature, MIT licensed, open-source libraries has allowed the codebase to focus on the implementation of IEEE 2416 for power analysis.
In this workshop you will be introduced to IEEE 2416 and get hands-on with UPM PowerCalc in a demo environment. You will generate system-level power information for a provided sample design and be given several tasks of varying difficulty to help you understand and employ the IEEE power modelling standard.
What you will learn:
• Highlights of IEEE 2416 and industry use cases
• How to analyze power from architecture through implementation
• Ways to calculate system-level power on real examples at multiple PVTs
Designed for architects planning systems, designers analyzing power, and everyone in between.
Be one of the first to see the standard in action!
Watch for Si2 DAC updates!
Running PVT corner simulations today means higher cost, effort, and time; IEEE 2416 changes that by modelling pre-RTL/system-level and gate level power. You can build PVT independent power models once and use them to quickly analyze power across design phases, abstraction levels, and operating conditions.
Si2 has been working in tandem with IEEE 2416 for many years, proving out use cases with proof-of-concept code and demos.
With the approval of IEEE 2416-2025, Si2 is proud to introduce the next generation of UPM PowerCalc, a new reference implementation for the standard.
This new power calculator has been written from the ground up with a focus on portability, modularity, and brevity. Utilizing mature, MIT licensed, open-source libraries has allowed the codebase to focus on the implementation of IEEE 2416 for power analysis.
In this workshop you will be introduced to IEEE 2416 and get hands-on with UPM PowerCalc in a demo environment. You will generate system-level power information for a provided sample design and be given several tasks of varying difficulty to help you understand and employ the IEEE power modelling standard.
What you will learn:
• Highlights of IEEE 2416 and industry use cases
• How to analyze power from architecture through implementation
• Ways to calculate system-level power on real examples at multiple PVTs
Designed for architects planning systems, designers analyzing power, and everyone in between.
Be one of the first to see the standard in action!
Watch for Si2 DAC updates!
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionGraph neural networks (GNNs) are widely used in hardware security but vulnerable to adversarial netlist rewrites. Existing adversarial approaches incur high overheads. We present NetDeTox, an automated framework combining large language models (LLMs) with reinforcement learning (RL) for efficient rewriting. The RL agent identifies GNN-critical components while the LLM devises functionality-preserving rewrites that diversify motifs. Iterative feedback minimizes overheads. Compared to state-of-the-art AttackGNN, NetDeTox degrades all security schemes with fewer rewrites and substantially lower area overheads (54.50%, 25.44%, and 41.04% reductions for GNN-RE, GNN4IP, and OMLA). For larger circuits, NetDeTox even reduces original area, demonstrating practical scalability.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionTransformer-based large language models are increasingly constrained by data movement as communication bandwidth drops sharply beyond the chip boundary. Wafer-scale integration using wafer-on-wafer hybrid bonding alleviates this limitation by providing ultra-high bandwidth between reticles on bonded wafers. In this paper, we investigate how the physical placement of reticles on wafers influences the achievable network topology and the resulting communication performance. Starting from a 2D mesh-like baseline, we propose four reticle placements (Aligned, Interleaved, Rotated, and Contoured) that improve throughput by up to 250%, reduce latency by up to 36%, and decrease energy per transmitted byte by up to 38%.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionLarge neural surrogate models accelerate scientific simulation but still face scalability limits from model size and memory.
We present FP-DDM (Foundation-model-based Physics-guided adaptation for the Domain Decomposition Method), a scalable framework for large-scale physics analysis. FP-DDM divides the domain into overlapping subdomains and enforces interface consistency by explicitly optimizing boundary values on overlaps via automatic differentiation, enabling fast alignment. Specifically, we propose a foundation architecture that effectively learns multi-physics and trasfer new domain.
The combination of foundation-model adaptation and physics-guided test-time adaptation enables generalization to new physics and unseen domains without labels. On thermal and stress problems, FP-DDM achieves numerical-solver-level accuracy, scales to 10-billion-cell domains, and supports layout design optimization, establishing its potential as a multi-physics platform for next-generation System–Technology Co-Optimization.
We present FP-DDM (Foundation-model-based Physics-guided adaptation for the Domain Decomposition Method), a scalable framework for large-scale physics analysis. FP-DDM divides the domain into overlapping subdomains and enforces interface consistency by explicitly optimizing boundary values on overlaps via automatic differentiation, enabling fast alignment. Specifically, we propose a foundation architecture that effectively learns multi-physics and trasfer new domain.
The combination of foundation-model adaptation and physics-guided test-time adaptation enables generalization to new physics and unseen domains without labels. On thermal and stress problems, FP-DDM achieves numerical-solver-level accuracy, scales to 10-billion-cell domains, and supports layout design optimization, establishing its potential as a multi-physics platform for next-generation System–Technology Co-Optimization.
People
Late Breaking Results
DescriptionModern IC globalization and automated synthesis often produce flattened netlists that obscure design intent, complicating hardware security analysis and reverse engineering. Traditional decompilation methods based on exact subgraph isomorphism or fixed libraries are computationally expensive and struggle to detect unseen logic patterns. We propose a self-supervised neural subgraph matching framework that decomposes netlists into k-hop networks and embeds them into a latent order space to effectively identify repeated arithmetic primitives. Experiments on adders and multipliers show that the method generalizes from small training circuits to larger designs, enabling scalable recovery of high-level behavior from gate-level netlists.
Work in Progress
DescriptionThe traveling salesman problem (TSP) is a fundamental combinatorial optimization problem with significant importance in various commercial and industrial domains. Numerous approaches have been proposed to address the TSP, ranging from advanced algorithmic heuristics to specialized hardware architectures, but it remains challenging to implement, especially for large-scale problems, due to the O(N^4) synaptic weight requirement and device-level constraints. In this paper, we present a scalable digital neuromorphic Ising solver that integrates 8,192 signed spiking neurons and an interleaving simulated annealing strategy on an AMD FPGA device (XC7K160T). Virtual synaptic routing, implemented through address-event-based communication, eliminates the need for fixed physical connectivity. The interleaving annealing reduces DRAM-access latency, achieving a 15% reduction in per-iteration execution time, and reduces synaptic weight storage from O(N^4) to O(N) by storing only distance-dependent coupling terms. The single-core operation provides feasible solutions for TSPLIB instances with fewer than 32 cities, while the multi-core operation parallelizes the solution of larger TSPs by partitioning the problem into hierarchical subgraphs and merging the subgraph solutions using a constraint TSP(CTSP) algorithm. With minimal per-core FPGA resource usage and a scalable architecture using virtual synaptic routing, the neuromorphic Ising solver can be expanded to hundreds of neuromorphic Ising cores.
Research Panel
AI
DescriptionAI is rapidly reshaping electronic design automation from architecture exploration and RTL generation to verification, physical design, and manufacturing optimization. But while chip industry owns the design data and EDA vendors control the toolchains, academia has traditionally disrupted through startups.
Recent shifts, including the slowing of traditional scaling laws, the rise of agentic and non-gradient-based methods, open-source tooling, and autonomous design flows, raise fundamental questions: Who will ultimately drive breakthroughs in AI-led design?
This panel brings together researchers from Startups, EDA, and academia to debate what it would take to build an eco-system that can fully realize AI's promise in chip design. Will academia lead the way once again despite the barrier of data and compute? What research ideas will lead to breakthroughs or disruption – is the problem just data and compute or the need for new foundational models? What are current research ideas across startups, academia and EDA companies that they think will succeed and why?
Recent shifts, including the slowing of traditional scaling laws, the rise of agentic and non-gradient-based methods, open-source tooling, and autonomous design flows, raise fundamental questions: Who will ultimately drive breakthroughs in AI-led design?
This panel brings together researchers from Startups, EDA, and academia to debate what it would take to build an eco-system that can fully realize AI's promise in chip design. Will academia lead the way once again despite the barrier of data and compute? What research ideas will lead to breakthroughs or disruption – is the problem just data and compute or the need for new foundational models? What are current research ideas across startups, academia and EDA companies that they think will succeed and why?
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPower density and total power are higher in 3DIC designs compared to monolithic system on chip (SoC) designs, making thermal integrity a greater concern in 3DIC designs. Given the increased risk of thermal issues, thermal analysis must start from the early stages of a design and be progressively refined as the design matures. It is desirable to embed the thermal analysis in the same tool/platform that 3DIC designers use and have an option to import the same thermal design into a CFD tool, which system thermal experts are familiar with. This facilitates efficient collaboration between electronics designers and thermal experts. Traditional thermal analysis workflow starts only after the chip design is completed and hence not suitable for 3DIC design needs. This paper proposes a new workflow for thermal analysis for 3DIC designs leveraging Siemens IC-Packaging solutions and demonstrate its working on several representative Intel designs and process technologies.
Exhibitor Forum
AI
EDA
Systems
DescriptionPartcl is building the next generation of electronic design automation by bringing GPU acceleration to core semiconductor workflows. Traditional EDA tools for static timing analysis, placement, and logic optimization were built for CPU-bound workflows and often take hours to complete critical iterations. Partcl rethinks this infrastructure from the ground up, enabling these same analyses to run in seconds instead of hours. This allows hardware teams to explore significantly more design iterations, improve power, performance, and area (PPA), and reduce overall tapeout risk and schedule pressure. Backed by Khosla Ventures, Partcl helps semiconductor companies move faster by turning traditionally slow, sequential design loops into high-throughput optimization systems.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionChiplet based 3D disaggregation is the future of chip design that brings many benefits such as better PPA (power, performance, area) and cost. The optimal chiplet stack, however, can be difficult to determine given the large number of design parameters to consider. Common questions to encounter could be which process technology to use for each chiplet, and how to partition the design on each chiplet. In this work, we demonstrate an AI-driven system-technology co-optimization (STCO) platform that leverages IEEE P3537 3Dblox to define 3D chiplet stacks based on the available process technologies. This methodology automatically experiments with hundreds of different combinations and permutations of process technologies, die sizes, and aspect ratios for each chiplet, as well as optimizing for heterogenous 3D design partitioning PPA in each 3DIC stack to obtain the best timing, power, IR, thermal, and other key metrics. The methodology uses user-defined relative wafer costs between process technologies to optimize for total die costs of each 3DIC stack instead of simple area metrics. With this AI-driven methodology, we automated the optimization of 3DIC process technology selection, 3D design partitioning with placement and bump assignment, and 3D design PPA and cost exploration in one single platform.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSemiconductor design at advanced process nodes is facing major hurdles, particularly the increased risk of IR-drop and electromigration (EM) failures. These issues are largely due to higher current densities and thinner BEOL stacks. Furthermore, complex modern lithography and the late identification of design flaws complicate the creation of reliable power delivery networks (PDN), frequently resulting in lengthy ECO cycles and delayed schedules.
To combat these PDN issues, the Calibre Design Enhancer Via Flow was developed as an automated, signoff-quality solution. This tool utilizes a technology-agnostic strategy supporting DEF and OASIS formats, enabling seamless integration across different process nodes and design stages. By incorporating signoff-level rule awareness, it ensures that via placements are DRC-clean and correct-by-construction.
The flow improves vertical connectivity and mitigates EM hotspots by strategically placing redundant vias at missed locations and critical junctions. This reduces IR-drop and increases signoff confidence. This scalable methodology, designed for sub-3nm processes, enhances PDN durability and removes the need for manual iterations.
To combat these PDN issues, the Calibre Design Enhancer Via Flow was developed as an automated, signoff-quality solution. This tool utilizes a technology-agnostic strategy supporting DEF and OASIS formats, enabling seamless integration across different process nodes and design stages. By incorporating signoff-level rule awareness, it ensures that via placements are DRC-clean and correct-by-construction.
The flow improves vertical connectivity and mitigates EM hotspots by strategically placing redundant vias at missed locations and critical junctions. This reduces IR-drop and increases signoff confidence. This scalable methodology, designed for sub-3nm processes, enhances PDN durability and removes the need for manual iterations.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionHigh-accuracy qubit-state readout is critical for quantum computation. While existing hardware-based qubit-state discriminators assume stable readout statistics across measurement shots, realistic long-running tasks like chemistry simulation suffer from slow noise fluctuations and drift that severely degrade fidelity over time. We present NIAQ, an adaptive software–hardware co-design combining Kalman filtering for drift tracking with MLP discriminators for robustness against noise, together with degradation rate (%/hour) as a novel metric to quantify long-term stability. NIAQ achieves 88% qubit-state-readout accuracy in multi-qubit dataset that is equivalent to 1-day runtime, compared to the 61% baseline qubit-state discriminator, and yields 19× lower degradation rate. Optimized for low-power real-time operation on ASIC, post-synthesis results in 28-nm CMOS show 148mW power and 5ns delay, achieving 7.7× power and 6.4× latency improvements compared to the state-of-the-art.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern computing systems are processor-centric, yet most hardware resources are devoted to storage and movement, which make data transfer a dominant performance bottleneck.
Recently, processing-near-memory (PnM) architectures have emerged in both commercial products and research prototypes.
However, practical PnM adoption is bottlenecked by the lack of a deployable system-to-memory interface that minimizes host involvement while managing offloading and memory access.
To overcome processor-centric adoption barriers, we propose NM-IF, a deployable near-memory interface that enables non-intrusive offloading without requiring system redesign.
NM-IF adopts a data-centric classification of near-memory traffic to streamline the management of read-mostly resident parameters, streaming activations, and on-demand output writebacks with data-aware mapping.
Additionally, it executes parameters and activations in a deterministic phase schedule to enable parallelism, while processing outputs in an event-driven manner to avoid unnecessary periodic scheduling.
Experiments show that NM-IF eliminates the 3×/7× latency growth of 8-bit/4-bit unpacking while maintaining a constant per-fetch mapping cost. It enables effective amortization of the fixed datapath cost for 64-bit system-bus fetches packing beyond 21 elements, improving per-element efficiency.
NM-IF further overlaps parameter/activation handling (22 cycles) with a 1-cycle output writeback, resulting in 23 cycles total.
Overall, NM-IF offers a practical near-memory interface that enhances both performance and efficiency at runtime.
Recently, processing-near-memory (PnM) architectures have emerged in both commercial products and research prototypes.
However, practical PnM adoption is bottlenecked by the lack of a deployable system-to-memory interface that minimizes host involvement while managing offloading and memory access.
To overcome processor-centric adoption barriers, we propose NM-IF, a deployable near-memory interface that enables non-intrusive offloading without requiring system redesign.
NM-IF adopts a data-centric classification of near-memory traffic to streamline the management of read-mostly resident parameters, streaming activations, and on-demand output writebacks with data-aware mapping.
Additionally, it executes parameters and activations in a deterministic phase schedule to enable parallelism, while processing outputs in an event-driven manner to avoid unnecessary periodic scheduling.
Experiments show that NM-IF eliminates the 3×/7× latency growth of 8-bit/4-bit unpacking while maintaining a constant per-fetch mapping cost. It enables effective amortization of the fixed datapath cost for 64-bit system-bus fetches packing beyond 21 elements, improving per-element efficiency.
NM-IF further overlaps parameter/activation handling (22 cycles) with a 1-cycle output writeback, resulting in 23 cycles total.
Overall, NM-IF offers a practical near-memory interface that enhances both performance and efficiency at runtime.
Engineering Presentation
Design
EDA
Systems
DescriptionPCIe Gen6/7 introduces major architectural enhancements to meet the bandwidth and latency demands of AI and datacentre workloads. Bandwidth doubles from 32 GT/s to 64 GT/s to 128GT/s through support for PAM4 signalling, flit‑based transfers, and multiple virtual channels (VCs). These capabilities also significantly increase the complexity of credit management, which governs buffer availability between transmitter and receiver. Gen6 onwards, each VC maintains its own dedicated credits while additionally drawing from a dynamic pool of shared credits. Features such as infinite credits and merged‑credit modes further expand the state space, making correct credit handling essential for avoiding pipeline deadlock, VC starvation, and silent throughput loss.
Traditional simulation‑driven approaches are unable to exhaustively explore the combinatorial explosion of credit configurations, buffer‑availability permutations, and multi‑TLP corner cases. To address this gap, we propose a quiescence‑based formal verification methodology aimed at proving forward progress for all in‑flight TLPs whenever sufficient receiver credits are available. The method isolates the design by suppressing external stimuli and driving the system toward a stable quiesced state. It uses symbolic TLPs, precise modelling of PCIe credit semantics, and quiescence‑driven stabilization proofs to ensure that forward progress is guaranteed under all legal credit scenarios.
Applying this methodology uncovered a subtle but critical bug in credit revaluation logic—an issue deeply buried in mode‑transition sequences and effectively unreachable through conventional verification. Our results show that quiesced formal verification provides exhaustive coverage of credit modes, exposes otherwise‑inaccessible corner‑case failures, and delivers high‑certainty correctness for next‑generation PCIe interconnect designs.
Traditional simulation‑driven approaches are unable to exhaustively explore the combinatorial explosion of credit configurations, buffer‑availability permutations, and multi‑TLP corner cases. To address this gap, we propose a quiescence‑based formal verification methodology aimed at proving forward progress for all in‑flight TLPs whenever sufficient receiver credits are available. The method isolates the design by suppressing external stimuli and driving the system toward a stable quiesced state. It uses symbolic TLPs, precise modelling of PCIe credit semantics, and quiescence‑driven stabilization proofs to ensure that forward progress is guaranteed under all legal credit scenarios.
Applying this methodology uncovered a subtle but critical bug in credit revaluation logic—an issue deeply buried in mode‑transition sequences and effectively unreachable through conventional verification. Our results show that quiesced formal verification provides exhaustive coverage of credit modes, exposes otherwise‑inaccessible corner‑case failures, and delivers high‑certainty correctness for next‑generation PCIe interconnect designs.
Engineering Presentation
EDA
Security
DescriptionModern SoCs incorporate multiple clock domains, making Clock Domain Crossing (CDC) verification critical for functional correctness. Metastability, an inherent phenomenon in asynchronous signal transfers, cannot be eliminated by synchronizers—only mitigated. Traditional RTL simulations assume ideal synchronizers, creating blind spots that overlook metastability-induced uncertainties, which can lead to silicon failures despite clean verification results. This paper introduces an RTL modeling framework that treats metastability as a first-class verification concern. The proposed approach replaces conventional synchronizers with metastability-aware models that selectively inject non-deterministic behaviors, such as cycle slips and missed captures, based on realistic timing conditions. Key parameters include clock skew threshold (ΔTc) and probabilistic event control via RNG mode, enabling stress testing across diverse CDC architectures. Evidence from FIFO-based communication demonstrates metastability-induced stalls and throughput drops, validating the model's effectiveness. The framework is scalable, non-intrusive, and applicable to control signals, status flags, handshake protocols, and reset release scenarios. By exposing metastability hazards early in RTL, this methodology improves design robustness, prevents reconvergence issues, and enhances CDC coverage. Future work includes extending support for multi-bit buses and integrating with formal verification for exhaustive metastability analysis.
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionAnalog in-Pixel computing (IPC) enables energy-efficient edge AI but faces critical noise challenges limiting computing accuracy. Existing design flows lack comprehensive device-aware noise modeling and cross-layer optimization. We introduce NoiseXplore, the first end-to-end design-technology co-optimization framework integrating: (1) physics-grounded noise models (thermal, shot, RTN, device mismatch); (2) device-aware SNR estimation and neural network co-simulation; (3) circuit-level performance modeling integrated into NeuroSim; and (4) automated Bayesian DTCO for multidimensional design space exploration. Demonstrated on noise-sensitive 1T FDSOI-based IPC, optimized configurations achieve 50 fps with 86.8% CIFAR-10 classification accuracy on a three-class subset (airplane/car/ship).
Exhibitor Forum
DescriptionAs large language models become increasingly powerful, they gain immense potential for complex, formal tasks in semiconductor design. But to unlock these capabilities, the models must be provided with the right theories and formalizations, giving them structured information to reason over rather than generate against. General-purpose LLMs, which optimize for human preference rather than formal correctness, inherently lack this foundational formal backbone. The result is outputs that score well on fluency but do not reliably compile, simulate, or maintain consistency across verification artifacts.
Larger models raise the floor on simple tasks, but the gap between statistically probable and provably correct widens as design complexity increases. Parameter count, context window, and training data do not address the absence of formal structure connecting specification to generated output.
We discuss how constructing machine-readable intent from multimodal design specifications including text, timing diagrams, state charts, and register maps along with auto-formalization gives AI the structured foundation that it needs. It transforms informal design intent into verifiable structure, enabling classical algorithms and reinforcement learning to guide generation toward provably correct outputs, drawing on the same principle behind DeepMind's AlphaProof.
We also look at where this intersection is heading: what it means to train models on formal modeling languages rather than natural language, how reinforcement learning on hardware description languages produces different capabilities than fine-tuning on code, and why the path to reliable AI for chip design runs through mathematical structure rather than scale.
Larger models raise the floor on simple tasks, but the gap between statistically probable and provably correct widens as design complexity increases. Parameter count, context window, and training data do not address the absence of formal structure connecting specification to generated output.
We discuss how constructing machine-readable intent from multimodal design specifications including text, timing diagrams, state charts, and register maps along with auto-formalization gives AI the structured foundation that it needs. It transforms informal design intent into verifiable structure, enabling classical algorithms and reinforcement learning to guide generation toward provably correct outputs, drawing on the same principle behind DeepMind's AlphaProof.
We also look at where this intersection is heading: what it means to train models on formal modeling languages rather than natural language, how reinforcement learning on hardware description languages produces different capabilities than fine-tuning on code, and why the path to reliable AI for chip design runs through mathematical structure rather than scale.
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionLogic locking protects hardware designs across the global semicon-
ductor supply-chain. However, recent machine learning (ML)-based
attacks, especially graph learning-based ones, have undermined
its security by learning circuit structures and gate compositions to
reveal locked connections/gates. Although recent locking methods
use explainability tools and adversarial perturbations, they remain
vulnerable to ML attacks under robust training. Therefore, there is
still a need for truly learning-resilient locking solutions.
We propose IsoLock, a locking scheme that performs intercon-
nect obfuscation to secure the circuit's structure and functionality.
IsoLock models the gate-level netlist as a graph and introduces
MUXes for locking (through rewiring selected interconnects), gen-
erating isomorphic structures. These structures are indistinguish-
able to ML models and other structural attacks, effectively blocking
all learnable/predictable features and preventing key deciphering.
For evaluation, first, we prove IsoLock's security against modeling
attacks. Second, we lock standard ISCAS-85 and ITC-99 benchmarks
and evaluate them against state-of-the-art ML attacks (SCOPE and
MuxLink) and structural attacks (Redundancy and SAAM), all of
which can decipher only 0–6% of the correct key on average, con-
firming IsoLock's resilience also in practice.
ductor supply-chain. However, recent machine learning (ML)-based
attacks, especially graph learning-based ones, have undermined
its security by learning circuit structures and gate compositions to
reveal locked connections/gates. Although recent locking methods
use explainability tools and adversarial perturbations, they remain
vulnerable to ML attacks under robust training. Therefore, there is
still a need for truly learning-resilient locking solutions.
We propose IsoLock, a locking scheme that performs intercon-
nect obfuscation to secure the circuit's structure and functionality.
IsoLock models the gate-level netlist as a graph and introduces
MUXes for locking (through rewiring selected interconnects), gen-
erating isomorphic structures. These structures are indistinguish-
able to ML models and other structural attacks, effectively blocking
all learnable/predictable features and preventing key deciphering.
For evaluation, first, we prove IsoLock's security against modeling
attacks. Second, we lock standard ISCAS-85 and ITC-99 benchmarks
and evaluate them against state-of-the-art ML attacks (SCOPE and
MuxLink) and structural attacks (Redundancy and SAAM), all of
which can decipher only 0–6% of the correct key on average, con-
firming IsoLock's resilience also in practice.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionLLMs have shown early promise in generating RTL code, yet evaluating their capabilities in realistic setups remains a challenge. So far, RTL benchmarks have been limited in scale, skewed toward trivial designs, offering minimal verification rigor, and vulnerable to data contamination. In this paper, we introduce NotSoTiny, a benchmark that assesses LLM on structurally rich and context-aware RTL generation tasks, while being resilient to contamination. Built from hundreds of representative hardware designs produced by the TinyTapeout community, our automated pipeline removes duplicates, verifies correctness using simulation and formal equivalence checking, and continuously incorporates new designs to mitigate leakage. Evaluation results show that these tasks are significantly more challenging than prior benchmarks, emphasizing NotSoTiny's effectiveness in revealing the current limitations of LLMs applied to hardware design and in guiding the refinement of this promising technology.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAt SOC level, customer requires comprehensive electrical views, of the memory IP we design, to support Power and Reliability sign-off. Generating APL models requires waveform database (FSDB) from the spice simulation of the IP. As IP size is very huge, full extracted macro simulations are not practical. So, we have used a hierarchical approach using block level simulations, without losing accuracy. We are able to generate model in reasonable runtime. we compared the results from Hierarchical approach against results from a flat run for the smallest IP, and the results are comparable.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern microprocessors grow increasingly complex with each generation, making hierarchical design essential for managing scale and complexity. In Physical Design, this translates into processing smaller building blocks from RTL to GDS, which are later integrated at higher levels. A critical aspect is constructing the high-speed global clock network on a hierarchical clock grid.
Traditionally, this process begins by defining blockages for global clock grid at the top level, followed by carving out blockages for smaller blocks. The global clock is then routed within these reserved areas and combined across levels. To ensure correctness, buildability, and functionality, global clock checks are performed both on individual blocks and at the highest level. However, this approach is time-intensive and heavily reliant on expert knowledge for defining routing blockages.
To address these challenges, the IBM z-Microprocessor team has developed an automated, in-context reservation modification technique that enables building a functional global clock from the ground up. This innovation significantly reduces dependency on manual expertise and accelerates the design process, improving efficiency and scalability for future microprocessor designs.
Traditionally, this process begins by defining blockages for global clock grid at the top level, followed by carving out blockages for smaller blocks. The global clock is then routed within these reserved areas and combined across levels. To ensure correctness, buildability, and functionality, global clock checks are performed both on individual blocks and at the highest level. However, this approach is time-intensive and heavily reliant on expert knowledge for defining routing blockages.
To address these challenges, the IBM z-Microprocessor team has developed an automated, in-context reservation modification technique that enables building a functional global clock from the ground up. This innovation significantly reduces dependency on manual expertise and accelerates the design process, improving efficiency and scalability for future microprocessor designs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionState-of-the-art microprocessors become more complex with every generation. Therefore, large microprocessor designs are implemented hierarchically. For Physical Design this means smaller building blocks are being processed from RTL to GDS first. Following, these pre-built blocks can be integrated at higher levels with a final chip-wide place and route as well as checking being done at the end. This process requires a precise handoff which for shapes is represented in form of a physical abstract. In the past, abstracts were created upfront on a higher level to include in-context information and were published as OpenAccess data to be used as inputs for block construction. However, this process is very time consuming, error prone and adds extra dependencies. The IBM z-Microprocessor team has created a methodology where the OpenAccess views are being replaced by version-controlled text files. Generation of these text files is automated, depends on the latest RTL as well as the latest parent level state of the design and runs independent of other processes. This simplifies the hierarchical end-to-end pipeline from RTL to GDS because block specific OA-abstracts, required for construction, are generated on-the-fly. This not only saves runtime but also increases the overall stability of the design.
Work in Progress
DescriptionIndustrial Algorithmic Pattern Generator (ALPG) code generation
with large language models (LLMs) has shown strong results under
high-cost configurations using proprietary models and GPU-based
pipelines. However, such setups limit practical deployment in on-
premise semiconductor environments. In this work, we ask how
far a cost-constrained pipeline can approach industrial viability.
We deploy an FP8-quantized open-source Llama 3.3 model on an
emerging low-power NPU accelerator (Renegade) and evaluate
domain-specific code generation. Compared to an NVIDIA A100
GPU, the NPU achieves comparable throughput (80–110 tokens/s)
while reducing steady-state power by 50–70%, yielding 2–3× higher
tokens-per-watt efficiency. Structural validity improves from 40% to
60–65% with lightweight rule-based correction. These results high-
light the trade-off between energy efficiency and domain accuracy,
suggesting that energy-efficient NPUs can approach practical indus-
trial deployment without relying on high-cost model and hardware
configurations.
with large language models (LLMs) has shown strong results under
high-cost configurations using proprietary models and GPU-based
pipelines. However, such setups limit practical deployment in on-
premise semiconductor environments. In this work, we ask how
far a cost-constrained pipeline can approach industrial viability.
We deploy an FP8-quantized open-source Llama 3.3 model on an
emerging low-power NPU accelerator (Renegade) and evaluate
domain-specific code generation. Compared to an NVIDIA A100
GPU, the NPU achieves comparable throughput (80–110 tokens/s)
while reducing steady-state power by 50–70%, yielding 2–3× higher
tokens-per-watt efficiency. Structural validity improves from 40% to
60–65% with lightweight rule-based correction. These results high-
light the trade-off between energy efficiency and domain accuracy,
suggesting that energy-efficient NPUs can approach practical indus-
trial deployment without relying on high-cost model and hardware
configurations.
Work in Progress
DescriptionLarge Language Models (LLMs) face significant compute and memory bottlenecks from massive key-value data in long-context inference. We present Nucleus, a configurable accelerator that applies online outlier-aware quantization to reduce TTFT and TBT. Its runtime-configurable outlier detector dynamically adapts to KV distribution patterns, enabling precision-aware computation. With dynamic outlier compaction and a reconfigurable fused multiply-accumulate (FMA) engine, Nucleus achieves near-ideal scaling across quantization levels. A network-on-chip (NoC) scheduler coordinates multiple Nucleus cores to overlap single-batch prefill and multi-batch decode, balancing variadic-length workloads. Fabricated in Samsung 5-nm technology, Nucleus outperforms state-of-the-art baselines such as NVIDIA A100 with KIVI, delivering latency-optimized long-context LLM inference.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe rapid growth of LLMs demands high-throughput, memory-capacity–intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPUs and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage-access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads FFN computation into the Flash array while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, ECC units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache–aware scheduler maintains throughput as context grows. Evaluated on OPT and LLaMA models up to 30B parameters, NVLLM achieves 16.7× speedup over A800 out-of-core inference and 1.1×–4.7× improvement over SSD-like designs, with only 2.7\% CMOS area overhead.
Research Manuscript
ODBP: Obfuscation Design Approach for Biochemical Protocols in Continuous-Flow Microfluidic Biochips
1:42pm - 1:55pm PDT Tuesday, July 28 Mtg Room 202CDesign
DES3. Emerging Models of Computation
DescriptionThe intellectual property (IP) of continuous-flow microfluidic biochips (CFMBs) faces the risk of reverse engineering (RE) attacks during the IP transfer process from designers to manufacturers. This IP comprises not only hardware layouts but also biochemical protocols that specify bioassays procedures, whose theft can lead to technology leakage, trust crises, and substantial economic losses. To address this risk, this paper presents ODBP, the first obfuscation design approach for biochemical protocols in CFMBs. This approach contains four complementary techniques: an anti-RE strength measurement, a dependency-preserving and function-covering strategy, an executable dummy scheme, and a position selection strategy. By incorporating these techniques into the position selection of dummy operations and dummy edges, ODBP generates obfuscated biochemical protocols that can effectively resist RE attacks. Experimental results on multiple benchmarks show that the proposed approach generates obfuscated biochemical protocols that reach the target anti-RE lifetime with low protocol-level cost. Moreover, architecture prediction based on these obfuscated protocols shows that the corresponding CFMBs architectures can be realized at a lower overall cost.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionSparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches. Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6× on average compared to existing sparsity methods.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionA transaction ensures an all-or-nothing guarantee for a set of read/write operations. While modern transaction processing systems provide atomicity and consistency for data center applications, their performance is fundamentally limited by a critical memory bandwidth bottleneck, caused by massive numbers of read/write operations. Processing-in-memory (PIM) architectures offer a promising solution with their high aggregate bandwidth. However, directly applying PIM becomes inefficient in practice due to load imbalance and costly data transfers among PIM processors. To bridge the gap between transaction processing and PIM architectures, this paper presents Onyx, a high-throughput transaction processing system that efficiently executes transactions on a real UPMEM PIM prototype. To fully leverage the high parallelism and bandwidth of PIM, Onyx orchestrates the transaction execution workflow to maximize local accesses while maintaining load balance across PIM processors. It further employs rank-level asynchronous data transfer to reduce communication overhead. Extensive evaluation using real-world YCSB and TPC-C benchmarks shows that Onyx significantly outperforms state-of-the-art transaction processing systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn analog-on-top devices featuring hundreds of digital test and trimming registers, routing complexity between the digital domain and analog IPs leads to high congestion. STMicroelectronics' patented solution No. US20250371237A1 relocates these registers inside the analog IP blocks, significantly reducing top-level interconnections. However, this requires executing the full digital design flow (from RTL generation to Place & Route) for each analog "box," increasing time and the risk of errors. To address this, the "OoB (Out-of-Box) Generator," a Python-based tool, was developed to fully automate the digital flow, integrating RTL generation, simulation, synthesis, Place & Route, and signoff through an intuitive GUI. This tool reduces development time from 3 days to 30 minutes per box, achieving a 17% improvement in logic density (kgates/mm²) and a 30% reduction in trim/test nets. OoB Generator also empowers analog designers to independently manage the digital flow, enhancing productivity and design quality. The approach has been silicon-validated and adopted in commercial products, with future plans to extend support to additional IPs and integrate AI for design decision assistance.
Additional Meeting
EDA
DescriptionThis “Open-Source EDA, Data and Collaboration” Birds-of-a-Feather session is the seventh in a series that began at DAC 2018. The DAC 2026 session serves as an informal forum for anyone who would like to hear or share ideas and the latest updates on emerging trends in our open-source ecosystem.To ensure a broad and inclusive discussion, topics of interest include, but are not limited to: Augmented AI in EDA (moving beyond traditional black-box hyperparameter search to include agentic workflows and AI-accelerated physical design), open-source ecosystem governance and initiatives, robust PDK development, open-source AMS design flows, as well as benchmarking and industrial usage. Additionally, we will focus on workforce development, discussing how the hardware and software design talent gap is being bridged by reshaping university curricula and training engineers on modern workflows. This BoF is designed as an open, interactive forum to surface challenges, share experiences, and collaborate. Everyone is welcome and admission is free for all! (Please send an email to abk@ucsd.edu with any questions.)
Tutorial
DescriptionQuantum computing hardware — including superconducting qubits, photonic circuits, and hybrid systems — presents unique physical design challenges that require specialized EDA workflows spanning electromagnetic simulation, layout generation, physical verification, and foundry integration. This tutorial provides a hands-on introduction to designing quantum computing circuits using open-source EDA tools built on the GDSFactory framework. Attendees will learn end-to-end design workflows from schematic capture through physical verification (DRC and LVS) and foundry tapeout, with a focus on qubit RF design and quantum circuit layout. The tutorial covers parametric cell development in Python for quantum device geometries, integration with electromagnetic and circuit simulators for qubit characterization, physical verification flows adapted for quantum fabrication processes, and foundry PDK integration for practical tapeout readiness. Through interactive Jupyter notebooks and an open-source quantum PDK, participants will gain hands-on experience with the complete design flow — from device specification to fabrication-ready layout. The tutorial includes a basic introduction to quantum circuit design principles to ensure accessibility for attendees without a quantum computing background.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionDigital Compute-in-Memory (DCiM) accelerates neural networks by reducing data movement. Approximate DCiM can further improve power–performance–area (PPA), but demands accuracy-constrained co-optimization across coupled architecture and transistor-level choices. Building on OpenYield, we introduce Accuracy-Constrained Co-Optimization (ACCO) and present OpenACMv2, an open framework that operationalizes ACCO via two-level optimization: (1) accuracy-constrained architecture search of compressor combinations and SRAM macro parameters, driven by a fast GNN-based surrogate for PPA and error; and (2) variation- and PVT-aware transistor sizing for standard cells and SRAM bitcells using Monte Carlo. By decoupling ACCO into architecture-level exploration and circuit-level sizing, OpenACMv2 integrates classic single- and multi-objective optimizers to deliver strong PPA–accuracy tradeoffs and robust convergence. The workflow is compatible with FreePDK45 and OpenROAD, supporting reproducible evaluation and easy adoption. Experiments demonstrate significant PPA improvements under controlled accuracy budgets, enabling rapid "what-if" exploration for approximate DCiM.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe deceleration of Moore's Law, combined with the rising computational demands of big-data applications and generative AI, is accelerating the development of scale-up systems. Unlike traditional scale-out architectures, scale-up systems typically extend shared memory across multiple nodes using high-bandwidth, load/store-based interconnects such as CXL, UALink, NVLink, and Unified-Bus.
Designing and optimizing such interconnected systems poses a multi-faceted challenge spanning computer architecture, interconnect technology and software system.
Although industry has introduced numerous commercial solutions, the academic community still lacks an open-source, hardware-based platform to support cross-stack research, such as exploring new protocols, topologies, I/O controller or switch architectures, and system-level software optimizations. To address such a huge gap, we introduce OpenSUN, a novel FPGA-based multi-node system that supports customizable architecture and software systems. Its baseline architecture integrates a full-stack design, encompassing I/O controllers, switches, protocol stacks, and software systems. These components are all accessible through user-friendly interfaces at each layer to enable agile prototyping of novel hardware and software designs. By providing this open, community-accessible platform, OpenSUN aims to foster exploration and innovation in next-generation scale-up interconnect systems.
Designing and optimizing such interconnected systems poses a multi-faceted challenge spanning computer architecture, interconnect technology and software system.
Although industry has introduced numerous commercial solutions, the academic community still lacks an open-source, hardware-based platform to support cross-stack research, such as exploring new protocols, topologies, I/O controller or switch architectures, and system-level software optimizations. To address such a huge gap, we introduce OpenSUN, a novel FPGA-based multi-node system that supports customizable architecture and software systems. Its baseline architecture integrates a full-stack design, encompassing I/O controllers, switches, protocol stacks, and software systems. These components are all accessible through user-friendly interfaces at each layer to enable agile prototyping of novel hardware and software designs. By providing this open, community-accessible platform, OpenSUN aims to foster exploration and innovation in next-generation scale-up interconnect systems.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
Description3D object detection is critical for robotics and autonomous driving; yet, the inference latency remains a bottleneck for efficient deployment on resource-constrained platforms. Existing acceleration strategies for sparse 3D detectors fail to fully exploit the intrinsic sparsity of voxels. To address this challenge, we propose a systematic operator-level acceleration pattern that exploits structural properties of sparse voxels through three synergistic techniques: (1) cache-aware attention fusion that redesigns slot-based aggregation via tiled tensor decomposition, achieving 10.0–33.5× speedup on GTX 4090 GPU and 12.5–50.8× speedup on A100 GPU while enabling previously infeasible high-dimensional attention heads (𝑑=512−1024); (2) sparsity-driven convolution approximation that leverages attention-induced feature redundancy to replace 3D axial convolutions with lightweight 1D channel operations, delivering 1.2–2× acceleration with negligible accuracy loss (<4% mAP); and (3) structure-aware precision reduction employing profiling-guided mixed-precision assignment that selectively applies FP16 to high-cost operators with high tolerance to precision loss, while preserving FP32 for geometric encoders, achieving 1.93–2.19× speedup. All optimizations preserve network architecture and require no retraining. Experimental results show that our method achieves a 2.1x-2.5x end-to-end speedup over state-of-the-art baselines while maintaining accuracy within 4% mAP and 2.5% NDS, establishing a novel pattern for efficient sparse 3D perception.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionQuantum computing have been demonstrated as a promising solution for computation-intensive power system electromagnetic transient (EMT) computations. However, existing work have focused on time-independent Hamiltonian evolution to solve ordinary differential equations (ODEs) with heavily classical interactions. This work propose OHQS, an optimized time-dependent Hamiltonian evolution quantum solver, to enable high-accuracy computation for ODE-dominated EMT problems without classical iterations. The spatiotemporay mapping transformation is discretized to construct iterative quantum circuit evolution, leveraging the spatiotemporal fixed points (SFPs)-enabled automatic optimizer to reduce discretized errors. Experimental results have demonstrated that OHQS can support time-dependent Hamiltonian evolution in practical applications with up to 99.91% accuracy and achieve up to 9.56× error reduction compared to classical methods. The observed exponential decay of the relative error implies that system qubits scales as O(log(1/ξ)) and the ξ is the target relative error tolerance.
Engineering Presentation
AI
Design
EDA
DescriptionStandard cell libraries form the foundation of SoC design but impose significant limitations through fixed drive strengths, P/N ratios, and tapering schemes. While typical libraries offer approximately 30 variants per gate type, the theoretical design space contains millions potential variants per gate. Traditional EDA tools cannot effectively explore this vast space, leaving substantial performance gains untapped.
This work introduces a novel methodology that optimizes power and performance through intelligent, selective transistor resizing at the individual cell level. The approach creates tailor-made cell variants while maintaining full compatibility with standard EDA flows. The method employs a two-stage process: a one-time per-Library precomputation phase that generates predictive models from SPICE models and library data, using AI techniques, followed by a per-design optimization flow that produces custom cell netlists.
This smart exploration of the design space identifies high-impact optimization opportunities without overwhelming existing tool flows. By enabling transistor-level customization within the standard cell framework, this methodology unlocks previously inaccessible performance improvements. Results on real industrial designs demonstrate double-digit percentage improvements in power efficiency and timing while seamlessly integrating into conventional design methodologies.
This work introduces a novel methodology that optimizes power and performance through intelligent, selective transistor resizing at the individual cell level. The approach creates tailor-made cell variants while maintaining full compatibility with standard EDA flows. The method employs a two-stage process: a one-time per-Library precomputation phase that generates predictive models from SPICE models and library data, using AI techniques, followed by a per-design optimization flow that produces custom cell netlists.
This smart exploration of the design space identifies high-impact optimization opportunities without overwhelming existing tool flows. By enabling transistor-level customization within the standard cell framework, this methodology unlocks previously inaccessible performance improvements. Results on real industrial designs demonstrate double-digit percentage improvements in power efficiency and timing while seamlessly integrating into conventional design methodologies.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionDynamic-shape neural networks are essential for handling variable-length data in tasks, but their optimization is challenging. Existing methods, such as hand-tuned libraries or static kernel compilation, are inefficient due to a lack of flexibility or the high overhead from repeated compilations.
We propose DyTuner, an adaptive operator-level tuning framework for dynamic-shape neural networks. DyTuner uses a unified, system-compatible interface and an adaptive algorithm to efficiently tune kernels for varying shapes, significantly improving end-to-end performance and reducing redundant compilations. Experiments demonstrate that DyTuner effectively optimizes dynamic-shape workloads, achieving speedups of up to 1.41x and 1.64x over state-of-the-art solutions, BladeDISC and TorchInductor.
We propose DyTuner, an adaptive operator-level tuning framework for dynamic-shape neural networks. DyTuner uses a unified, system-compatible interface and an adaptive algorithm to efficiently tune kernels for varying shapes, significantly improving end-to-end performance and reducing redundant compilations. Experiments demonstrate that DyTuner effectively optimizes dynamic-shape workloads, achieving speedups of up to 1.41x and 1.64x over state-of-the-art solutions, BladeDISC and TorchInductor.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionRegister arrays in GPU design consume significant power due to clock tree complexity and addressing logic. Analysis of a key GPU gaming workload revealed that a group of register arrays account for 10.3% of the total sub-block power, making them a prime target for optimization.
Clock power can be reduced by shrinking entry bit width and/or decreasing array depth or count. However, this can impact performance due to reduced capacity increasing stalls, which can introduce backpressure within a design.
Access pattern analysis showed that consecutive arrays are often accessed together, allowing them to be folded into fewer, wider arrays, preserving capacity and avoiding performance loss. This approach, called access pattern-based array folding, reduces addressing logic and clock power. An example optimization achieved a 2.75% net power reduction, 0.63% reduction in standard cell area, and 13.15% reduction in total ICG count.
The solution has broad applicability across industries, including CPU, AI/ML, networking, and SoC, and provides a systematic, architecture‑agnostic method for global clock-tree simplification. By leveraging access patterns to optimize register array design, this novel approach can reduce power consumption while preserving capacity and performance, making it a valuable technique for various designs across the industry.
Clock power can be reduced by shrinking entry bit width and/or decreasing array depth or count. However, this can impact performance due to reduced capacity increasing stalls, which can introduce backpressure within a design.
Access pattern analysis showed that consecutive arrays are often accessed together, allowing them to be folded into fewer, wider arrays, preserving capacity and avoiding performance loss. This approach, called access pattern-based array folding, reduces addressing logic and clock power. An example optimization achieved a 2.75% net power reduction, 0.63% reduction in standard cell area, and 13.15% reduction in total ICG count.
The solution has broad applicability across industries, including CPU, AI/ML, networking, and SoC, and provides a systematic, architecture‑agnostic method for global clock-tree simplification. By leveraging access patterns to optimize register array design, this novel approach can reduce power consumption while preserving capacity and performance, making it a valuable technique for various designs across the industry.
Engineering Presentation
EDA
Systems
DescriptionAdvanced technology nodes with tighter BEOL metal widths and pitches have enabled area scaling with higher interconnect densities but have significantly increased Electromigration (EM) induced reliability risks from increased power densities and self-heating effects from Gate-all-around (GAA) architecture. Modern mobile SoC are designed to ambitious performance targets and within very tight design windows, making traditional foundry-rules based EM closure challenging.
In this presentation, we employ a Context and Thermally aware Statistical EM budgeting (SEB) flow as sign-off methodology that prioritizes realistic physical and thermal environments, accurate correlation of models to silicon data, and precise failure-rate (FR) estimation to deliver efficient and reliable EM closure for complex 2nm SoC design. Implementation of this risk-aware flow empowered designers to maximize performance while maintaining the FR within the allocated reliability budget. Consequently, this Design For Reliability (DfR) approach eliminated over two weeks of iterative manual re-design effort and saved computing resources, significantly improving the sign-off timeline.
The work can be summarized as:
1. Context-Thermal-Risk Aware Modeling: Predict accurate failure rates using SEB that incorporates physical environment and thermal coupling with accurate silicon-calibrated models.
2. Targeted Fixing: Enable performance optimization by shifting to risk-based EM closure, prioritizing fixes only for critical violations exceeding the reliability budget.
In this presentation, we employ a Context and Thermally aware Statistical EM budgeting (SEB) flow as sign-off methodology that prioritizes realistic physical and thermal environments, accurate correlation of models to silicon data, and precise failure-rate (FR) estimation to deliver efficient and reliable EM closure for complex 2nm SoC design. Implementation of this risk-aware flow empowered designers to maximize performance while maintaining the FR within the allocated reliability budget. Consequently, this Design For Reliability (DfR) approach eliminated over two weeks of iterative manual re-design effort and saved computing resources, significantly improving the sign-off timeline.
The work can be summarized as:
1. Context-Thermal-Risk Aware Modeling: Predict accurate failure rates using SEB that incorporates physical environment and thermal coupling with accurate silicon-calibrated models.
2. Targeted Fixing: Enable performance optimization by shifting to risk-based EM closure, prioritizing fixes only for critical violations exceeding the reliability budget.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionDiffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal practical acceleration.
To overcome these limitations, we propose ORBIS, an SW–HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains.
To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.
To overcome these limitations, we propose ORBIS, an SW–HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains.
To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.
Research Special Session
AI
DescriptionIncreasing demands for intelligent processing across space edge platforms and emerging orbital data centers—from commercial constellations to national security missions—require computing architectures that deliver extreme SWaP (size, weight, and power) efficiency while ensuring radiation hardness, reliability, and hardware-level trustworthiness. Future space systems must securely support both autonomous edge intelligence and scalable in-orbit computing infrastructure under harsh radiation and adversarial conditions. We present OrbitBrain, a secure and reliable chiplet-based architecture for space edge AI and orbital data centers. OrbitBrain integrates heterogeneous chiplets, including radiation-aware analog in-memory computing (IMC) chiplets based on non-volatile memory (NVM), dedicated spiking neural network (SNN) acceleration chiplets, digital control chiplets, and dedicated security monitoring chiplets. Further, it establishes the root-of-trust through a secure integration fabric. The architecture incorporates rad-hard design techniques, fault detection and mitigation mechanisms, and secure monitoring to ensure resilience against faults and hardware and/or software attacks. Its modular design supports ultra-efficient edge inference as well as scalable model execution and adaptive fine-tuning for data center-class workloads, providing a trustworthy and resilient foundation for next-generation space computing systems.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionDiffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162× and 2.63-6.30× speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFormal verification rigorously explores all possible system states, offering unmatched coverage and the ability to detect subtle errors in complex designs. However, its exhaustive nature leads to the state space explosion problem, making verification computationally challenging. Reset abstraction mitigates this by simplifying the state space while preserving essential properties, but its application is often hindered by difficulties in selecting the right abstraction level and ensuring results of accuracy. Many engineers view reset abstraction as unpredictable and hard to control; a perception compounded by the scarcity of practical literature detailing effective strategies, common pitfalls, and best practices. To address these challenges, we introduced effective step-by-step Reset Abstraction methodology that has been validated by conducting case studies on multiple design blocks from our PCIe verification project. Our methodology delivers a comprehensive framework, beginning with precise identification of the optimal signals for abstraction and extending through rigorous analysis of potential pitfalls. By proactively anticipating and managing the risks, greater control and reliability over the method is enabled. This not only enhances the practical effectiveness of reset abstraction but also empowers engineers to apply it with increased confidence and efficiency, maximizing its potential in real-world verification environments.
Engineering Presentation
Design
EDA
DescriptionTraditional design cycles include multiple tapeout and measurement cycles to achieve optimal performance. This increase in cost and delayed product introduction is no longer attractive with expensive advanced process nodes and working with condensed design cycles. Electromagnetic (Emag) simulation must play a role in reducing design time and cost. There are many challenges with on-die Emag extraction: difficult port assignment because of poorly defined return paths, high port counts, high computational cost (i.e., long simulation times and RAM used) because of the mesh density required with full-wave solvers to accurately model skin and proximity effects for the high aspect ratio geometry found on die, and long SPICE run times when using the S-parameter outputs of Emag simulations.
We propose a partial element equivalent circuit method with random walk to extract the layout and generate rational function models (RFMs) to address speed and capacity concerns at all stages of analysis. We prove efficacy by running Emag extractions for 3.55 to 7.99 GHz voltage-controlled oscillator (VCO) inductor and capacitor resonant tank circuits and using S-parameter and RFM outputs with SPICE simulations to show ~1.4X reduction in SPICE simulation time using RFMs and prove accuracy by matching measured results within 6%.
We propose a partial element equivalent circuit method with random walk to extract the layout and generate rational function models (RFMs) to address speed and capacity concerns at all stages of analysis. We prove efficacy by running Emag extractions for 3.55 to 7.99 GHz voltage-controlled oscillator (VCO) inductor and capacitor resonant tank circuits and using S-parameter and RFM outputs with SPICE simulations to show ~1.4X reduction in SPICE simulation time using RFMs and prove accuracy by matching measured results within 6%.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionNeuro-symbolic AI is gaining traction in domains such as large language models, scientific discovery, and autonomous systems due to its ability to combine perception with structured reasoning. However, its deployment is often constrained by high memory demands, diverse computation patterns, and complex hardware requirements. Existing hardware platforms struggle with large on-chip memory overheads, frequent pipeline stalls, limited I/O bandwidth, and inefficient handling of nonlinear operations. To address these key computational bottlenecks, we propose Overmind, a unified neuro-symbolic architecture with cross-layer optimizations. Overmind tackles these core bottlenecks through Padé approximations for universal nonlinear functions, preemptive memory bypass that eliminates costly on-chip caches, and a complete software stack that optimizes model deployment. By reconfiguring the Padé orders for approximating nonlinear functions, we also demonstrate adaptive accuracy-performance scaling. Overmind achieves an energy efficiency of 8.1 TOPS/W and a throughput of 410 GOPS for mixed neuro-symbolic workloads with minimal model accuracy loss. Compared to existing solutions, Overmind improves performance and efficiency with significantly fewer hardware resources.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionAs large language models (LLMs) continue to scale in size and complexity, DRAM access has become a performance bottleneck, especially during the memory-bound decoding phase dominated by general matrix-vector multiplication (GEMV) operations. Recent work has explored near-/in-memory computing to address bandwidth limitations. Among these, Processing-Using-DRAM (PUD) offers a unique advantage as it performs logic operations entirely within DRAM peripherals with minimal circuit changes, enabling deployment on commercial off-the-shelf (COTS) DRAM. While prior work reported multi-input logic operations, its potential to accelerate GEMV operations in LLMs remains underexplored due to the limited compute capability of DRAM sense amplifiers (SAs).
To overcome this challenge, we propose PAGE, a PUD architecture that accelerates GEMV through four key optimizations: (1) mat-level parallelism that exploits row-level parallelism by activating SAs across multiple mats; (2) inversion-enabled SA supporting NOT operations within a mat; (3) AP-based row copy mechanism that performs simultaneous source–destination activation with only control-signal changes; and (4) adaptive adder-tree accumulation that reduces accumulation cycles. We demonstrate the baseline PUD architecture on a COTS DRAM–FPGA platform to verify the functionality and timing of PUD operations, implement the modified circuits in TSMC 16nm for circuit-level evaluation, and build a in-house simulator for system-level throughput and energy analysis. Overall, PAGE achieves up to 14× and 3.1× end-to-end throughput and energy improvement for GEMV workloads over the SoTA PUD designs, demonstrating its feasibility for accelerating memory-bound LLM workloads with PUD.
To overcome this challenge, we propose PAGE, a PUD architecture that accelerates GEMV through four key optimizations: (1) mat-level parallelism that exploits row-level parallelism by activating SAs across multiple mats; (2) inversion-enabled SA supporting NOT operations within a mat; (3) AP-based row copy mechanism that performs simultaneous source–destination activation with only control-signal changes; and (4) adaptive adder-tree accumulation that reduces accumulation cycles. We demonstrate the baseline PUD architecture on a COTS DRAM–FPGA platform to verify the functionality and timing of PUD operations, implement the modified circuits in TSMC 16nm for circuit-level evaluation, and build a in-house simulator for system-level throughput and energy analysis. Overall, PAGE achieves up to 14× and 3.1× end-to-end throughput and energy improvement for GEMV workloads over the SoTA PUD designs, demonstrating its feasibility for accelerating memory-bound LLM workloads with PUD.
Research Special Session
AI
DescriptionThe design of analog circuits traditionally relies on manual interventions in topology synthesis, transistor sizing, and layout generation. In this work, we introduce PANDA, a performance-driven framework that leverages large language models (LLMs) to automate the analog design flow, bridging the gap between high-level design intent and final layout. PANDA incorporates cutting-edge techniques, including LLM-empowered topology synthesis, post-layout simulation-driven sizing, and layout generation. The layout generation process is driven by an LLM-based task-decomposition engine that translates high-level design intents into precise placement and routing actions. While the underlying P\&R kernel leveraging engines such as SAGERoute—honors geometric and electrical constriants, the LLM selects constraints, steers optimization, and adaptively refines the layout, shifting the flow from algorithm-centric execution to intent-centric design. Rather than merely automating routine steps, the integration of LLMs enables PANDA to internalize design principles, infer strategy-level decisions, and generate execution-ready design directives, effectively elevating automation from procedural assistance to informed, high-level co-design. Experiments demonstrate that PANDA significantly improves design performance while reducing time and effort in producing high-quality analog layouts.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionPangenome graphs have emerged as the new standard for genomic reference, replacing linear references to capture population-level genetic variation. However, sequence-to-graph mapping tools exhibit severe memory bottlenecks, with intermediate data movement between computation stages consuming up to 86\% of total DRAM bandwidth. Through systematic profiling of state-of-the-art tools, we identify that no single computational stage dominates across different implementations, necessitating cross-stage optimization approaches rather than accelerating individual kernels.
We present PANGAEA, a co-design framework that formulates cross-stage loop fusion as an optimization problem to minimize DRAM traffic under on-chip buffer constraints. Our approach fuses the three memory-intensive stages (i.e., linear chaining, graph chaining, and wavefront alignment). The framework automatically generates tiling parameters and scheduling schemes that maximize data reuse across genome analysis kernels with different dependency patterns. We design PANGAEA accelerator with a unified tri-mode processing element array supporting sparse chaining and dense alignment operations. Compared to SOTA ASIC implementations, \name achieves an average of 1.47$\times$ higher throughput (bp/s), 1.73$\times$ better energy efficiency($\text{bp}/\mu\text{J}$) and 1.62$\times$ area efficiency(bp/s/mm$^2$).
We present PANGAEA, a co-design framework that formulates cross-stage loop fusion as an optimization problem to minimize DRAM traffic under on-chip buffer constraints. Our approach fuses the three memory-intensive stages (i.e., linear chaining, graph chaining, and wavefront alignment). The framework automatically generates tiling parameters and scheduling schemes that maximize data reuse across genome analysis kernels with different dependency patterns. We design PANGAEA accelerator with a unified tri-mode processing element array supporting sparse chaining and dense alignment operations. Compared to SOTA ASIC implementations, \name achieves an average of 1.47$\times$ higher throughput (bp/s), 1.73$\times$ better energy efficiency($\text{bp}/\mu\text{J}$) and 1.62$\times$ area efficiency(bp/s/mm$^2$).
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionCombinational equivalence checking (CEC) is a key and often time-critical step in the IC design flow for verifying functional equivalence after synthesis and technology mapping. As designs scale, improving CEC efficiency has become a major challenge with wide-ranging consequences for the entire flow. In this paper, we propose a novel parallel CEC approach through factored form sharing. In particular, the factored form literal count (FFLC) closely reflects both the amount of shareable literal/CNF-encoding and the branching complexity encountered during satisfiability (SAT)-based verification. Therefore, FFLC, together with a lightweight branching complexity proxy, guides a sharing-aware seed-and-grow partitioning algorithm that exploits literal and CNF-encoding reuse while balancing estimated SAT-solving complexity. Implemented in ABC and evaluated on large-scale benchmarks with 40 physical CPU cores, our implementation delivers geometric-mean speedups of 45.57× (max. 6,595×) over single-threaded ABC, 1.53× (max. 17×) over a state-of-the-art (SOTA) CPU-parallel method, and 5.52× (max. 382×) over a SOTA GPU-parallel method.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionCombinational Equivalence Checking (CEC) is a cornerstone of modern IC design and verification flows, and state-of-the-art CEC solvers predominantly rely on SAT-sweeping frameworks. As circuit scale and complexity continue to increase, purely serial sweeping becomes a critical performance bottleneck, motivating the exploration of parallelism. However, existing parallel CEC approaches primarily focus on accelerating the verification of individual candidate pairs, while leaving the sweeping process itself essentially serial. This paper presents HydraCEC, a novel, general-purpose parallel CEC framework that, to the best of our knowledge, is the first to parallelize the sweeping process itself by concurrently executing its verification tasks. Its effectiveness is further enhanced by a dynamic benefit-aware scheduling policy guided by a Task Benefit Graph (TBG), and an asynchronous equivalence sharing mechanism that enables cooperation without global synchronization. Experimental results on ISCAS/ITC, Datapath, and Large-Scale benchmarks show that HydraCEC consistently outperforms existing parallel CEC solvers, particularly on large-scale instances, where it achieves a 10.5x speedup over the best competitor and exhibits excellent scalability, reaching 72.3x speedup with 128 cores.
People
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionModel checking, especially symbolic algorithms such as IC3/PDR, is a powerful verification technique but is increasingly constrained by single-threaded performance. Existing parallel model checking approaches either rely on redundant portfolio executions or suffer from scalability bottlenecks caused by centralized scheduling.
We propose a new parallel paradigm for model checking based on fine-grained task decomposition and a decentralized Work-pulling architecture. A distributed concurrent task repository replaces the central coordinator, enabling worker threads to autonomously acquire tasks and naturally achieve dynamic load balancing. We present the framework design, analyze its parallel mechanisms, and evaluate its performance on representative benchmarks. Our approach achieves notable speedup and demonstrates solid scalability over serial execution.
We propose a new parallel paradigm for model checking based on fine-grained task decomposition and a decentralized Work-pulling architecture. A distributed concurrent task repository replaces the central coordinator, enabling worker threads to autonomously acquire tasks and naturally achieve dynamic load balancing. We present the framework design, analyze its parallel mechanisms, and evaluate its performance on representative benchmarks. Our approach achieves notable speedup and demonstrates solid scalability over serial execution.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionThe combinatorial explosion of directives in high-level synthesis (HLS) creates an expansive design space. Previous efforts to construct design space exploration (DSE) methods with heuristics struggle to identify Pareto-optimal designs under tight search budgets. Therefore, we introduce ParetoPilot, an automated DSE framework, which leverages a dedicated LLM to adopt global optimization reasoning to navigate exploration. Guided by optimization strategies, ParetoPilot integrates directive scheduling and quality awareness to rapidly shape broad and concave Pareto fronts. Compared to a SOTA heuristic-based DSE method, ParetoPilot improves DSE performance by 49.5% and achieves 1.8× efficiency gain, outperforming the auxiliary LLM DeepSeek-R1 by 81.4%.
Work in Progress
DescriptionAsymmetric transistor aging is becoming increasingly significant in advanced process nodes due to rising thermal density and spatial workload variations. Clock distribution networks are particularly susceptible, as non-uniform aging across clock branches can distort clock skew and lead to setup or hold timing violations. This paper presents PASCAL, a Predictive Aging Skew Compensation Architecture for cLock trees. PASCAL integrates programmable delay lines within clock branches and an ML-based predictive maintenance controller that monitors temperature gradients and clock-gate activity to estimate aging-induced skew degradation. The controller dynamically tunes programmable skew to compensate asymmetric aging. Aging-aware simulations in a 16 nm FinFET process show that PASCAL eliminates over 95% of timing violations while maintaining timing margins over a 10-year lifetime.
Engineering Presentation
EDA
Security
DescriptionCDC & RDC verification faces critical challenges where synthesis optimization introduces glitches undetectable during RTL analysis but manifest as silicon failures, significantly impacting Time to Market. Improved Glitch Checker is a novel three-stage methodology addressing synthesis-induced CDC and RDC issues that traditional EDA tools fail to capture.
The core problem arises when synthesis tools perform Boolean algebraic optimizations that maintain logical equivalence but introduce timing hazards. For example, the expression A & EN may be optimized to A & (~A + EN) using distributive laws, which algebraically simplifies (A & ~A) + (A & EN) = A & EN. While functionally equivalent, this optimization creates a critical timing vulnerability: when signal A transitions while EN=0, both A and ~A change simultaneously in opposite directions, causing the intermediate AND gate (A & ~A) to momentarily glitch before settling to zero. This glitch can propagate through the CDC and RDC path, violating metastability requirements and causing silicon failures that are impossible to detect during RTL-level CDC and RDC analysis.
Our three-stage solution employs: (1) Netlist Cone Extraction for targeted combinational logic, (2) Formal glitch analysis using Z3 satisfiability solving, and (3) Don't-Touch cell integration preventing synthesis optimization of critical CDC paths. Enablement of the solution demonstrates successful prevention of synthesis-induced CDC & RDC violations, reducing months of silicon debug time.
The core problem arises when synthesis tools perform Boolean algebraic optimizations that maintain logical equivalence but introduce timing hazards. For example, the expression A & EN may be optimized to A & (~A + EN) using distributive laws, which algebraically simplifies (A & ~A) + (A & EN) = A & EN. While functionally equivalent, this optimization creates a critical timing vulnerability: when signal A transitions while EN=0, both A and ~A change simultaneously in opposite directions, causing the intermediate AND gate (A & ~A) to momentarily glitch before settling to zero. This glitch can propagate through the CDC and RDC path, violating metastability requirements and causing silicon failures that are impossible to detect during RTL-level CDC and RDC analysis.
Our three-stage solution employs: (1) Netlist Cone Extraction for targeted combinational logic, (2) Formal glitch analysis using Z3 satisfiability solving, and (3) Don't-Touch cell integration preventing synthesis optimization of critical CDC paths. Enablement of the solution demonstrates successful prevention of synthesis-induced CDC & RDC violations, reducing months of silicon debug time.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionPath-Based Static Timing Analysis (PBA) offers high accuracy but incurs high runtime due to redundant computations. We present a segment-level reuse framework that accelerates PBA by reducing its inherent complexity. Timing paths are decomposed into multi-fanin-bounded segments, enabling delay reuse across structurally identical subpaths. A dual-indexed SegHashMap and a slew-sensitive model with Sobolev supervision ensure safe, pessimism-preserving reuse. Our segment-centric engine reconstructs end-to-end delays with minimal overhead. Experiments show 195× average speedup over parallel CPU PBA and 2–3× gains over GPU methods—highlighting complexity reduction as a promising new direction for scalable timing sign-off.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionPath-Based Analysis (PBA) is accurate but computationally expensive, while Graph-Based Analysis (GBA) is fast but overly pessimistic. We propose PBA-rGAT-Edge, a residual edge-aware graph attention model that performs arc-level delay and slew prediction on a pin-level timing graph. The model uses a compact residual attention backbone with a lightweight fusion and dual-task head for stable and efficient learning. Experiments on million-scale industrial benchmarks show state-of-the-art accuracy (R²: 0.965 slew, 0.997 delay) and faster convergence compared to prior GNN-based methods, including DeepEdgeGAT (ASPDAC'23).
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionGenerating three-dimensional (3D) models for Printed Circuit Board (PCB) components from manufacturer datasheets is a foundational yet inefficient step in Electronic Design Automation (EDA) workflows. Current processes remain heavily manual, relying on skilled engineers to interpret 2D views, textual specs, and dimension tables. This manual dependence leads to high error rates and limited scalability. To tackle these challenges, we propose PCBgen3D, a self-correcting graph-based multimodal Large Language Model (MLLM) framework. It integrates a suite of coordinated modules, each enhanced by advanced MLLMs, to automate high-precision 3D modeling. These modules follow a structured pipeline of data extraction, task planning, and process refinement, overseen by an iterative self-correction loop. A core innovation is its dynamic task graph, which decomposes complex PCB modeling into adaptive sub-tasks, and incorporates a self-correction mechanism to fix errors. Evaluated on a dataset containing components of various package types and from different manufacturers, PCBgen3D outperform state-of-the-art general-purpose MLLM on industrial-grade PCB 3D modeling tasks.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionPrinted circuit board (PCB) design is increasingly critical as artificial intelligence (AI) systems demand higher integration density and stricter electrical constraints, while purely manual workflows struggle to keep up. We introduce PCBgen, a spec-to-schematic framework that integrates PCB knowledge graph (PCB-KG)-based retrieval, constraint-aware pre-filtering, a multi-modal intermediate representation (IR), and SPICE-in-the-loop, training-free Group Relative Policy Optimization (TF-GRPO) into a unified closed-loop pipeline for board-level design. This pipeline compresses an otherwise combinatorial search space (up to 10^20 candidates) to roughly 10^3 simulatable designs while preserving semantic and electrical validity. Extensive experiments for power supply designs demonstrate that, on a 336-case benchmark spanning six topology families and three design regimes, PCBgen improves topology adaptation (TA) by up to 57%, raises Pass@5 from 41% to 74%, and cuts token usage and wall-clock time by more than 50% compared with LLM-only baselines. In a 24-case comparison with human designers using existing vendor tools, it achieves 3.70x and 4.74x speedups in topology selection and schematic verification with comparable Pass@1, yielding an overall return on investment (ROI) of 4.39x and pointing toward a practical route to agentic, closed-loop PCB power supply design.
People
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionHeterogeneous cross-device side-channel attacks remain a critical yet underexplored challenge, as models trained on one device often fail to generalize across architectures. This paper presents PD-Net, a domain generalization framework that learns device-invariant features by disentangling algorithmic content from device-specific style and aligning feature distributions using prototypical and Maximum Mean Discrepancy (MMD) losses.
PD-Net is trained on nine heterogeneous source domains spanning ARM/AVR/FPGA and power/electromagnetic leakage modalities, including 32-bit ARM Cortex-M0/M1/M3/M4, 8-bit AVR ATmega (three series), and 128-bit Xilinx Virtex-5 FPGA, and evaluated in a zero-shot setting without target-specific adaptation.
Experimental results demonstrate robust zero-shot cross-architecture transfers between 8-bit and 32-bit devices, with consistent gains over existing generalization and transfer-learning approaches. In particular, PD-Net delivers 29 successful attacks with only 10 divergences across 70 settings, markedly outperforming the state of the art, which succeeds in only 4 cases and diverges 19 times.
To the best of our knowledge, this is the first domain generalization (DG)-based deep learning framework to systematically demonstrate practical zero-shot heterogeneous cross-device side-channel attacks.
PD-Net is trained on nine heterogeneous source domains spanning ARM/AVR/FPGA and power/electromagnetic leakage modalities, including 32-bit ARM Cortex-M0/M1/M3/M4, 8-bit AVR ATmega (three series), and 128-bit Xilinx Virtex-5 FPGA, and evaluated in a zero-shot setting without target-specific adaptation.
Experimental results demonstrate robust zero-shot cross-architecture transfers between 8-bit and 32-bit devices, with consistent gains over existing generalization and transfer-learning approaches. In particular, PD-Net delivers 29 successful attacks with only 10 divergences across 70 settings, markedly outperforming the state of the art, which succeeds in only 4 cases and diverges 19 times.
To the best of our knowledge, this is the first domain generalization (DG)-based deep learning framework to systematically demonstrate practical zero-shot heterogeneous cross-device side-channel attacks.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionAggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. As prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making context length a first-order systems concern. Recent work with table-lookup-based ternary matrix multiplication on edge FPGAs exposes a fundamental prefill--decode asymmetry: prefill is compute-bound and dominated by dense matrix--matrix operations, whereas decoding is memory-bandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill--decode disaggregated LLM accelerator that uses Dynamic Patial Reconfiguration (DPR) to time-multiplex the attention module on edge FPGAs. The core table-lookup ternary matrix multiplication and weight-buffering engines remain static, while the attention subsystem is a reconfigurable partition with two phase-specialized dataflows: a compute-heavy, token-parallel prefill engine and a bandwidth-optimized, KV-cache-centric decoding engine. A roofline-inspired model and design space exploration jointly optimize reconfigurable-region size, parallelism under reconfiguration and routability constraints, and partial reconfiguration is overlapped with ongoing computation. Our design achieves state-of-the-art performance, with up to 27~tokens/s decoding throughput. Compared with the prior static design, PD-Swap achieves a 1.3x--2.1x speedup, with larger gains at longer context lengths.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn automotive MCU design, multi-package compatibility is a mandatory requirement. However, supporting multiple package types significantly increases the complexity and cost for PDN (Power Distribution Network) signoff.
This paper introduces co-design optimization strategies for adaptive multi-package MCUs. By analyzing six package models for inductance, resistance, and IR drop, we identified package size, parasitic elements, and PG pad count as key factors impacting power integrity. To streamline verification, the worst package type was selected for final power integrity signoff, reducing signoff workload by approximately 80%. Furthermore, we implemented IR drop optimization strategies—including widening bondwires, adding ground ring, and deploying inner/intra pads—which improved IR drop by up to 10% and enabled die size reduction without compromising reliability. These innovations provide an effective optimization methodology for chip-package co-design, accelerating time-to-market while ensuring robust PDN performance.
This paper introduces co-design optimization strategies for adaptive multi-package MCUs. By analyzing six package models for inductance, resistance, and IR drop, we identified package size, parasitic elements, and PG pad count as key factors impacting power integrity. To streamline verification, the worst package type was selected for final power integrity signoff, reducing signoff workload by approximately 80%. Furthermore, we implemented IR drop optimization strategies—including widening bondwires, adding ground ring, and deploying inner/intra pads—which improved IR drop by up to 10% and enabled die size reduction without compromising reliability. These innovations provide an effective optimization methodology for chip-package co-design, accelerating time-to-market while ensuring robust PDN performance.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGeneral matrix–matrix multiplication (GEMM) remains the dominant compute kernel in transformer-based large language model (LLM) inference.
However, large-scale matrices exhibiting irregular sparsity fundamentally limit throughput and energy efficiency in existing systems.
To address these challenges, we propose PEACE, an energy-aware matrix compression framework for accelerating GEMM operations at the processing element (PE)-array granularity.
PEACE comprises a novel matrix compression algorithm, PEACE-Alg, which reorganizes large-scale matrices into hardware-friendly compressed formats by exploiting the column-wise energy distribution.
To support PEACE-Alg, we design a hardware microarchitecture, PEACE-Hw, that integrates RISC-V ISA extensions with a table-metadata fetcher and a metadata-driven operand loader to feed a reconfigurable systolic PE array, and a dedicated partial-sum adder for efficiently merging intermediate results.
Experimental results show that PEACE occupies 2.705 mm^2 in a 14 nm ASIC and delivers a peak INT8 throughput of 88 GOPS/W.
PEACE achieves 1.60×-1.67× speedup over a RISC-V core baseline across 14 transformer-based LLMs.
Compared with state-of-the-art designs, PEACE provides 4.4× higher PE density, 1.79× average speedup and 2.53× energy efficiency.
However, large-scale matrices exhibiting irregular sparsity fundamentally limit throughput and energy efficiency in existing systems.
To address these challenges, we propose PEACE, an energy-aware matrix compression framework for accelerating GEMM operations at the processing element (PE)-array granularity.
PEACE comprises a novel matrix compression algorithm, PEACE-Alg, which reorganizes large-scale matrices into hardware-friendly compressed formats by exploiting the column-wise energy distribution.
To support PEACE-Alg, we design a hardware microarchitecture, PEACE-Hw, that integrates RISC-V ISA extensions with a table-metadata fetcher and a metadata-driven operand loader to feed a reconfigurable systolic PE array, and a dedicated partial-sum adder for efficiently merging intermediate results.
Experimental results show that PEACE occupies 2.705 mm^2 in a 14 nm ASIC and delivers a peak INT8 throughput of 88 GOPS/W.
PEACE achieves 1.60×-1.67× speedup over a RISC-V core baseline across 14 transformer-based LLMs.
Compared with state-of-the-art designs, PEACE provides 4.4× higher PE density, 1.79× average speedup and 2.53× energy efficiency.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionFine-tuning Large Language Models (LLMs) on resource-constrained edge devices is a critical but challenging task, primarily due to the prohibitive memory and computational costs of backpropagation. While forward-only optimizers like MeZO mitigate these costs by eliminating the backward pass, they often suffer from slow and unstable convergence. To address this limitation, we introduce PeFoo, which integrates a carefully designed preconditioning strategy into the forward-only paradigm. Furthermore, we propose PeFoo-L to counteract the memory overhead introduced by the preconditioner. This approach constrains preconditioner storage and weight updates to a single layer per iteration, reducing the overall memory footprint and data traffic.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionLarge Language Models (LLMs) are increasingly pushed to the edge to reduce latency, yet limited memory and compute make full-model deployment infeasible. Mixture-of-Experts (MoE) offers a lightweight alternative, but edge nodes can host only a few experts, creating persistent mismatches between dynamic query semantics and locally available expert models under resource constraints. We present Pegasus, a collaborative MoE inference system for multi-edge deployments that leverages the spatiotemporal correlations inherent in distributed query semantics. Pegasus integrates three innovations: (i) a similarity-based expert deployment mechanism using an efficient heuristic metric to assess the semantic relevance between queries and experts across nodes and over time; (ii) a personalized gating design that selectively fine-tunes a subset of gating parameters on each node to balance expert accuracy and communication latency; (iii) an intra-node online scheduling algorithm with adaptive batching for efficient memory utilization. Extensive performance evaluations corroborate that Pegasus achieves 11.8x lower inference latency than state-of-the-art distributed MoE frameworks, demonstrating high-throughput and communication-efficient edge LLM inference.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionThe growing demand for low-latency, high-quality video encoding in wireless applications, such as online conferencing and cloud gaming, necessitates robust screen content capabilities in mezzanine codec. However, some mezzanine codec like JPEG XS exhibit inefficiencies when handling screen content, particularly for text-rich scenes containing abundant high-frequency information. Thus we propose PEL, a low-complexity screen content enhancement layer that operates without modifying the core codec. The proposed enhancement layer is based on clustering and a palette algorithm, achieving a 18.13% BD-rate reduction on screen content datasets with 5H2V transform levels when validated with JPEG XS. The proposed method eliminates data dependencies and complex rate–distortion optimization processes, satisfying requirements for low complexity and low latency. To overcome hardware throughput bottlenecks, a scalable tri-core parallel architecture is developed for the palette algorithm. Implemented on the Xilinx ZCU106 evaluation board, the design achieves 4K@120FPS performance with only 19.7K LUTs and an additional latency of eight lines.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionAsynchronous Traffic Shaping (ATS) bounds end-to-end delay in Time-Sensitive Networking (TSN) without relying on global synchronization, but its per-group queuing introduces inter-flow interference that causes large delay jitter for time-critical flows. To address this limitation, we present PFATS, a novel per-flow asynchronous traffic shaping architecture that integrates a per-flow frame eligibility time calculator, per-flow queues, and a Synchronous Shift Register Matrix (SSRM) scheduler. We implement PFATS on an FPGA-based TSN switch, and it supports 192 flows by incurring only 7.14% additional on-chip memory usage compared to conventional ATS. In realistic TSN scenarios, PFATS achieves microsecond-level end-to-end delay and bounds delay jitter within 5 μs, while maintaining high throughput.
Exhibitor Forum
DescriptionAs semiconductor and system complexity accelerates, critical IP artifacts—requirements, specifications, verification results, integration history, and compliance records—are often fragmented across teams and tools. An IP lifecycle management (IPLM) framework allows teams to manage all of these design artifacts as traceable metadata within a single system for complete visibility. When combined with Retrieval-Augmented Generation (RAG) and Model Context Protocol (MCP), IPLM transforms fragmented data into contextual, actionable intelligence. Engineering teams can query IP maturity, reuse readiness, dependency impact, and compliance status; compare versions and variants; and detect and resolve conflicts, while grounding every answer in authoritative sources. By enabling users to learn, query, and initiate workflows directly from validated lifecycle data, this approach delivers improved engineering efficiency, greater design reuse, and more informed decision-making across chips-to-systems development.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWe addressed 1:1 GR Tech. challenges, Maximize pros and Minimize cons. In this work, we proposed 1) DR/PDN optimization to minimize the loss of routing resource by increasing routability through the method of getting more pin points and better pin connection. Also, we proposed 2) enhanced design flow for maximize RC benefit by increasing 1st Mx usage more in critical path due to 1st Mx has best RC delay among local routing layers. Our experiment shows that 0.9% performance gain can be achieved in 580K industrial design using the DR/PDN optimization to reduce 1% total wire length and increase 2% Mx routing ratio compared to simple migration, and total 1.8% performance gain can be achieved in 580K design through enhanced design flow to increase additional 4% Mx routing ratio in critical path.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionRecent AI-driven inverse design approaches have shown promise
in synthesizing complex electromagnetic (EM) structures for radio-
frequency (RF) circuits. However, existing methods suffer from two
fundamental limitations: random topology generation often pro-
duces physically infeasible layouts, and pixel-based representations
encode only topology while ignoring geometric dimensions, forc-
ing complete dataset regeneration and model retraining whenever
layout scale changes. To address these issues, we propose PeRIFi,
a physics-constrained and geometry-aware inverse design frame-
work built on three key innovations. (1) Feasibility-aware param-
eterization integrates B-splines, which guarantee direct-current
(DC) connectivity, with level-set representations that enable flexible
geometric variation, ensuring 100% physically feasible structure
generation. (2) Explicit geometric encoding decouples topol-
ogy from geometric dimensions, allowing a single surrogate model
to generalize across multiple layout scales without regenerating
datasets. (3) High-dimensional optimization employs Particle
Swarm Optimization tailored to the proposed 268-dimensional fea-
sible design space. Experimental results demonstrate substantial
data-efficiency gains: with only 5k training samples, PeRIFi attains
73% lower MSE and a 6.7% higher 𝑅^2 than a pixel-based baseline
trained on 20k samples. Furthermore, PeRIFi reduces optimization
cost by 22.19%–34.71% and lowers prediction error (MAE) by 58.2%
compared with state-of-the-art methods, enabling more accurate
and scalable RF inverse design.
in synthesizing complex electromagnetic (EM) structures for radio-
frequency (RF) circuits. However, existing methods suffer from two
fundamental limitations: random topology generation often pro-
duces physically infeasible layouts, and pixel-based representations
encode only topology while ignoring geometric dimensions, forc-
ing complete dataset regeneration and model retraining whenever
layout scale changes. To address these issues, we propose PeRIFi,
a physics-constrained and geometry-aware inverse design frame-
work built on three key innovations. (1) Feasibility-aware param-
eterization integrates B-splines, which guarantee direct-current
(DC) connectivity, with level-set representations that enable flexible
geometric variation, ensuring 100% physically feasible structure
generation. (2) Explicit geometric encoding decouples topol-
ogy from geometric dimensions, allowing a single surrogate model
to generalize across multiple layout scales without regenerating
datasets. (3) High-dimensional optimization employs Particle
Swarm Optimization tailored to the proposed 268-dimensional fea-
sible design space. Experimental results demonstrate substantial
data-efficiency gains: with only 5k training samples, PeRIFi attains
73% lower MSE and a 6.7% higher 𝑅^2 than a pixel-based baseline
trained on 20k samples. Furthermore, PeRIFi reduces optimization
cost by 22.19%–34.71% and lowers prediction error (MAE) by 58.2%
compared with state-of-the-art methods, enabling more accurate
and scalable RF inverse design.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionSparse matrix multiplication (SpGEMM) is fundamental various applications, yet conventional processors suffer from poor parallelism due to bandwidth limitation and pipeline stalls .
Therefore, we propose PF-GEMM, an accelerator exploiting hardware parallelism with three key methods. Firstly, we introduce a novel sparse format named prefix-CSR (PFCSR), which merges redundant prefixes of nearby indices to reduce I/O amount. Further, with element-wise out-of-order scheduling, PFGEMM achieves locally optimal data reuse to avoid pipeline stalls. Finally, PF-GEMM's cache employs a frequency-aware replacement policy to extend data residency. Overall, experiments demonstrate a gmean of 1.8x off-chip traffic reduction and 2.1x speedup over state-of-the-art accelerators.
Therefore, we propose PF-GEMM, an accelerator exploiting hardware parallelism with three key methods. Firstly, we introduce a novel sparse format named prefix-CSR (PFCSR), which merges redundant prefixes of nearby indices to reduce I/O amount. Further, with element-wise out-of-order scheduling, PFGEMM achieves locally optimal data reuse to avoid pipeline stalls. Finally, PF-GEMM's cache employs a frequency-aware replacement policy to extend data residency. Overall, experiments demonstrate a gmean of 1.8x off-chip traffic reduction and 2.1x speedup over state-of-the-art accelerators.
People
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionStandard cell library characterization is a critical bottleneck for timing closure. Existing machine learning (ML) surrogates reduce SPICE costs but are constrained by labeled data requirements, limiting accuracy. This work introduces PG-GNN, a Physics-Guided Graph Neural Network (GNN) framework integrating residual learning with uncertainty-driven active learning. It simulates sparse anchor PVT corners, builds a physics-consistent reference, and trains a GNN on residual errors. Active learning then queries high-uncertainty points to minimize labeling. In experiments on TSMC 16 nm libraries, PG-GNN reduces SPICE effort by 98.7% with 1.67% MAPE. When applied to ISCAS'89 benchmarks, it achieves 0.74% critical path delay mismatch versus foundry libraries. PG-GNN offers orders-of-magnitude speedups in characterization runtime while maintaining signoff-level accuracy, presenting a scalable solution for next-generation IC design flows.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionMulti-GPU systems rely on unified virtual memory to provide a global address space, which inevitably triggers page migrations. Our analysis shows that reducing migrations further is difficult even with advanced policies. These migrations invalidate page table entries (PTEs) across GPUs and generate migration-induced misses (M-misses), which constitute a large portion of all translation requests and significantly degrade performance. This paper introduces Phantom Walk, which accelerates M-miss address translation by reducing GPU-local page faults and page table walks through a shorter post-migration translation path. Our evaluation shows that Phantom Walk improves overall performance by 44.37% on average.
Work in Progress
DescriptionRecent advancements in large language models (LLMs) have demonstrated powerful capabilities across various application domains. At the same time, their quadratic computational complexity with respect to input sequence length remains a major bottleneck for efficient inference and deployment. To mitigate this problem, sparse attention methods utilizing the sparsity of computationally intensive operations in attention of LLM were proposed, but still incur additional overhead to locate important tokens during inference or performance degradation due to utilizing a fixed pattern, neglecting the difference of head-wise sparsity. In this paper, we propose a pre-analyzed head-wise attention pattern (PHAP) sparse attention method, which constructs head-specific patterns by analyzing position-dependent characteristics and the inherent sparsity variations across attention heads. This one-time pre-analyzed pattern construction eliminates runtime overhead during inference. In addition, the proposed method constructs optimal static sparsity patterns tailored to the unique characteristics of each attention head, thereby maintaining the baseline performance. Experimental results show reduced attention computation by more than 75% without additional overhead while maintaining performance comparable to full attention in Llama-3. Moreover, it achieves 1.17x faster inference than conventional sparse attention methods, demonstrating its effectiveness in maximizing the computational efficiency of LLMs.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSpatially partitioned heterogeneous accelerators (HAs) are increasingly adopted in embedded systems for their performance and flexibility. Yet most existing HA design frameworks optimize primarily for throughput or quality-of-service (QoS) metrics, they often overlook safety-critical real-time requirements, including hardware support for predictable execution, real-time-aware design space exploration (DSE), and rigorous schedulability analy- sis. These requirements are essential in safety-critical applications such as smart transportation, where timing guarantees directly affect system safety. To address this gap, we present PHAROS, a real-time-centric HA design framework. PHAROS introduces preemption mechanisms and scheduler designs for spatially partitioned HAs under FIFO and earliest-deadline-first (EDF) policies. Leveraging modern real-time theory, we further develop a soft real-time (SRT) schedulability-oriented DSE with objectives and constraints tailored to timing correctness. Through comprehensive modeling, analysis, and evaluation across diverse applications, we show that PHAROS's schedulability-oriented DSE discovers more feasible configurations for a broader range of task sets than throughput-oriented DSE baselines, while delivering improved real-time performance. We also provide response-time analyses for the supported scheduling algorithms.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionMerkle tree, as a fundamental cryptographic primitive for ensuring user privacy, play a critical role in zero-knowledge proof systems (ZKPs). However, its construction involves numerous computationally intensive Poseidon hashes, creating the primary computational bottleneck in ZKP systems. To address this challenge, we propose PhotoMT, the first photonic-electronic collaborative merkle tree engine. PhotoMT leverages the photonic microring resonator (MR) array for intensive matrix-vector multiplications to boost throughput and energy efficiency, while digital circuits handle control-intensive tasks. Furthermore, PhotoMT incorporates a multi-subtree interleaved execution strategy and an S-BOX bypass computation queue, which together improve hardware utilization and reduce memory overhead. The experiments reveal that PhotoMT boosts throughput by 18.8-20.5x over state-of-the-art AISC-based designs, achieving an energy efficiency gain of 3-5 orders of magnitude against the CPU baseline.
Engineering Presentation
Chiplet
EDA
DescriptionThe proliferation of 3D Integrated Circuits (3DICs) and 2.5D chiplet-based architectures has introduced substantial complexity into physical signoff verification (PV) flows. These advanced designs feature heterogeneous die stacking, multi-node integration, Through-Silicon Vias (TSVs), micro-bumps, and silicon interposers—each posing unique verification challenges that exceed the capabilities of conventional EDA tools. Additionally, Electrostatic Discharge (ESD) signoff across stacked dies and inter-die interfaces requires novel modeling techniques and rule-checking strategies to ensure reliability. This paper presents Marvell's advanced PV methodology, which incorporates scalable rule-based and pattern-matching verification engines, automated topology extraction, and shift-left debugging strategies. The proposed framework enables early detection of physical and electrical violations across complex 3DIC topologies, supports multi-die and multi-foundry environments, and ensures comprehensive signoff coverage. Experimental results demonstrate significant improvements in verification throughput and a reduction in late-stage design iterations, validating the framework's effectiveness for next-generation heterogeneous systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDynamic IR drop in advanced nodes poses a critical challenge for SRAM memory placement, where dense instance packing and synchronized switching currents frequently cause localized power integrity violations. Conventional IR Drop analysis relies on full-chip simulation after placement, making it impractical for early-stage placement optimization.
This work presents a CNN-based IR Drop-risk prediction framework tailored for SRAM macros, leveraging physics-aware features including local 3×3 peak current patterns, bump-to-instance power resistance, bump-to-instance ground resistance, bump distance, and instance aspect ratio. The proposed model classifies IR Drop-critical SRAM instances with high recall, enabling conservative detection of worst-case power integrity risks.
By integrating the learned IR Drop-risk model into a placement-aware cost formulation, our approach identifies vulnerable SRAM regions without exhaustive simulation and provides actionable guidance for placement refinement. Experimental results demonstrate that the proposed method effectively captures worst-case IR Drop behavior in SRAM-dominated designs, offering a scalable and analysis-efficient alternative to traditional IR signoff flows.
This enables IR Drop-aware SRAM placement decisions at a fraction of the cost of traditional signoff-driven optimization.
This work presents a CNN-based IR Drop-risk prediction framework tailored for SRAM macros, leveraging physics-aware features including local 3×3 peak current patterns, bump-to-instance power resistance, bump-to-instance ground resistance, bump distance, and instance aspect ratio. The proposed model classifies IR Drop-critical SRAM instances with high recall, enabling conservative detection of worst-case power integrity risks.
By integrating the learned IR Drop-risk model into a placement-aware cost formulation, our approach identifies vulnerable SRAM regions without exhaustive simulation and provides actionable guidance for placement refinement. Experimental results demonstrate that the proposed method effectively captures worst-case IR Drop behavior in SRAM-dominated designs, offering a scalable and analysis-efficient alternative to traditional IR signoff flows.
This enables IR Drop-aware SRAM placement decisions at a fraction of the cost of traditional signoff-driven optimization.
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
DescriptionDesign of large-scale integrated circuits requires rigorous power integrity (PI) analysis to ensure reliable and robust operation. In particular, Direct Current (DC) PI analysis entails solving massive symmetric positive definite (SPD) systems derived from Kirchhoff's current law. Traditional PI evaluation methods are computationally expensive, making it difficult to satisfy the stringent turnaround-time requirements of modern design iterations. To address these challenges, we introduce PIANO, a physics-informed admittance neural operator for fast and high-fidelity DC PI analysis in 3D IC. PIANO first extracts an equivalent resistive network from the PDN layout and represents it as a graph. A graph neural network then learns the equivalent port admittance matrix between voltage-source ports and current sinks. The resulting reduced SPD system, formed from the predicted admittance matrix, is efficiently solved by a lightweight numerical solver to obtain port voltages. Experiments on industrial-scale PDNs show that PIANO achieves a 12.6x speedup over commercial simulators with only 1.95% voltage error, and a 17.2x acceleration when integrated into a PI-constrained design space exploration flow.
People
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionHeterogeneous Graph Neural Networks (HGNNs) are widely used to capture structural and semantic information in heterogeneous graphs based on sequences of vertex and edge types (i.e., metapaths). To support HGNNs, several solutions have been proposed and adopt a source-centric approach that can concurrently process metapath instances sharing the same source vertex. However, the vertices and edges, which are shared by different metapath instances and beyond the first hop of the same source vertex, may be repeatedly processed, incurring irregular memory accesses and redundant edge computations. In this work, we observe that HGNN inference exhibits substantial vertex and edge overlaps across metapath instances, and the metapaths sharing a common pivot vertex (i.e., the center vertex bridging multiple metapaths) exhibit strong spatial similarity that effectively captures these overlaps. Based on these observations, we propose a pivot-centric accelerator named PiHG to effectively support HGNN inference. Specifically, PiHG introduces a novel pivot-centric execution model into accelerator design to concentrate feature accesses and computations around pivot vertices, which enables reusing the feature vectors of overlapped vertices and edges across metapaths and thus eliminating redundant computations and reducing irregular off-chip memory accesses. Experimental results show that, compared with the state-of-the-art software and hardware solutions, PiHG achieves 13.1×∼236.8×, 2.2×∼11.7× speedups and 15.5×∼216.8×, 2.4×∼12.7× energy savings, respectively.
People
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionAccelerating SpMM and SDDMM on Tensor Cores is vital for GNNs and HPC. Current methods suffer from severe bottlenecks, including multi-level pointer chasing, lack of metadata coalescing for vectorization, and high format conversion overhead. We propose PillarSparse with the Pillar CSR format. It organizes non-zeros into "Pillar" (non-zero column) units and uses metadata coalescing to pack all metadata into two integers. This design removes a pointer-chasing level, enables a single vectorized load, and allows a high-speed conversion algorithm using Thrust primitives. The co-designed SpMM and SDDMM kernels use warp specialization producer-consumer pipeline and tensor memory accelerator for efficient execution. Experiments show competitive performance.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionDatabases using fully homomorphic encryption (FHE) protect outsourced data but suffer slow query speeds due to full scans of encrypted entries. To address this, we propose PIMA-SecDB, a processing-in-memory (PIM) architecture that accelerates FHE database operations. It adopts an integrated co-design of hardware and algorithms, featuring a multi-level, multi-channel structure for parallel processing and high bandwidth. Moreover, we remove costly circuit bootstrapping from the structured query language based on FHE and propose a quantitative method to select PIM-friendly encryption parameters, further reducing computational and data load. Experiments show \NAME is 1.98 to 7.93 times faster than existing FHE accelerators.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionGraph-based retrieval-augmented generation (RAG) improves the interpretability and factual consistency of large language models (LLMs) through structured knowledge graphs. Despite these benefits, graph-based RAG suffers from inefficient retrieval. The retrieval stage causes massive movement of vector and graph data between memory and processors. This movement leads to low arithmetic intensity and heavy pressure on memory bandwidth.
This work presents PIMGRAG, a heterogeneous architecture that accelerates graph-based RAG through hardware/software co-design.
At the hardware level, PIMGRAG designs a PIM architecture for the retrieval stage, which reduces off-chip data transfer by executing bandwidth-intensive operations near memory.
At the software level, PIMGRAG applies a lightweight scheduling method that orchestrates PIM and GPU execution and lowers idle time across stages.
Evaluation results show improvements in throughput, latency, and energy efficiency over CPU–GPU and existing PIM-based baselines.
This work presents PIMGRAG, a heterogeneous architecture that accelerates graph-based RAG through hardware/software co-design.
At the hardware level, PIMGRAG designs a PIM architecture for the retrieval stage, which reduces off-chip data transfer by executing bandwidth-intensive operations near memory.
At the software level, PIMGRAG applies a lightweight scheduling method that orchestrates PIM and GPU execution and lowers idle time across stages.
Evaluation results show improvements in throughput, latency, and energy efficiency over CPU–GPU and existing PIM-based baselines.
People
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionProcess-in-memristor-memory (PiMM) offers promising in-/near-memory compute capability. However, existing PiMM designs support only a narrow set of model types and depend on coarse, static mapping strategies that overlook the vast design space, resulting in limited adaptability, scalability, and overall performance.
In this work, we propose a Process-in-Memristor-Memory Network on Chip (PiMM-NoC) architecture combined with a reinforcement learning (RL) based mapping framework to optimize on-chip latency for versatile AI workloads. PiMM-NoC integrates two types of tiles: PiMM tiles for weight-stationary operations and PU tiles for non-weight-stationary and nonlinear computations. A cycle-accurate architecture simulator is incorporated into an end-to-end hardware–software co-design framework that uses Monte Carlo Tree Search (MCTS) to automatically search efficient mapping strategies.
Experiments show that PiMM-NoC with the RL-based mapping framework achieves up to 3.45× speedup on DNNs and 3.85× on LLMs over existing mapping strategies, and up to 71.6× higher performance and 6.7× better energy efficiency compared to state-of-the-art AI accelerators.
In this work, we propose a Process-in-Memristor-Memory Network on Chip (PiMM-NoC) architecture combined with a reinforcement learning (RL) based mapping framework to optimize on-chip latency for versatile AI workloads. PiMM-NoC integrates two types of tiles: PiMM tiles for weight-stationary operations and PU tiles for non-weight-stationary and nonlinear computations. A cycle-accurate architecture simulator is incorporated into an end-to-end hardware–software co-design framework that uses Monte Carlo Tree Search (MCTS) to automatically search efficient mapping strategies.
Experiments show that PiMM-NoC with the RL-based mapping framework achieves up to 3.45× speedup on DNNs and 3.85× on LLMs over existing mapping strategies, and up to 71.6× higher performance and 6.7× better energy efficiency compared to state-of-the-art AI accelerators.
People
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionEdge-side LLM inference is gaining importance, yet its single-query nature shifts the performance bottleneck from computation to memory traffic. Processing-in-Memory (PIM) can exploit high internal bandwidth, but on resource-constrained edge devices, the same DRAM must also serve host memory requests, causing frequent bank conflicts and global stalls. We present PIMony, a DRAM-PIM architecture that enables seamless co-execution of PIM and conventional memory operations. PIMony introduces (i) interrupt-based asynchronous PIM, which allows preemptible MAC execution to remove channel-wide stalls, and (ii) Dual-Path Subarray Access (DPSA), which permits concurrent PIM and memory access within a single bank through dual-row activation. These mechanisms jointly enable higher co-execution efficiency and stable LLM decoding latency under concurrent PIM and memory accesses.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
Description3D Gaussian Splatting (3DGS) delivers high quality and speed for 3D scene reconstruction. However, achieving real-time performance on edge platforms remains challenging due to pipeline inefficiencies. Our profiling reveals two major bottlenecks: (1) insufficient pipeline overlap due to the latency gap between sorting and rasterization, and (2) pipeline parallelism constrained by globally serialized preprocessing. To address these challenges, we present PipeGS, a 3DGS accelerator with algorithm–architecture co-design. At the algorithm level, we propose hierarchical Gaussian reuse, which shortens rasterization latency by eliminating redundant computation and reduces 76.2% of pipeline bubbles. To further unlock full pipeline overlap, we introduce dynamic orthogonal partitioning, which breaks global serial dependencies and hides 63% of preprocessing latency. At the hardware level, PipeGS employs a customized layer-wise pipelined architecture that supports concurrent execution across stages. Implemented on a 28nm technology, PipeGS achieves 1.96 ∼ 3.34× on area efficiency, and 1.20 ∼ 2.77× on energy efficiency compared with SOTA 3DGS accelerators.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEmbedded RISC-V processors are often deployed in real-time and safety-critical systems due to their simplicity, predictability, and open architecture. However, these systems remain vulnerable to performance degradation caused by microarchitectural resource contention. This work investigates pipeline-based Denial of Service attacks on in-order embedded RISC-V cores, demonstrating how malicious firmware can exploit pipeline structures to significantly degrade system performance without violating functional correctness or privilege isolation. We design and implement representative attack primitives targeting instruction cache thrashing, long dependency chains, and branch misprediction, and compare them against baseline workloads. Both attack and baseline functions are simulated on an RTL-level RISC-V core. Performance counters such as cycle count, instructions retired, fetch stalls, and branch activity are logged and analyzed to quantify the performance degradation. Experimental results show performance degradation of up to 27% compared to baseline execution, with substantial amplification in instruction fetch stalls and pipeline inefficiencies. Our findings highlight the feasibility of user-mode pipeline-based DoS attacks in embedded RISC-V systems and motivate the need for lightweight mitigation strategies. We discuss potential defenses, including cache partitioning, performance-counter-based monitoring, and speculation throttling, and outline future research directions for secure and predictable RISC-V designs.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionZero-knowledge proofs, particularly succinct non-interactive variants (zk-SNARKs), are instrumental in
verifiable computation without revealing privacy. Zk-SNARKs are provably secure against cryptographic
attacks; however, whether these systems can resist fault injection attacks requires further investigation. Previous related work focused on QAP-based zk-SNARK like Groth16. PLONK improves upon Groth16 with a universal and updatable trusted setup. In this work, we present PLONK-Hammer, which breaks input privacy of PLONK via rowhammer. We inject faults into the secret inputs.Then we devise an algorithm to recover the secret using faulty proofs and polynomial commitment techniques. We evaluate PLONK-Hammer in gnark, successfully leak secrets by our attack.
verifiable computation without revealing privacy. Zk-SNARKs are provably secure against cryptographic
attacks; however, whether these systems can resist fault injection attacks requires further investigation. Previous related work focused on QAP-based zk-SNARK like Groth16. PLONK improves upon Groth16 with a universal and updatable trusted setup. In this work, we present PLONK-Hammer, which breaks input privacy of PLONK via rowhammer. We inject faults into the secret inputs.Then we devise an algorithm to recover the secret using faulty proofs and polynomial commitment techniques. We evaluate PLONK-Hammer in gnark, successfully leak secrets by our attack.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFront-end verification of complex SoCs requires realistic stimulus to validate protocol compliance and system-level functionality early in the design cycle. Traditional transaction-based verification methodologies rely on virtual transactors and extensive software testbenches to emulate peripheral behavior, which is time-consuming and error-prone for advanced protocols like USB. This paper introduces a novel approach that eliminates the need for custom device testbenches by leveraging real USB devices as stimulus in both simulation and emulation environments. The proposed solution integrates a USB Device Transactor with a Virtual Adapter built on the Linux USB framework and libusb library, enabling seamless communication between the DUT and a physical USB device. Demonstrated using USB mass storage devices, the methodology validates enumeration, control transfers, and bulk data exchanges without additional testbench logic. Experimental results show significant performance gains, with emulation achieving up to 300× speedup over simulation while maintaining accuracy. The approach is scalable to other USB device classes and offers a practical path for comprehensive system-level verification using real-world traffic.
People
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionThis work proposes a Parallel-Momentum Tempering Processing Architecture (PMTPA) featuring an elastic replica framework with time-division multiplexing and a system-level parallel pipeline. Key innovations include a similarity-driven local-field computation that halves local-field storage and simplifies arithmetic, an adaptive processing scheme supporting spin networks of varying scales, and a random-number-controlled shift mechanism to implement nonlinear functions, achieving a balance between computational accuracy and hardware cost. A prototype FPGA implementation supports up to 2,048 fully connected spins and 2–32 configurable replicas. Experimental results on large-scale Max-Cut and image-segmentation benchmarks demonstrate accuracy exceeding 99% and a 17,000× speedup compared with the CPU-based implementation.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionPrevious studies have verified that some special Performance Monitoring Unit (PMU) events are insecure. In this work, we further disclose that even the PMU events are securely designed and implemented, they are capable of constructing malicious attacks.
We develop and implement a fuzz testing framework to automatically identify and verify numerous PMU events. Based on which, we successfully discover 118 vulnerable PMU events. Leveraging these identified events, we introduce PMU-Faker, a precise timing mechanism that can implement most of the timing-based side channel attacks without timer. To accommodate variations in instruction-triggered cycle durations, PMU-Faker provides two timing mechanisms with varying levels of granularity, thereby balancing precision and adaptability. Moreover, we successfully utilizing PMU-Faker to implement four attacks including executing transient execution attacks, extracting AES's key, conducting attacks against Intel Software Guard Extensions (SGX), and achieving cross-virtual machine information leakage.
We develop and implement a fuzz testing framework to automatically identify and verify numerous PMU events. Based on which, we successfully discover 118 vulnerable PMU events. Leveraging these identified events, we introduce PMU-Faker, a precise timing mechanism that can implement most of the timing-based side channel attacks without timer. To accommodate variations in instruction-triggered cycle durations, PMU-Faker provides two timing mechanisms with varying levels of granularity, thereby balancing precision and adaptability. Moreover, we successfully utilizing PMU-Faker to implement four attacks including executing transient execution attacks, extracting AES's key, conducting attacks against Intel Software Guard Extensions (SGX), and achieving cross-virtual machine information leakage.
People
Late Breaking Results
DescriptionSparse kernels are widely used in applications ranging from scientific computing to machine learning. Among them, SpGEMM is particularly challenging due to irregular memory accesses, making it highly memory-bound and dominated by data movement. Processing-in-Memory architectures can mitigate this cost by performing computation within memory banks, but irregular sparsity often causes load imbalance and costly cross-bank accesses. To address these challenges, we propose Polaris, a PIM-aware map-
ping solution for efficient SpGEMM on HBM-based systems. Polaris leverages a row-wise product formulation and a sparsity-aware mapping strategy to improve locality and balance workloads across banks, reducing data movement and achieving on average, 2× lower energy consumption and up to 8× speedup over prior designs.
ping solution for efficient SpGEMM on HBM-based systems. Polaris leverages a row-wise product formulation and a sparsity-aware mapping strategy to improve locality and balance workloads across banks, reducing data movement and achieving on average, 2× lower energy consumption and up to 8× speedup over prior designs.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionSequential-constraint characterization consumes up to 80% of library signoff runtime. Traditional path-based methods rely on internal-node observability and fragile topology analysis, requiring extra simulations. They fail fundamentally for toggle-unexpected arcs such as hold and removal, where no robust internal voltage transitions exist. We present power-path, an internal-probe-free method that extracts a physics-grounded supply-current signature from the mandatory reference-delay simulation, enabling universal constraint estimation across all arc types without additional SPICE costs. An affine calibration using lookup table vertices refines the raw estimator for nominal characterization; for statistical deployment, an online ridge regression progressively tightens search intervals across Monte Carlo samples. Evaluated on post-layout TSMC 5 nm and 12 nm libraries across 5 PVT corners, power-path achieves 1.5× speedup with affine calibration and 3.6× with statistical calibration, while maintaining signoff-grade accuracy. The approach requires no topology analysis or additional simulations. We open-source our code and data.
Work in Progress
DescriptionRowHammer (RH) remains a leading security threat in modern DRAM. Industry has standardized Per-Row Activation Counting (PRAC) for monitoring and Alert Back-Off (ABO) for mitigation, but ABO's pulse-only ALERT_n exposes little about which rows are at risk or how far recovery has progressed. Lacking actionable state, memory controllers fall back to coarse, pessimistic mitigation producing substantial performance loss. This paper presents PRAC-Auto, a practical, pin-compatible replacement for ABO that enables accurate, flexible, and efficient in-DRAM RH mitigation in PRAC-based systems. PRAC-Auto converts the standard ALERT_n into a lightweight handshake and introduces a subarray-status register read command, allowing the controller to discover which subarrays are affected and to track mitigation progress. With this finer-grained, progress-aware visibility, the controller continues serving requests to unaffected subarrays while the device autonomously refreshes only the vulnerable regions. Moreover, the autonomous mitigation can be extended or reduced via the handshake to reflect new traffic, improving security against adaptive performance attacks. Our evaluation demonstrates that PRAC-Auto nearly eliminates RH-mitigation performance overhead while providing robust protection.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTighter time-to-market requirements for modem SoCs drive real workload validation to move earlier into the pre-silicon phase. Legacy In-Circuit Emulation (ICE), a conventional approach for pre-silicon validation, was considered. but its dependence on physical testers and cabling causes high maintenance overhead, resource binding issues, and scalability limits.
This presentation describes our use of virtualization for integrating wireless network testers with hardware emulation. An existing virtual wireless UE solution has been used to support physical wireless testers without proprietary physical connections to the emulator, reducing maintenance overhead and relaxing resource binding. However, some hardware dependencies remained, and scalability was still constrained by physical equipment availability.
To resolve these limitations, we integrated a software-based wireless network tester, Anritsu Virtual ST, into the same virtual wireless UE solution. This represents the first integration of a commercial software-based wireless signaling tester with hardware emulation. This approach removes the need for physical testers and extra hardware, enabling flexible scalability while preserving the same architecture for both physical and software testers.
We demonstrated the feasibility of this integration through datapath integrity checks and basic functional tests, showing correct data transfer between the wireless network tester and Modem RTL on emulation.
This presentation describes our use of virtualization for integrating wireless network testers with hardware emulation. An existing virtual wireless UE solution has been used to support physical wireless testers without proprietary physical connections to the emulator, reducing maintenance overhead and relaxing resource binding. However, some hardware dependencies remained, and scalability was still constrained by physical equipment availability.
To resolve these limitations, we integrated a software-based wireless network tester, Anritsu Virtual ST, into the same virtual wireless UE solution. This represents the first integration of a commercial software-based wireless signaling tester with hardware emulation. This approach removes the need for physical testers and extra hardware, enabling flexible scalability while preserving the same architecture for both physical and software testers.
We demonstrated the feasibility of this integration through datapath integrity checks and basic functional tests, showing correct data transfer between the wireless network tester and Modem RTL on emulation.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionNVIDIA's Multi-Instance GPU (MIG) technology, which enables reconfiguration of large GPUs into smaller, independent slices, is a promising feature for high-performance AI inference servers. Our characterization reveals that the data preprocessing stage of AI inference causes a significant performance bottleneck in these MIG-based inference systems. We present PREBA, a hardware/software co-design targeting MIG inference servers. PREBA offloads data preprocessing to a latency-optimized, FPGA-based accelerator. Simultaneously, PREBA's analytical model-based dynamic batching system maximizes small vGPU utilization by creating optimal, input-aware batches. PREBA provides 3.7x improvement in throughput, 3.5x energy-efficiency, and 3.0x cost-efficiency over a baseline that uses CPU-based preprocessing.
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
DescriptionDebugging complex FPGA prototypes of modern processors in embedded systems is challenging due to limited signal visibility and significant tracing overhead. Existing approaches struggle to balance execution efficiency with debugging capability, often requiring either expensive continuous tracing or heavyweight snapshots. We propose Prelude, a lightweight snapshot-based debugging framework that records only essential architectural states and memory footprint of the processor on FPGA. During replay, a short visibility warm-up reconstructs internal micro-architectural states, enabling cycle-accurate analysis. Implemented on BOOM and Rocket, Prelude provides comparable visibility to prior work while significantly improving debugging efficiency: 32.88× / 2191.2× speedup over DESSERT / ENCORE on BOOM, and 18.09× / 896.4× on Rocket.
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionComputation graphs are fundamental for GPUs to efficiently deploy workloads. They also bring a challenge of growing memory demands. Operator recomputation and operator fission are promising
directions to mitigate peak memory usage. However, existing work relies on offline decisions in recomputation and fission, leading to a increase in computation latency for only a marginal reduction in
memory usage. In this paper, we propose PRICE, which is a novel operator recomputation and fission framework for computation graphs. Central to this framework is the attainable arithmetic intensity that can effectively trade off between memory usage and computation latency. We design a compile-time modeling approach to parameterize arithmetic intensity with GPU workload characteristics. Then we develop a runtime instantiation approach to attain the arithmetic intensity for online decision-making. Experiments on NVIDIA RTX 4090 and A100 demonstrate that PRICE achieves 1.08× to 1.31× speedup while delivering a comparable reduction in peak memory usage, compared to the state-of-the-art work.
directions to mitigate peak memory usage. However, existing work relies on offline decisions in recomputation and fission, leading to a increase in computation latency for only a marginal reduction in
memory usage. In this paper, we propose PRICE, which is a novel operator recomputation and fission framework for computation graphs. Central to this framework is the attainable arithmetic intensity that can effectively trade off between memory usage and computation latency. We design a compile-time modeling approach to parameterize arithmetic intensity with GPU workload characteristics. Then we develop a runtime instantiation approach to attain the arithmetic intensity for online decision-making. Experiments on NVIDIA RTX 4090 and A100 demonstrate that PRICE achieves 1.08× to 1.31× speedup while delivering a comparable reduction in peak memory usage, compared to the state-of-the-art work.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionAccurately forecasting GPU workloads is essential for AI infrastructure, enabling efficient scheduling, resource allocation, and power management. Modern workloads are highly volatile, multiple periodicity, and heterogeneous, making them challenging for traditional predictors. We propose PRISM, a primitive-based compositional forecasting framework combining dictionary-driven temporal decomposition with adaptive spectral refinement. This dual representation extracts stable, interpretable workload signatures across diverse GPU jobs. Evaluated on large-scale production traces, PRISM achieves state-of-the-art results. It significantly reduces burst-phase errors, providing a robust, architecture-aware foundation for dynamic resource management in GPU-powered AI platforms.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionMicroscaling formats have emerged as prominent candidates for 4-bit quantization of modern AI models due to its fine-grained group-wise quantization granularity. However, such format still exhibit fundamental accuracy degradations. In particular, MXFP4 and NVFP4 are limited by fixed shared-scale precision, and NVFP4 further cannot cover the full value range without an added FP32 scale. This paper presents PRISM, a microscaling format with a single 8-bit encoded group level shared scale that adaptively allocates shared scale's reprsentation based on the relative importance of values. Through our evaluation, PRISM surpasses conventional Microscaling format accuracy with only 0.62% area and 0.86% energy overhead.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionApproximate Nearest Neighbor Search (ANNS) at scale is constrained by the memory wall. By moving compute to memory, Processing-in-Memory (PIM) offers ample internal bandwidth but limited on-die compute, making arithmetic reduction crucial. We propose Prism, a PIM-based ANNS system co-optimizing vector pruning, distance evaluation, and host-PIM orchestration. It employs a proximity-aware vector pruner to leverage high intra-PIM bandwidth and dual-cluster affiliations to filter out distant vectors. Prism then performs sensitivity-ordered distance computation, prioritizing high-impact dimension segments and early terminating candidates once exclusion criteria are met. A stall-free host-PIM pipeline overlaps query preparation, PIM execution, and global ranking to sustain high throughput. Experiments show that Prism achieves 2.7-19.8x higher throughput over state-of-the-art systems.
People
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionRegister-Transfer Level (RTL) verification is a primary bottleneck, consuming 60–70% of development time. While Large Language Models (LLMs) show promise for RTL automation, their performance and research focus have overwhelmingly centered on RTL generation rather than verification. Current methods for RTL verification rely on large-scale proprietary models (e.g., GPT-4o) to generate Python-based functional references, incurring a high cost and raising data-privacy risks. To date, an end-to-end open-source solution for autonomous verification remains absent.
We introduce PRO-V-R1, the first trainable open-source agentic framework for autonomous RTL verification. Our contributions are threefold: (1) we design PRO-V sys, a modular agentic system that couples LLM-based reasoning with programmatic tool use for RTL verification; (2) we establish a data construction pipeline that leverages existing RTL datasets to build simulation-validated, expert-level trajectories tailored for supervised fine-tuning (SFT) RTL verification agents; and (3) we implement an efficient reinforcement learning (RL) algorithm that uses verification-specific rewards derived from program-tool feedback to optimize the end-to-end verification workflow. Our empirical evaluation demonstrates PRO-V-R1 achieves a 57.7% functional correctness rate and 34.0% in robust fault detection, significantly outperforming the base model's 25.7% and 21.8% (respectively) from the state-of-the-art (SOTA) automatic verification system. This configuration also outperforms large-scale proprietary LLMs in functional correctness and shows comparable robustness for fault detection.
We introduce PRO-V-R1, the first trainable open-source agentic framework for autonomous RTL verification. Our contributions are threefold: (1) we design PRO-V sys, a modular agentic system that couples LLM-based reasoning with programmatic tool use for RTL verification; (2) we establish a data construction pipeline that leverages existing RTL datasets to build simulation-validated, expert-level trajectories tailored for supervised fine-tuning (SFT) RTL verification agents; and (3) we implement an efficient reinforcement learning (RL) algorithm that uses verification-specific rewards derived from program-tool feedback to optimize the end-to-end verification workflow. Our empirical evaluation demonstrates PRO-V-R1 achieves a 57.7% functional correctness rate and 34.0% in robust fault detection, significantly outperforming the base model's 25.7% and 21.8% (respectively) from the state-of-the-art (SOTA) automatic verification system. This configuration also outperforms large-scale proprietary LLMs in functional correctness and shows comparable robustness for fault detection.
People
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionProbabilistic computation plays an important role in trustworthy edge intelligence to quantify uncertainty, enhance robustness, reconstruct data and protect privacy, but its adoption is limited by orders of magnitude data throughput gap between Gaussian random number generation(GRNG) and computation, as well as instruction overhead. This paper introduces \emph{probabilistic memory} (p-MEM), a unified memory primitive that stores distribution parameters and samples directly at native memory bandwidth where deterministic data becomes the zero-variance special case. Using a layout-validated p-MEM simulator, we comprehensively explore device choices, memory specifications, and technology nodes, showing that p-MEM can achieve $>1000$\,GSa/s/mm$^2$ GRNG throughput (including memory arrays). Integrated into CPU / GPU systems, p-MEM reduces instruction count by up to $2.19\times$/$4.37\times$, sampling latency by $562\times$/$3.45\times$, and energy by $295.5\times$/$3.53\times$ for BNN workloads, providing a scalable hardware substrate for trustworthy probabilistic AI.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSparsity has become a defining characteristic of modern workloads, motivating the development of specialized accelerators to mitigate the challenges inherent to sparsity. Sparse streaming accelerators have proved to be an effective solution, yet current designs are restricted to single-workload execution. Additionally, they exhibit significant underutilization in processing elements (PEs) due to their underlying non-zero scheduling strategies. To address these limitations, we propose Procyon, a fine-grain multi-tenancy framework that fuses the PE instruction streams of multiple workloads into a unified execution schedule. This allows instructions from different workloads to execute on the same PE, thereby, improving its utilization and enabling concurrent execution of multiple workloads on a single accelerator instance. We evaluate Procyon on AMD Alveo U55C FPGA using workloads from the SuiteSparse dataset and show that it substantially reduces PE underutilization that results in 3x speedup over state-of-the-art sparse streaming accelerators (Serpens and Chason), and reaches a peak throughput of 61.2 GFLOP/s.
Research Special Session
AI
DescriptionIC design is inseparable from IC design automation, and in recent years has also become inseparable from AI and ML boosters for automation and optimization (and people). Here, we will review recent progress and a roadmap leading toward LLM-based, tool-using AI agents that function as automated EDA R&D software engineers. These agents will be expert developers of new EDA tools (not expert users of existing EDA tools). They will be capable of writing bespoke tool code to meet designer- and design-specific needs, accelerating innovation for both designers and EDA providers. The talk will present specific waypoints and proofpoints that span documentation, ideation, planning, code implementation that draws on research literature, and more. Among the key messages: (1) docs are foundational because they determine the quality of the LLM's context; (2) autonomous, closed-loop improvement of complex EDA heuristics and code is feasible today; and (3) open source is a crucial part of this trajectory.
Engineering Presentation
EDA
DescriptionTimeout failures are among the most resource-intensive issues in System-on-Chip (SoC) verification, frequently manifesting as silent execution hangs with no explicit diagnostic signature. In large-scale regression environments, such failures can consume a significant fraction of total simulation resources and account for a disproportionate share of debug effort. Conventional mechanisms,
including global watchdogs, UVM heartbeats, and waveform-based analysis, typically provide only reactive termination without isolating the underlying root cause across RTL, testbench, or firmware layers.
This paper introduces a deterministic and protocol-aware framework for timeout root-cause isolation in UVM-based verification environments. The framework integrates distributed telemetry agents, localized watchdogs, and transaction-aging monitors across verification layers to enable real-time
correlation of causal dependencies. Timeout events are automatically classified into architectural, testbench, and system-level handshake categories using temporal and protocol-aware analysis.
Across more than 500 large-scale regression runs, the proposed framework reduced average timeout debug effort by 65–75%, lowered re-run frequency by approximately 40%, and improved overall schedule predictability. By transforming non-diagnostic global hangs into localized, actionable failure signatures, this approach enables structured timeout debug and proactive failure prevention in modern SoC verification flows.
including global watchdogs, UVM heartbeats, and waveform-based analysis, typically provide only reactive termination without isolating the underlying root cause across RTL, testbench, or firmware layers.
This paper introduces a deterministic and protocol-aware framework for timeout root-cause isolation in UVM-based verification environments. The framework integrates distributed telemetry agents, localized watchdogs, and transaction-aging monitors across verification layers to enable real-time
correlation of causal dependencies. Timeout events are automatically classified into architectural, testbench, and system-level handshake categories using temporal and protocol-aware analysis.
Across more than 500 large-scale regression runs, the proposed framework reduced average timeout debug effort by 65–75%, lowered re-run frequency by approximately 40%, and improved overall schedule predictability. By transforming non-diagnostic global hangs into localized, actionable failure signatures, this approach enables structured timeout debug and proactive failure prevention in modern SoC verification flows.
People
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionThe prevalence of nonlinear systems in safety-critical domains calls for controllers with safety guarantees, while vision-based control relies on high-dimensional images complicates both decision-making and formal safety analysis under uncertainty. This paper proposes a provably safe controller synthesis method for vision-based neural network control systems. We first employ a conditional generative adversarial network (cGAN) to approximate the mapping from system states to visual observations and combine it with RL-based pretraining to build a verifiable closed-loop structure. A data-driven model quantifies uncertainties from environmental perturbations, while martingale theory guides the learning of a stochastic barrier certificate (SBC) to provide rigorous probabilistic safety bounds. Furthermore, counterexamples from verification are used to alternately refine both the controller and certificate networks, ultimately yielding a controller with formally provable probabilistic safety guarantees. Experimental results on widely studied benchmarks demonstrate the efficiency and effectiveness of our approach.
People
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionThe propositional Satisfiability (SAT) problem is fundamental to many applications in Electronic Design Automation (EDA). This paper presents PRS, an efficient and comprehensive parallel SAT framework. We introduce two key techniques to enhance solver performance. The first is a lightweight preprocessing method called Resolution Checking, which efficiently simplifies circuit-encoded CNFs. The second is a new hybrid diversification strategy that combines a Regular Shifting method for the initial branching order with a parallel local search to generate diverse initial variable phases. PRS also supports extensive preprocessing, dynamic clause sharing, reproducible parallel solving, and parallel proof generation. Extensive experiments on the SAT Competition 2025 (SC25) benchmark demonstrate the effectiveness and scalability of our framework: PRS outperforms the SC25 Parallel-Track winner MallobSAT by solving 12 more instances with an 11.4% better PAR2 score, while also achieving a 4.7x speedup on 64 cores.
People
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionPoint-based neural networks have become the mainstream approaches for point cloud processing. However, they heavily rely on computationally intensive neighbor search operations like K-Nearest Neighbors (KNN) and ball query, which create significant bottlenecks for deployment in resource-constrained environments.In this paper, we present PS-CAM, a novel NOR-CAM-based accelerator designed to support both ball-query and KNN efficiently. PS-CAM converts the ball-query operation into a set of range search problems, formulating a one-shot search via the RENÉ range query scheme. To enhance energy efficiency within this scheme, a Two-Level Region-Partitioned (TLRP) technique is adopted, which reduces the number of activated CAM banks simultaneously. For KNN search, PS-CAM employs a progressive method through a series of ball-queries with increasing radius. The search iterations are bounded by the Density-Aware Radius Estimation (DARE) method, which rapidly approximates the KNN-equivalent ball radius (KEBR), thereby drastically reducing the need for repeated queries. Compared with prior works, PS-CAM demonstrates significant performance gains, achieving speedups of 1.25~1.33× and an energy efficiency improvement of 142× in ball query. For KNN task, PS-CAM delivers a speedup of 4.6~27.2× and energy-efficiency improvements of 7.1~2578×.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe increasing integration of distributed energy resources, storage, and flexible loads makes Energy Management System (EMS) design a complex system-level problem involving strong interactions among control software, hardware, environmental uncertainty, and economic objectives. The volatility of renewable generation and demand creates large operational variability, making manual, scenario-driven design insufficient for exploring the constrained and high-dimensional EMS design space.
This paper presents a system-level design automation framework based on the Portable Stimulus
Standard (PSS) and constrained-random simulation. The framework models EMS controllers, storage, renewable sources, loads, and grid interfaces in a unified, executable specification and automatically generates diverse operational scenarios. Simulation results are analyzed using data-driven methods to extract performance, safety, and economic metrics and to guide directed, multi-objective design space exploration.
We introduce a CPU–Memory EMS Abstraction that models EMS control as CPU-like and storage as memory-like, enabling modular modeling and systematic scenario generation. The proposed approach transforms EMS design into a scalable, automated, and data-driven process and demonstrates the use of PSS beyond verification for system-level design optimization.
This paper presents a system-level design automation framework based on the Portable Stimulus
Standard (PSS) and constrained-random simulation. The framework models EMS controllers, storage, renewable sources, loads, and grid interfaces in a unified, executable specification and automatically generates diverse operational scenarios. Simulation results are analyzed using data-driven methods to extract performance, safety, and economic metrics and to guide directed, multi-objective design space exploration.
We introduce a CPU–Memory EMS Abstraction that models EMS control as CPU-like and storage as memory-like, enabling modular modeling and systematic scenario generation. The proposed approach transforms EMS design into a scalable, automated, and data-driven process and demonstrates the use of PSS beyond verification for system-level design optimization.
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
DescriptionInverse lithography (ILT) is critical for modern semiconductor manufacturing but faces the challenge of non-convex optimization, which can lead to sub-optimal local minima. Although advances in model architectures and loss functions have improved performance and reduced the number of post-ILT refinement iterations, two critical issues remain unresolved, leaving generative AI–based inverse solutions sub-optimal: 1) The training dataset is often sub-optimal; simply mimicking its behavior does not necessarily yield higher-quality masks. 2) Post-training ILT refinement involves navigating a highly non-convex manifold. To address this, we reformulate the generative model G as a learned distribution over the mask space conditioned on designs. Using a Style-Aware GAN pre-trained on a large design dataset, we introduce a fine-tuning stage that combines policy optimization with imitation learning. This trains the GAN to generate masks that are both high-quality and robust, requiring minimal subsequent numerical refinement. Our hybrid framework mitigates the sub-optimal traps of conventional ILT, improves mask quality, and reduces optimization time, offering advantages beyond what traditional solvers can achieve.
Research Special Session
EDA
DescriptionQuantum and AI are converging from chips to systems—a cross-layer opportunity. I'll position quantum-AI as design automation: noise-aware synthesis/compilation and verification mapped to real devices; interpretable QNNs (Quantum Grad-CAM) for debug and coverage; privacy-preserving and federated QML for secure, distributed training; "learning to measure" that treats observables as tunable design knobs; and multi-chip ensemble VQCs that improve trainability and robustness under realistic constraints. We'll show how these plug into EDA flows—accelerating design-space exploration, improving PPA/quality-of-results, and reducing calibration/iteration time—on workloads from HEP analytics to power-grid control. I'll close with an integration agenda for QAI toolchains, metrics, and benchmarks that hardware and CAD teams can use now while we scale to larger systems.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionIn fault-tolerant quantum circuit synthesis, T gates supplied via magic states dominate space–time cost, while Clifford gates incur negligible overhead. Conventional flows minimize AND count in an {XOR, AND, NOT} basis as a proxy for T, which neglects phase cancellation and can be far from T-optimal. We instead formulate an exact T synthesis problem and canonicalize Boolean functions under Clifford equivalence. By precomputing T-optimal implementations up to seven variables and developing a specialized mapper, we reduce the T count by up to 16% on EPFL benchmarks and improve the best-known T counts of several cryptographic modules.
Engineering Special Session
EDA
Quantum
DescriptionThis talk introduces a high-level approach to quantum software development that draws directly on EDA methodology. Rather than coding at the gate level, engineers describe quantum algorithms as functional models — specifying behavior, constraints, and optimization objectives — and allow a synthesis engine to automatically generate optimized, hardware-aware quantum circuits. The approach mirrors how RTL synthesis transformed chip design: abstract once, optimize and target anywhere.
People
Research Special Session
EDA
DescriptionThe performance of quantum processors is critically limited by the control stack, where traditional systems often present a black-box abstraction, hindering full-stack co-design. To address this, we present QubiC, a novel, open-source quantum control system designed to bridge this gap. QubiC provides a vertically integrated framework that exposes low-level hardware control while maintaining high-level abstractions. Its architecture features native, high-bandwidth links to GPUs, enabling advanced techniques such as AI-driven automated calibration and enhanced quantum state discrimination. We have successfully validated QubiC for high-fidelity control on superconducting processors. By removing these abstraction barriers, QubiC provides a powerful tool to accelerate the collective progress toward robust, high-performance quantum computers through hardware-software co-design.
Work in Progress
DescriptionEfficient inter-core communication (ICC) is vital for performance and scalability in multi-core systems. We present R5-Link (RISC-V Link), a hardware–software co-designed ICC mechanism enabling low-latency, contention-free message passing across a 2D mesh network. Unlike traditional shared-memory or scratchpad-based methods, R5-Link uses dedicated bi-directional hardware queues and lightweight MPI-compatible APIs for direct neighbor-to-neighbor data exchange. This design eliminates complex synchronization and memory contention, improving communication efficiency without interfering with core execution. Implemented on a PULP RISC-V SoC (16 nm FinFET), R5-Link achieves 6.6–14.8× lower latency and 6.6–41× higher throughput compared to software-based MPI communication.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionRetrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating information retrieval from knowledge databases, significantly improving the generated results' accuracy, relevance, and contextual richness. An in-depth analysis of RAG reveals that its diverse operations are primarily constrained by memory bottlenecks. The diversity and continual evolution of RAG algorithms further increase system design complexity. In this paper, we introduce RAGNMP, a general-purpose Near-Memory Processing (NMP) accelerator designed for RAG. Specifically, we first propose an enhanced and quantified elimination tree variant that simultaneously explores data placement, task parallelism, and pipelining to better support RAG workloads on NMP architectures. It also remains adaptable to algorithm changes in RAG. We further propose a general-purpose NMP architecture with a flexible processing unit that efficiently supports diverse memory-bound operations in RAG. Experimental results show that RAGNMP outperforms the state-of-the-art RAG system and accelerator.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionThis work presents a unified RISC-V extension for Post-Quantum Cryptography (PQC) that emphasizes versatility and openness. The design exposes a compact set of custom scalar instructions via the Core-V eXtension Interface (CV-X-IF) and a modular PQ-ALU that accelerates the dominant kernels across hash-, lattice-, and code-based schemes: Keccak processing, randomness sampling, modular and polynomial arithmetic, finite-field operations, and coefficient compression. The system supports standardized algorithms—ML-KEM, ML-DSA, SLH-DSA, and HQC—as well as candidates under evaluation. The full hardware and software stack is released as open source to enable reproducibility and community reuse. We provide ASIC results from post-synthesis characterization in 65 nm CMOS, reporting instruction-cycle counts, along with power and energy estimates for each custom operation.
Overall, the proposed extension delivers a compact (∼48 kGE) and practical path to crypto-agile, energy-efficient PQC on RISC-V while preserving software compatibility and a clean integration into existing cores.
Overall, the proposed extension delivers a compact (∼48 kGE) and practical path to crypto-agile, energy-efficient PQC on RISC-V while preserving software compatibility and a clean integration into existing cores.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionCoarse-grained reconfigurable arrays (CGRAs) are programmable hardware devices having large coarse-grained processing elements (PEs) and word-wide configurable interconnect. The interconnect
comprises a considerable fraction of total CGRA area, making it desirable to restrict interconnect connectivity richness. However, reducing interconnect richness makes application mapping more
difficult, and impossible in some cases. We propose techniques for mapping applications onto CGRAs with restricted routing architectures. For a target CGRA having restricted interconnect, we precompute the reachability and distance between all PE pairs, as well as an estimate of the number of available routing paths between PE pairs. The values are stored in tables, and used during the placement stage of the mapping flow to assess the potential routability of an intermediate placement. For benchmark applications mapped onto CGRA variants with restricted interconnect, results demonstrate higher mapping success, and lower mapping runtimes vs. recent state-of-the-art CGRA mappers.
comprises a considerable fraction of total CGRA area, making it desirable to restrict interconnect connectivity richness. However, reducing interconnect richness makes application mapping more
difficult, and impossible in some cases. We propose techniques for mapping applications onto CGRAs with restricted routing architectures. For a target CGRA having restricted interconnect, we precompute the reachability and distance between all PE pairs, as well as an estimate of the number of available routing paths between PE pairs. The values are stored in tables, and used during the placement stage of the mapping flow to assess the potential routability of an intermediate placement. For benchmark applications mapped onto CGRA variants with restricted interconnect, results demonstrate higher mapping success, and lower mapping runtimes vs. recent state-of-the-art CGRA mappers.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionNext-generation scale-up and scale-out AI systems require high-radix switches that deliver high bandwidth, low latency, and non-blocking performance while scaling to hundreds of ports. Traditional switch fabrics are typically implemented using large crossbars, which satisfy performance requirements but present significant implementation and physical design (PD) challenges. As port counts increase, crossbars become increasingly difficult to develop, floorplan, route, and time, and pose fundamental barriers to scaling across chiplets for continued growth in number of ports.
This work presents NeuraScale, a scalable, non-blocking switch fabric architecture designed to address these challenges. The fabric employs a hybrid topology that combines the scalability and PD regularity of mesh-based structures with Clos-style connectivity to provide system-level non-blocking behavior. The architecture is constructed from repeatable, PD-friendly tiles that can be replicated to build large fabrics, including across chiplets, enabling rapid design iteration and deployment.
Performance evaluation of a 128×800G (102.4 Tb/s) AI switch under permutation traffic demonstrates 100% peak throughput with flat latency, while a conventional mesh saturates at 73% throughput with sharply increasing latency. This approach enables practical realization of high-radix, chiplet-ready AI switches for large-scale AI systems.
This work presents NeuraScale, a scalable, non-blocking switch fabric architecture designed to address these challenges. The fabric employs a hybrid topology that combines the scalability and PD regularity of mesh-based structures with Clos-style connectivity to provide system-level non-blocking behavior. The architecture is constructed from repeatable, PD-friendly tiles that can be replicated to build large fabrics, including across chiplets, enabling rapid design iteration and deployment.
Performance evaluation of a 128×800G (102.4 Tb/s) AI switch under permutation traffic demonstrates 100% peak throughput with flat latency, while a conventional mesh saturates at 73% throughput with sharply increasing latency. This approach enables practical realization of high-radix, chiplet-ready AI switches for large-scale AI systems.
Research Special Session
Systems
DescriptionFlexible Electronics (FE) enable lightweight, conformable, and low-cost wearable devices. Still, their limited integration density and device dimensions impose tight area and power constraints, making on-substrate neural network classifiers challenging. A fully digital realization requires full-bandwidth sensor readout and power-hungry ADCs, whereas analog classifiers can operate directly on continuous-time signals on the flexible substrate and use only a simple comparator at the output. Wearable sensing pipelines must tolerate sensor noise, motion artifacts, and baseline drift, which places strong robustness demands on low-precision hardware. Existing analog neural networks in large-area electronics use smooth, monotonic activation functions that offer limited resilience to such disturbances. Radial Basis Function (RBF) kernels, by contrast, provide bounded, local similarity responses naturally compatible with current-mode analog implementations and help suppress signal excursions. We propose RBF-based activation kernels for analog neural networks in FE and demonstrate, via system-level experiments, improved robustness over standard activations at comparable circuit complexity.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionBackground:
RDC analysis is crucial for ensuring the robustness and reliability of complex digital designs. Reset Domain Crossing analysis is a part of Sign-Off checks in today's design. Due to complexity of the design, a huge number of reset domains and asynchronous resets require a proper reset domain crossing sign-off methodology to avoid metastability, glitches and other functional problems. However, traditional RDC verification methodologies often generate excessive noise and violations, hindering accurate identification of genuine issues. This paper presents a novel approach to significantly reduce noise and enhance result accuracy in RDC analysis by leveraging two key features, Skip Resetless Flow and 3PIP Soft Reset Handling.
Methodology:
In the RDC flow, designers are expected to enable these features. Let's understand the functionality of both of these features to start with,
1. Skip Resetless Flow:
In today's modern design we use restless flops to reduce the power consumption, silicon area and improve the timing performance. Most of the cases these flops are placed so to keep in mind that reset path will not contribute to the metastability of the design.
Using the skip_resetless flow, we address the challenge of false violations arising from the propagation of data corruption through resetless flops in current flow. In conventional RDC analysis the tools are primarily configured to propagate corruption and report violations on the first sequential element after the source, even if these elements are resetless flop. Most of the time restless flops are not contributing any metastability. This leads to an inflated number of violations, obscuring critical RDC issues and prolonging verification cycles.
Our proposed solution utilizes the skip_resetless_flops argument within the RDC flow. By setting this argument to true, we enable the "Skip Resetless Flow" effectively bypassing the propagation of corruption through resetless flops. This targeted approach significantly reduces the number of reported violations and focuses attention on genuine RDC hazards.
In the below design, if the skip_resetless_flops argument is set to true, RDC tool skips the dest1, skip1, and the skip2 resetless flops and reports a violation between the src1 [rst1] and obs_seq [rst2].
Note: Sequential elements with asynchronous set/reset pins tied to an inactive constant values are also treated as resetless flops, such as, dest1 and skip2, in the above case.
2. 3rd-Party IPs Soft Reset Handling:
Handling 3rd-Party IPs is the most crucial problem statement we have in our current design as lots of 3PIPs are used today. Most of the time these IPs are not run flat at block level to reduce the runtime and unnecessary debugging within the IPs. In current methodology, when we blackbox these IPs, all reset interfaces are treated as asynchronous and create lots of violations between 3PIP and block level logic.
With this new custom configuration, users will have the liberty to perform soft-reset analysis for RDC paths that are interacting between internal to the 3PIP and the external paths effectively without considering all reset paths as asynchronous. In this approach it will still consider these 3rd-Party IPs as blackbox and will analyse without affecting runtime of the tool.
Config Setting: configure_setup_check -enable_ip_block_check_for_soft_reset true
Usage: configure_ip_block -names { <3PIP list with space separated>}
Listed below are some example scenarios which will give more clarity on the scenarios that would be covered or not covered as part of this methodology.
1. Scenarios where soft-resets generated internally to the 3PIP and also consumed internally, won't be considered for the analysis.
2. Scenarios where resets or soft-resets generated internally to the 3PIP but getting consumed externally, will be considered for the analysis.
3. Scenarios where resets or soft-resets generated externally to the 3PIP but getting consumed internally in the 3PIP, will be considered for the analysis.
Results and Conclusions:
Design Soft-Resets Reported
With configuration Without configuration
SubSystem_1 403 707
SubSystem_2 5 13
SubSystem_3 67 567
In conclusion, this methodology demonstrates the usage of the "Skip Resetless Flow" and "3rd Party IPs Soft Reset Handling" in refining RDC analysis by minimizing noise and pinpointing genuine violations. By strategically bypassing resetless flops, this method offers a more precise and efficient approach to identifying and mitigating RDC hazards, ultimately leading to more reliable designs. Furthermore, we extend this enhancement by introducing custom configuration for "3rd Party IPs soft-reset handling." This feature allows users to selectively analyze RDC paths interacting between internal and external domains of a 3PIP, enabling focused verification of critical reset behaviors and also reducing the noise. The combined application of "Skip Resetless Flow" and tailored "3PIP Soft-reset handling" provides a comprehensive and efficient solution for robust RDC verification in complex digital designs.
Applicability:
Highly applicable to a broad range of users dealing with complex SoCs, especially those utilizing numerous reset domains, asynchronous resets, and 3rd-Party IPs.
The techniques are readily implementable within standard RDC tool flows.
Challenge/Idea:
The fundamental issue of false violations due to resetless flops and blackboxed 3rd-Party IPs reset handling is a common industry problem. However, the specific combination and implementation of "Skip Resetless Flow" and "3rd-Party IPs Soft Reset Handling" provides a novel and effective solution.
Novelty:
Traditional RDC analysis often treats all reset crossings as potential hazards, leading to a high volume of false positives. This submission introduces targeted filtering. The "Skip Resetless Flow" provides a precise mechanism to bypass known non-contributing resetless flops, significantly reducing noise.
The "3rd-Party IP Soft Reset Handling" provides a custom configuration that allows the user to perform soft-reset analysis for RDC paths that are interacting between internal to the 3PIP and the external paths. Standard industry practices often struggle to efficiently handle reset interactions with blackboxed 3rd-Party IPs, leading to either excessive violations or incomplete analysis. This combination provides a very detailed and customizable approach to RDC analysis that is not commonly seen.
RDC analysis is crucial for ensuring the robustness and reliability of complex digital designs. Reset Domain Crossing analysis is a part of Sign-Off checks in today's design. Due to complexity of the design, a huge number of reset domains and asynchronous resets require a proper reset domain crossing sign-off methodology to avoid metastability, glitches and other functional problems. However, traditional RDC verification methodologies often generate excessive noise and violations, hindering accurate identification of genuine issues. This paper presents a novel approach to significantly reduce noise and enhance result accuracy in RDC analysis by leveraging two key features, Skip Resetless Flow and 3PIP Soft Reset Handling.
Methodology:
In the RDC flow, designers are expected to enable these features. Let's understand the functionality of both of these features to start with,
1. Skip Resetless Flow:
In today's modern design we use restless flops to reduce the power consumption, silicon area and improve the timing performance. Most of the cases these flops are placed so to keep in mind that reset path will not contribute to the metastability of the design.
Using the skip_resetless flow, we address the challenge of false violations arising from the propagation of data corruption through resetless flops in current flow. In conventional RDC analysis the tools are primarily configured to propagate corruption and report violations on the first sequential element after the source, even if these elements are resetless flop. Most of the time restless flops are not contributing any metastability. This leads to an inflated number of violations, obscuring critical RDC issues and prolonging verification cycles.
Our proposed solution utilizes the skip_resetless_flops argument within the RDC flow. By setting this argument to true, we enable the "Skip Resetless Flow" effectively bypassing the propagation of corruption through resetless flops. This targeted approach significantly reduces the number of reported violations and focuses attention on genuine RDC hazards.
In the below design, if the skip_resetless_flops argument is set to true, RDC tool skips the dest1, skip1, and the skip2 resetless flops and reports a violation between the src1 [rst1] and obs_seq [rst2].
Note: Sequential elements with asynchronous set/reset pins tied to an inactive constant values are also treated as resetless flops, such as, dest1 and skip2, in the above case.
2. 3rd-Party IPs Soft Reset Handling:
Handling 3rd-Party IPs is the most crucial problem statement we have in our current design as lots of 3PIPs are used today. Most of the time these IPs are not run flat at block level to reduce the runtime and unnecessary debugging within the IPs. In current methodology, when we blackbox these IPs, all reset interfaces are treated as asynchronous and create lots of violations between 3PIP and block level logic.
With this new custom configuration, users will have the liberty to perform soft-reset analysis for RDC paths that are interacting between internal to the 3PIP and the external paths effectively without considering all reset paths as asynchronous. In this approach it will still consider these 3rd-Party IPs as blackbox and will analyse without affecting runtime of the tool.
Config Setting: configure_setup_check -enable_ip_block_check_for_soft_reset true
Usage: configure_ip_block -names { <3PIP list with space separated>}
Listed below are some example scenarios which will give more clarity on the scenarios that would be covered or not covered as part of this methodology.
1. Scenarios where soft-resets generated internally to the 3PIP and also consumed internally, won't be considered for the analysis.
2. Scenarios where resets or soft-resets generated internally to the 3PIP but getting consumed externally, will be considered for the analysis.
3. Scenarios where resets or soft-resets generated externally to the 3PIP but getting consumed internally in the 3PIP, will be considered for the analysis.
Results and Conclusions:
Design Soft-Resets Reported
With configuration Without configuration
SubSystem_1 403 707
SubSystem_2 5 13
SubSystem_3 67 567
In conclusion, this methodology demonstrates the usage of the "Skip Resetless Flow" and "3rd Party IPs Soft Reset Handling" in refining RDC analysis by minimizing noise and pinpointing genuine violations. By strategically bypassing resetless flops, this method offers a more precise and efficient approach to identifying and mitigating RDC hazards, ultimately leading to more reliable designs. Furthermore, we extend this enhancement by introducing custom configuration for "3rd Party IPs soft-reset handling." This feature allows users to selectively analyze RDC paths interacting between internal and external domains of a 3PIP, enabling focused verification of critical reset behaviors and also reducing the noise. The combined application of "Skip Resetless Flow" and tailored "3PIP Soft-reset handling" provides a comprehensive and efficient solution for robust RDC verification in complex digital designs.
Applicability:
Highly applicable to a broad range of users dealing with complex SoCs, especially those utilizing numerous reset domains, asynchronous resets, and 3rd-Party IPs.
The techniques are readily implementable within standard RDC tool flows.
Challenge/Idea:
The fundamental issue of false violations due to resetless flops and blackboxed 3rd-Party IPs reset handling is a common industry problem. However, the specific combination and implementation of "Skip Resetless Flow" and "3rd-Party IPs Soft Reset Handling" provides a novel and effective solution.
Novelty:
Traditional RDC analysis often treats all reset crossings as potential hazards, leading to a high volume of false positives. This submission introduces targeted filtering. The "Skip Resetless Flow" provides a precise mechanism to bypass known non-contributing resetless flops, significantly reducing noise.
The "3rd-Party IP Soft Reset Handling" provides a custom configuration that allows the user to perform soft-reset analysis for RDC paths that are interacting between internal to the 3PIP and the external paths. Standard industry practices often struggle to efficiently handle reset interactions with blackboxed 3rd-Party IPs, leading to either excessive violations or incomplete analysis. This combination provides a very detailed and customizable approach to RDC analysis that is not commonly seen.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionTraining large language models (LLMs) poses considerable challenges due to long training times, extensive memory requirements, and substantial bandwidth demands. Activation compression is a promising approach to reduce the memory footprint since activation memory accounts for the majority of memory usage in LLM training with large batch sizes and long context lengths. However, because reducing data size increases the amount of information per bit, it may become sensitive to a fault, which could adversely affect the model convergence and accuracy. In this work, we demonstrate that compression can enhance robustness to bit errors and present REACT, a Rapid Error-tolerant Activation Compressor for efficient Transformer training, which not only reduces memory usage but also enhances error tolerance. Using base-delta encoding, which stores only the differences between data, it minimize data size while significantly reducing the number of error-vulnerable bits. Additionally, an error-limiting encoding scheme is incorporated to address remaining vulnerable bits. Experimental results show that REACT maintains accuracy under bit error rate 1000 times higher while achieving a 4 times reduction in activation memory, with negligible performance overhead.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEnsuring accurate extraction of Local Layout Effects (LLE) parameters in LVS rule decks is critical for maintaining electrical yield as technology nodes scale below 5nm. While LLE parameter extraction is essential for foundry process integrity, the key design challenge is translating these physical parameters into electrical impacts specifically threshold voltage (Vth) and mobility (u0) variations before costly layout simulation. A new Real-Time LLE design-aware methodology is introduced that processes extracted LLE parameters with foundry-provided equations to compute per-device electrical impacts in real-time. Our approach detects physically placed neighboring cells near target devices and translates LLE physical parameters into electrical parameters (Vth/u0), enabling immediate performance assessment. This methodology addresses two critical design scenarios: First, layout-integrated visualization after LVS runs enables designers to easily identify, review, and measure how LLE parameters affect device performance, efficiently guiding layout modifications for improved yield. Second, schematic designers gain insights about LLE parameter effects using reference layouts before physical design begins, allowing them to mimic optimal layout behavior during early design stages. Deployed across multiple technology nodes, Real-Time LLE analyzes millions of devices in hours versus days, fundamentally transforming the design flow by enabling early detection and mitigation of LLE-induced performance variations.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe Number Theoretic Transform (NTT) dominates the hardware latency of lattice-based post-quantum cryptography (PQC) and is vulnerable to soft errors and fault injection. Existing protection mechanisms either require re-computing, assume a Residue Number System (RNS) datapath, or provide error detection only without real-time correction. This work presents two lightweight architectures for real-time fault detection and correction for NTT modular multipliers targeting the non-RNS primes used in the CRYSTALS lattice-based PQC schemes. Method 1 exploits the pseudo-Mersenne structure of the CRYSTALS moduli to build ultra-lightweight dual-modulus residue paths for error detection. Method 2 introduces cross-basis single-modulus checks together with lookup-table-based single-bit correction and a mixed linear CRT scheme for multi-bit recovery. On Artix-7, the detection configurations incur 55–78\% area overhead; single-bit detection achieves 100\%, and multi-bit detection at least 99.01\% or nearly 100\% under the tested settings. The design requires no re-computation and has a shorter critical path delay and surpasses state-of-the-art and Triple Modular Redundancy (TMR) baselines in area, error coverage, and latency.
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe rise of embodied intelligence is driving the deployment of Vision Transformers (ViTs) for dense prediction (DP) tasks on resource-limited edge devices. Look-up-table (LUT)-based digital compute-in-memory (CiM) has emerged as a promising paradigm, delivering high accuracy with superior energy and area efficiency. However, practical ViT deployment on LUT‑CiM for edge DP faces dual hardware and software challenges: (1) LUT cost scales exponentially with bit-width, while NLP‑oriented quantization struggles to push convolution-heavy ViTs below 8 bits, especially in compact on‑chip CiM models; and (2) fixed LUT structures cannot accommodate ViTs' heterogeneous static/dynamic workloads, while unstructured mixed‑precision weights strain memory density and utilization. To overcome these, we propose a software-hardware co-designed low-precision ViT accelerator. Software-wise, we propose Greedy G-shuffle, a lightweight, general quantization method, paired with a PCA‑based, channel‑wise weight assignment, tailored for lightweight ViTs in edge DP tasks. Hardware-wise, at the architectural level, we introduce a reconfigurable LUT-based CiM macro that accelerates both static and dynamic matrix operations with superior energy efficiency, complemented by a high-utilization pipeline- and tensor-parallel dataflow that serves heterogeneous ViT layers. At the circuit level, we employ high-density 1TnR VRRAM with flexible read paths and channel-wise hardware remapping to support mixed-precision INT4/2/1 weights, minimizing area overhead. We demonstrate the first fully 4-bit ViT with INT4 activations and structured INT4/2/1 weights on a reconfigurable 3D 1TnR VRRAM LUT-based CiM macro, achieving a 26.7% improvement in mIoU and a 30.1% reduction in RMSE on the ADE20K and NYUDv2 compared to vanilla round-to-nearest quantization, along with up to 8.9× area efficiency and 9.7× energy efficiency improvements over the state-of-the-art.
People
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionSecure speculation schemes are critical for defeating speculative side-channel attacks, but often incur notable performance penalties, encouraging research into optimization strategies. One such optimization strategy, ReCon, improves performance by exploiting previous non-speculative data leakage to remove protections for already leaked data. However, we find that ReCon can leak secrets when an instruction depends on multiple sources of potentially secret data. We show how interactions between Speculative Taint Tracking (STT) and ReCon results in ReCon undermining STT's security guarantees. We address this oversight through a low-impact multi-source tracking mechanism that we call ReConnaissance. Together with a detailed microarchitecture, we show that, on average, ReConnaissance only lowers the overhead-reduction of ReCon from 23.5% to 18.3% on SPEC2017.
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
DescriptionFixed-outline floorplanning remains a critical challenge in VLSI physical design, particularly for modern System-on-Chips (SoCs) that employ rectilinear soft modules to maximize area utilization and reduce wirelength. Existing methods adopt an incremental optimization strategy, introducing rectilinear shape considerations only after obtaining an initial rectangular solution. These approaches restrict module flexibility in early stages, limit the solution space, and degrade the final Quality of Results (QoR).We propose a unified floorplanning framework that, for the first time, seamlessly integrates the optimization of rectilinear soft module into both global floorplanning and legalization stages.To address the computational challenges posed by rectilinear polygons, we introduce a differentiable polygon shaping model that represents and dynamically optimizes module shapes using a structured set of continuous variables during global floorplanning.We further design an augmented Lagrangian method to efficiently manage complex constraints with theoretical convergence guarantees, enhancing solution quality and stability. During legalization, we ingeniously model the problem as a minimum-cost flow formulation, where the objective is to minimize the total displacement, thereby ensuring a legal and high-quality final floorplan. Experimental results on GSRC and MCNC benchmarks demonstrate that our method achieves better solution quality, reducing half-perimeter wirelength by an average of 4% and up to 14% compared with state-of-the-art floorplanners, while maintaining competitive runtime performance.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionElectromigration and IR drop (EMIR) signoff at advanced process nodes demands prohibitively large compute resources, often requiring thousands of cores and multi day runtimes for full chip flat analysis. As design sizes and scenario counts increase, traditional methodologies fail to scale, limiting coverage and delaying signoff closure. This work presents a scalable Reduced Order Modeling (ROM)–based hierarchical EMIR signoff flow that delivers flat analysis level accuracy with dramatic improvements in turnaround time and memory efficiency.
The proposed flow abstracts each block into a scenario aware ROM containing geometry, circuit representation, demand currents, and decap models. These ROMs are generated once at block level and reused across all top level instantiations, enabling flexible configuration of detail granularity per instance. In full chip analysis, ROM consumption significantly reduces simulation complexity while preserving IR drop fidelity.
Results on a production scale GPU design demonstrate 86% runtime reduction (30.4 hrs → 4.21 hrs) and 52% lower memory usage, using identical core counts. IR drop correlation remains exceptionally strong: 99.9% of instances within ±10 mV, with peak delta under 15 mV across vectorless and DFT vector scenarios. These results establish ROM based EMIR analysis as a practical, high accuracy alternative to flat signoff, enabling broader scenario coverage, faster iteration, and scalable deployment across SoC design teams.
The proposed flow abstracts each block into a scenario aware ROM containing geometry, circuit representation, demand currents, and decap models. These ROMs are generated once at block level and reused across all top level instantiations, enabling flexible configuration of detail granularity per instance. In full chip analysis, ROM consumption significantly reduces simulation complexity while preserving IR drop fidelity.
Results on a production scale GPU design demonstrate 86% runtime reduction (30.4 hrs → 4.21 hrs) and 52% lower memory usage, using identical core counts. IR drop correlation remains exceptionally strong: 99.9% of instances within ±10 mV, with peak delta under 15 mV across vectorless and DFT vector scenarios. These results establish ROM based EMIR analysis as a practical, high accuracy alternative to flat signoff, enabling broader scenario coverage, faster iteration, and scalable deployment across SoC design teams.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionHardware design verification is essential for ensuring the functional correctness of electronic designs. Recent advances have explored using Large Language Models (LLMs) to automate hardware testbench synthesis. However, existing approaches primarily target single hardware modules and often fail to generalize to complex, system-level designs. As design complexity increases, both the LLM-generated testbench and the reference design are more likely to exhibit correlated or compensating errors, often producing false-positive outcomes that obscure true functional violations. To tackle these challenges, we propose ReflectBench, an agentic framework for automated system-level hardware design verification with logic analysis, consensus-based validation, feedback self-reflection, and testbench self-correction. To further minimize manual effort in constructing knowledge bases or preparing examples, we devise Self-Reflective Domain Knowledge Mining, an automated approach for constructing an effective retrieval database aligned with hardware semantics. We develop ReflectVDB, a benchmark consisting of 53 system-level hardware designs for evaluating LLM-based verification frameworks. Experimental results show that our approach outperforms SOTA framework ConfiBench on both ReflectVDB and AutoBench datasets, achieving 7.6\% and 4.5\% improvement in terms of testbench generation success rate, and 3.8\% and 1.2\% improvement in terms of coverage rate. The framework and ReflectVDB benchmark are open-sourced at https://anonymous.4open.science/r/ReflectBench/.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLarge language models (LLMs) face decoding bottlenecks as attention repeatedly accesses the key-value (KV) cache. Sparse attention and processing-in-memory (PIM) each reduce data movement, but their naive integration produces irregular KV accesses that span multiple DRAM rows, leading to unnecessary activations and cache rewrites. We present REFLEX, a rewrite-free sparse attention framework that colocates required KV entries in a single DRAM row and applies activation-aware scheduling for PIM execution. REFLEX preserves accuracy without hardware changes, achieving up to 1.64× throughput and 1.36× energy efficiency on PIM, and 1.37× throughput in GPU-PIM systems.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionThis paper presents Relay-GS, an algorithm-hardware co-design that accelerates 4D Gaussian splatting (4DGS) rendering. Conventional methods process each frame independently and repeatedly execute the heavy sorting stage. This results in performance bottlenecks during real-time rendering. To mitigate this inefficiency, our approach introduces optimizations at both the algorithmic and architectural levels. At the algorithmic level, we propose selective Gaussian sorting (SGS), which reuses the sorted results of previous frames, and periodic correction to maintain rendering quality. At the architectural level, we design a fine-grained pipeline to hide sorting latency and a parallel rasterization unit that reduces operations through parameter sharing. Our experimental results show that the proposed accelerator achieves a speedup of 1.55x to 1.64x and up to 16.5% in energy savings over a baseline design, while maintaining a rendering quality with negligible degradation compared to that of full sorting.
Work in Progress
DescriptionWe present RELMAS-HRT, an online reinforcement learning scheduler for heterogeneous multi-accelerator systems (MAS) that guarantees hard real-time (HRT) deadlines while optimizing QoS for multi-tenant DNN inference. The framework admits HRT tasks through a WCET-based feasibility test, constructs a deterministic baseline schedule, and dynamically refines it at runtime using an RL policy whose decisions are accepted only when provably safe. RELMAS-HRT exploits slack and missing HRT releases to improve QoS-aware throughput without compromising guarantees. Experiments show 100% HRT deadline satisfaction and higher QoS adherence with respect to baselines policies, enabling efficient mixed-criticality inference in industrial systems.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionThe growing prevalence of mixed workloads in datacenter servers has intensified memory contention, where auxiliary tasks interfere with main applications, leading to excessive page swapping and degraded performance. To address this, we propose a CXL SSD–based Workload-aware Reversed Memory Tiering System (WA-RMT) that redefines memory hierarchy by dedicating CXL DRAM as the fast tier for main applications while isolating auxiliary tasks in host memory. Complementing this design, a Sparse-burst Resistant Page Replacement Policy (SBR-PRP) distinguishes short-lived bursty pages from truly hot ones to prevent memory pollution. Our implementation introduces cacheable and streamed CXL.io memory accesses with data reordering for reduced latency. Evaluations using mixed workloads such as K-means, Radiosity, and ResNet50 show that the proposed system reduces execution time by up to 35% and swap-in latency by over 60% compared to conventional tiering. Together, WA-RMT and SBR-PRP offer a robust, workload-aware memory management solution for CXL-based systems
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionNon-Volatile Memory (NVM)-based Field Programmable Gate Ar- rays (FPGAs) have attracted attention due to lower power consump- tion, higher logic density, and instant-on capability from retained configuration data. However, NVM cells have limited write en- durance and higher latency than SRAM, which can affect long-term reliability, especially under frequent Partial reconfiguration (PR). Although PR improves design efficiency by updating only part of the logic, it imposes lifetime limitations, particularly on NVM-based Look Up Tables (LUTs).
In this paper, we introduce REPAIR, a framework that minimizes LUT reconfiguration through PR without changing the normal CAD flow. REPAIR uses a lifetime-aware fixed placement technique in the Preprocessing stage to reduce LUT reconfigurations. Moreover, it includes a postprocessing tool that generates differential partial bit- streams, writing only the changed LUT configuration bits between consecutive configurations. Finally, at runtime it incorporates a low-overhead scheduler to optimize LUT updates within the recon- figuration path. Together, these mechanisms reduce write stress on NVM cells, enhance the lifetime and reliability of Non-Volatile FPGAs (NVFPGAs) while preserving the performance and flexibility benefits of PR. Our experimental results show that REPAIR extends the operational lifetime of NVFPGAs by at least 7× compared to a baseline PR using commercial tools
In this paper, we introduce REPAIR, a framework that minimizes LUT reconfiguration through PR without changing the normal CAD flow. REPAIR uses a lifetime-aware fixed placement technique in the Preprocessing stage to reduce LUT reconfigurations. Moreover, it includes a postprocessing tool that generates differential partial bit- streams, writing only the changed LUT configuration bits between consecutive configurations. Finally, at runtime it incorporates a low-overhead scheduler to optimize LUT updates within the recon- figuration path. Together, these mechanisms reduce write stress on NVM cells, enhance the lifetime and reliability of Non-Volatile FPGAs (NVFPGAs) while preserving the performance and flexibility benefits of PR. Our experimental results show that REPAIR extends the operational lifetime of NVFPGAs by at least 7× compared to a baseline PR using commercial tools
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionTo optimize the accuracy and efficiency of warpage and stress analysis for chip packages during the early-stage chip design process, this paper conducts the following research steps:
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEMIR is a factor in chip performance and long-term reliability. Steps to address different causes of EMIR issues are taken across a number of tools in any design flow methodology. At the very last stage of the design cycle, via insertion by a EDA tool, such as Calibre DesignEnhancer is the last attempt to further optimize via arrangements in the physical chip design. The approach of via insertion by most software is an area-based computation. This paper discusses the benefits of a resistive model-based computation and suggests an implementation methodology. For proof-of-concept, the model used in this paper is fictional, but it illustrates the impact of resistance-based via selection on the ultimate aim of reducing EMIR in the face of increasing via choices of future technology.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAt advancded nodes, aggressive cell height scaling has made lower metal routing resources a primary limiter for block-level PPA, as these layers are simultaneously demanded by signal routing, pin access, and power delivery. While adding more routing layers can increase overall capacity, the benefit diminishes with deeper metal stacks, leaving lower metals as the most performance-critical and congested resources.
This work revisits power delivery at the middle-of-line (MOL) boundary and proposes a placement-aware MOL-PDN integration flow that redistributes PDN burden away from lower metals without altering the conventional P&R flow. The proposed approach introduces MOL-level PDN segments during pre-placement and further adapts them after routing, enabling effective lower-metal resource reallocation while preserving the conventional physical design methdology.
Experiments on a large ARM CPU sub-block show that the proposed MOL-PDN flow achieves up to 0.6% frequency improvement and up to 1.5% total wirelength reduction compared to a conventional lower-metal PDN scheme. Static IR behavior is maintained, while increased dynamic IR sensitivity is observed under signoff worst-case conditions.
This work revisits power delivery at the middle-of-line (MOL) boundary and proposes a placement-aware MOL-PDN integration flow that redistributes PDN burden away from lower metals without altering the conventional P&R flow. The proposed approach introduces MOL-level PDN segments during pre-placement and further adapts them after routing, enabling effective lower-metal resource reallocation while preserving the conventional physical design methdology.
Experiments on a large ARM CPU sub-block show that the proposed MOL-PDN flow achieves up to 0.6% frequency improvement and up to 1.5% total wirelength reduction compared to a conventional lower-metal PDN scheme. Static IR behavior is maintained, while increased dynamic IR sensitivity is observed under signoff worst-case conditions.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionThe growing demands of scientific computing and artificial intelligence require high-throughput, flexible and efficient matrix computation for dense, sparse, and convolutional workloads. This paper introduces a reconfigurable accelerator based on the fast inner product (FIP) algorithm, which reduces hardware resource consumption while supporting dense and sparse computation modes. The accelerator is further enhanced by a novel sparse encoding scheme that skips zero inputs with minimal overhead, and a convolution module that reduces data redundancy. The prototype is implemented in 28nm technology, achieving a 21.3% area reduction and up to 4.4× and 5.1× speedup for SpGEMM and SpMM over GEMM, respectively.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionQuantum computing imposes stringent requirements for the precise control of large-scale qubit systems, including, for example, microsecond-latency feedback and nanosecond-precision timing of gigahertz signals – demands that far exceed the capabilities of conventional real-time systems. The rapidly evolving and highly diverse nature of quantum control necessitates the development of specialized hardware accelerators. While a few custom real-time systems have been developed to meet the tight timing constraints of specific quantum platforms, they face major challenges in scaling and adapting to increasingly complex control demands – largely due to fragmented toolchains and limited support for design automation.
To address these limitations, we present RISC-Q – an open-source flexible generator for Quantum Control System-on-Chip (QCSoC) designs, featuring a programming interface compatible with the RISC-V ecosystem. Developed using SpinalHDL, RISC-Q enables efficient automation of highly parameterized and modular QCSoC architectures, supporting agile and iterative development to meet the evolving demands of quantum control. We demonstrate that RISC-Q can surpass the performance of existing QCSoCs with significantly reduced development effort, facilitating efficient exploration of the hardware-software co-design space for rapid prototyping and customization.
To address these limitations, we present RISC-Q – an open-source flexible generator for Quantum Control System-on-Chip (QCSoC) designs, featuring a programming interface compatible with the RISC-V ecosystem. Developed using SpinalHDL, RISC-Q enables efficient automation of highly parameterized and modular QCSoC architectures, supporting agile and iterative development to meet the evolving demands of quantum control. We demonstrate that RISC-Q can surpass the performance of existing QCSoCs with significantly reduced development effort, facilitating efficient exploration of the hardware-software co-design space for rapid prototyping and customization.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionProcessors are the foundation of all computing systems, yet complex RTL and microarchitectural front ends make logic defects difficult to eliminate and extremely costly to fix after tape-out.Although RISC-V fuzzing has revealed many bugs in practice, existing approaches remain limited: they rely on manually maintained instruction models and generate insufficiently rich CPU test programs.Moreover, they lack precise register and memory monitoring, and either perform high-overhead per-instruction differential checks or coarse end-of-program comparisons -- both making bug analysis and localization inefficient.
In this work, we propose RISCSmith (30K+ Rust LoC), a fuzzing framework aimed at detecting bugs in RISC-V CPUs.First, RISCSmith automatically builds rich instruction models by parsing the RISC-V UnifiedDB to extract structured instruction metadata, resolve inconsistencies, and generate strongly typed models capturing operand roles and runtime semantics.Second, RISCSmith instruments RISC-V implementations with lightweight logging to collect per-instruction register, memory, and exception data, performing on-the-fly differential analysis to pinpoint the first divergence, classify its cause, and minimize the reproducing test case.We implemented and evaluated RISCSmith on six widely used RISC-V CPUs. In total, it uncovered 18 previously unknown bugs.Compared to state-of-the-art CPU fuzzers like Cascade and RISCV-DV, RISCSmith detects 3.5x and 2.6x more bugs and covers 37% and 61% more branches, respectively.
In this work, we propose RISCSmith (30K+ Rust LoC), a fuzzing framework aimed at detecting bugs in RISC-V CPUs.First, RISCSmith automatically builds rich instruction models by parsing the RISC-V UnifiedDB to extract structured instruction metadata, resolve inconsistencies, and generate strongly typed models capturing operand roles and runtime semantics.Second, RISCSmith instruments RISC-V implementations with lightweight logging to collect per-instruction register, memory, and exception data, performing on-the-fly differential analysis to pinpoint the first divergence, classify its cause, and minimize the reproducing test case.We implemented and evaluated RISCSmith on six widely used RISC-V CPUs. In total, it uncovered 18 previously unknown bugs.Compared to state-of-the-art CPU fuzzers like Cascade and RISCV-DV, RISCSmith detects 3.5x and 2.6x more bugs and covers 37% and 61% more branches, respectively.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDesigning with chiplets is attractive to enable easier configurability, while reducing costs. This paradigm is safe when a complex system is disaggregated into multiple chiplets.
Now, even more savings will be achieved when the chiplets will be sourced from different vendors. Integration of chiplets of various provenances, hence of uneven quality, poses the question of the overall liability of the assembly. Actually, the downside of subcontracting is that the quality is that of poorest chiplet.
But even worse, the system in package (SiP) can happen to be non-functional if only one of the chiplets fail. Other risks exist, such as the opportunity for a competitor to (illegitimately) copy one chiplet alongside with its firmware.
In order to determine how to mitigate these issues, a risk analysis is a necessary prerequisite. In this presentation, we give our findings, based on ISO 31000 and Common Criteria methodologies. Interestingly, most serious risks concern the supply chain:
- Abuse of disaggregation to inject rogue FW / keys for backdoor access, by interjecting into the chain of custody;
- Counterfeiting by over-provisioning.
In response to those threats, we have been drafting a set of security assurance requirements (SAR) and security functional requirements (SFR), gathered in a Protection Profile (PP). We will uncover this PP, that is designed to complement other market/domain-specific security requirements. The official publication of this PP is scheduled end of Q2 2026.
Eventually, we touch architectural implications. At hardware level, we analyze the need and the role of a "Root of Trust" (RoT), liable to implement security by default. The RoT is also the hardware support for the enforcement of the ownership handover mechanism. Additionally, we explain the need for inter-chiplet communication protection, to ensure end-to-end data protection, against both spoofing and alteration. At software level, it is possible to make up if some chiplets are not capable of managing security, or are not provisioned. Also, software brings security services, namely hardware bill of material (BoM) verification through aggregated secure boot, and software BoM verification through distributed remote attestation.
Now, even more savings will be achieved when the chiplets will be sourced from different vendors. Integration of chiplets of various provenances, hence of uneven quality, poses the question of the overall liability of the assembly. Actually, the downside of subcontracting is that the quality is that of poorest chiplet.
But even worse, the system in package (SiP) can happen to be non-functional if only one of the chiplets fail. Other risks exist, such as the opportunity for a competitor to (illegitimately) copy one chiplet alongside with its firmware.
In order to determine how to mitigate these issues, a risk analysis is a necessary prerequisite. In this presentation, we give our findings, based on ISO 31000 and Common Criteria methodologies. Interestingly, most serious risks concern the supply chain:
- Abuse of disaggregation to inject rogue FW / keys for backdoor access, by interjecting into the chain of custody;
- Counterfeiting by over-provisioning.
In response to those threats, we have been drafting a set of security assurance requirements (SAR) and security functional requirements (SFR), gathered in a Protection Profile (PP). We will uncover this PP, that is designed to complement other market/domain-specific security requirements. The official publication of this PP is scheduled end of Q2 2026.
Eventually, we touch architectural implications. At hardware level, we analyze the need and the role of a "Root of Trust" (RoT), liable to implement security by default. The RoT is also the hardware support for the enforcement of the ownership handover mechanism. Additionally, we explain the need for inter-chiplet communication protection, to ensure end-to-end data protection, against both spoofing and alteration. At software level, it is possible to make up if some chiplets are not capable of managing security, or are not provisioned. Also, software brings security services, namely hardware bill of material (BoM) verification through aggregated secure boot, and software BoM verification through distributed remote attestation.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionLogic synthesis compilers play an indispensable role in FPGA design, translating register-transfer-level (RTL) descriptions into gate-level netlists. Defects in these compilers may compromise the security of the final hardware implementations. Existing testing methods often produce redundant test cases, which limits their ability to effectively explore compiler defects. To address this problem, we propose RL4HDL, a new method that uses reinforcement learning to guide logic synthesis compiler testing. Comprehensive experiments conducted over a three-month period demonstrate the practical effectiveness of our approach: we discovered 20 unique defects, including 12 previously unreported ones, all of which have been confirmed by the official developers.
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
DescriptionUAV trajectory tracking demands robustness and real-time performance, but traditional architectures suffer high latency, failing to counter sudden disturbances. We propose an algorithm-hardware co-design framework to address this. We integrate reinforcement learning (RL) with model predictive control (MPC) for real-time online learning. Hardware-wise, we designed a dedicated analog compute-in-memory (ACIM) accelerator, RM-CIMA, mapping both RL learning and MPC optimization to the analog domain. RM-CIMA enables UAVs to counter fast disturbances, improving recovery times from strong wind by 7.9× over traditional NMPC. Furthermore, it reduces computational latency and energy consumption by 2995.6× and 983.1×, respectively, compared to traditional architectures.
Engineering Special Session
AI
Chiplet
EDA
Systems
DescriptionField-Programmable Gate Arrays (FPGAs) have become indispensable tools in modern Emulation and Prototyping systems, bridging the gap between conceptual design and silicon realization. This talk explores the pivotal role FPGAs play in accelerating hardware verification, enabling early software development, and reducing time-to-market for complex system-on-chip (SoC) designs. The speaker will discuss the architectural features of FPGAs that make them ideally suited for rapid prototyping, including their reconfigurability, parallelism, and scalability. Real-world case studies will highlight how FPGAs facilitate functional validation, hardware/software co-design, and iterative design refinement, offering a cost-effective and flexible alternative to traditional ASIC flows. Attendees will gain insights into current trends, emerging challenges, and best practices for leveraging FPGAs in both pre-silicon emulation and post-silicon validation environments. This presentation will provide a comprehensive overview of how FPGAs are shaping the future of electronic system development.
Engineering Special Session
AI
Chiplet
EDA
Systems
DescriptionThis technical talk explores the pivotal role that emulation and prototyping systems play in the semiconductor development lifecycle. Emulation platforms, built on reconfigurable hardware, allow design teams to validate functional correctness, verify hardware/software integration, and uncover system-level bugs early in the process. Their ability to execute real workloads at near-system speeds provides a significant advantage over traditional simulation, accelerating functional verification and improving overall project predictability. Complementing emulation, hardware prototyping, often using FPGAs, enables high-speed testing of design concepts and early software development. Prototyping platforms allow engineers to interact with the hardware in real time, facilitating in-depth performance analysis and system validation well before silicon is available. The talk will highlight practical integration strategies for these tools within a typical semiconductor design flow. Real-world case studies will illustrate how leading design teams leverage these technologies to mitigate risk, reduce costs, and deliver robust silicon products on schedule.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionThis work presents ROSA, a microring-based optical neural network architecture that improves robustness and energy efficiency using an optical shift-and-add (OSA) module and a layer-wise hybrid mapping strategy. It introduces a noise-aware voltage-to-weight model considering DAC and thermal variations, and a workload-aware framework to co-optimize MRR array size and layer-wise dataflow. Optimized arrays reduce aggregated relative energy-delay-product (EDP) by 64% and 26%, compared to DEAP-CNNs and general compact array respectively. OSA further contributes 29% EDP reduction. The proposed hybrid mapping strategy improves CIFAR-10 accuracy by 8.3% than weight-stationary mapping while achieving an average 54.7% lower EDP than DEAP-CNNs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionWe introduce a new design method to optimize routability through flexible pin.
It contributes to securing routing resources for the entire block by using flexible pins, which provide multiple pin‑access points on the highly congested lower metal layers of advanced node technologies, together with a new integrated DRC and LVS checking flow.
We propose two methods: ① Flexible Pin Location and ② Flexible Pin Shape. Each method expands the set of pin‑connection candidates according to the surrounding routing congestion, thereby improving the overall routing resources of the block.
In this work, we observed a noticeable decrease in the detour ratio on critical paths 16.05% for Flexible Pin Location and 63.60% for Flexible Pin Shape, compared with the fixed‑pin case, achieved by increasing the number of connection candidates or by reducing displacement during cell moves.
In addition, the proposed methods show a reduction in net delay on critical paths 1.77% for Flexible Pin Location and 8.84% for Flexible Pin Shape, resulting in Fmax improvements for the respective cases, as demonstrated on an industrial test case with a seamless flow.
It contributes to securing routing resources for the entire block by using flexible pins, which provide multiple pin‑access points on the highly congested lower metal layers of advanced node technologies, together with a new integrated DRC and LVS checking flow.
We propose two methods: ① Flexible Pin Location and ② Flexible Pin Shape. Each method expands the set of pin‑connection candidates according to the surrounding routing congestion, thereby improving the overall routing resources of the block.
In this work, we observed a noticeable decrease in the detour ratio on critical paths 16.05% for Flexible Pin Location and 63.60% for Flexible Pin Shape, compared with the fixed‑pin case, achieved by increasing the number of connection candidates or by reducing displacement during cell moves.
In addition, the proposed methods show a reduction in net delay on critical paths 1.77% for Flexible Pin Location and 8.84% for Flexible Pin Shape, resulting in Fmax improvements for the respective cases, as demonstrated on an industrial test case with a seamless flow.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionAlthough Federated Learning (FL) is becoming increasingly popular in designing Artificial Intelligence of Things (AIoT) applications, due to the varying computing and communication capabilities of resource-constrained devices, it suffers from the problems of slow convergence and poor training performance, especially when devices are powered by batteries. To address these issues, this paper introduces RTFL, a novel energy-aware Real-Time Federated Learning framework based on Multi-Agent Reinforcement Learning (MARL), aiming to enhance the knowledge sharing across AIoT devices within a specified training time constraint. Specifically, RTFL employs an Adaptive Quantization-based Multi-Agent Scheduling (AQMAS) strategy, enabling a team of agents to intelligently select devices with specific model quantization levels for each round of local training, taking into account the resource constraints (e.g., remaining battery power, computing capability, and communication bandwidth) of the current devices. By facilitating collaboration among agents through reinforcement learning, our approach enables devices to maximize their contributions to forming an optimal global model, while balancing the trade-off between the accuracy of quantized models and the limited resources available on each device. Comprehensive experiments show that RTFL not only accelerates the convergence of FL training, but also encourages devices to participate in more rounds of knowledge aggregation, thereby significantly improving overall training performance.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionRTL-3D enables early-stage cross-tier timing analysis and optimization in 3D ICs.
The framework utilizes a Transformer-based pre-synthesis timing analysis model to directly steer its core partitioning method, which features differentiable register tier partitioning.
This creates a tight feedback loop between timing prediction and physical implementation, closing the loop for co-optimization.
The experimental evaluation demonstrates that RTL-3D achieves significant improvements in sign-off timing metrics by a 59.3% reduction in WNS and a 75.4% reduction in TNS compared to state-of-the-art methods.
The framework utilizes a Transformer-based pre-synthesis timing analysis model to directly steer its core partitioning method, which features differentiable register tier partitioning.
This creates a tight feedback loop between timing prediction and physical implementation, closing the loop for co-optimization.
The experimental evaluation demonstrates that RTL-3D achieves significant improvements in sign-off timing metrics by a 59.3% reduction in WNS and a 75.4% reduction in TNS compared to state-of-the-art methods.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionThis paper introduces RTL-BenchMT, an agentic framework for dynamic maintenance of RTL generation benchmarks. Large Language Models (LLMs) for automated RTL generation are one of the most important directions in EDA research, yet current RTL benchmarks face two critical challenges: (1) flawed cases in the benchmarks and (2) overfitting to the benchmarks. Both challenges are difficult to resolve purely by manual engineering effort. To address these issues and systematically reduce human maintenance cost, we propose an automated agentic framework, RTL-BenchMT. RTL-BenchMT focuses on two key applications: (1) automatically identifying and revising flawed benchmark cases and (2) automatically detecting and updating overfitting cases. With the assistance of RTL-BenchMT, we conduct a thorough, in-depth analysis of flawed and overfitting cases and produce a refined benchmark suite that will be open-sourced to the community.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionAccurate timing prediction at the register-transfer level (RTL) is a longstanding challenge in design automation. Existing graph-based methods struggle with limited receptive fields, high complexity, and a lack of signal directionality. We present RTL-Sequencer, a novel sequence-based paradigm that enables scalable RTL timing prediction via linearizing logic cones by breadth-first traversal and applying modern linear sequence models. Furthermore, sequence models are customized by four synergistic techniques, including sequence shuffling, bidirectional modeling, differentiable modeling, and a hybrid graph-sequence architecture. Extensive experiments demonstrate significant improvements of RTL-Sequencer over state-of-the-art baselines, advancing early-stage timing optimization.
People
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionFault injection attack (FIA) poses a severe threat to accelerators executing privacy-preserving homomorphic encryption (HE) neural networks. Defending against FIAs in the ciphertext domain is challenging due to opaque internal states and the difficulty of distinguishing inherent encryption noise from malicious perturbations. We propose Runtime-ADAR, a lightweight runtime framework enabling closed-loop detection and recovery fully within the ciphertext domain. By repurposing intrinsic HE noise as a high-fidelity diagnostic signal, Runtime-ADAR integrates a sparse-projection noise detector for statistical anomaly sensing and a gradient-field repair engine for in-situ weight compensation. This synergy transforms encryption noise from a vulnerability into a defensive asset. Evaluations on multiple encrypted models show that Runtime-ADAR mitigates over 90% of attack successes and restores up to 97% baseline accuracy with modest performance overhead on CNNs.
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
DescriptionEdge computing that offloads AI inference from the cloud to local devices can significantly reduce interaction latency and improve data privacy, while simultaneously imposing stringent requirements on both efficiency and security. Non-volatile compute-in-memory (nvCIM) boosts efficiency by executing multiply-accumulate (MAC) inside memory arrays, reducing data movement and eliminating weight refresh. However, deploying nvCIM in untrusted environments exposes both user inputs and model weights to multiple security threats: during storage, non-volatility keeps weights vulnerable to physical extraction; during computation, plaintext inputs must be transferred and exposed at runtime. Recent in-memory encryption CIM (IME-CIM) schemes integrate an XOR cipher and in-situ decryption to protect stored weights, yet still require plaintext keys and inputs during inference and often incur up to 2x array area overhead, leaving a security-efficiency gap. This paper proposes S²CIM, a FeFET-based nvCIM architecture with circuit-algorithm co-design to bridge this gap. For security, an Affine-Transform Splitting (ATS) scheme based on randomization strategies protects both offline weights (XORed weights) and online computation (obfuscated inputs) without explicitly exposing plaintext keys or inputs. For efficiency, a Drain-Input Gate-Scan (DIGS) based on 1FeFET-1R achieves low-variation signed multi-bit MAC in minimal cycles with high area efficiency comparable to common nvCIMs. We validate its functionality using an S²CIM macro with 16x16 arrays. Compared with recent IME-CIMs, S²CIM improves area efficiency by up to 16.8x and reduces energy per MAC by up to 6.7x, while ensuring 97.4% encrypted ViT inference accuracy and reducing attack success rates to 50% under all potential threat models.
People
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionAs VLSI designs grow in complexity, partitioning is widely adopted to accelerate physical design through parallel computing. However, traditional hypergraph partitioning methods often degrade in performance when applied to 2D layouts due to spatial constraints. For routers with post-placement locations, a spatial-aware partitioning method fully utilizing placement data is preferable. Existing works can only consider soft spatial constraints, leading to a scattered distribution in one partition. We propose SAAP, an analytic partitioning algorithm enforcing hard spatial constraints while efficiently minimizing cut sizes. It includes analytic boundary modeling with regularity-guided simulated annealing and region embedding. Given placed netlists, it generates timing-friendly k-way spatially continuous partitions for parallel routing. Experiments show that it can quickly provide several to dozens of times smaller spatial cut sizes than previous state-of-the-art, with better spatial continuity.
Research Manuscript
SAEM: Stage-Aware Expert Management for Memory-Efficient MoE Inference in Chain-of-Thought Reasoning
12:18pm - 12:30pm PDT Tuesday, July 28 Mtg Room 101AAI
AI4-II. AI/ML Architecture Design
DescriptionChain-of-thought (CoT) prompting enhances LLM reasoning by decomposing tasks into intermediate steps, but it increases decoding latency and memory usage. Mixture-of-Experts (MoE) models improve scalability via sparse expert activation, yet expert weights often exceed GPU capacity. Existing runtimes treat tokens uniformly, missing structural regularities where consecutive reasoning stages exhibit coherent expert activation patterns. This causes inefficient expert caching and excessive GPU-CPU data movement. We present SAEM, a stage-aware inference runtime that detects reasoning stage boundaries and exploits stage-level coherence in expert activation. SAEM applies stage-aware caching for GPU-CPU placement, combining token packing with in-situ CPU execution. SAEM achieves 1.54× speedup over state-of-the-art baselines under memory constraints.
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionThe development of IoT devices imposes stringent requirements on data security. Existing CPU-based cryptographic schemes suffer from high latency, while dedicated hardware accelerators lack scalability and fail to support agile development. To address these challenges, we propose Safe-IoT, a memory-efficient HW/SW co-designed ML-DSA accelerator for IoT edge devices. Its core components include a memory-efficient IoT-oriented number theoretic transform (MI-NTT) circuit and a low-cost LUT-based modular multiplier. Experimental results show that, compared with state-of-the-art CPU-based designs, Safe-IoT achieves up to a 3.0× improvement in throughput. Compared with similar HW/SW co-design approaches, on-chip RAM usage is reduced by 2.6-8.3x, and overall performance, measured by area-time product (ATP), is improved by 4.28-176.53x.
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionTraditional simulation-based fault analysis tends to be overly conservative and fails to reflect true fault criticality.
This paper presents SafeGen, an LLM-driven, formal-verification-assisted framework for functional-safety-oriented fault criticality assessment.
SafeGen employs large language models with a Hyper Knowledge Graph (HyperKG) to extract verifiable specifications and to evaluate their importance for overall system safety.
The HyperKG is then extended with register-transfer level information to guide the generation of Functional Safety Assertions.
A gate-to-RTL fault-mapping mechanism supporting both stuck-at and bridging faults, combined with formal property verification, enables semantic-level fault criticality grading. A digital–physical co-simulation platform for a field-oriented control system validates SafeGen.
This paper presents SafeGen, an LLM-driven, formal-verification-assisted framework for functional-safety-oriented fault criticality assessment.
SafeGen employs large language models with a Hyper Knowledge Graph (HyperKG) to extract verifiable specifications and to evaluate their importance for overall system safety.
The HyperKG is then extended with register-transfer level information to guide the generation of Functional Safety Assertions.
A gate-to-RTL fault-mapping mechanism supporting both stuck-at and bridging faults, combined with formal property verification, enables semantic-level fault criticality grading. A digital–physical co-simulation platform for a field-oriented control system validates SafeGen.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSparse LU factorization is a critical kernel in scientific computing. Given its irregular data dependencies and complex computation patterns arising from inherent high sparsity, its acceleration remains largely unexplored on FPGAs. The high concurrency capability offered by High Bandwidth Memory (HBM) has recently presented new opportunities. Nevertheless, accelerating sparse LU factorization on HBM-equipped FPGAs remains non-trivial due to rigid dependency synchronization and imbalanced load.
In this paper, we propose SALUT, a high-performance sparse LU factorization accelerator on HBM FPGAs. An asynchronous task activation mechanism is developed to pre-compile complex runtime data dependency resolution into fine-grained dependency management, reducing synchronization overhead and maximizing computation parallelism. Additionally, we design a hardware architecture facilitating stream-aware parallel scheduling by decoupling burdensome dependency resolution from task activation, eliminating serial bottleneck of a centralized scheduler. Finally, a locality-aware dual-queue load balancing strategy ensures data locality and high hardware utilization. Evaluations on 15 sparse matrices show that SALUT's geometric mean throughput and energy efficiency surpass the NVIDIA cuDSS solver by 4.0x and 6.5x on RTX A6000 GPU, and 3.7x and 4.7x on Tesla V100 GPU, respectively.
In this paper, we propose SALUT, a high-performance sparse LU factorization accelerator on HBM FPGAs. An asynchronous task activation mechanism is developed to pre-compile complex runtime data dependency resolution into fine-grained dependency management, reducing synchronization overhead and maximizing computation parallelism. Additionally, we design a hardware architecture facilitating stream-aware parallel scheduling by decoupling burdensome dependency resolution from task activation, eliminating serial bottleneck of a centralized scheduler. Finally, a locality-aware dual-queue load balancing strategy ensures data locality and high hardware utilization. Evaluations on 15 sparse matrices show that SALUT's geometric mean throughput and energy efficiency surpass the NVIDIA cuDSS solver by 4.0x and 6.5x on RTX A6000 GPU, and 3.7x and 4.7x on Tesla V100 GPU, respectively.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionCPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints—existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance.
We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation.
Evaluated on five x86/ARM CPU platforms, Sandwich achieves average 2.01× end-to-end speedup and up to 3.40× latency reduction over SOTA systems. Its kernels match static compiler performance with three orders of magnitude less tuning cost.
We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation.
Evaluated on five x86/ARM CPU platforms, Sandwich achieves average 2.01× end-to-end speedup and up to 3.40× latency reduction over SOTA systems. Its kernels match static compiler performance with three orders of magnitude less tuning cost.
Late Breaking Results
DescriptionLarge-scale CNFs exceed LLM context windows, and direct rewriting can break equivalence. SAT-Helper is a zero-shot multi-agent CNF optimizer that samples sliding windows, selects small high-impact blocks and local strategies, rewrites only those blocks, and accepts updates only after SAT-based equivalence checking. This jointly addresses scalability, adaptivity, and correctness. On CNFgen and SAT Competition 2025, SAT-Helper cuts solving time by 68.0% and 80.2% while reducing memory by 18.7--25.7%. The results show that verified input-side optimization is a practical direction for LLM-assisted SAT solving.
Research Special Session
AI
DescriptionRadio-frequency (RF) signal separation and classification are increasingly vital for space sensing, scientific missions, and autonomous satellite operations, yet deep learning models capable of handling non-terrestrial emissions exceed the compute, memory, and energy limits of onboard processors. We introduce a hardware-aware neural network compression framework that uses knowledge distillation from a large teacher model trained on challenging RF scenes, combined with architecture search and lightweight quantization techniques. The resulting compact models run efficiently on constrained embedded platforms, including microcontrollers and Jetson-class devices, enabling practical in-space RF intelligence for communication monitoring, interference mitigation, and spectrum situational awareness.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionGraph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity.
In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.
In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAt advanced process nodes, flat EMIR signoff for modern SoCs often exceeding four billion instances and 50 billion nodes faces prohibitive runtime and compute demands. This work introduces a production-proven hierarchical EMIR signoff methodology leveraging Reduced Order Model (ROM) technology to achieve scalable, resource-efficient IR/EM analysis for billion-instance systems. The proposed bottom-up flow generates compact ROMs that preserve electrical and geometric characteristics of child blocks across defined Common Connection Layers (CCL), enabling accurate hierarchical modeling without full flat complexity. Integrated automation ensures consistent ROM generation, validation, and consumption across multi-scenario, multi-PVT signoff environments. The resulting hierarchical analysis achieves near-flat accuracy with over 50% reduction in runtime and memory footprint, supporting realistic workload and packaging variations. Validation on the large scale SoC demonstrates around 75% total compute-hour reduction, 55% disk space savings and 35% memory improvement.
This methodology has been successfully deployed in production for next-generation Arm SoCs, enabling faster turnaround and efficient compute utilization in both on-premise and hybrid cloud environments. The flow establishes a scalable, practical, and high-fidelity EMIR signoff paradigm transforming hierarchical verification for large-scale, advanced-node SoCs.
This methodology has been successfully deployed in production for next-generation Arm SoCs, enabling faster turnaround and efficient compute utilization in both on-premise and hybrid cloud environments. The flow establishes a scalable, practical, and high-fidelity EMIR signoff paradigm transforming hierarchical verification for large-scale, advanced-node SoCs.
Engineering Presentation
EDA
Systems
DescriptionReticle-scale SoCs and repetitive compute fabrics have fundamentally altered the scalability requirements of EMIR signoff. Modern designs composed of large, clustered subsystems and shared routing channels expose critical limitations in conventional flat EMIR analysis flows, including multi-day runtimes, compute memory saturation, and unmanageable activity data volumes. EDA Tool based boundary abstraction partially restores current accuracy but ignores logic interaction and propagation, limiting its effectiveness for localized IR detection. LEF with current profile abstraction of blocks improves scalability but smoothens current distribution, underestimating boundary bump currents and masking rail-sharing and temporal switching correlations.
This work presents a novel hierarchical and interaction-aware hybrid EMIR analysis methodology that combines region-based clustering with an abstraction strategy that selectively applies Reduced-Order EDA Models and LEF with current profile abstraction based on electrical sensitivity and interaction criticality. This mixed modeling approach enables accurate logic propagation, realistic switching correlation, and precise modeling of rail sharing and bump contention without incurring the cost of full flat simulation.
The proposed flow achieves up to 10× runtime reduction, 50% reduction in memory and database footprint and extremely good correlation with flat EM/IR signoff. The methodology is production-proven, easily adoptable within existing signoff flows, and directly applicable to next-generation reticle-scale, AI/ML SoC architectures.
This work presents a novel hierarchical and interaction-aware hybrid EMIR analysis methodology that combines region-based clustering with an abstraction strategy that selectively applies Reduced-Order EDA Models and LEF with current profile abstraction based on electrical sensitivity and interaction criticality. This mixed modeling approach enables accurate logic propagation, realistic switching correlation, and precise modeling of rail sharing and bump contention without incurring the cost of full flat simulation.
The proposed flow achieves up to 10× runtime reduction, 50% reduction in memory and database footprint and extremely good correlation with flat EM/IR signoff. The methodology is production-proven, easily adoptable within existing signoff flows, and directly applicable to next-generation reticle-scale, AI/ML SoC architectures.
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
DescriptionTraining autonomous agents on multiple tasks is crucial for adapting to diverse real-world environments. However, state-of-the-art rely on fixed task-switching intervals during its training, limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective and scalable multi-task learning. It leverages a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and devises an adaptive task-switching policy that leverages rewards and internal dynamics of the network parameters. Experimental results show that SwitchMT achieves superior performance in multi-task learning than state-of-the-art.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionDeep Neural Networks (DNNs) are being deployed in safety-critical applications, where resilience to transient faults is essential. Traditional fault injection methods often face challenges in scaling efficiently to larger models, whereas the majority of existing speed-up techniques is closely linked to specific hardware or architectural configurations.
To speed up the assessment of single-fault effects, an approach based on simultaneous injection of faults has demonstrated promising results, where multiple non-interacting faults are injected concurrently during a single workload execution. Nevertheless, the applicability of this method to DNNs has not been explored. In this study, we investigate the use of simultaneous injection of faults in DNNs and observe that faults can easily interact with one another due to DNNs' densely connected structure.
These fault interactions can create ``artificial'' masking effects, leading to the misclassification of faults as non-critical (called false negatives), ultimately compromising the accuracy of the reliability assessment.
To overcome this phenomenon, we propose an approach to mitigate the effects of such fault interaction during simultaneous injection of faults in DNNs, ensuring accurate assessment. Furthermore, we propose a strategy to further accelerate the assessment by pruning non-critical inputs from the DNN input batch during fault injection, further improving the speedup with negligible accuracy loss. To our knowledge, this is the first approach to enable accurate and efficient simultaneous injection of faults into DNNs, supporting fast reliability assessment applicable to different abstraction levels. We experiment with nearly 42 million injections at both software (SW) and RTL, achieving very low false negatives (as low as 0%, avg 0.2%) and an average injection time gain of 3.82x (RTL) and 5.29x (SW) over existing DNN fault injection approaches.
To speed up the assessment of single-fault effects, an approach based on simultaneous injection of faults has demonstrated promising results, where multiple non-interacting faults are injected concurrently during a single workload execution. Nevertheless, the applicability of this method to DNNs has not been explored. In this study, we investigate the use of simultaneous injection of faults in DNNs and observe that faults can easily interact with one another due to DNNs' densely connected structure.
These fault interactions can create ``artificial'' masking effects, leading to the misclassification of faults as non-critical (called false negatives), ultimately compromising the accuracy of the reliability assessment.
To overcome this phenomenon, we propose an approach to mitigate the effects of such fault interaction during simultaneous injection of faults in DNNs, ensuring accurate assessment. Furthermore, we propose a strategy to further accelerate the assessment by pruning non-critical inputs from the DNN input batch during fault injection, further improving the speedup with negligible accuracy loss. To our knowledge, this is the first approach to enable accurate and efficient simultaneous injection of faults into DNNs, supporting fast reliability assessment applicable to different abstraction levels. We experiment with nearly 42 million injections at both software (SW) and RTL, achieving very low false negatives (as low as 0%, avg 0.2%) and an average injection time gain of 3.82x (RTL) and 5.29x (SW) over existing DNN fault injection approaches.
Engineering Presentation
AI
EDA
Systems
DescriptionMarvell's latest designs are 3D chiplet-based architectures, where multiple dies are stacked into a single package. Our 3D platform is a dual-die project featuring advanced die-to-die (D2D) interconnect technology, with a top die and bottom die built on 3nm and 5nm process nodes. Verifying such systems is challenging because traditional design verification flows are built for single chips. Issues include common module names (like PLL) that differ across dies, and directory structures that don't easily support multi-die integration. To address these challenges, we introduced the System in Package (SIP) flow using sipreg, a tool that builds and executes one or more SIP simulations. Unlike legacy flows, SIP enables users to incorporate chiplet design components into a single monolithic simulation executable by leveraging the VCS 3-step flow - analysis, elaboration, and simulation - along with logical Verilog libraries. Logical libraries allow design components to be analyzed into named libraries (vlibs) and reused during elaboration, giving SIP project leads flexibility to select chiplet components as building blocks for system-level simulations. These solutions, now deployed on our multi-die test chip, enable scalable verification for chiplet-based designs and streamline future multi-die projects.
Engineering Presentation
EDA
Systems
DescriptionModern RISC-V CPU clusters are delivered as highly parameterized design families, where variations in core count, microarchitecture, cache hierarchy, interconnect fabric, and system IP produce hundreds or thousands of legal RTL configurations. Verifying only a small number of representative configurations is insufficient, as functional and memory-model bugs may manifest only under specific parameter combinations. This paper presents a configuration-oriented, emulation-friendly co-simulation Testbench framework paired with a JSON-driven RTL generation flow that enables scalable verification across large RISC-V configuration spaces using a single reusable testbench. This Testbench flow integrates RISC-V architectural checking, a memory consistency model interface, and a C++ based verification methodology including instruction-accurate co-simulation with ISS and RVWMO memory-model validation, all while remaining design-agnostic with respect to core microarchitecture and cluster composition through a companion script, gen_rtl_top.py, which consumes JSON configuration descriptions to automatically synthesize consistent cluster top-level RTL for each configuration of cores, clusters, shared cache, and system IP. The same verification environment runs unmodified across Hardware simulation, hardware emulation,early Software bring up on emulation and open-source tools such as Verilator, enabling efficient, vendor-independent configuration-space exploration and high-throughput verification of complex RISC-V cluster families.
Engineering Presentation
Design
EDA
Systems
DescriptionModern multi-domain SoCs almost always integrate serial link IPs, making reliable system reset a critical challenge, especially when clock generation circuitry is compromised. Conventional out-of-band reset schemes require dedicated pins, limiting scalability, while existing in-band reset solutions depend on clock availability, rendering them unreliable in certain failure scenarios. This paper introduces a scalable, protocol-agnostic, clockless in-band reset architecture that ensures deterministic SoC reset through precise reset pulse width generation without relying on any clock source. The proposed design partitions functionality into PHY and controller blocks, incorporating a Reset Pattern Detector (RPD) for robust reset signaling. Demonstrated on an MIPI I3C controller subsystem, the architecture is easily portable to PCIe, eSPI, USB, SATA, DisplayPort, Ethernet, and other protocols. Key advantages include guaranteed reset duration, independence from SoC size or complexity, technology agnosticism, and non-intrusiveness to existing IP standards. This approach significantly enhances system reliability and scalability for heterogeneous SoCs, supporting error recovery, synchronization, and security-critical applications across diverse wired interfaces.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionRealizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel runtime-adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A run-time controller selects an appropriate ACE mode under user-defined and battery-constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced-complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by ≈4× (from ≈6.9 mJ to ≈1.6 mJ) while maintaining high gesture-detection performance (event-level F1 ≈ 0.8–0.9) and low mean latency (≈6 ms).
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionIn real-time systems, both individual task execution and data propagation must meet strict timing constraints. Cause–effect (CE) chains are widely used to analyze such behaviors by end-to-end latency. However, timing anomalies (TAs) can distort it, where a local reduction in execution times leads to an increase in the overall end-to-end latency. As a result, precisely analyzing the upper bounds of the latency becomes challenging, and such systems typically exhibit larger upper bounds than TA-eliminated systems. Existing studies either eliminate TAs by completely sacrificing average latency to simplify analysis or, despite adopting complex safe analysis methods, do not eliminate TAs effectively, still having high latencies.
To address this issue, we identify two basic causes of TAs in end-to-end latency. Based on these causes, we propose the first treatment that eliminates TAs in the latency with negligible average latency loss using Deterministic Data Flow (DDF). We further formally prove its TA-free property. Therefore, we can get a precise upper bound for latency when all jobs execute with their worst-case execution times.
Experimental results show that it effectively reduces the maximum end-to-end latency, the average latency, and latency jitter compared with the state-of-the-art (SOTA) method.
To address this issue, we identify two basic causes of TAs in end-to-end latency. Based on these causes, we propose the first treatment that eliminates TAs in the latency with negligible average latency loss using Deterministic Data Flow (DDF). We further formally prove its TA-free property. Therefore, we can get a precise upper bound for latency when all jobs execute with their worst-case execution times.
Experimental results show that it effectively reduces the maximum end-to-end latency, the average latency, and latency jitter compared with the state-of-the-art (SOTA) method.
People
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionLog schema extraction, the process of deriving human-readable templates that support clear log understanding from massive volumes of log data, is fundamental for automated debugging and performance analysis across modern Electronic Design Automation (EDA) tool chains. The challenge is amplified in advanced design flows, where logs interweave command scripts, multi-stage progress reports, timing updates, and deeply nested performance tables, rendering manual regular-expression design brittle and unscalable to evolving tool versions. We introduce SchemaCoder, a fully-automated LLM-driven schema extraction framework that constructs structured log schemas by synthesizing reusable parser code, enabling robust handling of arbitrary unstructured log files and complex EDA tool logs without manual regular-expression design. At its core, SchemaCoder uses a novel Question-Tree (Q-Tree) pattern code generation process to identify pattern codes and utilizes the extracted raw contents to drive a textual residual evolutionary optimizer in the inner loop without relying on a gold labeled dataset. In the outer loop, a residual Q-Tree boosting mechanism identifies additional pattern codes and iteratively refines the parser code. On EDA tool logs from OpenROAD and commercial tools, the structured log schema generated by SchemaCoder enables an average 11.7% improvement in agentic EDA tool log analysis QA tasks (pass@1) over a strong commercial baseline. SchemaCoder also achieves up to 7.3% improvement in the average scores and outperforms state-of-the-art baselines in 9 of 14 applications on LogHub-2.0. We will open-source the code and the EDA tool log QA benchmark for reproducibility upon acceptance.
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
DescriptionModern vehicles integrate multiple in-vehicle networks (IVNs), such as Controller Area Network (CAN), Automotive Ethernet, and host systems, to support advanced automation and connectivity. While these networks enable new functionalities, they also expand attack surface across multiple domains. Sophisticated multi-stage attacks exploit these surfaces to propagate covertly and existing vehicular IDS systems cannot detect them as these attacks cause minor changes in IVNs. This paper presents SCIMI, a Semantic-aware Collaborative Intrusion detection system for Multi-domain In-vehicle networks, which learns inter-domain causal relationships to identify subtle, coordinated attacks that evade single domain intrusion detection systems. SCIMI aligns asynchronous traffic patterns through a dual window temporal model, extracts semantic features via domain adaptive encoders, and applies cross domain attention to correlate suspicious behaviors. A dataset is also released for collaborative IDS. Experiments show that SCIMI achieves over 99% F1-score with a 17% improvement over state-of-the-art methods, while maintaining low false positive alarms and real time inference efficiency, underscoring its applicability to real time automotive cyber-security.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionSpecialized hardware accelerators facilitate timing guarantees for real-time DNNs, but high computational demands cause severe thermal stress. We present SCOUT, a real-time framework for Google Edge TPU that jointly satisfies timing/thermal constraints. The key idea couples task-level dynamic frequency scaling (DFS) with runtime on-chip SRAM allocation to offset DFS-induced performance loss. To this end, SCOUT provides i) a thermal model that reflects SRAM's thermal impact and ii) an SRAM reallocation technique that safely adjusts allocations while mitigating write overheads. Our experiments show SCOUT schedules up to 46% more DNN task sets than prior approaches, while adhering to thermal constraints.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionLarge language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete.
We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.
We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.
People
Engineering Presentation
Design
EDA
Security
Systems
DescriptionWhile cryptographic modules are digitally verified for logical correctness, they remain susceptible to physical side-channel attacks (SCA). This paper proposes a verification flow to determine whether in-depth analog verification is required for digitally verified cores by identifying security vulnerabilities rooted in analog anomalies. By leveraging efficient variation-aware analysis flamework, we achieve fast exploration of Points of Interest (PoI) at the transistor level. It was proven that when we heuristically select analog quantities such as delay time and charge consumption among the digital paths of secret information within a crypto core, the outliers appeared in their statistical distributions and requested further analog investigation. The effectiveness was demonstrated among 17 S-boxes in AES-128 bit core, including a particular one embedded with a Hardware Trojan (HT). Our statistical logic successfully flagged one unintentionally anomalous S-box and also another intentionally HT-embedded one as the PoI candidates for detailed analog verification. Subsequent Correlation Power Analysis (CPA) confirmed that these flagged PoIs revealed secret keys significantly faster than standard S-boxes. The proposed flow achieves signoff accuracy 10-1000x faster than brute-force SPICE simulations. This approach empowers designers to systematically and efficiently introduce analog verification processes for digital security modules.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionInterpretable models like Decision Trees (DT) and Random Forests (RF) are ill-suited for edge hardware, facing von Neumann bottlenecks and sensitive to device variations. We propose a hardware-algorithm co-design, the Soft Decision Machine (SDM). Algorithmically, SDM transforms a multivariate model into a hardware-friendly univariate form. In hardware, a compact 1-FeFET analog CAM natively computes the required sigmoid function. This "soft" probabilistic approach provides inherent variation tolerance. Under 40 mV device variation, SDM maintains 93.2% accuracy on MNIST (0.54% drop), outperforming DT (43.86% drop) and RF (16.06% drop), enabling robust and efficient interpretable computing for the edge intelligence.
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionZoned neutral atom architectures are emerging as a promising platform for large-scale quantum computing. Their growing scale, however, creates a critical need for efficient and automated compilation solutions. Yet, existing methods fail to scale to the thousands of qubits these devices promise. State-of-the-art compilers, in particular, suffer from immense memory requirements that limit them to small-scale problems. This work proposes a scalable compilation strategy that "searches smarter, not harder". We introduce Iterative Diving Search (IDS), a goal-directed search algorithm that avoids the memory issues of previous methods, and relaxed routing, an optimization to mitigate atom rearrangement overhead. Our evaluation confirms that this approach compiles circuits with thousands of qubits and, in addition, reduces rearrangement overhead by 28.1% on average.
Workshop
DescriptionWorkshop Website: https://www.qlou.org/SPGAI/
Generative and agentic AI systems are rapidly transforming the semiconductor, electronic design automation (EDA) ecosystem and beyond. Modern design workflows increasingly rely on large language models (LLMs), multimodal generators, and autonomous agentic systems for RTL generation, verification, debugging, documentation, and design-space exploration. As these AI systems interact directly with proprietary IP, sensitive design assets, and automated toolflows, ensuring security, privacy, trustworthiness, and verifiability becomes essential for both academia and industry. This workshop provides the first dedicated DAC forum on secure and private generative and agentic AI, uniting experts across AI security, hardware design, cryptography, and EDA to address the urgent risks and opportunities emerging in this evolving domain.
The workshop covers a broad set of topics central to DAC. We examine model watermarking, fingerprinting, and provenance tracking as mechanisms for protecting AI-generated design artifacts and enforcing model copyright. We highlight recent advances in verifiable computing—such as zero-knowledge proofs and trusted hardware—that provide integrity guarantees for AI-assisted design flows. We also survey major progress in privacy-preserving AI, including fully homomorphic encryption (FHE), secure multi-party computation (MPC), and confidential accelerators capable of supporting scalable execution of LLMs and agentic systems. These technologies are increasingly important for semiconductor companies working across distributed, international, or multi-tenant IP environments.
A key part of the workshop focuses on the expanding threat landscape associated with AI-driven design workflows. We analyze backdoors, jailbreaks, prompt manipulation, data poisoning, model extraction, adversarial evasion, and reconstruction attacks—threats that pose immediate risks to chip design pipelines enabled by generative AI. We further explore hardware-assisted defenses, secure enclaves, runtime monitors, and EDA-driven verification flows. Beyond model-level issues, we emphasize agentic AI risks, including unsafe tool usage, insecure function-calling, compromised memory components, rogue autonomous tasks, and cross-agent interference, especially when such agents interface with simulators, synthesis tools, or automated test frameworks.
This workshop differs from traditional hardware security, which typically focuses on supply-chain risks, microarchitectural attacks, or hardware Trojans. Instead, we address AI-native security challenges arising from the behavior, training, and deployment of generative and agentic models themselves. Topics such as LLM copyright protection, trust boundaries for autonomous agents, cryptographic acceleration for secure inference, and IP protection for AI-generated content represent a new frontier that spans AI, cryptography, architecture, and design automation. This cross-layer approach is essential for trustworthy, AI-enabled semiconductor innovation.
The workshop is highly relevant to DAC as semiconductor and EDA companies integrate AI into core design workflows. Ensuring these AI systems are secure, private, and verifiable is now a fundamental requirement. With its combination of academic rigor, practical insights, leading speakers, and direct relevance to emerging design practices, the workshop will attract a broad audience across AI, hardware, security, and EDA communities. By bringing these domains together, it establishes foundational principles for secure and scalable generative and agentic AI systems.
Generative and agentic AI systems are rapidly transforming the semiconductor, electronic design automation (EDA) ecosystem and beyond. Modern design workflows increasingly rely on large language models (LLMs), multimodal generators, and autonomous agentic systems for RTL generation, verification, debugging, documentation, and design-space exploration. As these AI systems interact directly with proprietary IP, sensitive design assets, and automated toolflows, ensuring security, privacy, trustworthiness, and verifiability becomes essential for both academia and industry. This workshop provides the first dedicated DAC forum on secure and private generative and agentic AI, uniting experts across AI security, hardware design, cryptography, and EDA to address the urgent risks and opportunities emerging in this evolving domain.
The workshop covers a broad set of topics central to DAC. We examine model watermarking, fingerprinting, and provenance tracking as mechanisms for protecting AI-generated design artifacts and enforcing model copyright. We highlight recent advances in verifiable computing—such as zero-knowledge proofs and trusted hardware—that provide integrity guarantees for AI-assisted design flows. We also survey major progress in privacy-preserving AI, including fully homomorphic encryption (FHE), secure multi-party computation (MPC), and confidential accelerators capable of supporting scalable execution of LLMs and agentic systems. These technologies are increasingly important for semiconductor companies working across distributed, international, or multi-tenant IP environments.
A key part of the workshop focuses on the expanding threat landscape associated with AI-driven design workflows. We analyze backdoors, jailbreaks, prompt manipulation, data poisoning, model extraction, adversarial evasion, and reconstruction attacks—threats that pose immediate risks to chip design pipelines enabled by generative AI. We further explore hardware-assisted defenses, secure enclaves, runtime monitors, and EDA-driven verification flows. Beyond model-level issues, we emphasize agentic AI risks, including unsafe tool usage, insecure function-calling, compromised memory components, rogue autonomous tasks, and cross-agent interference, especially when such agents interface with simulators, synthesis tools, or automated test frameworks.
This workshop differs from traditional hardware security, which typically focuses on supply-chain risks, microarchitectural attacks, or hardware Trojans. Instead, we address AI-native security challenges arising from the behavior, training, and deployment of generative and agentic models themselves. Topics such as LLM copyright protection, trust boundaries for autonomous agents, cryptographic acceleration for secure inference, and IP protection for AI-generated content represent a new frontier that spans AI, cryptography, architecture, and design automation. This cross-layer approach is essential for trustworthy, AI-enabled semiconductor innovation.
The workshop is highly relevant to DAC as semiconductor and EDA companies integrate AI into core design workflows. Ensuring these AI systems are secure, private, and verifiable is now a fundamental requirement. With its combination of academic rigor, practical insights, leading speakers, and direct relevance to emerging design practices, the workshop will attract a broad audience across AI, hardware, security, and EDA communities. By bringing these domains together, it establishes foundational principles for secure and scalable generative and agentic AI systems.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionCryptographically secure neural network inference typically relies on secure computing techniques such as Secure Multi-Party Computation (MPC), enabling cloud servers to process client inputs without decrypting them. Although prior privacy-preserving inference systems co-design network optimizations with MPC, they remain slow and costly, limiting real-world deployment. A major bottleneck is their use of a single, fixed transformer model for all encrypted inputs, ignoring that different inputs require different model sizes to balance efficiency and accuracy. We present \textit{SecureRouter}, an end-to-end encrypted routing and inference framework that accelerates secure transformer inference through input-adaptive model selection under encryption. SecureRouter establishes a unified encrypted pipeline that integrates a secure router with an MPC-optimized model pool, enabling coordinated routing, inference, and protocol execution while preserving full data and model confidentiality. The framework includes training-phase and inference-phase components: an MPC-cost-aware secure router that predicts per-model utility and cost from encrypted features, and an MPC-optimized model pool whose architectures and quantization schemes are co-trained to minimize MPC communication and computation overhead. Compared to prior work, SecureRouter achieves a latency reduction by 1.95$\times$ with negligible accuracy loss, offering a practical path toward scalable and efficient secure AI inference.
Tutorial
DescriptionThis tutorial addresses the unique security challenges of Embodied AI agents — autonomous systems operating in high-stakes cyber-physical, human-centric environments where failures can cause physical harm or privacy breaches. It focuses on the fragility of high-level reasoning and decision-making algorithms, proposing a co-design approach combining hardware architecture with algorithmic strategies for resilient, safety-aware agents. Topics include adversarial attack taxonomies and RL trajectory hijacking, verifiable agentic systems with mathematical guarantees, robust reinforcement learning for high policy integrity, differential privacy and secure federated learning for distributed coordination, and verifiable safety frameworks with robust evaluation metrics for real-world deployment.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe Performance Monitoring Unit (PMU) is a significant hardware module of modern processors. Previous studies have disclosed numerous hidden PMU events. However, their underlying microarchitectural semantics remain unexplored, which largely limits their utilization.
This paper proposes a reverse engineering framework for deeply understanding the microarchitectural semantics of hidden PMU events. For demonstration, we successfully utilize the framework to reveal microarchitectural behaviors of branch-related events and discover a hidden pattern of UMask Combination in Intel CPUs. Based on the reverse-engineering results, we implement a PMU-based covert channel for Spectre-v1 in SGX and enable a high-precision website fingerprinting attack under resource-constrained conditions.
This paper proposes a reverse engineering framework for deeply understanding the microarchitectural semantics of hidden PMU events. For demonstration, we successfully utilize the framework to reveal microarchitectural behaviors of branch-related events and discover a hidden pattern of UMask Combination in Intel CPUs. Based on the reverse-engineering results, we implement a PMU-based covert channel for Spectre-v1 in SGX and enable a high-precision website fingerprinting attack under resource-constrained conditions.
People
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionReal-time thermal monitoring of 3D integrated circuits confronts a fundamental observability crisis: reconstructing high-resolution volumetric thermal fields from sparse, heterogeneous on-chip observations. We demonstrate that existing single-modality approaches are physically insufficient—performance counters capture heat generation but are blind to thermal transport, while sparse sensors
capture local states but lack spatial coverage. To bridge this gap, we introduce a physics-informed conditional diffusion framework that synergistically fuses these complementary modalities, treating
thermal reconstruction as a constrained inference problem over the heat equation. Our 3D architecture explicitly models vertical inter-layer coupling, reducing estimation error by over 94% (0.48°C
vs. 13.08°C RMSE) compared to single-source baselines. Furthermore, by leveraging deterministic sampling, we accelerate inference 155× over finite-element simulation (0.95s vs. 147s) while maintaining high physical fidelity (𝑅2 > 0.99). This work establishes a new, observability-driven paradigm for thermal monitoring, enabling practical closed-loop dynamic thermal management for
next-generation 3D systems.
capture local states but lack spatial coverage. To bridge this gap, we introduce a physics-informed conditional diffusion framework that synergistically fuses these complementary modalities, treating
thermal reconstruction as a constrained inference problem over the heat equation. Our 3D architecture explicitly models vertical inter-layer coupling, reducing estimation error by over 94% (0.48°C
vs. 13.08°C RMSE) compared to single-source baselines. Furthermore, by leveraging deterministic sampling, we accelerate inference 155× over finite-element simulation (0.95s vs. 147s) while maintaining high physical fidelity (𝑅2 > 0.99). This work establishes a new, observability-driven paradigm for thermal monitoring, enabling practical closed-loop dynamic thermal management for
next-generation 3D systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSELECT (Sensitivity-based Exploration and Localisation of Event-wise Correlated Tokens for Test generation) targets a core limitation of coverage-directed generation (CDG): wasting simulations on low-impact parameters and invalid tests. SELECT learns which configuration tokens truly matter for activating rare but critical events and focuses generation effort around them. Using Random Forest–style feature importance, event-wise token–event correlation, and failure-aware penalties, SELECT assigns dynamic scores to each token as coverage evolves. A UCB-based multi-armed bandit policy then biases updates toward high-impact, high-uncertainty tokens while still exploring underused parameters. Deployed in production flows for next-generation IBM Power and Telum processor blocks, SELECT achieves about 13% more rare-event hits with ~10% fewer failing runs than a baseline that mutates all tokens uniformly. At the same time, it surfaces a ranked list of impactful tokens per functionality, giving verification engineers interpretable guidance on which configuration dimensions drive coverage growth and which mainly cause wasteful failures.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionAlthough increasingly deployed for in-orbit earth observation (EO) imagery, existing neural image compression, following the design for natural images, struggles due to (1) a clear semantic structures gap and (2) computational inconsistencies between different in-orbit and ground hardware induce decoding errors. Observing EO imagery is dominated by large, semantically coherent regions (e.g., forest, urban), we propose SemCom, which uses a semantically partitioned codebook for efficient, region-aware bit allocation. SemCom introduces semantic-adaptive entropy coding with offline probability estimation, ensuring a fixed, platform-agnostic probability model for bit-exact decoding. Experiments on real-world satellite payloads show SemCom outperforms SOTA baselines with low in-orbit complexity.
Analyst Presentation
AI
DescriptionTrace a token through the data path and memory hierarchy of a modern AI accelerator, from tensor cores and cache hierarchies to floor plans and macros, and follow it down into silicon. The SemiAnalysis STEEL team links a chip's micro-architectural topology to its physical implementation, connecting design intent to as-built structure.
SemiAnalysis STEEL operates a full sample-preparation lab (mechanical, wet- and dry-chemical processing, FIB and plasma-FIB) and a comprehensive analytical suite: macro and micro optical microscopy, high-resolution X-ray tomography, field-emission SEM, TEM, and aberration-corrected STEM with EDS. From area budgets, die floor plans, and advanced-packaging architecture down to the structure and materials of individual transistors, we resolve how the most advanced designs on the market are actually built.
SemiAnalysis STEEL operates a full sample-preparation lab (mechanical, wet- and dry-chemical processing, FIB and plasma-FIB) and a comprehensive analytical suite: macro and micro optical microscopy, high-resolution X-ray tomography, field-emission SEM, TEM, and aberration-corrected STEM with EDS. From area budgets, die floor plans, and advanced-packaging architecture down to the structure and materials of individual transistors, we resolve how the most advanced designs on the market are actually built.
People
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionMicrocontrollers lack support for virtual addressing, limiting their ability to employ efficient memory protection mechanisms that rely on a spacious virtual address space. While integrating an MMU could overcome this limitation, it is often impractical due to increased hardware cost and unpredictable latency. To address this, we propose \textit{semi-virtual addressing}—a lightweight technique that extends the effective address space of microcontrollers without full virtual memory support. This technique introduces a Lightweight Address Translation Module (LATM) that performs deterministic, computation-based address translation without page tables, enabling semi-virtual addressing with minimal translation overhead. Implemented on the RISC-V CORE-V-MCU, LATM expands its 575 KB physical address space to 2 GB of semi-virtual address space. We demonstrate the practicality of semi-virtual addressing by enabling heap randomization and type-safe allocation on the extended address space, efficiently enhancing memory protection.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionIn semiconductor manufacturing, defect analysis is essential, but manual inspection cannot scale. However, existing automated inspection methods remain insufficient for root-cause analysis and process optimization. To this end, we present SePArate, a weakly supervised wafer defect segmentation method. SePArate enables pixel-level separation of patterns by leveraging only image-level annotations. It consists of a three-phase training: encoder pretraining, knowledge transfer to learn spatial cues, and training on synthetic mixed-defect data for accurate segmentation. Experiments demonstrate that SePArate significantly outperforms the baselines.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionWe present a high-effort Boolean optimization framework that restructures multiple nodes in combinational logic simultaneously, enabling area reductions beyond the reach of conventional single-output transformations. Our method generalizes Boolean resubstitution on both nodes and edges to operate across shared logic while leveraging don't-care conditions. Our approach supports coordinated removal and resynthesis of shared logic regions and expands the optimization scope to multiple-output windows. When deployed in an industrial standard-cell design flow, our framework achieves up to 6.79% area reduction and 7.85% switching power reduction post-synthesis. Additionally, it establishes 19 new best results in the EPFL synthesis competition on LUT networks.
People
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionWhile state-of-the-art 3D object detectors achieve high accuracy, their computational cost hinders deployment. Knowledge Distillation (KD) offers compression, but existing methods use separate teacher-student networks, incurring memory overhead with static teachers that cannot adapt to student progress. We introduce SharedKD, unifying pruning and distillation within a single network. The full network acts as a dynamic teacher while a pruned sub-network serves as student, eliminating separate teachers, reducing memory, and enabling co-evolution for effective knowledge transfer. On nuScenes, SharedKD achieves 2.55% higher NDS than prior state-of-the-art at 75% pruning ratio, demonstrating exceptional accuracy-efficiency balance.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIntegrating Generative AI into Universal Verification Methodology (UVM) environments typically faces two critical bottlenecks: "Structural Chaos," where stochastic AI outputs lead to interface mismatches preventing reuse, and "Context Overload," where mixing pin-level details with high-level specifications causes hallucinations. This paper proposes SHELL, a standardized and decoupled framework that resolves these challenges through template-driven generation and a recursive container architecture.
By decoupling DUT Information (for structural connectivity) from Specification Information (for test logic), SHELL automates the creation of standardized wrapper containers while enabling AI agents to generate hallucination-free, spec-compliant sequences. This architecture supports a "Plug-and-Expand" recursive hierarchy, ensuring seamless vertical scalability from submodule to SoC. Validated on an AMD Xilinx Versal™ FPGA Ethernet Subsystem, the proposed framework demonstrates significant efficiency gains: it reduces the cycle from setup to 100% code coverage by 86% (shrinking a 2-week task to 2 days) and minimizes vertical integration overhead with a code modification rate of less than 5%. SHELL effectively transforms AI-driven verification from an experimental concept into a scalable, production-ready methodology.
By decoupling DUT Information (for structural connectivity) from Specification Information (for test logic), SHELL automates the creation of standardized wrapper containers while enabling AI agents to generate hallucination-free, spec-compliant sequences. This architecture supports a "Plug-and-Expand" recursive hierarchy, ensuring seamless vertical scalability from submodule to SoC. Validated on an AMD Xilinx Versal™ FPGA Ethernet Subsystem, the proposed framework demonstrates significant efficiency gains: it reduces the cycle from setup to 100% code coverage by 86% (shrinking a 2-week task to 2 days) and minimizes vertical integration overhead with a code modification rate of less than 5%. SHELL effectively transforms AI-driven verification from an experimental concept into a scalable, production-ready methodology.
Engineering Presentation
Design
EDA
Systems
DescriptionAs system behavior increasingly shifts into firmware, pre-silicon verification must extend beyond isolated hardware tests toward realistic system-level execution. Image sensor systems exemplify this challenge, as long-running and state-dependent behavior is largely governed by complex firmware-controlled configuration flows. In practice, however, conventional pre-silicon verification—largely based on short-running RTL simulation—does not scale to such scenarios, leaving critical portions of real usage insufficiently exercised.
This work presents a pre-silicon verification strategy that shifts system-level validation left by executing firmware-driven, randomized usage scenarios within the firmware execution environment. Instead of static, hand-crafted test configurations, realistic usage patterns are explored through firmware-controlled randomization, including frequent mode transitions, extended operational sequences, and long-horizon execution. UVM-accelerated emulation enables these scenarios to be executed at scale, making validation feasible beyond the practical limits of RTL simulation.
Despite the extended execution horizon, verification fidelity is preserved by observing firmware-driven hardware results and validating them against a reference C-model at the system level. As a result, production-equivalent firmware logic can be validated pre-silicon alongside the hardware, enabling realistic system behavior to be explored earlier without sacrificing verification rigor. While demonstrated using an image sensor system, the proposed strategy is broadly applicable to other firmware-controlled SoC subsystems requiring long-running, system-level verification.
This work presents a pre-silicon verification strategy that shifts system-level validation left by executing firmware-driven, randomized usage scenarios within the firmware execution environment. Instead of static, hand-crafted test configurations, realistic usage patterns are explored through firmware-controlled randomization, including frequent mode transitions, extended operational sequences, and long-horizon execution. UVM-accelerated emulation enables these scenarios to be executed at scale, making validation feasible beyond the practical limits of RTL simulation.
Despite the extended execution horizon, verification fidelity is preserved by observing firmware-driven hardware results and validating them against a reference C-model at the system level. As a result, production-equivalent firmware logic can be validated pre-silicon alongside the hardware, enabling realistic system behavior to be explored earlier without sacrificing verification rigor. While demonstrated using an image sensor system, the proposed strategy is broadly applicable to other firmware-controlled SoC subsystems requiring long-running, system-level verification.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionFalcon, a compact post-quantum digital signature scheme, has been selected by National Institute of Standards and Technology (NIST) for the standardization of post-quantum cryptography. While it effectively mitigates the emerging quantum threats, its susceptibly to fault attacks remains insufficiently understood. In this work, we present a novel Rowhammer-based fault attack against Falcon with only one single-bit flip, significantly reduces the number of required faulty signatures by 800× compared to the state-of-the-art attack. Our solution identifies a new vulnerable address offset in Falcon and optimizes the classical Hidden Parallelepiped Problem (HPP) attack. Our end-to-end attack on the Falcon-512 reference implementation successfully recovers the secret key with only 250k faulty signatures.
People
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionTraditional logic synthesis heuristics have plateaued. We introduce ShortCircuit, a novel transformer-based architecture generating AND-Inverter Graphs (AIGs) directly from truth tables. Unlike prior deep learning efforts, we employ a two-phase training process combining supervised learning and an AlphaZero variant to navigate the doubly exponential state space. Evaluated on 500 8-input truth tables from real-world circuits, ShortCircuit guarantees functional correctness and outperforms the state-of-the-art tool ABC in circuit size. Furthermore, our greedy rollout mechanism achieves a 31×speedup, demonstrating significant efficiency gains.
Additional Meeting
DescriptionJuly 27th, 12:00 PM – 1:00 PM
Come to the OpenAccess Forum 2026 and see what new EDA development is going on within the OpenAccess Coalition. We have several new members this year with some truly innovative ideas. The theme for DAC has been Artificial Intelligence for some time now and this year we expect to see more results from AI application.
Presentations will start at noon, we have lunch for the first fifty who attend.
Watch for Si2 DAC updates!
Moderator: Marshall Tiner, Si2
Presentations:
Primarius - DTCO (Design Technology Co-Optimization)
Pulsaris AI - LYRA – LaYout Reshaping Agent
Maieutic - GenAI-Powered Circuit Assistant
Astrus AI - The AI agent for physics-aware chip design
Frontier Design - AI Based Area Router
Come to the OpenAccess Forum 2026 and see what new EDA development is going on within the OpenAccess Coalition. We have several new members this year with some truly innovative ideas. The theme for DAC has been Artificial Intelligence for some time now and this year we expect to see more results from AI application.
Presentations will start at noon, we have lunch for the first fifty who attend.
Watch for Si2 DAC updates!
Moderator: Marshall Tiner, Si2
Presentations:
Primarius - DTCO (Design Technology Co-Optimization)
Pulsaris AI - LYRA – LaYout Reshaping Agent
Maieutic - GenAI-Powered Circuit Assistant
Astrus AI - The AI agent for physics-aware chip design
Frontier Design - AI Based Area Router
People
Exhibitor Forum
DescriptionFrom hand-drawn schematics on paper to billion-gate FPGA prototypes—the evolution of semiconductor design verification has been nothing short of revolutionary. This talk takes you on a journey from the early days of 5µm ASICs verified through team meetings and manual timing calculations to today's sophisticated FPGA-based prototyping platforms that handle AI/ML SoCs with unprecedented complexity.
We'll explore how the Veloce proFPGA platform transforms verification workflows through its magnificently modular architecture. Starting with FPGA modules containing cutting-edge devices like AMD's Versal Premium VP1902—the world's largest FPGA-based SoC—the platform scales from desktop configurations to rack-mounted systems supporting up to 60 FPGA modules and 3 billion gates of capacity.
Key innovations include the flexible motherboard ecosystem (Uno, Duo, Quad) that allows incremental capacity expansion, revolutionary blade architectures (QUAD and HEXA Blades) for multi-blade scaling, and breakthrough "Probeless Debug" capability that eliminates the traditional constraint of pre-defining debug signals before verification runs begin.
The session demonstrates how modern FPGA prototyping addresses the critical challenge of declining first silicon success rates while project schedules continue to slip. Through automated partitioning, memory modeling, and debug IP insertion, teams can verify complex designs at speeds impossible with traditional simulation approaches.
Join us to discover how FPGA-based prototyping has evolved from an unimaginable concept in the 1980s to an essential verification methodology enabling today's AI-driven semiconductor innovations. Learn practical strategies for implementing scalable prototyping solutions that accelerate time-to-market while ensuring design quality.
We'll explore how the Veloce proFPGA platform transforms verification workflows through its magnificently modular architecture. Starting with FPGA modules containing cutting-edge devices like AMD's Versal Premium VP1902—the world's largest FPGA-based SoC—the platform scales from desktop configurations to rack-mounted systems supporting up to 60 FPGA modules and 3 billion gates of capacity.
Key innovations include the flexible motherboard ecosystem (Uno, Duo, Quad) that allows incremental capacity expansion, revolutionary blade architectures (QUAD and HEXA Blades) for multi-blade scaling, and breakthrough "Probeless Debug" capability that eliminates the traditional constraint of pre-defining debug signals before verification runs begin.
The session demonstrates how modern FPGA prototyping addresses the critical challenge of declining first silicon success rates while project schedules continue to slip. Through automated partitioning, memory modeling, and debug IP insertion, teams can verify complex designs at speeds impossible with traditional simulation approaches.
Join us to discover how FPGA-based prototyping has evolved from an unimaginable concept in the 1980s to an essential verification methodology enabling today's AI-driven semiconductor innovations. Learn practical strategies for implementing scalable prototyping solutions that accelerate time-to-market while ensuring design quality.
People
Exhibitor Forum
DescriptionAs AI/ML SoCs, chiplet-based architectures, and software-defined systems increase design complexity, verification is no longer defined only by design size. Increasingly, complexity comes from interaction, iteration, and execution context: how IP blocks, protocols, power states, software, and workloads behave when composed into a system. As a result, verification has become a feedback-driven process in which each run can reshape intent, change priorities, expose new assumptions, and determine what must happen next.
Much of today’s verification effort now lives both within the tools and between the runs: planning, static analysis, triage, interpretation, debug, coverage analysis, and re-steering. AI-enhanced verification engines can accelerate specific tasks, while agentic AI can help connect those tasks into context-aware workflows that are easier to guide, analyze, and close. Together, these approaches bring intelligence not only to individual verification engines, but also to the engineering decisions that connect them.
Rather than treating AI as a chatbot or isolated code generator, agentic verification connects design intent, verification plans, RTL, testbenches, assertions, coverage, logs, waveforms, and engine results so AI can help engineers plan, execute, analyze, and refine verification tasks under human-defined governance. The session will examine the foundations required for trusted AI-driven flows, including domain-scoped agents, tool-aware context, persistent verification knowledge, traceability from recommendation to evidence, and orchestration across simulation, formal, emulation, and hardware-assisted validation.
Attendees will leave with a framework for identifying where AI can deliver near-term productivity gains inside verification tools, where agentic workflows can reduce friction across iterations, and how engineering judgment remains central to trusted signoff.
Much of today’s verification effort now lives both within the tools and between the runs: planning, static analysis, triage, interpretation, debug, coverage analysis, and re-steering. AI-enhanced verification engines can accelerate specific tasks, while agentic AI can help connect those tasks into context-aware workflows that are easier to guide, analyze, and close. Together, these approaches bring intelligence not only to individual verification engines, but also to the engineering decisions that connect them.
Rather than treating AI as a chatbot or isolated code generator, agentic verification connects design intent, verification plans, RTL, testbenches, assertions, coverage, logs, waveforms, and engine results so AI can help engineers plan, execute, analyze, and refine verification tasks under human-defined governance. The session will examine the foundations required for trusted AI-driven flows, including domain-scoped agents, tool-aware context, persistent verification knowledge, traceability from recommendation to evidence, and orchestration across simulation, formal, emulation, and hardware-assisted validation.
Attendees will leave with a framework for identifying where AI can deliver near-term productivity gains inside verification tools, where agentic workflows can reduce friction across iterations, and how engineering judgment remains central to trusted signoff.
People
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSigma Profiling offers a breakthrough power integrity signoff solution in chip design, overcoming coverage and runtime limitations present in traditional dynamic IR analysis techniques. Conventional methods, such as peak power or tile-based cycle selection, often fail to comprehensively capture high-risk instances due to their reliance on spatial or time averages, frequently missing critical timing paths and localized IR-drop hotspots. Sigma Profiling addresses these challenges by leveraging instance-level analysis with SigmaDvD technology, which efficiently evaluates aggressor-victim interactions and local voltage drop risks across long test vectors.
This advanced approach partitions long stimulus vectors into user-defined slices and calculates instance-level metrics—including logic activity and sigma-based voltage drop. A robust coverage scoring mechanism is then used to pinpoint and prioritize high-impact cycles and instances, enabling highly targeted dynamic analysis. As a result, Sigma Profiling delivers maximum coverage of critical and high-risk design regions, delivering more accurate and actionable signoff feedback.
By combining granular, high-coverage profiling with critical path information, Sigma Profiling facilitates comprehensive detection of major IR-drop hotspots and timing violations. This methodology ensures reliable, efficient power integrity verification for increasingly complex semiconductor devices.
This advanced approach partitions long stimulus vectors into user-defined slices and calculates instance-level metrics—including logic activity and sigma-based voltage drop. A robust coverage scoring mechanism is then used to pinpoint and prioritize high-impact cycles and instances, enabling highly targeted dynamic analysis. As a result, Sigma Profiling delivers maximum coverage of critical and high-risk design regions, delivering more accurate and actionable signoff feedback.
By combining granular, high-coverage profiling with critical path information, Sigma Profiling facilitates comprehensive detection of major IR-drop hotspots and timing violations. This methodology ensures reliable, efficient power integrity verification for increasingly complex semiconductor devices.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionWeights in neural networks (NNs) are signed. Resistive Random Access Memory (RRAM), though a promising computing in-memory (CiM) for NN acceleration, is inherently unipolar as the conductance can only be positive, making it challenging to represent signed weights.Existing mainstream solutions utilize a differential pair of RRAM cells for representing a signed weight, which double the array size and significantly increase peripheral circuit cost, limiting the scalability and energy efficiency of CiM accelerators.In this work, instead of assigning a specific sign to each weight, we treat each column of the CiM crossbar as a unit and assign an identical sign to the entire column. To do this, we propose a Sign-Align Training and Mapping (SATM) framework that regularizes the network during training so that all weights mapped to the same column of a CiM crossbar share an identical sign. A per-column Sign Register (PSR) is added to the array's periphery to store the learned 1-bit sign configuration for each respective column. The experimental results show that the proposed SATM achieves up to 49.3% reduction in area overhead and up to 60.7% reduction in energy consumption for AlexNet, ResNet18, VGG16, ResNet50, and ViT-Base models compared to differential-pair implementations.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionClock frequency modulation schemes are used in high performance computing systems to mitigate fast didt events which create voltage droop. Validation of such schemes at pre-silicon and post-silicon levels are necessary to ensure the feature is operating as per the functionality. This presentation captures the clock modulation scheme, the method used in pre-silicon verification and correlation data with post-silicon.
Exhibitor Forum
DescriptionPPA convergence and functional debug often take several months per silicon project when using traditional EDA workflows. In addition, fixing PPA issues can introduce new functional bugs—and vice versa—creating longer iterative cycles that increase schedule risk.
This talk introduces Silimate's AI-native PPA and debug copilots, which are used in production by chip teams at Fortune 500s and chip unicorns to accelerate PPA optimization and functional debugging. We will present the underlying approach along with real customer case studies, including the first public overview of Silimate's PPA copilot (Preqorsor) and debug copilot (SMDB).
This talk introduces Silimate's AI-native PPA and debug copilots, which are used in production by chip teams at Fortune 500s and chip unicorns to accelerate PPA optimization and functional debugging. We will present the underlying approach along with real customer case studies, including the first public overview of Silimate's PPA copilot (Preqorsor) and debug copilot (SMDB).
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionThough quantum computing theoretically provides exponential computational advantage, an end-to-end quantum speedup should consider the encoding, compilation, and many other peripheral process that rely on classical computer. Among them, the readout error mitigation turns out to be the bottleneck, requiring mitigation of noise errors to achieve high fidelity. To this end, leveraging the sparsity in the tensor-product has emerged as an appealing approach to improve computational efficiency. However, existing approaches, based on intermediate data pruning or threshold pruning, suffer from the limited accelerator and fidelity loss. In this paper, we propose SiTA a quantum error mitigation accelerator that exploits the inherent sparsity of Hilbert space. The key insight lies in that the superposition states generated by quantum algorithms cover only a subset of the Hilbert space (output sparsity), and the basis state naturally exhibits bit-level sparsity (input sparsity). Therefore, we introduce a sparse dataflow that features probability-level and state-level parallelism, which effectively skips calculations related to zero values. Finally, we design its hardware architecture that supports various computational paradigms of tensor-product, enabling acceleration for readout error mitigation. Experiments show that SiTA achieves an end-to-end speedup of 4.9X to 1605X over prior mitigation methods, without sacrificing fidelity.
People
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionBuilding high-performance Spiking Neural Networks (SNNs) involves a trade-off among competing paradigms. While unsupervised rules like STDP offer efficiency and ANN-to-SNN conversion provides a simple path, direct training with Backpropagation Through Time (BPTT) is the most effective route for achieving high accuracy. However, BPTT's effectiveness comes at a steep price: its memory cost scales linearly with timesteps, creating a resource barrier that limits the information capacity and practicality of SNNs. This makes \emph{memory-efficient} training methods critical for real-world deployment.
To this end, we propose the \underline{Ski}pping-\underline{S}ide-\underline{T}uning (SkiST) framework, which leverages both spatial and temporal redundancy in SNNs for \emph{memory-efficient} fine-tuning. On the spatial side, SkiST introduces side networks and applies low-rank approximation with adaptive rank filtering to compress trainable parameters. On the temporal side, it employs Dynamic Sparse BPTT, selectively skipping non-critical timesteps during gradient propagation while compensating for information loss with exponentially decayed gradients. Experiments show SkiST reduces GPU memory by up to 50\% over full fine-tuning while maintaining competitive accuracy. On an SNN-based language model, SkiST requires 23.2\,GB of memory for 1024-token sequences, compared to 40.3\,GB for the baseline, with minimal accuracy degradation. This work enables the deployment of adaptable, efficient spiking models on resource-constrained devices.
To this end, we propose the \underline{Ski}pping-\underline{S}ide-\underline{T}uning (SkiST) framework, which leverages both spatial and temporal redundancy in SNNs for \emph{memory-efficient} fine-tuning. On the spatial side, SkiST introduces side networks and applies low-rank approximation with adaptive rank filtering to compress trainable parameters. On the temporal side, it employs Dynamic Sparse BPTT, selectively skipping non-critical timesteps during gradient propagation while compensating for information loss with exponentially decayed gradients. Experiments show SkiST reduces GPU memory by up to 50\% over full fine-tuning while maintaining competitive accuracy. On an SNN-based language model, SkiST requires 23.2\,GB of memory for 1024-token sequences, compared to 40.3\,GB for the baseline, with minimal accuracy degradation. This work enables the deployment of adaptable, efficient spiking models on resource-constrained devices.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern ICs face severe IR-drop challenges, leading to timing violations and reliability issues. Traditional mitigation methods like power grid reinforcement or global buffer sizing incur significant area/power overheads, while IRECO often degrades performance due to tight timing budgets. This paper introduces a novel "load splitting" methodology to address localized IR-drop. It strategically decomposes large fanout drivers into multiple smaller drivers, each handling a subset of loads. This approach effectively reduces current density in power grid segments, alleviating voltage drops without extensive power delivery network redesign or consuming valuable timing budget. Experimental results show load splitting significantly reduces IR-drop violations with minimal impact on timing. A TSMC N2 testcase demonstrated a 20% higher fixing rate compared to existing IRECO features, positioning this method as a practical and scalable solution for robust high-performance digital circuits in advanced technology nodes.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionReal-time (RT) traffic in modern SoCs is often assigned the highest priority to meet strict latency requirements during traffic bursts. However, aggressively servicing urgent RT requests forces immediate DRAM access, breaking row locality and increasing read/write direction switching. This degrades DRAM efficiency and causes non-real-time (NRT) traffic to be repeatedly delayed under shared resource contention, reducing overall system performance.
We propose an SLC-based Real-Time write buffering mechanism that decouples RT urgency from DRAM scheduling. RT write requests are absorbed by SLC(System-level-cache) and acknowledged immediately preserving bounded RT latency while allowing RT traffic priority to be safely lowered.
When draining RT data buffered in the SLC to DRAM, deferred writebacks are issued only when the DRAM path is less congested. To further improve memory efficiency, the SLC groups writebacks targeting the same DRAM row and issue them consecutively. A small DRAM Row-Aware Buffer (DRAB) tracks unique row addresses of buffered RT writes and enables row-grouped writeback generation.
This approach reduces priority-driven interference, improves DRAM scheduling efficiency, and allows NRT traffic to make steady forward progress with minimal architectural changes.
We propose an SLC-based Real-Time write buffering mechanism that decouples RT urgency from DRAM scheduling. RT write requests are absorbed by SLC(System-level-cache) and acknowledged immediately preserving bounded RT latency while allowing RT traffic priority to be safely lowered.
When draining RT data buffered in the SLC to DRAM, deferred writebacks are issued only when the DRAM path is less congested. To further improve memory efficiency, the SLC groups writebacks targeting the same DRAM row and issue them consecutively. A small DRAM Row-Aware Buffer (DRAB) tracks unique row addresses of buffered RT writes and enables row-grouped writeback generation.
This approach reduces priority-driven interference, improves DRAM scheduling efficiency, and allows NRT traffic to make steady forward progress with minimal architectural changes.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionCross-component cache side-channel attacks exploit shared cache resources across different processing units to infer sensitive information, posing a significant threat to modern heterogeneous computing systems. However, the effectiveness of fine-grained cache attacks across components on Apple silicon remains unexplored.
In this paper, we demonstrate that cross-component cache attacks can leak GPU sensitive information from the CPU via the System-Level Cache (SLC) on Apple M2 SoC. We first reverse-engineer the key SLC mechanisms required to enable cache attacks. Leveraging this knowledge, we introduce a GPU-to-CPU covert channel and attack targeting the Large Language Model (LLM) embedding table during local LLM inference.
In this paper, we demonstrate that cross-component cache attacks can leak GPU sensitive information from the CPU via the System-Level Cache (SLC) on Apple M2 SoC. We first reverse-engineer the key SLC mechanisms required to enable cache attacks. Leveraging this knowledge, we introduce a GPU-to-CPU covert channel and attack targeting the Large Language Model (LLM) embedding table during local LLM inference.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionMoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate–constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37× and 2.85×, respectively, and improves decode latency by up to 1.81× and 1.64×, while preserving near–high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDeploying production-grade compute infrastructure in resource-constrained IC design startups presents unique challenges not addressed by traditional HPC guidance. We present practical lessons from operating SLURM for 30 compute nodes serving 80 engineers running Cadence and Synopsys EDA workloads, where commercial schedulers such as LSF or Grid Engine impose high licensing costs and require dedicated administrative teams.
Our configuration-as-code approach, documented through 75 git commits over 18 months, addresses EDA-specific constraints including FlexLM license management, mixed interactive and batch workloads, and CI-driven DV regressions generating 5,000+ queued jobs. Key contributions include scheduler guardrails preventing controller deadlocks, license-aware backfill optimization, phased memory enforcement with user adaptation, and a four-layer observability model combining dashboards, automated alerts, preventive health checks, and forensic diagnostics.
Quantified results demonstrate 95% reduction in node failures, 80% backfill success rate, and 99.9% uptime. An LLM-assisted troubleshooting model resolves ~50% of configuration issues, enabling part-time administration. Economic analysis demonstrates $200K+ annual savings versus commercial alternatives, validating open-source SLURM as a viable production solution for startup IC design environments.
Our configuration-as-code approach, documented through 75 git commits over 18 months, addresses EDA-specific constraints including FlexLM license management, mixed interactive and batch workloads, and CI-driven DV regressions generating 5,000+ queued jobs. Key contributions include scheduler guardrails preventing controller deadlocks, license-aware backfill optimization, phased memory enforcement with user adaptation, and a four-layer observability model combining dashboards, automated alerts, preventive health checks, and forensic diagnostics.
Quantified results demonstrate 95% reduction in node failures, 80% backfill success rate, and 99.9% uptime. An LLM-assisted troubleshooting model resolves ~50% of configuration issues, enabling part-time administration. Economic analysis demonstrates $200K+ annual savings versus commercial alternatives, validating open-source SLURM as a viable production solution for startup IC design environments.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionThe increasing size of large language models (LLMs) has led to a surge in memory requirements during training, often exceeding the capacity of high-bandwidth memory (HBM). Swap-based memory optimization incurs neither accuracy loss nor additional end-to-end overhead when effectively overlapped, thus being an attractive solution. However, existing swap methods assume consistent operator sequences, which is impractical in Eager Mode, where operator sequences can vary across iterations.
We propose SmartSwap, which redesigns the end-to-end process of swap-based memory optimization and is the first work to consider varying operator sequences in Eager Mode. SmartSwap (i) introduces a lightweight online profiler to enable continuous profiling for monitoring operator sequences, (ii) generates effective swap policies with limited operator information, and (iii) optimizes the policy execution module for accurate policy application and better performance. Experimental results demonstrate that SmartSwap reduces profiling overhead by 84.25%, enables training models up to 4× larger than hardware memory while adapting to changes in operator sequences, improves performance by up to 38.94% compared to recomputation or high-degree parallelism.
We propose SmartSwap, which redesigns the end-to-end process of swap-based memory optimization and is the first work to consider varying operator sequences in Eager Mode. SmartSwap (i) introduces a lightweight online profiler to enable continuous profiling for monitoring operator sequences, (ii) generates effective swap policies with limited operator information, and (iii) optimizes the policy execution module for accurate policy application and better performance. Experimental results demonstrate that SmartSwap reduces profiling overhead by 84.25%, enables training models up to 4× larger than hardware memory while adapting to changes in operator sequences, improves performance by up to 38.94% compared to recomputation or high-degree parallelism.
People
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionMixture-of-Experts (MoE) models are attractive for large-scale inference, but sparse activation and skewed, time-varying expert popularity lead to cost-inefficiency. Serverless computing has become an attractive way due to its fine-grained resource elasticity and billing. Even so, achieving cost-efficient and SLO-compliant scaling for each expert with intra- and inter-layer dependencies under concurrent requests remains fraught with challenges. In this paper, we present sMoE, a topology-aware elastic auto-scaler for serverless MoE inference. Our insight is treating the deployment of a MoE inference pipeline as a DAG of experts and non-MoE segments. Building on this, sMoE is designed as a deep reinforcement learning-based solution. Specifically, it encodes expert- and layer-level runtime features with DAGNN by propagating cross-layer semantics. Coupled with a layer-wise pointer network, sMoE captures intra-layer semantics to jointly decide vertical resources, replica counts, and concurrency settings for each of all experts at runtime. Experimental results show that sMoE reduces serving cost by 21.4%–39.2% compared to state-of-the-art solutions, while meeting stringent end-to-end P95 latency SLOs.
Work in Progress
DescriptionSpiking neural network (SNN) accelerators offer the promise of energy efficiency and edge computing excellence when compared to traditional neural networks. However, they are vulnerable to faults that can cause catastrophic failures in safety-critical applications. Manufacturing defects, process variations, and malicious perturbations can alter SNN dynamics, leading to significant performance degradation. To this end, we propose a resource-optimized testing methodology for SNNs using a normalized sparse matrix, and the ensemble of Mean Absolute Membrane Potential Deviation (MAMPD) and Mean Absolute Spike Count Difference (MASCD) metrics. Our approach is founded on the key insight that hardware faults induce measurable distribution shifts in neuronal activities, deviating faulty networks from their fault-free counterparts. Experimental validation shows that our sparse test matrix approach, achieving up to 99.994% sparsity, attains 100% fault coverage under bit flips, synaptic stuck-at, and neuron faults, while reducing memory requirements by 43×, test generation time by 150K×, and computational cost by 55× compared to state-of-the-art methods.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionHigh-Level Synthesis (HLS) is a pivotal electronic design automation (EDA) technology that enables the generation of hardware circuits from high-level language descriptions. A critical step in HLS is Design Space Exploration (DSE), which seeks to identify high-quality hardware architectures under given constraints. However, the enormous size of the design space makes DSE computationally prohibitive. Although numerous algorithms have been proposed to accelerate DSE, our extensive experimental studies reveal that no single algorithm consistently achieves Pareto dominance across all problem instances. Consequently, the inability of any single algorithm to dominate all benchmarks necessitates an automated selection mechanism to identify the best-performing DSE algorithm for each specific case. To address this challenge, we propose the SoberDSE framework, which recommends suitable algorithm based on benchmark characteristics. Experimental results demonstrate that our SoberDSE framework significantly outperforms state-of-the-art heuristic-based DSE algorithms by up to 5.7 \times and state-of-the-art learning-based DSE methods by up to 4.2 \times. Furthermore, compared to conventional classification models, SoberDSE delivers superior accuracy in small-sample learning scenarios, with an average enhancement of 35.57\%. Code and models are available at \url{https://anonymous.4open.science/r/Sober-4377}.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis presentation specifies the DDR Memory controller verification for memory subsystem using hierarchical design aware Verification IP for recent generation of memory sub systems like DDR5, LPDDR5 etc. and how Cadence memory controller team has used a generic solution to describe such interconnect hierarchy in a modular and simple way. This approach defines a feature, associated construct to capture memory sub system and implementation of handshake mechanism with triggers (like commands) to enhance individual instance DRAM model to be able to get visibility into other DRAM devices present in the design that are sharing resources like data bus, ZQ registers etc. Presentation also given example of how this innovative solution has been used by Cadence memory controller IP and other external customers to enhance their sub system level verification to the next level while verifying protocol compliance for JEDEC define specification for multi-rank memory sub systems for DDR5 and LPDDR5 based designs. This solution can also potentially be applied to any Verification IP models to higher level protocol compliance checking.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionDONNs leverage light propagation for efficient analog AI and signal processing. Advances in nanophotonic fabrication and metasurface-based wavefront engineering have opened new pathways to realize high-capacity DONNs across various spectral regimes. Training such DONN systems to determine the metasurface structures remains challenging. Heuristic methods are fast but oversimplify metasurfaces modulation, often resulting in physically unrealizable designs and significant performance degradation. Simulation-in-the-loop optimizes implementable metasurfaces via adjoint methods, but is computationally prohibitive and unscalable. To address these limitations, we propose SP2RINT, a spatially decoupled, progressive training framework that formulates DONN training as a PDE-constrained learning problem. Metasurface responses are first relaxed into freely trainable transfer matrices with a banded structure. We then progressively enforce physical constraints by alternating between transfer matrix training and adjoint-based inverse design, avoiding per-iteration PDE solves while ensuring final physical realizability. To further reduce runtime, we introduce a physics-inspired, spatially decoupled inverse design strategy based on the natural locality of field interactions. This approach partitions the metasurface into independently solvable patches, enabling scalable and parallel inverse design with system-level calibration. Evaluated across diverse DONN training tasks, SP2RINT achieves digital-comparable accuracy while being 1825 times faster than simulation-in-the-loop approaches. By bridging the gap between abstract DONN models and implementable photonic hardware, SP2RINT enables scalable, high-performance training of physically realizable meta-optical neural systems.
People
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionVideo diffusion transformers (vDiTs) generate high quality but pay quadratic self-attention cost, making inference prohibitive at video-token scales. The challenge is input-adaptive sparsity: selecting critical Q/K/V tokens with negligible overhead and executing them for end-to-end gains. We present \textsc{SPADE}, a training-free sparse-attention engine of three parts: (i) \textsc{vDiT-SSR}, specification defining 3D blocking candidates and formalizing dynamic masks via \emph{Summarizer}/\emph{Estimator} expressions; (ii) runtime \textsc{scheme generation} using SICS and a head-wise policy; and (iii) an executor with low-overhead index search, flash block-sparse attention, and kernel grouping. Across Hunyuan-Video, Wan~2.1/2.2, \textsc{SPADE} raises sparsity and speed and preserves quality, accelerating attention $2.26{\times}$--$3.40{\times}$ and end-to-end $1.49{\times}$--$1.80{\times}$.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSparse tensor computation is widely used in deep learning and scientific computing, but its irregular computation and memory access patterns pose significant challenges for general-purpose accelerators such as GPUs and NPUs. FPGAs, owing to their reconfigurability, are inherently well-suited for handling sparse workloads. However, existing point-based manual design paradigms severely limit the performance potential of reconfigurable hardware across diverse sparse scenarios.
To address this, we propose SpArC—an automated compilation framework for sparse tensor computation that automatically generates high-performance FPGA accelerators.
SpArC consists of:(1) a primitive-based DSL serving as the design specification for sparse accelerators, providing a unified abstraction of data formats, iteration pattern, and hardware mapping strategies;(2) a two-stage mapping mechanism centered around sparse meta-operations, enabling arbitrary sparse dataflows to be efficiently mapped to hardware micro-architectures; and(3) a heuristic design space exploration method for joint software–hardware optimization.Experimental results show that SpArC can automatically generate FPGA accelerators for typical sparse workloads such as SpMSpV, Plus3, and SpMM. On the SuiteSparse benchmark dataset,for SpMSpV workloads, SpArC achieves 31.7×–130.4× performance improvement over automated frameworks such as ScaleHLS and Allo. On the same SuiteSparse SpMSpV workloads running on CPU platforms, SpArC further achieves 10.8x–34.7× speedup over TACO and 1.03x–1.89x speedup over MKL.
To address this, we propose SpArC—an automated compilation framework for sparse tensor computation that automatically generates high-performance FPGA accelerators.
SpArC consists of:(1) a primitive-based DSL serving as the design specification for sparse accelerators, providing a unified abstraction of data formats, iteration pattern, and hardware mapping strategies;(2) a two-stage mapping mechanism centered around sparse meta-operations, enabling arbitrary sparse dataflows to be efficiently mapped to hardware micro-architectures; and(3) a heuristic design space exploration method for joint software–hardware optimization.Experimental results show that SpArC can automatically generate FPGA accelerators for typical sparse workloads such as SpMSpV, Plus3, and SpMM. On the SuiteSparse benchmark dataset,for SpMSpV workloads, SpArC achieves 31.7×–130.4× performance improvement over automated frameworks such as ScaleHLS and Allo. On the same SuiteSparse SpMSpV workloads running on CPU platforms, SpArC further achieves 10.8x–34.7× speedup over TACO and 1.03x–1.89x speedup over MKL.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionSparse Matrix Multiplication (SpMM) incurs prohibitive performance overheads in Privacy-Preserving computing based on Fully Homomorphic Encryption (FHE), since the encryption of the sparse matrix hides its inherent sparsity pattern, leading to dense matrix computations. To overcome this, we propose SparseE, a hardware-software co-design framework that enables efficient, sparsity-aware SpMM under FHE. Our novel algorithm recasts SpMM into a secure Scatter-Gather-Apply paradigm, using a homomorphic permutation network to perform the critical data gathering based on encrypted indices. This approach ties the computational cost directly to the number of non-zero elements while protecting the sparsity pattern. To further bridge this performance gap, we co-design a dedicated hardware accelerator. Its Homomorphic Permutation Engine adapts the network to a hardware-friendly Benes topology, enabling a deeply pipelined Radix-k MDC architecture that resolves the on-chip bandwidth bottleneck. Concurrently, its Homomorphic Expansion Engine performs on-the-fly decompression of compressed selector ciphertexts, mitigating the massive storage bottleneck. Experimental results demonstrate that SparseE achieves an average speedup of 401.8× and an average energy reduction of 2594.3× compared to state-of-the-art solutions.
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionRecent progress in satellite formation flight in Low Earth Orbit (LEO) is driving demand for energy-efficient AI accelerators with strong soft-error resilience. This paper proposes a co-design of a fault-tolerant neural network and hardware that combines tolerance-aware temporal redundancy with sparse outer-product computation. A theoretical bound on single-bit-flip effects in residual-quantized dot products enables selective protection of only the most vulnerable computations. In addition, a zero-skip outer-product unit exploits activation sparsity created by residual quantization to reduce redundancy overhead. Implementation results show that the proposed design achieves a 16.0% speedup over the non-fault-tolerant model while keeping area overhead to 0.20–1.26%.
People
Engineering Special Session
AI
Design
EDA
Systems
DescriptionBronco AI's Spec Reviewer analyzes specs and docs, identifies and resolves ambiguities and contradictions that would otherwise lead to costly downstream errors. Bronco Spec Reviewer reflects Bronco's vision of accelerating Design Verification from the start of a project to the finish
People
Work in Progress
DescriptionVerification planning is a critical but highly time and effort consuming stage in modern hardware design flows, requiring engineers to interpret lengthy and complex specifications to derive comprehensive and high-quality verification plans. Moreover, manual plan authoring is an error-prone process that can lead to extended verification cycles and reducing productivity. We present Spec2Plan, a structured LLM-driven framework for generating detailed, coverage-oriented, and robust verification plans directly from hardware specifications. Spec2Plan combines four key components: (i) coverage-guided prompting, (ii) two-pass plan-and-generate strategy, (iii) testing techniques injection and (iv) iterative coverage refinement loop to generate robust verification plans. To assess plan quality, we developed a cross-LLM-validation framework that evaluates four key aspects: specification coverage, expert-plan alignment, testing methodology robustness, and plan redundancy. Experiments on six OpenTitan IP blocks demonstrate that Spec2Plan significantly improves in all above metrics when compared to a one-pass prompting baseline. These results establish Spec2Plan as a practical step toward trustworthy and automation-ready verification planning that can be seamlessly integrated into downstream verification workflows.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionDisk-based Approximate Nearest Neighbor Search (ANNS) suffers from high I/O latency due to random node access, which dominates over 90% of search time. To address this, we propose SpecANNS, an In-Storage Computing (ISC)-based solution leveraging in-storage FPGAs for fast distance computation. SpecANNS speculatively identifies pages likely to be accessed in subsequent hops and exploits NAND flash parallelism to minimize page read latency. Implemented on real Computational Storage Device (CSD) hardware with a simple interface, SpecANNS significantly reduces query latency and improves energy efficiency compared to state-of-the-art disk-based ANNS methods.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionThe Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training. Our system improves inference throughput by up to 4.30×, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.
Work in Progress
DescriptionReinforcement learning (RL) has shown to be promising in optimally solving the analog circuit sizing problem from simulations but often results in low sample efficiency and long execution times. We introduce SPEEDY, an actor-critic RL framework that leverages a single-step formulation to improve sample efficiency. SPEEDY accelerates convergence by running parallel, low-cost simulations that incrementally refine the design range. Evaluated on two operational transconductance amplifier topologies and a state-of-the-art low-dropout regulator design, SPEEDY achieves up to 8x of improvement in convergence time, while improving the figure of merit with respect to comparable baseline methods.
People
Work in Progress
DescriptionWe present SPICEMixer, a genetic algorithm that synthesizes circuits by directly evolving SPICE netlists. SPICEMixer operates on individual netlist lines, making it compatible with arbitrary components and subcircuits and enabling general-purpose genetic operators: crossover, mutation, and pruning, all applied directly at the netlist level. To support these operators, we normalize each netlist by enforcing consistent net naming (inputs, outputs, supplies, and internal nets) and by sorting components and nets into a fixed order, so that similar circuit structures appear at similar line positions. This normalized netlist format improves the effectiveness of crossover, mutation, and pruning. We demonstrate SPICEMixer by synthesizing standard cells (e.g., NAND2 and latch) and by designing OpAmps that meet specified targets. Across tasks, SPICEMixer matches or exceeds recent synthesis methods while requiring substantially fewer simulations.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionSpiking Neural Networks (SNNs) are energy-efficient and biologically plausible, ideal for embedded and security-critical systems, yet their adversarial robustness remains open. Existing adversarial attacks often overlook SNNs' bio-plausible dynamics. We propose Spike-PTSD, a biologically inspired adversarial attack framework modeled on abnormal neural firing in Post-Traumatic Stress Disorder (PTSD). It localizes decision-critical layers, selects neurons via hyper/hypoactivation signatures, and optimizes adversarial examples with dual objectives. Across six datasets, three encoding types, and four models, Spike-PTSD achieves over 99% success rates, systematically compromising SNN robustness. Code: https://anonymous.4ope- n.science/r/Anonymous-testcode.
People
Work in Progress
DescriptionDeploying adaptive intelligence at the edge remains challenging due to the high computational and energy cost of training neural models. SNNs offer a promising alternative, but enabling on-device learning requires hardware–algorithm co-design. This paper presents SpikerLL, an FPGA-based SNNs accelerator that extends the open-source Spiker+ inference architecture with efficient support for the STSF local learning rule. Through targeted microarchitectural extensions, SpikerLL performs inference and online learning with minimal overhead. Across MNIST, F-MNIST, and DIGITS, it achieves up to 93% accuracy, sub-millisecond latency, and <0.1 mJ per inference, while remaining DSP-free and highly scalable for edge-FPGA deployments.
Work in Progress
DescriptionWe propose a Split-and-Sync Bayesian learning framework for SRAM compiler optimization under macro-level timing and power constraints. The framework decomposes these global constraints into leaf-cell-level objectives, where each leaf cell—such as a controller, decoder, or peripheral unit—is locally optimized through constraint-aware tuning. Our flow directly adjusts transistor-level parameters—widths and threshold voltages—and automatically regenerates the schematic and layout for each candidate. A bank-adaptive search jointly explores single- and multi-bank organizations, overcoming fixed banking rules. Implemented in a 28nm SRAM compiler, the method achieves 6.87%-15.8% in dynamic power reduction, 25.07%-26.45% access speed improvement, and a 4.5× reduction in total optimization cost.
People
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionDiffusion models have recently demonstrated great performance in robotic manipulation, standing out with their ability to handling multi-modal action distributions and high-dimensional action spaces. However, its iterative computing feature limits real-time control in resource-constrained robot platform. Some of existing diffusion accelerators explored sparsity based on the similarity of adjacent steps, but suffering from low sparsity of full network and hardware inefficiency due to under-utilization of processing elements (PE) and ignorance of more significant non-matrix-multiplication operators (NMOs) in sparse model. To address these issues, we propose SpRDA, an algorithm-architecture co-design accelerator that can achieve high-speed and energy-efficient inference of robotics diffusion model. A fine-grained and difference-aware cache algorithm combining differential computing and cache is proposed to identify more redundant computation in robotics diffusion, achieving over 90\% sparsity with no distinct accuracy degradation. Furthermore, we apply two-level group sparsity adjusting and design a dynamic sparsity-aware matrix processing array to fully leverage sparsity in hardware. We also propose a graph-based vector unit to process various NMOs in higher throughput and lower area. Compared to the state-of-the-art diffusion accelerators, SpRDA achieves 1.56-2.60× speedup and 6.45-7.85× energy efficiency improvement.
People
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionFlash wear raises raw bit error rate (RBER), limiting SSD lifetime by TBW or P/E cycles. We present SSD Reforge, which dynamically strengthens protection near end‑of‑life by adding intra‑device parity, allowing more lifetime. To handle inter‑block reliability heterogeneity, we build an RBER‑based error‑probability model and derive parity stripe grouping as a combinatorial optimization problem. We design Reliability‑Aware Striping (RAS) for efficient grouping and Stripe UPER Leveling (SUPERL) to sustain RAS during subsequent writes. Implemented in FEMU and evaluated with real I/O traces, SSD Reforge increases user TBW by up to 40% on average versus traditional mode at the lifetime limit, significantly extending SSD lifespan.
Work in Progress
DescriptionStatic Timing Analysis (STA) plays a critical role in the design of high-performance digital integrated circuits, ensuring reliable operation under varying process conditions. This paper introduces a machine learning (ML) based approach to the problem of selecting relevant timing paths for STA accuracy validation. The proposed method helps significantly the tasks of timing path classification, important circuit feature identification and timing tool accuracy prediction. Also, the proposed approach provides relevant timing path samples that capture the timing characteristics of the entire design. The final analysis of the tested sample paths' accuracy informs the design methodology recommendations to EDA.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionLLM reliability is compromised by soft errors, yet existing static defenses fail due to the "Outlier Dichotomy": non-uniform activation patterns between Prefill and Decode phases. We propose STAC, a phase-aware framework that resolves this asymmetry. STAC employs a Spatial Contextualizer to preserve massive functional outliers in Prefill and a token-predictive Temporal Contextualizer to track dynamic drift in Decode. By aligning protection with these intrinsic structures, STAC achieves high fault coverage with negligible overhead. Experiments confirm STAC significantly outperforms static baselines, simultaneously preserving model accuracy and ensuring robustness.
People
Engineering Presentation
EDA
Systems
DescriptionStaged snapshot-based collaborative debug across firmware and hardware environments bridges the gap between fast microcode simulators and slow, but fully observable, system simulations. Our approach captures a minimal architected state—registers and selected memory—from a firmware simulator at post‑IML and replays it into RTL system simulation, emulation, and back, using profile‑driven, protobuf‑encoded snapshots and testbench drivers to reconstruct non‑architected state. This decoupled, bidirectional flow enables HW and FW teams to debug the same failure in their native environments, without weeks of manual state reconstruction, reducing debug turn around time from weeks to hours, while shrinking snapshot size and cutting engineer effort by over an order of magnitude. Deployed on IBM Z17 for HW–FW interaction bugs, the framework offers better debug state quality, and the schema‑based design is generalizable to other firmware simulators and architectures, making it a practical for next‑generation systems.
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
DescriptionWe introduce a machine learning approach for distinguishing between wafers fabricated using a sanctioned/ratified chain of tools on a semiconductor manufacturing floor and unsanctioned/unratified versions of the tool-chain based on metrology or wafer acceptance test collected during manufacturing and testing. Our method exploits the systematic nature of process variation and captures the subtle causal relationships between tool exchanges and the resulting changes in physical dimensions or electrical characteristics of the silicon, which can then be used for enabling wafer-level tool-chain attestation. Effectiveness of our solution is demonstrated on a dataset of inline and e-test measurements from 7000 wafers fabricated with multiple tool-chain variants in the GlobalFoundries 12LP FinFET technology node.
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
DescriptionThe increasing complexity of embedded software has made comprehensive manual testing impractical, motivating the use of automated techniques such as fuzzing. Coverage-guided fuzzers like AFL++ have shown strong results for conventional software but remain challenging to apply effectively in embedded contexts, where peripheral behaviors play critical roles. Existing approaches either use fast user-mode simulators, sacrificing peripheral realism, or rely on full-system simulators with manual instrumentation, limiting applicability to large-scale software.
In this work, we present a novel framework that integrates AFL++ with a stateful SystemC-TLM virtual prototype to enable realistic fuzzing of embedded software. Fuzzer-generated inputs are injected directly into peripheral models, allowing peripherals to trigger natural side effects such as interrupts and FIFO updates.
By integrating fuzzing with full-system simulation, our framework advances the effectiveness of pre-silicon testing for embedded systems. Results on embedded workloads show that our approach eliminates false positives while maintaining comparable code coverage and execution performance as state-of-the-art tools.
In this work, we present a novel framework that integrates AFL++ with a stateful SystemC-TLM virtual prototype to enable realistic fuzzing of embedded software. Fuzzer-generated inputs are injected directly into peripheral models, allowing peripherals to trigger natural side effects such as interrupts and FIFO updates.
By integrating fuzzing with full-system simulation, our framework advances the effectiveness of pre-silicon testing for embedded systems. Results on embedded workloads show that our approach eliminates false positives while maintaining comparable code coverage and execution performance as state-of-the-art tools.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThis presentation introduces an AI-augmented Static Timing Analysis (STA) analytics platform that shifts timing closure from a reactive debugging process to a proactive, predictive, and data-driven workflow. The platform automatically ingests and structures large-scale STA datasets, enabling interactive visualization of multi-corner and multi-mode timing behavior. Machine-learning models are applied to predict slack evolution, identify high-risk paths and corners, detect timing anomalies, and estimate ECO effectiveness prior to implementation. By integrating deterministic STA results with AI-driven insights, the approach substantially reduces manual log analysis, accelerates timing closure cycles, and improves sign-off confidence. This methodology allows engineers to quickly prioritize critical timing issues, reduce regression risk, and efficiently scale STA analysis as design size and complexity continue to increase.
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
DescriptionSparse tensor operations underpin graph neural networks, scientific simulations, and data analytics, yet irregular sparsity leads to highly uneven tile densities and irregular memory access patterns, preventing GPUs from effectively utilizing Tensor Cores and CUDA cores. We present STCG, a compiler–runtime framework for sparse tensor code generation with adaptive tile scheduling on modern GPUs. STCG introduces a four-level IR that separates algebraic lowering, tiling, micro-kernel construction, and heterogeneous kernel assembly. The compiler emits a unified GPU kernel that embeds both WMMA-based Tensor Core micro-kernels and coordinate-indexed FMA kernels under a consistent memory and warp layout, ensuring stable register and shared-memory usage despite irregular sparsity. At runtime, lightweight tile descriptors enable constant-time, per-tile selection between the WMMA and FMA paths, providing data-aware load balancing and eliminating the need for kernel regeneration or manual tuning. On an NVIDIA A100, STCG achieves a 1.13$\times$ geometric-mean speedup over the per-dataset best existing method on SpMM and a 5.06$\times$ speedup on SpTTM, demonstrating the effectiveness and advancement of our approach across diverse sparsity patterns.
Engineering Presentation
EDA
DescriptionCoverage-Directed Generation (CDG) accelerates functional verification closure by iteratively optimizing constrained-random test templates that target under-exercised design functionalities, but it requires substantial compute for each optimization simulation phase. STEER (Simulation of Templates Engine for Efficient Resource usage) is a parallel, mini-batched framework that reduces simulations done by CDG while preserving coverage quality-of-results (QoR). STEER partitions each CDG phase into multiple mini-batches to use coverage feedback from completed mini-batches to adaptively select high-potential test templates for simulation. Test template selection is formulated as a multi-armed bandit policy using normalized target-coverage reward to balance exploration of uncertain templates with exploitation of strong performers. To preserve QoR, STEER applies dual stopping criteria during mini-batch selection: (i) a redundancy criterion requiring that at least α templates are recommended multiple times within a selection cycle, indicating a clear high-potential subset relative to unselected candidates that were not recommended; and (ii) a retention criterion requiring that the top η templates by historical average target coverage are always selected, ensuring they are not dropped while the algorithm explores. Across 20+ production design units, STEER reduces simulations by ~35% with practically no QoR degradation.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionFormal Verification (FV) relies on high-quality SystemVerilog Assertions (SVAs), but the manual writing process is slow and error-prone. Existing LLM-based approaches either generate assertions from scratch or ignore structural patterns in hardware designs and expert-crafted assertions. This paper presents STELLAR, the first framework that guides LLM-based SVA generation with structural similarity. STELLAR represents RTL blocks as AST structural fingerprints, retrieves structurally relevant (RTL, SVA) pairs from a knowledge base, and integrates them into structure-guided prompts. Experiments show that STELLAR achieves superior syntax correctness, stylistic alignment, and functional correctness, highlighting structure-aware retrieval as a promising direction for industrial FV.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionMinkowski distance metrics (Lₚ) are widely used in AI. For hardware, digital designs incur limited parallelism and high energy/latency. Alternatively, content addressable memory (CAM) supports large-scale parallel distance metrics. However, existing CAM-based designs mainly support L₀/L₁, with limited scalability/configurability. In this work, we propose a segmented thermometer-encoded multibit CAM (STEM-CAM) architecture, enabling arbitrary-order Lₚ metrics. Segmenting distance norms with wildcard-augmented thermometer encoding, STEM-CAM enables a structured, scalable encoding for Lₚ. Moreover, an ultra-compact FeFET-based encoding-aware multibit CAM cell is proposed, providing efficient, high-precision implementation. Experiments demonstrate high throughput and area/energy efficiency for Lₚ, showing great potential for distance-based AI.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionAutomatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionVerification flows use Verification IPs (VIPs) to identify assertion violations. This is well-suited for on-chip protocols (e.g., AXI, AHB) with standard specifications (e.g., AMBA) for developers to write assertions that detect violations. Patching these violations, however, still requires manual effort. A good patch must not only fix the violation but also preserve functionality (e.g., pass the testsuite) without causing new violations. We propose Stitch, based on the intuition that given an implementation, a violated assertion, and a counterexample an LLM can try to synthesize patches. To evaluate Stitch, we develop a new dataset that comprises 100 violations in 11 implementations of 5 protocols (AXI, AHB, APB, Wishbone, and TileLink). Our first experiment reports 19% success rate across 4 LLMs (GPT4, GPT5, Gemini, and Claude). By analyzing the failed cases, we devise 3 improvement strategies: patch localization, violation specific context (cone of influence, counterexample), iterative feedback from model checker outputs over candidate patches. Our experiments show these strategies help Stitch achieve 61% patch success, with GPT5 dominating with 56%. We validate Stitch patches through simulation and on real hardware. We compare Stitch to 3 state-of-the-art non-LLM tools and existing patches.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAutomatic transistor sizing remains a significant challenge in modern analog and mixed-signal circuit design. While recent AI-driven methods show promise, they remain sample-inefficient, largely because they struggle to effectively incorporate expert knowledge and generalize across different designs. We propose Stochastic Diffusion Prior (SDP), a framework that systematically bridges this gap. SDP elicits domain expertise through an intuitive interface based on the established gm/ID methodology, formally integrating this knowledge into the optimization. Critically, SDP is not a static prior; it enables dynamic adaptation through a novel mutual information-guided diffusion process, allowing the prior to be refined and corrected by new simulation data. SDP can be integrated with virtually any optimization method, enhancing its sample efficiency and exploration capabilities. Experimental validation on practical analog circuits demonstrates SDP's superiority over state-of-the-art approaches. When integrated with a simple Vanilla BO, SDP achieves up to 92.3× speedup (27.9× on average) while improving key performance metrics by up to 2.0× (1.5× on average). The approach maintains high effectiveness even with imperfect prior knowledge and successfully enables knowledge transfer between technology nodes, positioning it as a versatile and robust enhancement for circuit optimization.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionThis paper proposes the Stone-Skimming microarchitectural attack, the first page table walk (PTW)-based attack targeting large-scale sensitive memory applications. Stone-Skimming leverages the correlation among multiple signals to establish novel side channel attacks. We develop the first microarchitectural attack targeting DNA sequence reconstruction. Multiple PTW entries generated by the victim are cross-validated, enabling robust DNA reconstruction where traditional attacks fail. Moreover, we discover a new stateful side effect of PTW on prefetch instructions that can be exploited to break KASLR, even with KPTI on. We also introduce the early-instruction-fetch (early-IF) attack capable of bypassing LFENCE and CPUID defense.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionAdversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least $50\%$ of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2 ), and found to provide robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionDNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent.
Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04x slowdown and minimal hardware overhead.
Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04x slowdown and minimal hardware overhead.
People
Engineering Presentation
EDA
DescriptionModern Clock Tree Synthesis (CTS) relies on heterogeneous clock cells with wide variations in physical width, drive strength, and area. However, existing CTS placement controls such as density-based halos apply uniform or linearly scaled spacing that is largely agnostic to clock tree structure and stage-specific requirements. This often leads to suboptimal clock cell clustering, irregular latency gradients, increased power, and degraded clock skew consistency across corners.
This paper presents a structure-aware CTS methodology that leverages cell-width–based classification to guide clock cell placement using explicit CTS halo attributes. By mapping physical base-cell widths of buffers and inverters to non-uniform X-direction clock halos, the proposed approach implicitly encodes top, trunk, and leaf placement intent during CTS construction.
The proposed method is implemented entirely using tool-native CTS attributes and requires no modification to the CTS engine. Experimental results on a production-scale design demonstrate reductions in clock cell count, clock area, wirelength, and congestion, along with improved latency smoothness and multi-corner clock skew consistency, with negligible impact on target skew. This work shows that physical-aware halo modulation provides an effective and scalable mechanism for robust clock tree optimization.
This paper presents a structure-aware CTS methodology that leverages cell-width–based classification to guide clock cell placement using explicit CTS halo attributes. By mapping physical base-cell widths of buffers and inverters to non-uniform X-direction clock halos, the proposed approach implicitly encodes top, trunk, and leaf placement intent during CTS construction.
The proposed method is implemented entirely using tool-native CTS attributes and requires no modification to the CTS engine. Experimental results on a production-scale design demonstrate reductions in clock cell count, clock area, wirelength, and congestion, along with improved latency smoothness and multi-corner clock skew consistency, with negligible impact on target skew. This work shows that physical-aware halo modulation provides an effective and scalable mechanism for robust clock tree optimization.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionSystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of real-world SVA corpora and the lack of methods to promote the NL-SVA semantic equivalence. For the former, large-scale RTL code is used to guide LLMs to generate real-world SVAs; for the latter, bidirectional NL-SVA translation maintains semantic consistency. With
the synthesized data, we train SVACoder, a series of SVA generation models. Notably, SVACoder-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
the synthesized data, we train SVACoder, a series of SVA generation models. Notably, SVACoder-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
People
Work in Progress
DescriptionAutonomous Navigation Systems (ANS) incorporate many safety-critical functions, such as collision avoidance. Recent studies have shown how remote clock/voltage glitch injections pose an imminent threat to mission-sensitive modules in the autonomous navigation domain: timing/power perturbations in the perception stages can cascade into severe accuracy loss and latency drift for downstream tasks. In this paper, we present Swift-Healer, a firmware-reconfigurable self-healing architecture that unifies prediction-detection modules and an automated healing unit to mitigate remote clock/voltage glitches, while satisfying the application latency constraints. Our solution leverages a chiplet-based architecture that offers isolation from compromised hardware modules, while enabling self-healing in the firmware management layer. Our proposed design incorporates an autonomous monitor that is a combination of a glitch predictor and a reactive detector. We implement our design on a Zynq–7000 with a hardware accelerator, where Swift-Healer predicts glitches within autonomous driving kernels up to two real–time loop iterations earlier ≈ 0.06𝑚𝑠, thereby giving abundant time for self-healing (typically closer to 0.005 ms); if a prediction is below the confidence threshold, the reactive detector flags the fault, and deploys the healing module rapidly. The system restores perception to its regular 0.03 ms latency, holds steady-state power at 1.95 W, and exhibits transient peaks up to ∼2.44 W during the self-healing process.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionFault-tolerant quantum computing relies on low-depth and high-fidelity syndrome extraction in quantum error correcting codes. We propose a scheduling strategy for syndrome extraction circuits in the broad class of quasi-abelian lifted-product codes, encompassing both hypergraph product and bivariate bicycle codes. Our approach constructs syndrome extraction circuits with CNOT depths no more than one layer above the fundamental lower bound, frequently achieving optimal depth. A pipelined variant further reduces the average depth per round, and the strategy generalizes naturally to higher-dimensional product codes. This scheduling framework enables computational speedups for quantum computing architectures built on these codes, all while preserving high-fidelity error correction.
Exhibitor Forum
DescriptionMemory is currently the biggest bottleneck for AI hardware. In the datacenter it is HBM bandwidth and capacity, and at the edge it is the energy cost of moving weights. While the datacenter race is crowded and well-funded, the edge is the next frontier for AI hardware, and a faster bus does not address the problem there. This session looks at a new class of startups attacking the problem at its root, by computing directly inside non-volatile memory so model weights stop making the costly trip to the processor, all on mature, standard CMOS nodes. The session will discuss AI Memory startup journeys from concept to customer adoption, and the business models that enable their success. Learn from startup founders and technical leaders how enabling CMOS compatibility, a clean path to integration, and ecosystem partnerships that make the IP real, drive end-customer adoption and a faster path to success.
Exhibitor Forum
DescriptionAs advanced-node and multi-die designs push more functionality into tighter integration, chips are increasingly constrained by tightly coupled thermal, electrical, and electromagnetic effects that are difficult to predict and can no longer be treated as late-stage concerns. Join Synopsys for a panel discussion on how multiphysics fusion enables a shift to physics-aware co-design—integrating analysis earlier and across domains—to improve accuracy, reduce unnecessary margins, and accelerate convergence for next-generation AI and high-performance computing systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionSemiconductor chips today form the technological foundation of nearly every industry, from consumer electronics and automotive systems to artificial intelligence and high-performance computing. This widespread demand has driven significant advancements in semiconductor manufacturing and design methodologies, while intensifying the challenges faced by engineers developing highly integrated System-on-Chip (SoC) architectures.
Modern SoCs encompass of multiple sub-designs and blocks made-up of hundreds of complex intellectual properties (IP's), CPUs, GPUs, NPUs, and advanced processing units. This intricate architecture of SoC's requires a complex clock distribution network, with multiple asynchronous domains operating at varying frequencies and voltages. The need to deliver these complex, clock-intensive devices within aggressive schedules further amplifies the need for strategies that effectively verifies timing constraints and achieve precise timing closure before final tape-out to ensure silicon correctness and long-term reliability. Among these, Static Timing Analysis (STA) and Gate-Level Simulation (GLS) represent two of the most crucial checkpoints. Any oversight in these phases can result in complex Engineering Change Orders (ECOs), post-silicon failures, or degradation of Power, Performance, and Area (PPA) margins.
However, STA lacks an intrinsic mechanism to verify the correctness or completeness of the applied timing constraints, relying instead on user-defined inputs that may not always reflect the true design intent. On other end GLS is computationally intensive, often involves multiple iterative debug cycles that increase turnaround time and jeopardize project schedules. To address these inefficiencies, there is a growing demand for enhanced methodologies that improves constraint efficiency, uphold timing quality, and meet aggressive time-to-market goals.
The Timing Constraint Verification (TCV) methodology addresses this need by providing a comprehensive framework that ensures the accuracy, consistency, and completeness of timing constraints across the RTL-to-GDSII design flow. By fostering close collaboration among design, verification, and backend teams, TCV facilitates the analysis, propagation, and integration of high-quality SDCs—leading to improved efficiency, reduced design risk, and overall SoC reliability.
Modern SoCs encompass of multiple sub-designs and blocks made-up of hundreds of complex intellectual properties (IP's), CPUs, GPUs, NPUs, and advanced processing units. This intricate architecture of SoC's requires a complex clock distribution network, with multiple asynchronous domains operating at varying frequencies and voltages. The need to deliver these complex, clock-intensive devices within aggressive schedules further amplifies the need for strategies that effectively verifies timing constraints and achieve precise timing closure before final tape-out to ensure silicon correctness and long-term reliability. Among these, Static Timing Analysis (STA) and Gate-Level Simulation (GLS) represent two of the most crucial checkpoints. Any oversight in these phases can result in complex Engineering Change Orders (ECOs), post-silicon failures, or degradation of Power, Performance, and Area (PPA) margins.
However, STA lacks an intrinsic mechanism to verify the correctness or completeness of the applied timing constraints, relying instead on user-defined inputs that may not always reflect the true design intent. On other end GLS is computationally intensive, often involves multiple iterative debug cycles that increase turnaround time and jeopardize project schedules. To address these inefficiencies, there is a growing demand for enhanced methodologies that improves constraint efficiency, uphold timing quality, and meet aggressive time-to-market goals.
The Timing Constraint Verification (TCV) methodology addresses this need by providing a comprehensive framework that ensures the accuracy, consistency, and completeness of timing constraints across the RTL-to-GDSII design flow. By fostering close collaboration among design, verification, and backend teams, TCV facilitates the analysis, propagation, and integration of high-quality SDCs—leading to improved efficiency, reduced design risk, and overall SoC reliability.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe methods traditionally used to analyse dynamic voltage drop on complex automotive SoCs are becoming increasingly inadequate in identifying important local simultaneous switching noise conditions that affect timing. Designers face significant challenges in EMIR signoff for such SoCs.
Coverage is severely limited for local switching noise, as most voltage drops arise from aggressors activity around a victim cell and significantly very less from cell's self-switching. Transient EMIR analysis is very sensitive to simulation timestep which should be small 5-10ps to ensure accuracy, limiting runtime for higher local noise coverage. For VCD-based flows, it is practically impossible to get proper vectors for 100% IR-drop coverage.
To overcome the above local noise coverage limitations in traditional EMIR, we present here a novel aggressor-coverage based sigma methodology that calculates realistic dynamic IR on all instances by taking a collection of all possible aggressors to each victim instance. It provides debug visibility to the designers to enable fixing violations early. Our methodology helps both in runtime reduction as well as uncovering IR hotspots missed in traditional approach to improve PDN robustness. Our results show high local noise coverage along with 2X runtime improvements that can be efficiently used to signoff EMIR with high confidence.
Coverage is severely limited for local switching noise, as most voltage drops arise from aggressors activity around a victim cell and significantly very less from cell's self-switching. Transient EMIR analysis is very sensitive to simulation timestep which should be small 5-10ps to ensure accuracy, limiting runtime for higher local noise coverage. For VCD-based flows, it is practically impossible to get proper vectors for 100% IR-drop coverage.
To overcome the above local noise coverage limitations in traditional EMIR, we present here a novel aggressor-coverage based sigma methodology that calculates realistic dynamic IR on all instances by taking a collection of all possible aggressors to each victim instance. It provides debug visibility to the designers to enable fixing violations early. Our methodology helps both in runtime reduction as well as uncovering IR hotspots missed in traditional approach to improve PDN robustness. Our results show high local noise coverage along with 2X runtime improvements that can be efficiently used to signoff EMIR with high confidence.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionWe propose TACo, a training-free, hardware-aware architecture search framework for Vision Transformers (ViTs), powered by a Hypervolume-based Unified Zero-Cost Score (HV-score) that integrates four accuracy- and hardware-related dimensions: a newly proposed zero-cost accuracy score (ZAS), latency, energy, and activation memory. TACo rapidly estimates the potential of candidate ViTs without training: ZAS captures layer-wise gradient stability, weight–gradient coupling, and gradient-modulated activation stability, while the HV-score guides multi-objective selection using Pareto and hypervolume principles. Experiments on CIFAR-10 and ImageNet-1K show that TACo reduces search time from GPU-days to GPU-hours while identifying a well-balanced ViT architecture in terms of accuracy and computational cost.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionGraph Neural Networks (GNNs) offer powerful graph-structured data modeling capabilities, yet their acceleration is challenging. Rigid hardware parallelism struggles to accommodate algorithmic diversity and the sparse, irregular topology inherent to GNNs. To resolve this conflict, we propose TAG, a topology-aware GNN accelerator that achieves high performance through synergistic innovations at three levels: dataflow, scheduling, and memory hierarchy. For algorithmic diversity, we employ a configurable, topology-driven dataflow that is aware of both algorithmic needs and graph structure. To mitigate irregularity, a contention-aware scheduler orchestrates irregular memory access by reordering them into a conflict-free stream. Furthermore, an algorithm-architecture co-designed memory hierarchy, combined with a coarse-to-fine graph partitioning algorithm, maximizes data reuse from sparse graphs and significantly minimizes off-chip traffic. Evaluations demonstrate that TAG achieves an average of 3.22x speedup and 3.04x energy efficiency over state-of-the-art GNN accelerators.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionSecure multi-party computation (MPC) offers a practical foundation for privacy-preserving machine learning at the edge. However, current MPC systems rely heavily on communication and computation-intensive primitives-such as secure comparison-for nonlinear operations, which are often impractical on resource-constrained platforms. To enable real-time secure inference, we introduce a highly efficient, TEE-accelerated framework for secure comparison. Specifically, we reduce communication cost by redesigning the core primitives-leaf comparison and merge-so that each completes in a single round of interaction, reducing the round complexity from log(n) to just 1 per operation. Furthermore, unlike prior work that heavily relies on Oblivious Transfer (OT), a well-known computational bottleneck, we leverage synchronized seeds inside the TEE to eliminate OT for the vast majority of our designs, along with a correlated-randomness reuse technique that keeps new designs computationally lightweight. To fully realize the potential, we design a specialized accelerator that restructures the dataflow across stages to enable continuous, fine-grained streaming and high parallelism, reducing memory overhead. Our design achieves up to 4.86x speedup on ResNet-50 inference, compared with state-of-the-art CNN frameworks, and achieves up to 7.44x speedup on bert-base inference, compared with state-of-the-art LLM frameworks.
Engineering Presentation
Design
EDA
DescriptionFormal verification plays a critical role in modern hardware development by delivering exhaustive, mathematically rigorous validation to ensure functional correctness. As digital architectures scale in complexity—driven by advanced features and the growing demand for AI centric workloads—formal verification faces major hurdles, including state space explosion and difficulty achieving convergence. GPU compression blocks exemplify this trend: originally designed for efficient data storage and bandwidth reduction, these blocks have evolved to include intricate control and data path logic, multi-cache line handling, and interdependent compression/decompression flows. The addition of AI-driven features has further increased gate counts and buffer sizes, amplifying verification complexity.
To address these challenges, we implemented a structured formal verification methodology which is scalable for complex designs. Our approach divides the system into smaller, focused blocks, applies abstraction and parameter reduction to manage state space, leverages incremental verification and symbolic coding to deal with complexity. These techniques enabled us to dramatically improve convergence rates from under 6% to nearly 83% and uncover critical issues that would be difficult to detect without these advanced techniques or simulation. This approach delivers a scalable, resilient verification framework that strengthens GPU architecture reliability while enabling secure, efficient processing for advanced AI driven workloads in future systems.
To address these challenges, we implemented a structured formal verification methodology which is scalable for complex designs. Our approach divides the system into smaller, focused blocks, applies abstraction and parameter reduction to manage state space, leverages incremental verification and symbolic coding to deal with complexity. These techniques enabled us to dramatically improve convergence rates from under 6% to nearly 83% and uncover critical issues that would be difficult to detect without these advanced techniques or simulation. This approach delivers a scalable, resilient verification framework that strengthens GPU architecture reliability while enabling secure, efficient processing for advanced AI driven workloads in future systems.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionTargeted bit-flip attacks (BFAs) exploit hardware faults to manipulate model parameters, posing a significant security threat. While prior work targets single-step inference models (e.g., image classifiers), LLM-based agents with multi-stage pipelines and external tools present new attack surfaces, which remain unexplored. This work introduces Flip-Agent, the first targeted BFA framework for LLM-based agents, manipulating both final outputs and tool invocations. Our experiments show that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks, revealing a critical vulnerability in LLM-based agent systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn modern DFT (Design-for-Test) methodologies, wrappers play a crucial role in enabling core-level testing, test data reuse, and test compression. Traditionally, wrapper parameter tuning — such as chain partitioning, compression ratio, and wrapper insertion effort — relies on fixed heuristics or manual adjustments. This often leads to suboptimal reuse, high overhead, and multiple costly iterations during test integration.
The proposed idea introduces an ML-driven Smart Wrapper Tuning framework that learns from past runs and predicts optimized wrapper configurations for upcoming runs. By leveraging Bayesian Optimization, Reinforcement Learning, or Regression Models, the system can intelligently balance maximum reuse (to reduce redundant test logic) against minimum overhead (to maintain area, timing, and power budgets).
Each wrapper run generates logs (reuse %, overhead %, coverage, runtime, etc.), which are fed back into the learning engine. The engine then builds predictive models to guide future runs, effectively creating a closed-loop adaptive system. This reduces trial-and-error iterations, improves wrapper reuse efficiency, and accelerates convergence in large-scale SoCs.
The novelty lies in embedding data-driven intelligence within the DFT flow:
• A DFT-specific reward function (reuse – overhead)
• Cross-design learning for faster ramp-up on new projects
• Adaptive tuning that improves continuously across iterations
The proposed idea introduces an ML-driven Smart Wrapper Tuning framework that learns from past runs and predicts optimized wrapper configurations for upcoming runs. By leveraging Bayesian Optimization, Reinforcement Learning, or Regression Models, the system can intelligently balance maximum reuse (to reduce redundant test logic) against minimum overhead (to maintain area, timing, and power budgets).
Each wrapper run generates logs (reuse %, overhead %, coverage, runtime, etc.), which are fed back into the learning engine. The engine then builds predictive models to guide future runs, effectively creating a closed-loop adaptive system. This reduces trial-and-error iterations, improves wrapper reuse efficiency, and accelerates convergence in large-scale SoCs.
The novelty lies in embedding data-driven intelligence within the DFT flow:
• A DFT-specific reward function (reuse – overhead)
• Cross-design learning for faster ramp-up on new projects
• Adaptive tuning that improves continuously across iterations
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionDynamic Heterogeneous Graph Neural Networks(DHGNNs) effectively capture complex and evolving structural and semantic information through metapath learning. Despite notable advancements, current solutions still suffer from redundant metapath matching. To overcome these challenges, we introduce TDH-GNN, a topology-driven accelerator tailored for high-performance DHGNN inference. Specifically, we propose an efficient hyperedge-centric incremental execution approach into accelerator design, utilizing the concept of hyperedge to encapsulate dependencies among metapath instances, enabling incremental metapath updates and reducing redundant matching and computation. Implemented on a Xilinx U280, TDH-GNN delivers average speedups of 4.3x and 2.8x, along with 5.9x and 3.7x energy savings, over leading HGNN accelerators.
People
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionFunction-as-a-Service (FaaS) has become a de-facto paradigm for deploying cloud applications, yet it exposes sensitive data to untrusted infrastructures. Running functions on confidential virtual machines (CVMs) such as Intel TDX provides strong isolation for data but significantly increases cold-start latency due to CVM initialization. This paper presents TDSnap, the first TDX-aware function snapshot system that records a verified function state with its TD and restores it on demand. Using snapshot-bounded attestation and TD-aware working-set identification, TDSnap preserves TDX isolation while reducing cold-start latency. Evaluation shows that TDSnap reduces cold-start latency by up to 40x and matches non-TDX snapshot systems.
Work in Progress
DescriptionIn this paper, we present a process-aware Design–Technology Co-Optimization (DTCO) framework that rapidly integrates transistor-level process shifts into circuit-level Power–Performance–Area (PPA) analysis. The approach combines Neural Compact Models (NCM) with polar level-set clustering to identify and accelerate process-dependent optima within the PPA trade-off space. A semi-supervised NCM retargeting scheme reduces SPICE model development time by 98% while preserving model consistency and capturing median electrical shifts across diverse electrical targets, enabling the fast generation of technology-specific SPICE libraries and their corresponding PPA spaces. By mapping these libraries into a polar PPA domain, Pareto-based level-set clustering partitions technology behaviors into radial performance shells, laying the foundation for a radial multi-objective optimization that effectively reduces the dimensionality of subsequent design-space exploration and significantly increases the proportion of truly Pareto-optimal solutions within the elite population. This unified NCM–polar-clustering framework delivers interpretable, process-aware DTCO without TCAD or statistical sampling, and can be directly calibrated with fab data for immediate industrial deployment.
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
DescriptionRecent advances in Microsoft's ternary BitNet, restricting weights to {-1, 0, +1}, have highlighted the potential of low-bit large language models (LLMs).
Ideally, each ternary weight requires only log_2(3), approximately 1.58 bits. However, existing inference systems still use redundant bit representations for computational efficiency.
To push the compression boundary and maintain efficiency, we propose TernaInfer, a GPU inference framework for ternary LLMs that integrates lossless compression with ternary-optimized sparse matrix multiplication.
To our knowledge, TernaInfer is the first work to realize 1.58 bits per weight on the ternary BitNet while providing 1.53 times throughput improvement over half-precision GPU inference.
Ideally, each ternary weight requires only log_2(3), approximately 1.58 bits. However, existing inference systems still use redundant bit representations for computational efficiency.
To push the compression boundary and maintain efficiency, we propose TernaInfer, a GPU inference framework for ternary LLMs that integrates lossless compression with ternary-optimized sparse matrix multiplication.
To our knowledge, TernaInfer is the first work to realize 1.58 bits per weight on the ternary BitNet while providing 1.53 times throughput improvement over half-precision GPU inference.
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionOptimal Test Point Insertion (TPI) to reduce pattern count is NP-hard. Traditional testability metrics fail in reconvergent fanout structures. Machine learning solutions suffer from costly ATPG labels (SL) or high runtime overhead and instability (RL). We propose an ATPG-free Test Point Insertion approach using self-supervised learning (SSL). We train an autoencoder on bit sequences to capture functional embeddings. A propagation model, trained via SSL, reliably propagates embeddings across reconvergent regions. Controllability is inferred from the embeddings; observability is estimated via saliency maps. Our pipeline eliminates ATPG dependency and RL overhead for effective TPI.
Engineering Special Session
AI
EDA
Description-
Work in Progress
DescriptionDeep Neural Network (DNN) edge devices are increasingly vulnerable to severe hardware security threats, particularly side-channel attacks (SCA) and fault injection attacks (FIA). This paper, for the first time, presents a combined side-channel and fault injection attack framework targeting the ArgMax unit—the hardware component responsible for converting output probabilities into the final class prediction and serving as the decision logic in DNN inference. An adversary can recover intermediate prediction values through SCA and subsequently manipulate the ArgMax decision via FIA to induce targeted misclassification. On an unprotected ArgMax, our attack achieved a 56.92% targeted misclassification success rate with full controllability across all classes, demonstrating its practical feasibility and high threat potential. To counter these threats, we introduce Shuffled-ArgMax, a lightweight defense scheme designed to resist combined SCA–FIA attacks. Using the power side-channel and voltage fault injection capabilities of the hardware security evaluation platform, CrackNuts, we evaluate a convolutional neural network deployed on an STM32F407 microcontroller. Experimental results demonstrate that Shuffled-ArgMax significantly suppresses side-channel leakage and enhances robustness against practical fault injection attacks.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionDigital verification of integrated circuits faces critical challenges such as achieving extremely high functional and code coverage with full reliability, managing the complexity and significant time investment required by formal and UVM environments, and accelerating debugging processes to meet increasingly tight project schedules.
This paper presents a well-structured digital verification flow built around the ChipAgentsAI platform, illustrating a complete methodology that leverages agentic AI to correctly execute all key steps in the IC design verification lifecycle: from understanding the specification, to generating verification code and test plans, through to debugging and results analysis.
By drastically reducing manual effort and minimizing human errors, the solution improves both efficiency and verification quality and delivers unprecedented levels of automation, effectively addressing the growing complexity of modern IC designs.
In practice, the application of agentic AI tools enabled the discovery of previously unnoticed specification bugs, the generation of a larger number of assertions and covergroups, and the attainment of very high code and functional coverage metrics.
Notably, specification reading time was reduced by 15×, formal assertion generation by 240×, and UVM environment creation by 400×, significantly shortening the overall verification timeline.
These gains are achieved under the full control of the verification engineer: the tool executes only what it is explicitly asked to do and, starting from a clear and complete specification, the engineer guides the flow step by step and remains central in using ChipAgentsAI as a powerful ally to simplify, accelerate, and increase the accuracy of the verification process.
This paper presents a well-structured digital verification flow built around the ChipAgentsAI platform, illustrating a complete methodology that leverages agentic AI to correctly execute all key steps in the IC design verification lifecycle: from understanding the specification, to generating verification code and test plans, through to debugging and results analysis.
By drastically reducing manual effort and minimizing human errors, the solution improves both efficiency and verification quality and delivers unprecedented levels of automation, effectively addressing the growing complexity of modern IC designs.
In practice, the application of agentic AI tools enabled the discovery of previously unnoticed specification bugs, the generation of a larger number of assertions and covergroups, and the attainment of very high code and functional coverage metrics.
Notably, specification reading time was reduced by 15×, formal assertion generation by 240×, and UVM environment creation by 400×, significantly shortening the overall verification timeline.
These gains are achieved under the full control of the verification engineer: the tool executes only what it is explicitly asked to do and, starting from a clear and complete specification, the engineer guides the flow step by step and remains central in using ChipAgentsAI as a powerful ally to simplify, accelerate, and increase the accuracy of the verification process.
Engineering Special Session
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe Next Decade of Semiconductors and Transformation Through Industry, Academia, and Government is a panel bringing together leaders from industry, academia, and government to discuss the trends shaping semiconductor innovation and the ecosystem required to scale it.
The panel will focus on four major themes defining the next decade: EDA and evolving design methodologies, including the growing role of open-source ecosystems; advanced memory and storage architectures needed to support AI-driven data movement and processing; post-silicon advances that extend performance beyond traditional scaling; and the future of semiconductor manufacturing and workforce readiness, including stronger fab connectivity, talent pipelines, and skills development to sustain global competitiveness and long-term innovation.
This panel session is organized by IEEE Global Semiconductors.
The panel will focus on four major themes defining the next decade: EDA and evolving design methodologies, including the growing role of open-source ecosystems; advanced memory and storage architectures needed to support AI-driven data movement and processing; post-silicon advances that extend performance beyond traditional scaling; and the future of semiconductor manufacturing and workforce readiness, including stronger fab connectivity, talent pipelines, and skills development to sustain global competitiveness and long-term innovation.
This panel session is organized by IEEE Global Semiconductors.
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionPeripheral Component Interconnect Express (PCIe) is the de facto interconnect standard for high-speed peripherals and CPUs. The development of PCIe devices for emerging applications requires realistic Transaction Layer Packet (TLP) traces that accurately simulate device-CPU interactions. While generative AI offers a promising avenue for synthesizing complex TLP sequences, it is prone to a critical challenge inherent in all generation tasks: hallucination. Naively applying these models often produces traces that violate fundamental PCIe protocol rules, such as ordering and causality, rendering them unusable for device simulation. To resolve this, our work introduces a methodology to bridge the gap between generative AI and high-fidelity device simulation. This paper presents Phantom, a framework that systematically addresses AI-generated hallucinations in TLP synthesis. Phantom achieves this by coupling a generative backbone with a novel post-processing filter that enforces PCIe-specific constraints, effectively eliminating invalid TLP sequences. We validate Phantom's effectiveness by synthesizing TLP traces for an actual PCIe network interface card. Experimental results show that Phantom produces practical, large-scale TLP traces, significantly outperforming existing models, with improvements of up to 1000$\times$ in task-specific metrics and up to 2.19$\times$ in Fréchet Inception Distance (FID) compared to backbone-only methods. The prototype implementation has been made open-source.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe problem of finding specification bugs during RTL verification is two-fold: (1) It involves weeks of effort to debug, identify and fix architecture bugs. (2) As much as ~20% deviations in project timelines could be caused by fixing the architecture specification. Furthermore, hardware architecture is increasingly complex due to advanced memory schemes and multi-agent systems and are more prone to spec bugs.
These problems could be overcome by embracing architecture-level formal verification (ArchFV). ArchFV adds value to RTL verification not only by shifting left, but also by automating the RTL reference model development.
ArchFV is conducted by creating an executable formal model of specification represented as a collection of tables. The tables are exhaustively verified by model-checkers for passing safety properties and absence of deadlock.
Some of the key outcomes of adopting ArchFV at Intel are:
1. ArchFV of a critical IP, comprising of ~30 tables were formally verified in ~8 weeks
2. ~4050 safety properties proven and ~60 spec bugs fixed
3. ~7 stalled table transitions detected and rectified
4. Saved ~3 weeks of manual effort in reference model development
Overall, ArchFV prevents ripple effects in verification, improves quality of the design and saves product time to market.
These problems could be overcome by embracing architecture-level formal verification (ArchFV). ArchFV adds value to RTL verification not only by shifting left, but also by automating the RTL reference model development.
ArchFV is conducted by creating an executable formal model of specification represented as a collection of tables. The tables are exhaustively verified by model-checkers for passing safety properties and absence of deadlock.
Some of the key outcomes of adopting ArchFV at Intel are:
1. ArchFV of a critical IP, comprising of ~30 tables were formally verified in ~8 weeks
2. ~4050 safety properties proven and ~60 spec bugs fixed
3. ~7 stalled table transitions detected and rectified
4. Saved ~3 weeks of manual effort in reference model development
Overall, ArchFV prevents ripple effects in verification, improves quality of the design and saves product time to market.
Research Special Session
AI
DescriptionAMS/RF design has been among the most underserved EDA markets, but there has been great interest in changing the status quo. The last decade has seen several major efforts in advancing design automation in this field. This effort has been particularly boosted by the advent of artificial intelligence and machine learning (AI/ML), which has led to new possibilities that were previously difficult to achieve. The use of models such as convolutional neural networks, graph neural networks, transformers, and large language models has exposed new possibilities that were difficult to achieve using prior methods. However, the blind use of ML models leaves behind decades of designer expertise and wisdom, and optimal solutions should attempt to harness both. This talk will present an overview of a set of AI/ML methods in AMS/RF design, and will discuss the quest to explore cost/benefit tradeoffs in deploying AI in this domain.
Research Special Session
Systems
DescriptionThe rapid proliferation of advanced packaging technologies, including 2.5D, 3D, and chiplet integration, presents unprecedented challenges for test engineers. As system complexity increases, conventional test methodologies often fall short, impacting yield, reliability, and time-to-market. This presentation delves into the multifaceted testing landscape of advanced packaging, offering a comprehensive exploration of key challenges and innovative solutions. It will cover critical aspects of advanced packaging test, starting with the intricacies of chiplet test, including strategies for known good die (KGD) assurance and inter-chiplet communication testing. We will then examine stack test methodologies, focusing on validating vertical interconnects, power delivery, and thermal performance within stacked die configurations. The discussion will also explore optimizing test application and test flow optimization across the entire advanced packaging assembly, from wafer-level to final package test, emphasizing parallelization, adaptive testing, and data-driven approaches to reduce test time and cost. Furthermore, the presentation will address advanced defect screening related to packaging components, such as interposer defects, ubumps, delamination, cracks, utilizing sophisticated analytical techniques to ensure robust package quality and long-term reliability.
People
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
Description3D In-memory Computing (IMC) chips, with high density and bandwidth, offer a promising solution for AI workloads but suffer from thermal instability, which degrades weight precision and system throughput. This work proposes a cross-layer thermal stability strategy (TSS) and validates it on the first 3D eFlash analog IMC prototype with hybrid bonding, integrating a 12nm logic die and a 28nm eFlash die (36MB). TSS consists of: (1) a physics-guided thermal model achieving a kullback-leibler divergence of 0.088; (2) a temperature-insensitive differential programming (TIDP) scheme that stabilizes weights from –30°C to 60°C; and (3) an adaptive temperature calibration (ATC) algorithm that reduces output error standard deviation by 60.7%. With TSS, ResNet-18 and DCCRN maintain robust inference on the 3D eFlash IMC SoC across –15°C to 75°C with less than 10% degradation in end-to-end task accuracy. This is the first demonstration of thermally resilient analog IMC on a commercial-grade 3D SoC, paving the way for reliable AI deployment under dynamic thermal conditions.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionReasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20–30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
People
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionThis paper presents Thrend, the first defense that mitigates counting threads used in timing side-channel attacks on Arm and Apple CPUs through real-time monitoring and scheduling. Thrend leverages fuzzing to automatically generate high-quality counting threads for training, and detects them by sampling hardware performance counters. For suspected counting threads, Thrend prevents effective timing measurements by co-scheduling the attacker threads and the counting threads on the same core. We implement Thrend on four ARM and Apple devices. Under the sampling rate of 100 Hz, Thrend achieves 99.71% detection accuracy, a 0% false-negative rate, and incurs less than 1.42% runtime overhead.
People
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionLarge Language Model (LLM) serving is crucial for applications like chatbots
and code assistants, yet its implementation remains bottlenecked by low hardware utilization. Speculative decoding improves effective throughput, but still suffers from suboptimal resource partitioning. We introduce SmoothSpec, a speculative decoding system designed to maximize serving throughput by fully leveraging intra-device parallelism. SmoothSpec deploys draft and target models with the same parallelism configurations but allocates asymmetric compute resources at runtime. We design a novel scheduling strategy that dynamically adjusts both the draft parameters for the next step and the resource allocation ratio between the two models, based on the generation quality of draft tokens. We implement SmoothSpec on a single-node LLM serving system and demonstrate its effectiveness across diverse workloads.
and code assistants, yet its implementation remains bottlenecked by low hardware utilization. Speculative decoding improves effective throughput, but still suffers from suboptimal resource partitioning. We introduce SmoothSpec, a speculative decoding system designed to maximize serving throughput by fully leveraging intra-device parallelism. SmoothSpec deploys draft and target models with the same parallelism configurations but allocates asymmetric compute resources at runtime. We design a novel scheduling strategy that dynamically adjusts both the draft parameters for the next step and the resource allocation ratio between the two models, based on the generation quality of draft tokens. We implement SmoothSpec on a single-node LLM serving system and demonstrate its effectiveness across diverse workloads.
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
DescriptionPerforming Approximate Nearest Neighbor Search (ANNS) on large-scale vector datasets is essential in the context of AI applications. Graph-based ANNS demonstrates superior performance and accuracy, positioning itself as a leading approach within the ANNS landscape. However, its inherent dependency on random data access mandates a memory-centric deployment strategy, which in turn presents significant scalability challenges.
Recent advancements in Compute Express Link (CXL) technologies, known for their high-bandwidth memory extension capabilities, offer a critical opportunity to enhance the scalability of graph-based ANNS. Nonetheless, the implications of the CXL-extended memory architecture when applied to graph-based ANNS remain largely unexplored.
In this paper, we present a CXL-Oriented Graph-based ANNS (COGA) system to achieve scalable, high-speed ANNS for extensive datasets. Our key observation is the search pipeline can be enhanced through a CXL-tailored priority queue that maximizes CXL bandwidth utilization while bridging the latency gap. Furthermore, we propose a tiered data layout and placement strategy that leverages queue hints to facilitate speculative access to index data.
Recent advancements in Compute Express Link (CXL) technologies, known for their high-bandwidth memory extension capabilities, offer a critical opportunity to enhance the scalability of graph-based ANNS. Nonetheless, the implications of the CXL-extended memory architecture when applied to graph-based ANNS remain largely unexplored.
In this paper, we present a CXL-Oriented Graph-based ANNS (COGA) system to achieve scalable, high-speed ANNS for extensive datasets. Our key observation is the search pipeline can be enhanced through a CXL-tailored priority queue that maximizes CXL bandwidth utilization while bridging the latency gap. Furthermore, we propose a tiered data layout and placement strategy that leverages queue hints to facilitate speculative access to index data.
People
Research Manuscript
Design
Quantum
DES6. Quantum Computing
DescriptionFrequent recalibration of multi-qubit systems is essential to maintain hardware fidelity, but calibration itself is costly and subject to rapid parameter drift. This work presents a novel framework for time-aware active calibration that combines Bayesian Optimization (BO) and Reinforcement Learning (RL) to adaptively schedule and prioritize calibration experiments under uncertainty. We model the calibration landscape as a directed acyclic graph (DAG), where nodes represent hardware parameters (e.g., frequency, amplitude ratio, DRAG, or phase) and edges capture inter-parameter dependencies extracted automatically from Qiskit Experiments metadata. Each parameter's temporal drift is modeled as an Ornstein-Uhlenbeck (OU) or exponential decay process with uncertainty bounds. The BO agent selects the next calibration experiment to maximize expected information gain per unit time while respecting system-level constraints. A parallel RL/ILP scheduler coordinates cross-qubit calibration and readout tasks to minimize idle time and avoid conflicts.
Our system further incorporates self-triggered recalibration based on online RB/XEB health metrics, enabling autonomous recovery from drift without human intervention. Experimental evaluation on multi-qubit testbeds demonstrates improved calibration efficiency, longer mean time between recalibrations (MTBC), and reduced overall calibration overhead compared to fixed or purely heuristic scheduling baselines. We provide an open-source implementation, Cal Orchestrator, which integrates DAG construction, BO-based experiment selection, and RL-based calibration scheduling for scalable quantum system maintenance.
Our system further incorporates self-triggered recalibration based on online RB/XEB health metrics, enabling autonomous recovery from drift without human intervention. Experimental evaluation on multi-qubit testbeds demonstrates improved calibration efficiency, longer mean time between recalibrations (MTBC), and reduced overall calibration overhead compared to fixed or purely heuristic scheduling baselines. We provide an open-source implementation, Cal Orchestrator, which integrates DAG construction, BO-based experiment selection, and RL-based calibration scheduling for scalable quantum system maintenance.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPower integrity in high-performance computing systems is crucial for ensuring reliable and efficient operation. With each generation, high-performance CPU cores are significantly increasing in area and power to meet compute needs. Power integrity has become a critical factor in CPU silicon performance, as CPUs are starting to throttle before achieving the STA frequency due to power budgets and thermal limitations. General-purpose CPUs used in various product segments often face coverage issues, making it challenging to sign off on power integrity with a limited number of vectors. Therefore, a global solution is needed to ensure robust electrical performance and reliability.
Global PG-Fill provides reinforcement to the existing Power Delivery Network (PDN), improving PDN resistance, reducing voltage drops, and minimizing electromagnetic interference.
In this paper, we discuss how we utilized opportunistic track-based PG-Fill using Cadence Pegasus and optimally integrated it with the existing PDN to enhance power integrity while addressing challenges such as runtime and timing degradation. We propose our custom solution by optimally controlling STAMP lengths, adding additional PG VIAs, and improving the overall effective resistance of the design. We also address key challenges, such as timing degradation caused by PG-Fill, and how our custom solution helped to minimize these issues.
Using the proposed custom solution, we achieved an improvement of approximately 20% in overall PDN resistance and 88% cleanup in overall dynamic IR violations, resulting in a more robust design. All of this was achieved without introducing any additional DRC violations.
Global PG-Fill provides reinforcement to the existing Power Delivery Network (PDN), improving PDN resistance, reducing voltage drops, and minimizing electromagnetic interference.
In this paper, we discuss how we utilized opportunistic track-based PG-Fill using Cadence Pegasus and optimally integrated it with the existing PDN to enhance power integrity while addressing challenges such as runtime and timing degradation. We propose our custom solution by optimally controlling STAMP lengths, adding additional PG VIAs, and improving the overall effective resistance of the design. We also address key challenges, such as timing degradation caused by PG-Fill, and how our custom solution helped to minimize these issues.
Using the proposed custom solution, we achieved an improvement of approximately 20% in overall PDN resistance and 88% cleanup in overall dynamic IR violations, resulting in a more robust design. All of this was achieved without introducing any additional DRC violations.
Research Manuscript
Design
DES3. Emerging Models of Computation
DescriptionFully Programmable Valve Array (FPVA) biochips offer high flexibility and reusability for executing complex biochemical assays. However, their reliability and performance are constrained by the time-sensitive nature of bioassays and the inherent complexity of pressure-driven fluid routing. Existing methods often overlook timing constraints and the effects of pressure paths during early design stages, leading to suboptimal scheduling and placement. To address these limitations, we propose a timing-driven scheduling and placement framework that explicitly incorporates pressure-path interference to optimize fluidic operations, thereby reducing delays and resource overhead. Experimental results demonstrate that the framework significantly decreases bioassay completion time, fluid path length, and delays, enhancing both operational efficiency and system robustness.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionThe efficiency of processor-based emulation (PBE) heavily depends on the scheduling of computations. In the scheduling process, the challenge in accurate modeling time step stems from the complex instruction execution process, which is influenced by both the inter-processor communication latency, the execution sequence of the allocated nodes, and scheduling constraints. Therefore, existing works struggle with the inability to optimize the time step directly. In this paper, inspired by the key insight of the inherent connection between the time step and the topological order balancing of the netlist graph, we propose the topological order balancing-driven scheduling algorithm. Our approach introduces the mobility-prioritized node selection and efficient forward and backward propagation. Besides, a theorem is presented to reduce the time complexity of gain calculation to O(1). Experiments demonstrate significant improvements over the SOTA scheduling algorithm, achieving 72% better TOB metric, 22.5% reduction in time steps, and 55% faster runtime on public and open-source chip benchmarks, thus proving TOB as a critical metric for the scheduling process and enhancing the efficiency and performance of PBE emulation.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionProgrammable RISC-V soft cores are becoming more widespread in data-center scenarios, making it crucial to design efficient systems that deploy them on FPGA chips with HBM memory. Toki, released as open source, is the first hardware-software framework that enables profiling the performance of HBM on FPGA accelerator cards by jointly considering (i) the execution of workloads on RISC-V soft cores instantiated on the FPGA and (ii) the injection of memory traffic from the host system via DMA over PCIe, providing insights that cannot be obtained with synthetic traffic generators alone. Extensive experiments target an AMD Alveo U55C card, deploying up to 60 RISC-V compute cores and stressing its HBM2 memory through real-world applications and microbenchmarks with user-defined access patterns. Results showcase how Toki can effectively profile the impact on HBM performance of the compute cores' organization, the workload's memory access patterns, data locality, contention over the memory controllers, and host traffic.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionTransistor topology optimization is a critical step in standard cell design, directly dictating diffusion sharing efficiency and downstream routability. However, identifying optimal topologies remains a persistent bottleneck, as conventional exhaustive search methods become computationally intractable with increasing circuit complexity in advanced nodes. This paper introduces TOPCELL, a novel and scalable framework that reformulates high-dimensional topology exploration as a generative task using Large Language Models (LLMs). We employ Group Relative Policy Optimization (GRPO) to fine-tune the model, aligning its topology optimization strategy with logical (circuit) and spatial (layout) constraints. Experimental results within an industrial flow targeting an advanced 2nm technology node demonstrate that TOPCELL significantly outperforms foundation models in discovering routable, physically-aware topologies. When integrated into a state-of-the-art (SOTA) automation flow for a 7nm library generation task, TOPCELL exhibits robust zero-shot generalization and matches the layout quality of exhaustive solvers while achieving an 85.91X speedup.
People
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
DescriptionExisting GNN quantization methods suffer from considerable quantization overhead, which severely limits their practical usage in real-world scenarios. To this end, we present TopGQ, an accurate post-training quantization framework tailored for GNNs, alleviating the burden of redundant quantization overhead. We propose TopPIN, a proxy for nodes' local structure, and use it to group nodes with similar topology during quantization. On top of that, we introduce Dual-axis scale absorption, which enables activation quantization along both the outer and inner dimensions by merging one into the adjacency matrix. Experimental results show that TopGQ reduces quantization time by an order of magnitude while preserving accuracy.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionTask-oriented object detection on CLIP enables open-vocabulary, prompt-driven semantics but dense alignment and memory traffic block real-time edge use. We propose TorR, a brain-inspired algorithm–architecture co-design that replaces CLIP-style dense matching with hyperdimensional associative reasoning and exploits temporal coherence. TorR combines HDC similarity, graph reasoning with query caching and delta updates, and a lane-scalable, precision-gated item memory with RT-30/RT-60 control. A 28 nm TorR accelerator delivers real-time detection with millijoule energy and competitive AP at orders-of-magnitude lower energy.
Research Special Session
AI
DescriptionAs technology nodes shrink, design rule checks established by foundries to ensure manufacturability have become more complex and stringent. As a result, fixing design rule violations (DRV) in layouts has become very time-consuming and complex. In practice, DRV fixing is still performed manually during late design closure and often under intense tapeout pressure. Since DRC reports provide only limited information, such as rule names and DRV locations in layout, engineers are forced to repeatedly cross-reference reports, layouts, and lengthy design rule manuals (DRMs) to identify the cause of DRVs. Recently, large language models (LLM) have shown a strong ability to interpret rule constraints. To further drive their use for DRC-related tasks, we introduce a multimodal benchmark suite that serves as both a training dataset and an evaluation benchmark for LLM-driven workflows in DRC research. The suite pairs raw GDSII layouts and PNG screenshots with an extraction pipeline that converts GDSII into layout scripts, for open-source tools such as KLayout, making the inputs compatible with LLMs. It also provides compressed DRC annotations in JSON format that record the rule name and the corresponding DRV location, as well as compressed multimodal information derived from the DRM, including images and textual data. The benchmark includes cases from standard cell layouts to blocklevel designs built from open-source flows, including designs from OpenROAD and OpenCores synthesized using various technologies. The benchmark will enable reproducible evaluation across DRC tasks, including identifying DRVs, explaining their root causes in the layout, and fixing DRVs.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionThe rapid growth of AI workloads is fundamentally reshaping system-on-chip (SoC) design. Monolithic scaling is increasingly constrained by reticle limits, rising costs, power density, and yield challenges, while AI systems demand heterogeneous compute, high-bandwidth memory, and system-level co-design across silicon, packaging, and software. Chiplet-based architectures offer a compelling path forward, enabling modularity, reuse, and faster innovation. However, the lack of common system-level standards risks fragmenting the ecosystem and limiting multi-vendor interoperability.
This presentation introduces the Foundation Chiplet System Architecture (FCSA), an open, ISA-neutral, and architecture-agnostic framework designed to enable interoperable chiplet-based systems at scale. Building on the lessons and adoption of the Arm Chiplet System Architecture (CSA), FCSA defines a common foundation for chiplet systems independent of CPU instruction set, enabling reuse across diverse architectures and markets. Developed under the Open Compute Project (OCP), FCSA establishes a shared vocabulary of chiplet types—including compute, accelerator, I/O, memory, and system expansion—and specifies their expected behaviors, interfaces, and trust boundaries.
FCSA adopts a layered model spanning system partitioning, functional interfaces, protocol mappings, and physical integration, leveraging industry standards such as UCIe, AMBA CHI-C2C, PCIe, and CXL to ensure multi-vendor interoperability. By decoupling architecture-specific requirements from common system foundations, FCSA enables both ISA-specific optimization and broad ecosystem compatibility.
The Foundation Chiplet System Architecture represents a critical step toward an open chiplet marketplace, reducing integration friction, accelerating custom silicon development, and enabling scalable, heterogeneous systems for AI infrastructure, cloud, automotive, and edge computing.
This presentation introduces the Foundation Chiplet System Architecture (FCSA), an open, ISA-neutral, and architecture-agnostic framework designed to enable interoperable chiplet-based systems at scale. Building on the lessons and adoption of the Arm Chiplet System Architecture (CSA), FCSA defines a common foundation for chiplet systems independent of CPU instruction set, enabling reuse across diverse architectures and markets. Developed under the Open Compute Project (OCP), FCSA establishes a shared vocabulary of chiplet types—including compute, accelerator, I/O, memory, and system expansion—and specifies their expected behaviors, interfaces, and trust boundaries.
FCSA adopts a layered model spanning system partitioning, functional interfaces, protocol mappings, and physical integration, leveraging industry standards such as UCIe, AMBA CHI-C2C, PCIe, and CXL to ensure multi-vendor interoperability. By decoupling architecture-specific requirements from common system foundations, FCSA enables both ISA-specific optimization and broad ecosystem compatibility.
The Foundation Chiplet System Architecture represents a critical step toward an open chiplet marketplace, reducing integration friction, accelerating custom silicon development, and enabling scalable, heterogeneous systems for AI infrastructure, cloud, automotive, and edge computing.
Research Special Session
Systems
DescriptionThis talk will fous on how we can recover from defects in fanout wafer-level packaging (FOWLP) and hybrid bnding. Defects in FOWLP can arise from coefficient of thermal expansion mismatch, warpage, die shift, and post-molding protrusion, causing misalignment and imperfect bonding during redistribution layer (RDL) buildup. As a result, high-density back-end-of-line (BEOL) interconnects, RDLs, and through-mold vias are susceptible to warpage-induced stress, strain, and deformation. Hybrid bonding interfaces are vulnerable to nanoscale voids, contamination, and alignment-induced opens or shorts. The speaker will present an analysis of these defects, fault models, and impact of these defects on logic functionality and timing. The presentation showcase recent breakthroughs in built-in self-test and repair for FOWLP and hybrid bonding, translating cutting-edge academic research into practical solutions.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionFace-to-face (F2F) three-dimensional integrated circuit (3D IC) design leverages advanced vertical interconnect technologies, such as hybrid bonding and micro-bump soldering, to vertically stack dies and preserve transistor-density scaling beyond Moore's law. Despite significant progress in 3D physical design methodologies, explicit timing optimization within 3D analytical global placement (GP) remains largely unexplored. In this work, we present the first timing-driven analytical GP framework that moves toward true-3D optimization for mixed-size F2F 3D ICs. We propose a comprehensive timing-driven net weighting formulation that integrates drive-strength-based cell delay, L-shaped RC-based net delay, and static timing analysis (STA)-based incremental timing criticality into the analytical GP model, serving as a backbone for timing optimization and flexibly adapts to both true-3D and multi-die 2D GP environments. To steer macros away from central congestion and provide an effective 3D initialization, we introduce a macro-boundary-aware true-3D initial placement approach that models macro-to-boundary interactions using a differentiable function. Then, we develop the first timing-driven mixed-size true-3D GP algorithm that jointly optimizes standard cells and macros within a unified 3D design space, enabling cross-die timing refinement and improving 3D timing closure. After die partitioning based on true-3D GP results, we further introduce a timing-driven multi-die 2D GP guided by 3D-aware STA, in which cross-die RC-trees are reconstructed to enable realistic 3D parasitic estimation for STA. Experimental results on OpenROAD benchmark suites demonstrate that, compared with existing open-source placement flows and a wirelength-driven 3D GP baseline, our timing-driven 3D GP framework achieves at least 33.2% and 43.2% average improvements in total negative slack (TNS) and worst negative slack (WNS), respectively.
TechTalk
AI
DescriptionDesign verification remains one of the most time-intensive and expertise-dependent stages of chip development, routinely consuming over 70% of total project effort. As AI agents mature from research prototypes into production tools, the DV community faces a pivotal question: are these systems actually ready for real tapeouts?
This talk presents lessons learned from deploying AI agents across production DV environments from block-level designs to full-chip SoC verification. We describe an agentic architecture that autonomously produces root-cause analyses and suggested fixes in under 15 minutes while operating under the realities of security, efficiency, and integration with existing flows..
Across a diverse set of production issues, we demonstrate a 70% end-to-end success rate on first-pass debug. We discuss what makes agents succeed and fail, and share how an AI-native EDA toolstack plus advanced layers for planning and institutional knowledge are foundational to achieving this level of reliability. This includes rethinking how agents access waveforms, query design databases, and interact with simulation infrastructure at scale.
We argue that the bottleneck to an AI-native EDA stack is no longer model capability — it is infrastructure, integration, and trust.
This talk presents lessons learned from deploying AI agents across production DV environments from block-level designs to full-chip SoC verification. We describe an agentic architecture that autonomously produces root-cause analyses and suggested fixes in under 15 minutes while operating under the realities of security, efficiency, and integration with existing flows..
Across a diverse set of production issues, we demonstrate a 70% end-to-end success rate on first-pass debug. We discuss what makes agents succeed and fail, and share how an AI-native EDA toolstack plus advanced layers for planning and institutional knowledge are foundational to achieving this level of reliability. This includes rethinking how agents access waveforms, query design databases, and interact with simulation infrastructure at scale.
We argue that the bottleneck to an AI-native EDA stack is no longer model capability — it is infrastructure, integration, and trust.
People
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionAlways-on vision is essential for smart glasses, yet continuous visual-processing under stringent power budgets remains a major challenge. This work presents an always-on visual wakeup acceleration-engine for sub-milliwatt hand gesture recognition. A compact convolutional neural network (<13k parameters), trained on the HaGRID dataset, performs binary classification on 64×64 inputs achieving over 92% accuracy at 3b precision, requiring less than 9kB of memory. When executed on a flexible accelerator engine implemented in 7nm technology, at 100MHz, it consumes only 64nJ per frame, translating to always-on power of 11μW at 30FPS, enabling energy-efficient, always-on interaction for next-generation
smart-glasses.
smart-glasses.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionTraining satellite-edge deep learning models remains constrained by limited onboard resources and high data download latency. Although split federated learning (SFL) offers a potential solution through model partitioning, its convergence and robustness are fundamentally compromised by the intermittent and asymmetric satellite–ground links. To address these issues, we introduce SatSFL as a novel SFL system for satellite-based computing networks. SatSFL employs an interpolated gradient approximation to emulate ground feedback during disconnections, markedly accelerating convergence while maintaining robustness under heterogeneous data. In addition, we design adaptive uplink compression under asymmetric bandwidth to ensure that balanced and critical gradients are reliably transmitted back to satellites. We implement and evaluate SatSFL on real-world LEO satellite systems and datasets, demonstrating superior accuracy and convergence speed compared to state-of-the-art methods.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe security properties and practical efficiency of Trusted Execution Environments (TEEs) have made them play an increasingly critical role in cloud computing, e.g., Confidential Virtual Machines (CVMs). However, TEE implementations, closely tied to specific hardware, introduce compatibility challenges for upper cloud services. In this paper, we focus on Live Migration, which is widely used in Normal Virtual Machines (NVMs) by Cloud Service Providers (CSPs) to manage computing resources, e.g., upgrading the host system with out taking services into downtime. Currently, TEE vendor solutions support CVM migration only within homogeneous TEE stacks, without consideration of heterogeneous environments, rendering heterogeneous migration difficult even impossible. To narrow the gap, we propose a generic framework applicable to various x86 TEEs, providing insights into achieving heterogeneous migration. In addition, we implement a migration system based on this frame work that manages to achieve migration between AMD SEV and Hygon CSV by introducing a trusted migration agent with a specific design. The key idea is to emulate the required migration commands of the TEE firmware and resolve compatibility issues through a helper agent. Our prototype offers security guarantees comparable to homogeneous CVM migration, with experiments demonstrating acceptable performance overhead.
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
DescriptionRowHammer is one of the most critical security threats in modern DRAM. The industry has introduced Per-Row Activation Counting (PRAC) to monitor activations, but this approach degrades performance by lengthening critical DRAM timing parameters. We propose TRAC (Transparent Row Activation Counting), a lightweight monitoring architecture that eliminates these drawbacks. TRAC leverages ACT-triggered update, robust in-subarray processing, and a Linear Feedback Shift Register (LFSR)-based counter to track activations transparently, without introducing timing overhead. Circuit-level simulations and system-level evaluations demonstrate TRAC's correctness, stability, and negligible area cost, establishing it as a practical and scalable defense against RowHammer.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionLearning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation.
To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.
To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern backend design teams lose significant time diagnosing disk alerts, locating stale data, and cleaning cloned PD execution flow tracks—effort that scales poorly with project size. We present Track Analyzer + Visualizer, an automated framework that identifies redundant dataout directories, quantifies reclaimable storage, and generates intuitive drill-down cleanup reports.
Cloned tracks for new experiments often copy large dataout release directories intended for consecutive sessions in the original track, creating unnecessary duplication and consuming disk space.
In one project, the tool analyzed ~240 TB of design release data and identified 105 TB (~44%) of recoverable space in under 2 hours, replacing days of manual effort. Weekly use across multiple projects has reclaimed 350+ TB, preventing disks from reaching critical capacity and avoiding workflow interruptions. These savings return engineering time and reduce full storage operating costs—including power, cooling, backups, and maintenance—while helping avoid costly expansion cycles.
This insight drove methodology improvements to minimize duplication while preserving flow integrity, complementing ongoing automated cleanup for sustainable data management. By automating a recurring burden, this solution improves productivity, lowers storage costs, and enhances scalability and business efficiency.
Cloned tracks for new experiments often copy large dataout release directories intended for consecutive sessions in the original track, creating unnecessary duplication and consuming disk space.
In one project, the tool analyzed ~240 TB of design release data and identified 105 TB (~44%) of recoverable space in under 2 hours, replacing days of manual effort. Weekly use across multiple projects has reclaimed 350+ TB, preventing disks from reaching critical capacity and avoiding workflow interruptions. These savings return engineering time and reduce full storage operating costs—including power, cooling, backups, and maintenance—while helping avoid costly expansion cycles.
This insight drove methodology improvements to minimize duplication while preserving flow integrity, complementing ongoing automated cleanup for sustainable data management. By automating a recurring burden, this solution improves productivity, lowers storage costs, and enhances scalability and business efficiency.
Research Manuscript
Design
DES4. Digital and Analog Circuits
DescriptionThis work introduces a machine-learning (ML)-based calibration
framework for analog-to-digital converters (ADCs) that is demon-
strated on an over-sampled, time-interleaved band-pass delta-sigma
ADC (TI-BPADC) test-chip in 65nm and a nyquist, successive ap-
proximation register (SAR) ADC test-chip in 28 nm. The proposed
framework employs two models: a hybrid convolutional-recurrent
network (ConvRec) and a residual convolutional network (ResConv)
for suppressing both static and dynamic errors in ADCs, and presents
trade-offs between the two models in terms of calibration accu-
racy and hardware cost. The proposed models improve signal-to-
noise-and-distortion ratio (SNDR) and spurious-free-dynamic range
(SFDR) by more than 20 dB on the ADC test chips without requir-
ing prior knowledge of circuit architecture, error mechanism or
input statistics. This is in contrast to existing calibration techniques
which are either algorithmic and require prior knowledge of errors
for calibration, or leverage ML for calibration but need knowledge
of input statistics. Input and error agnostic property of the proposed
ML framework is the key differentiation of this work over others
and allows correction of errors, including errors that are unforeseen
during design time.
framework for analog-to-digital converters (ADCs) that is demon-
strated on an over-sampled, time-interleaved band-pass delta-sigma
ADC (TI-BPADC) test-chip in 65nm and a nyquist, successive ap-
proximation register (SAR) ADC test-chip in 28 nm. The proposed
framework employs two models: a hybrid convolutional-recurrent
network (ConvRec) and a residual convolutional network (ResConv)
for suppressing both static and dynamic errors in ADCs, and presents
trade-offs between the two models in terms of calibration accu-
racy and hardware cost. The proposed models improve signal-to-
noise-and-distortion ratio (SNDR) and spurious-free-dynamic range
(SFDR) by more than 20 dB on the ADC test chips without requir-
ing prior knowledge of circuit architecture, error mechanism or
input statistics. This is in contrast to existing calibration techniques
which are either algorithmic and require prior knowledge of errors
for calibration, or leverage ML for calibration but need knowledge
of input statistics. Input and error agnostic property of the proposed
ML framework is the key differentiation of this work over others
and allows correction of errors, including errors that are unforeseen
during design time.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionReducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-consuming in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that rely on manual or automatic synthesis without considering AI model contexts, we present TRAM, which jointly optimizes the AxM structure and AI model to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by 27.09% on vision transformers with ImageNet.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionPre‑silicon validation environments such as simulation, emulation, and FPGA prototyping increasingly require early driver bring‑up to enable system-level debug, performance analysis, and IP integration. However, writing bare‑metal drivers from scratch remains a slow and expertise-intensive process, often taking months, and leading to duplicated effort across teams and projects. At the same time, mature Linux drivers exist for virtually every peripheral IP, encapsulating years of development, design knowledge, and corner-case handling. Unfortunately, these drivers cannot be reused directly in bare‑metal flows due to deep dependencies on kernel services—including device enumeration, memory management, PCIe infrastructure, and interrupt routing—none of which exist in pre‑silicon setups.
We present a practical and reusable framework that enables systematic extraction and reuse of the device‑specific logic from Linux drivers by decoupling OS-bound components through a kernel-proxy layer, a lightweight hardware wrapper, and an SV‑DPI transport bridge. This architecture preserves proven driver behavior while providing deterministic execution, fast bring‑up, and portability across simulation, emulation, and prototyping platforms. Through case studies such as USB4, the framework demonstrates significant reductions in time‑to‑first‑IO, high reuse of driver code, and improved debug efficiency. The approach provides a scalable methodology for accelerating hardware‑aware software development and early system enablement in complex SoC programs.
We present a practical and reusable framework that enables systematic extraction and reuse of the device‑specific logic from Linux drivers by decoupling OS-bound components through a kernel-proxy layer, a lightweight hardware wrapper, and an SV‑DPI transport bridge. This architecture preserves proven driver behavior while providing deterministic execution, fast bring‑up, and portability across simulation, emulation, and prototyping platforms. Through case studies such as USB4, the framework demonstrates significant reductions in time‑to‑first‑IO, high reuse of driver code, and improved debug efficiency. The approach provides a scalable methodology for accelerating hardware‑aware software development and early system enablement in complex SoC programs.
People
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionSAT-based transistor network synthesis delivers optimal, minimum-transistor designs but faces exponential complexity, making it impractical for large circuits. This paper introduces TransSyn, a scalable transistor network synthesizer that brings the benefits of SAT-based synthesis to the macro-cell scale through a novel two-stage framework. First, a cut-based global synthesis stage reformulates the task as a cut-based mapping problem, rapidly decomposing the design into smaller, manageable sub-problems. This ensures scalability and produces a high-quality network in seconds. Then, a SAT-based detailed synthesis stage refines this network by systematically identifying and re-optimizing merged subcircuits across cut boundaries using a SAT-based approach, recapturing lost optimization opportunities. Experiments demonstrate that TransSyn significantly outperforms previous methods, achieving 11.2% and 10% reductions in transistor count compared to graph-based and hybrid-based approaches, respectively. Furthermore, when integrated with a transistor-level placer, TransSyn breaks the library and cell boundaries, yielding up to a 48.8% area reduction compared to conventional cell-based designs. TransSyn demonstrates its capability for scalable, high-quality transistor-network synthesis, successfully bridging the gap between the optimality of SAT-based synthesis and the practical demands of large-scale cell design.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionTriple Sparse General Matrix-Matrix Multiplication (TriSpGEMM) is a major performance bottleneck in scientific computing and graph analytics. Existing domain-specific accelerators support only Sparse General Matrix-Matrix Multiplication (SpGEMM). Performing TriSpGEMM by sequencing two SpGEMM kernels introduces a global synchronization barrier, forces intermediate data to be written to and read from off-chip memory, and leads to frequent pipeline stalls, all of which significantly degrade performance. A natural solution is to design a fused architecture that executes TriSpGEMM directly, but such a design must overcome three key challenges, i.e., pipeline overflow caused by highly irregular and bursty sparse dataflows, locality inefficiency stemming from the trade-off between input reuse and output aggregation, and severe head-of-line blocking due to variable memory access latencies.
In this work, we present TRIDENT, a transient-data-centric accelerator for fused TriSpGEMM processing. TRIDENT integrates a hierarchical flow-control mechanism, a hybrid dataflow co-designed with a windowed sparse format, and an out-of-order decoupled execution scheme to ensure pipeline stability, locality efficiency, and latency tolerance.
Comprehensive evaluations demonstrate that TRIDENT achieves an average of 3.56x performance speedup over state-of-the-art SpGEMM accelerators, and up to 49.6x and 6.3x improvements over CPU and GPU baselines, respectively.
In this work, we present TRIDENT, a transient-data-centric accelerator for fused TriSpGEMM processing. TRIDENT integrates a hierarchical flow-control mechanism, a hybrid dataflow co-designed with a windowed sparse format, and an out-of-order decoupled execution scheme to ensure pipeline stability, locality efficiency, and latency tolerance.
Comprehensive evaluations demonstrate that TRIDENT achieves an average of 3.56x performance speedup over state-of-the-art SpGEMM accelerators, and up to 49.6x and 6.3x improvements over CPU and GPU baselines, respectively.
People
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionTo deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU–CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU–CPU–NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83$\times$ speedup over state-of-the-art solutions.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionMultimodal stacks mixing ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms due to heterogeneous compute and tight real-time constraints. We present TRINE, a single-bitstream FPGA accelerator and compiler that runs end-to-end multimodal inference without reconfiguration. It unifies layers as DDMM/SDDMM/SpMM on a mode-switchable PE array supporting weight/output-stationary systolic, 1×CS SIMD, and a routable adder tree with in-stream top-k token pruning. Dependency-aware layer offloading overlaps independent kernels across RPUs. On Alveo U50 and ZCU104, TRINE achieves up to 22.57× and 6.86× lower latency than RTX 4090 and Jetson Orin Nano at 20–21 W, with <2.5% accuracy loss and state-of-the-art efficiency.
People
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe growing need for environmental adaptability, privacy, and security in edge applications drives the search for alternatives to deep learning that are both computationally efficient and hardware-friendly. WiSARD Weightless Neural Networks (WNNs) meet this need through simple table lookups, offering low latency and minimal computation. In this work, we propose TsetlinWiSARD, an FPGA-based on-chip training architecture for WiSARD WNNs that leverages Tsetlin Automata (TAs) to enable probabilistic, feedback-driven learning. By mapping logical LUTs and TAs onto physical LUTs on FPGAs, TsetlinWiSARD achieves state-of-the-art accuracy, over 1000× faster training, and reduced resource use (22%), latency (93.3%), and power (64.2%).
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionCloud–edge AI deployments must balance compression and security on resource-limited devices. While Tensor-Train Decomposition (TTD) is widely used to compact models, encryption research targets dense networks, leaving the practicality of selective encryption under compression unclear. We introduce TT-SEAL, the first selective encryption scheme tailored to TT-decomposed networks. TT-SEAL ranks TT cores using a core-wise importance criterion and encrypts a minimal set of critical cores with AES, cutting decryption cost while retaining robustness to adversarial transfer comparable to full encryption. FPGA-based experiments show encrypting as little as 4.89% of parameters preserves robustness while reducing the share of AES decryption overhead in end-to-end inference to low single digits, enabling secure, low-latency inference.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionScientific machine learning (SciML) field deployment faces communication bottlenecks from centralized architectures, while distributed approaches violate physical principles. We introduce EPIC, a hardware-physics co-guided framework using full-waveform inversion as a representative task. EPIC performs lightweight edge encoding and physics-aware central decoding, transmitting compact latents instead of raw data. Cross-attention preserves inter-receiver wavefield coupling while reducing communication costs. Evaluated on five Raspberry Pi devices across 10 OpenFWI datasets, EPIC reduces latency by 8.9× and communication energy by 33.8×, while improving reconstruction fidelity on 8 out of 10 datasets compared to centralized approaches.
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionSRAM-based digital in-memory computing excels in low-precision AI, but high-precision floating-point for scientific computing faces challenges: (1) ultralarge mantissa increases area and reduces MAC speed; (2) alignment and accumulation impose excessive peripheral circuit overhead. We propose an ultralarge bit-width bit-serial mantissa MAC with 5-signal Booth encoding and two-step carry-save adder, and exponent-aligned accumulation module with exponent difference generator and FIFO adder tree. In 28 nm, UBFP-IMC with 168 MAC units and one accumulation module achieves 4.43×, 4.46×, 4.30×, and 8.53× EEF improvement, 27× lower accumulation latency, and outstanding area advantage, which provide excellent scalability, demonstrating potential for accelerating scientific computing.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionFor AI/HPC application, ASIC development with heterogeneous integration of 2.5DIC through UCIe has been a critical demand in recent years. UCIe consortium proposes advanced packaging interface protocol of data width 64-bit with 64GT/s in revision 3.0 on Aug., 2025, which is also beneficial for huge data transmission requirement. However, such UCIe-A 64GT/s integration also brings strict SI/PI design challenges at interconnects and power distributed networks (PDNs).
In addition, various advanced packaging technologies of 2.5DIC were proposed in recent decades, e.g., TSMC CoWoS and Intel EMIB. To fulfill chiplets integration at these various advanced packaging technologies, the related signal/power integrity design has also played a significant role at high speed integration from pre-layout to post-layout.
This presentation discusses the critical design factors, performance merits, and EDA workflows, based on TSMC three critical interposer technologies, i.e., CoWoS-S/L/R, regarding UCIe-A x64E, 64GT/s integration, where the design using CoWoS-S can achieve good margins of eyewidth of 0.9UI and Vpp of 5.92mV. This work can provide an in-depth view and reference for successful integration of signal/power integrity aware design.
The content is organized as, (1) UCIe-A 64GT/s, x64E, design scope and EDA workflows; (2) signal integrity comparison; (3) power integrity comparison; (4) summary.
In addition, various advanced packaging technologies of 2.5DIC were proposed in recent decades, e.g., TSMC CoWoS and Intel EMIB. To fulfill chiplets integration at these various advanced packaging technologies, the related signal/power integrity design has also played a significant role at high speed integration from pre-layout to post-layout.
This presentation discusses the critical design factors, performance merits, and EDA workflows, based on TSMC three critical interposer technologies, i.e., CoWoS-S/L/R, regarding UCIe-A x64E, 64GT/s integration, where the design using CoWoS-S can achieve good margins of eyewidth of 0.9UI and Vpp of 5.92mV. This work can provide an in-depth view and reference for successful integration of signal/power integrity aware design.
The content is organized as, (1) UCIe-A 64GT/s, x64E, design scope and EDA workflows; (2) signal integrity comparison; (3) power integrity comparison; (4) summary.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionMatrix extensions are essential in modern CPUs, but existing implementations with tight pipeline coupling and fine-grained synchronous instructions hinder integration and high-performance kernel implementation.
We propose an adaptable matrix extension that carefully decouples matrix-unit from the CPU pipeline for low-overhead integration, unified software support, and reuse of existing compute and memory resources. It supports asynchronous operations and enables efficient matrix–vector overlap.
Integrated into four open-source RTL CPUs, the design achieves over 90% utilization and up to 2.31× speedup on ResNet50, BERT, and Llama3 over Intel AMX, with 30% gains from matrix–vector overlap, while a 4TFLOPS@14nm implementation occupies 0.53mm².
We propose an adaptable matrix extension that carefully decouples matrix-unit from the CPU pipeline for low-overhead integration, unified software support, and reuse of existing compute and memory resources. It supports asynchronous operations and enables efficient matrix–vector overlap.
Integrated into four open-source RTL CPUs, the design achieves over 90% utilization and up to 2.31× speedup on ResNet50, BERT, and Llama3 over Intel AMX, with 30% gains from matrix–vector overlap, while a 4TFLOPS@14nm implementation occupies 0.53mm².
People
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionConventional Analog-to-Digital Converter (ADC) linearity testing separates data acquisition from model extraction, resulting in long post-processing times. This work unifies two complementary strategies, Uncertainty-Guided Live Measurement Sequencing (UGLMS) for adaptive test stimulus control and ultrafast Segmented Model Identification of Linearity Error (uSMILE) for error parameter extraction, into a single closed-loop test framework. The combined UGLMS–uSMILE method performs live estimation of MSB, ISB, and LSB-level nonlinearity during measurement, guided by covariance-based uncertainty metrics. This eliminates offline processing while retaining high accuracy and stability, enabling rapid, measurement-driven characterization of ADC linearity directly within production test environments. Experimental results on Successive Approximation Register (SAR) ADCs demonstrate sub-0.2 LSB accuracy and test times below 150 ms for 16-bit resolution.
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
DescriptionLarge language models (LLMs) require larger GPU memory space these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate up to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
DescriptionAnalog circuit design is time-consuming and heavily dependent on the expertise of human designers. This paper presents a universal circuit design hierarchical model (UNCHARTED). It is the first machine learning-based framework that emulates transient simulation by modeling physical behaviors of individual circuit components through the graph neural network. The sizing of circuit elements is reflected through low-rank adaptation, and inverse design is achieved by the back-propagation mechanism. UNCHARTED can accelerate simulation by up to 241.9x, with high similarity (>95%) to SPICE simulation in more than 70\% of tested circuits. Inverse design examples of a delay chain and an amplifier also demonstrate its powerful capabilities in automatic design optimization.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionNonlinear activation functions (NAFs) are critical to deep neural networks (DNNs), yet their diverse and complex computational forms are inherently hardware-unfriendly, incurring substantial latency, area, and energy overheads. This work presents UNICON, a unified reconfigurable hardware architecture that efficiently supports diverse NAFs through a logarithmic-domain computing paradigm.
UNICON uncovers the intrinsic correlations among NAFs and decomposes complex nonlinear operations into lightweight shift–add operations. With modular design and dynamic reconfigurable dataflows, UNICON achieves high functional flexibility and resource efficiency without hardware duplication. As the first algorithm–architecture co-designed solution that unifies diverse NAFs within a logarithmic-domain framework, UNICON gains an average 1.55x speedup, 2.56x energy efficiency, and 2.74x area efficiency over state-of-the-art NAF architecture.
UNICON uncovers the intrinsic correlations among NAFs and decomposes complex nonlinear operations into lightweight shift–add operations. With modular design and dynamic reconfigurable dataflows, UNICON achieves high functional flexibility and resource efficiency without hardware duplication. As the first algorithm–architecture co-designed solution that unifies diverse NAFs within a logarithmic-domain framework, UNICON gains an average 1.55x speedup, 2.56x energy efficiency, and 2.74x area efficiency over state-of-the-art NAF architecture.
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
DescriptionThe growing demand for high-performance yet energy-efficient systems has elevated the need for coordinated power and signal routing in multilayer printed circuit boards (PCBs). While prior work optimizes either the power-delivery network (for DC IR-drop compliance) or signal routing (for routability), these separated flows often induce spatial contention and inefficient layer utilization. We propose the first unified rail-signal co-design framework for multilayer PCBs that jointly optimizes routing quality and power integrity. Our approach (1) generates routing guides to profile and allocate resources, (2) performs iterative rail-signal co-optimization that explicitly models crossing, overflow, and current density, and (3) conducts detailed routing via a guided, crossing-aware A* search that aligns with globally optimized guides. Experiments demonstrate that our proposed framework substantially improves routability while satisfying DC IR-drop constraints, delivering robust solutions with efficient resource usage.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionAs software complexity increases in modern SoC development, effective pre-silicon software validation has become critical. Although software development is performed early in the design flow, conventional emulation environments remain limited in realism, configurability, and performance, restricting their ability to capture silicon-relevant behavior.
This work introduces a pre-silicon software validation framework that addresses fundamental limitations of conventional hybrid emulation by combining a camera sensor model (CSM) with a new-generation emulator featuring multi-port memory support. The framework is implemented on the StratoCS emulator.
The sensor model enables high-fidelity camera behavioral validation by providing image data that closely reflects real hardware behavior in a hybrid emulation environment. To efficiently support high-resolution camera workloads, a high-bandwidth multi-port hybrid memory architecture is employed, enabling data transfer rates that approximate real sensor throughput while preserving functional correctness. In addition, virtual I2C enables user-driven runtime control of the sensor model, improving configurability and extending the practical applicability of sensor-based validation.
As a result, realistic validation can be performed earlier, allowing post-silicon efforts to focus on issue resolution as well as power and performance optimization. By overcoming conventional emulation limitations, the proposed framework establishes emulation as a practical platform for user-driven, DMA-intensive pre-silicon software validation.
This work introduces a pre-silicon software validation framework that addresses fundamental limitations of conventional hybrid emulation by combining a camera sensor model (CSM) with a new-generation emulator featuring multi-port memory support. The framework is implemented on the StratoCS emulator.
The sensor model enables high-fidelity camera behavioral validation by providing image data that closely reflects real hardware behavior in a hybrid emulation environment. To efficiently support high-resolution camera workloads, a high-bandwidth multi-port hybrid memory architecture is employed, enabling data transfer rates that approximate real sensor throughput while preserving functional correctness. In addition, virtual I2C enables user-driven runtime control of the sensor model, improving configurability and extending the practical applicability of sensor-based validation.
As a result, realistic validation can be performed earlier, allowing post-silicon efforts to focus on issue resolution as well as power and performance optimization. By overcoming conventional emulation limitations, the proposed framework establishes emulation as a practical platform for user-driven, DMA-intensive pre-silicon software validation.
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
DescriptionDeploying large language models (LLMs) on edge devices is critical for user privacy and low-latency inference. However, existing matrix-centric LLM accelerators suffer from limited throughput when handling fragmented non-linear operators due to tight edge-side resource constraints, and the absence of a general fusion mechanism frequently results in severe PE array underutilization during non-linear execution phases. To address these challenges, we propose UniNL, a unified execution framework designed to efficiently handle non-linear operators in LLMs. First, we introduce an abstraction model that decomposes non-linear operators into a sequence of primitives, leveraging polynomial approximation to enable point-wise operations to be executed on the PE array. Second, we implement lightweight microarchitectural extensions to the accelerator to efficiently support these primitives. Finally, we devise a primitive-aware fusion mechanism that effectively hides non-linear latency behind matrix operations. Experimental results demonstrate that UniNL achieves a 6.89× speedup and 7.26× energy-efficiency improvement on average compared to a Jetson AGX Orin baseline. Furthermore, UniNL outperforms state-of-the-art designs by 1.46× and 1.35× in average performance, respectively.
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
DescriptionMany-core neuromorphic systems serve as specialized accelerators for Spiking Neural Networks (SNNs). However, the communication mechanisms in existing neuromorphic systems incur significant traffic and energy overhead due to redundant address transmissions, a bottleneck exacerbated by the short payloads inherent to the spike-based communication nature of SNNs. Specifically, our characterization reveals that in current neuromorphic systems, duplicate address transmissions can account for up to 49% of the total traffic in representative workloads.
This paper presents UniSpike, a hardware-software co-design that eliminates address redundancy by aggregating spikes destined for the same core into single, compact packets. UniSpike implements a spike transmission scheduling strategy to enable efficient spike merging, supported by a dedicated hardware architecture for runtime packet assembly and dispatch, as well as a destination-aware SNN partitioning algorithm that maximizes address sharing opportunities. Experimental results demonstrate that on average, UniSpike reduces traffic volume by 1.93x, achieving 1.77x speedup and 1.50x energy efficiency improvement over state-of-the-art designs.
This paper presents UniSpike, a hardware-software co-design that eliminates address redundancy by aggregating spikes destined for the same core into single, compact packets. UniSpike implements a spike transmission scheduling strategy to enable efficient spike merging, supported by a dedicated hardware architecture for runtime packet assembly and dispatch, as well as a destination-aware SNN partitioning algorithm that maximizes address sharing opportunities. Experimental results demonstrate that on average, UniSpike reduces traffic volume by 1.93x, achieving 1.77x speedup and 1.50x energy efficiency improvement over state-of-the-art designs.
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
DescriptionClock gating and data gating are established techniques for reducing dynamic power. However, performing automated data gating during logic synthesis remains challenging due to the difficulty of accurately estimating power savings and identifying common Observability Don't Care (ODC) conditions for the enable logic. This work introduces a novel methodology that addresses both challenges: it introduces a novel SAT-based technique to compute valid ODC conditions and employs a machine learning-based power model to predict power improvements with high accuracy. The approach is integrated into an industrial synthesis flow and achieves efficient, fully automated, data gating. Experimental results demonstrate an average dynamic power reduction of -1% post place & route,
with negligible area and runtime overhead.
with negligible area and runtime overhead.
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe interrupt is a significant mechanism for computer systems to schedule hardware resources. We find that some hardware interrupts are not well-managed so that they can expose low-level events to unprivileged software, which introduce unanticipated security risks.
We first design an automated testing framework to identify the interrupts' working features and trigger condition, which helps us discover six vulnerable interrupts. Next, we design a novel set of attack primitives for leaking secrets using them. Finally, we realize four attacks with the proposed attack primitives: leaking contents from a restricted directory, fingerprinting DNN model architectures, classifying processes, and enhancing Spectre attacks.
We first design an automated testing framework to identify the interrupts' working features and trigger condition, which helps us discover six vulnerable interrupts. Next, we design a novel set of attack primitives for leaking secrets using them. Finally, we realize four attacks with the proposed attack primitives: leaking contents from a restricted directory, fingerprinting DNN model architectures, classifying processes, and enhancing Spectre attacks.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionIn AI-driven design optimization EDA flows, there is usually a pre-defined list of primitives, which are options or parameters, that the flow can experiment on. By fine-tuning primitive values using AI-driven algorithms and analysis, EDA flows aim to produce optimal PPA (performance, power, and area) results with little user guidance. The pre-defined primitive list, however, is not likely to be completely comprehensive, and users may request for additional primitives to further improve PPA. Adding new primitives to EDA flows can take significant amount of time due to software development and qualification timelines. In this work, we propose a novel methodology to allow users to add customized primitives into existing AI-assisted EDA flows. We demonstrate a case study on 3DIC power grid optimization to showcase this capability. Power grids are usually product specific, and therefore prevents EDA vendors from pre-defining primitives related to power grid even for the same process technology. By using this methodology, users can instantaneously add or change customized primitives to expand capabilities in AI-driven optimization flows by exploring additional design/flow parameters not existed before, and thus opening new realms of possibilities for PPA improvements with fast turn-around-time.
People
Engineering Presentation
Design
EDA
Systems
DescriptionPerformance is an eternal issue for virtual platforms. Executing target code directly on hosts with the same instruction-set architecture (ISA) by leveraging virtualization technology can potentially increase performance by an order of magnitude compared to a just-in-time (JIT) compiler. Since Arm-based hosts have become more available, we have integrated Arm-based native acceleration (ANA) with the VLAB™ virtual platform framework. The solution uses ANA opportunistically and allows mixing different ISAs and core types independently of the host computer. JIT and interpreter modes can be used with ANA to fill in execution modes that ANA cannot handle due to how Arm virtualization works currently. Using ANA does not affect standard VLAB features, and platforms are built in the same way as before. We present the implementation and evaluate the trade-offs and performance on a variety of hosts, across different types of software loads.
Engineering Presentation
Design
EDA
Security
Systems
DescriptionThis work presents a machine‑learning‑assisted approach to pre‑silicon side‑channel analysis (SCA) for AES‑GCM cryptographic IPs. Traditional SCA methods rely on explicit leakage models and manual point‑of‑interest selection, which can limit accuracy and scalability in early‑stage hardware security verification. In contrast, this study utilizes RTL‑generated power traces and profiling attacks powered by convolutional and multilayer perceptron models to automatically learn secret‑dependent leakage without prior assumptions. The proposed flow integrates RTL simulation, power‑trace preprocessing, and ML‑based key‑byte classification to evaluate leakage resilience before tape‑out. Experimental results using 10,000 simulated traces demonstrate successful key‑recovery performance, with a significant portion of AES‑GCM key bytes identified at low ranking positions across both ML models. These findings show that ML‑assisted SCA strengthens shift‑left security validation by revealing vulnerabilities early in the design cycle and reducing dependence on traditional leakage modeling techniques.
Engineering Special Session
EDA
Quantum
DescriptionQuantum computing progress is no longer limited by scientific discovery alone, but it is constrained by the broader chip and manufacturing ecosystem required to scale reliable quantum systems. The emerging quantum chip supply chain spans materials, wafer fabrication, cryogenic packaging, control electronics, and system-level integration, each introducing new engineering and manufacturing challenges. Key bottlenecks continue to slow the transition from laboratory prototypes to industrial-scale production, revealing the limitations of applying traditional semiconductor workflows to quantum technologies. Addressing these challenges requires deeper hardware–software co-design, where device physics, fabrication processes, control stacks, and algorithm requirements evolve together. Reframing quantum development as a full-stack engineering problem highlights the technical and ecosystem-level priorities necessary to move quantum hardware from bespoke research platforms into manufacturable, scalable, and commercially viable computing systems.
Engineering Presentation
EDA
DescriptionTraditional analog mixed-signal (MS) IP verification relies heavily on analog-centric simulations, such as Monte Carlo simulations, which focus solely on analog behavior . This approach leads to significant drawbacks, including the inability to accurately model loading effects, kickbacks, and reference/supply variations, resulting in potential silicon bugs. Furthermore, the use of Verilog-AMS models and connect modules introduces additional complexities, leading to prolonged simulation runtimes and limited functional coverage. To address these challenges, this paper proposes a novel UVM-MS based verification methodology that leverages VIPs (Verification Intellectual Property) to drive and monitor analog signals, providing comprehensive coverage of digital-analog interactions. Our approach eliminates the need for connect modules, reduces simulation runtime, and enhances functional coverage, thereby improving the overall verification efficiency of MS IPs. By adopting this new methodology, designers can ensure more accurate and reliable verification of their MS IPs, reducing the risk of silicon bugs and improving time-to-market.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionVerification presents a major bottleneck in Integrated Circuit (IC) development, consuming nearly 70% of total effort. While the Universal Verification Methodology (UVM) improves reuse through structured verification environments, constructing subsystem-level UVM testbenches and generating high-quality stimuli still require extensive manual coding, repeated EDA tool runs, and deep protocol and micro-architectural expertise. We present UVMarvel, an automated verification framework that leverages Large Language Models (LLMs) to build UVM testbenches for subsystem-level RTL. UVMarvel introduces an Intermediate Representation (IR) and a Bus Protocol Library to translate heterogeneous specifications into protocol-correct subsystem-level UVM testbenches, and employs a Signal Tracker and a Verilog Patching Library to guide LLM-based stimuli refinement. UVMarvel is the first framework capable of automatically constructing subsystem-level UVM testbenches across mainstream bus protocols, and it achieves an average code coverage of 95.65%, while reducing verification time from several human working days to a 4.5-hour automated execution.
People
Research Special Session
AI
DescriptionAI is approaching its next major infrastructure frontier: space. The critical needs for AI in space demand high-performance fault-tolerant computing to support future spacecraft autonomy, in-orbit data processing and AI training & inferencing in orbital data centers. Our space computing research evaluates the implementation of AI/ML algorithms on next-generation multi-core processors, including High Performance Spaceflight Computing (HPSC) processors, Snapdragon VOXL2, AMD Xilinx Versal, and other processors, with the objective of achieving real-time computation while improving power efficiency and system reliability onboard spacecraft. Beyond computational performance, this study strengthens fault tolerance by leveraging HPSC's dual-core lockstep architecture and multi-core redundancy computations to implement fault protection methods in "static" and "dynamic" situations. By implementing targeted fault mitigation strategies at critical points in the computation pipeline, the approach ensures mission-critical reliability by reducing the impact of radiation-induced failures and hardware malfunctions in deep-space operations. We discuss a scalable, energy-efficient, and fault-tolerant computing architecture for future planetary exploration missions. This study provides a comprehensive evaluation of next-generation space processors and co-processors, offering key insights into high-performance, autonomous space computing solutions that balance speed, power efficiency, and mission-critical reliability. The advancements presented are directly applicable to AI applied to space exploration missions and orbital data centers.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
DescriptionHybrid Transactional and Analytical Processing (HTAP) workloads are widely deployed in production systems. Under HTAP workloads, frequent updates lengthen the version chains maintained in Multi-Version-Concurrency-Control (MVCC) systems. Longer version chains require Visible Version Searching (VVS) to scan more versions during analytical queries. In-Storage Computing (ISC) technique alleviates the I/O burden. We propose VALVE to offload the VVS process to a Computational-Storage-Device (CSD). VALVE first solves the consistency issue under host-CSD concurrency. A frugal skip list is introduced in VALVE to further accelerate the version-chain traversal. Experimental results show that VALVE reduces end-to-end scan latency and improves analytical throughput.
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
DescriptionRRAM-based in-memory computing (IMC) offers high energy efficiency
but suffers from conductance drift that severely degrades long-term
accuracy. Existing approaches including retraining, noise-aware
training, and Batch Normalization-based calibration either require RRAM rewriting,
demand large storage overhead, or rely on online correction. We
propose VeRA+, a lightweight drift compensation framework that
reuses shared projection matrices and introduces only two compact
drift-specific vectors per drift level. A drift-aware scheduling
algorithm offline-trains a small set of VeRA+ parameters and
selects the appropriate set over time without any on-chip
retraining or data replay. VeRA+ preserves up to 99.77% of the drift-free accuracy after ten years of simulated drift and reduces storage overhead by more than three orders of magnitude compared with BN-based calibration. To validate VeRA+ under realistic device behavior, we extract one-week drift statistics from measurements on our fabricated 1T1R RRAM devices and use them to simulate realistic drifted weights. Under these measured drift conditions,
VeRA+ achieves accuracy close to the drift-free baseline, providing an
efficient and practical solution for long-term drift resilience in RRAM-IMC.
but suffers from conductance drift that severely degrades long-term
accuracy. Existing approaches including retraining, noise-aware
training, and Batch Normalization-based calibration either require RRAM rewriting,
demand large storage overhead, or rely on online correction. We
propose VeRA+, a lightweight drift compensation framework that
reuses shared projection matrices and introduces only two compact
drift-specific vectors per drift level. A drift-aware scheduling
algorithm offline-trains a small set of VeRA+ parameters and
selects the appropriate set over time without any on-chip
retraining or data replay. VeRA+ preserves up to 99.77% of the drift-free accuracy after ten years of simulated drift and reduces storage overhead by more than three orders of magnitude compared with BN-based calibration. To validate VeRA+ under realistic device behavior, we extract one-week drift statistics from measurements on our fabricated 1T1R RRAM devices and use them to simulate realistic drifted weights. Under these measured drift conditions,
VeRA+ achieves accuracy close to the drift-free baseline, providing an
efficient and practical solution for long-term drift resilience in RRAM-IMC.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionLarge language models are increasingly used to generate hardware verification artifacts from natural language prompts, yet evaluation is often limited to syntactic or build success. We present VeriScore, an open evaluation framework that measures end to end verification quality by executing generated verification artifacts on correct and systematically mutated RTL designs. VeriScore provides standardized prompts with explicit assumptions, golden RTL implementations, and controlled bug injected mutants representing common hardware error patterns. Generated artifacts such as assertions, checkers, testbenches, and harness code are built and executed under fixed resource budgets and scored based on their ability to detect injected bugs without producing false failures. In addition, VeriScore can optionally benchmark models against a curated suite of known correct verification artifacts derived from design specifications, enabling direct comparison to reference assertions and checkers. The resulting report decomposes performance into buildability, soundness, sensitivity, and non triviality and aggregates these into a transparent prompt to verification accuracy score. VeriScore enables reproducible tool agnostic comparison of verification pipelines and establishes a benchmark for evaluating verification effectiveness beyond compilation.
Exhibitor Forum
DescriptionDriven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced “Design Conductor” (or just “Conductor”), a system capable of building a 1.5GHz, 5-stage Linux-capable RISC-V CPU running in 12 hours. In this work, we introduce an updated multi-agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. We examined 4 designs that the system produced autonomously, including “VerTQ”, an LLM inference accelerator which hard-wires support for TurboQuant in a 240-cycle pipeline, starting from the TurboQuant arXiv paper published by Google Research on March 24, 2026. VerTQ includes heavy compute processing, with 5,129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7mm2 in TSMC 16FF node (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor’s token usage and other empirical characteristics, including its limitations.
People
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionEnsuring reliable chip operation at minimum voltage (Vmin) demands robust analysis of timing, thermal, and power delivery networks (PDN). Traditional transient analysis methods are computationally intensive and impractical for evaluating PDN across all critical path scenarios. This work introduces the Path-aware SigmaAV (PSAV) methodology, which leverages Sigma technology in RedHawk-SC to efficiently simulate worst-case voltage drops. Critical paths extracted via PrimeTime are analyzed using aggressor voltage impact data, enabling realistic scenario generation without exhaustive transient simulations. The PSAV flow back-annotates worst-case drops into PrimeTime for slack analysis, guiding targeted PDN reinforcement. Applied to a 100M-instance design, PSAV simulated 1.9M path segments in 6 hours using 127 workers, outperforming traditional VCD-based methods in both coverage and accuracy. Over 90% of critical path instances were better analyzed in a single PSAV run, revealing slack violations missed by conventional techniques. Future enhancements aim to refine aggressor selection for broader coverage. PSAV presents a scalable, performance-efficient solution for Vmin reliability assurance.
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
DescriptionWhile deep neural networks (DNNs) have achieved state-of-the-art performance in fields from computer vision to natural language processing, efficiently running these computationally demanding models requires specialized hardware accelerators. However, designing these accelerators is a time-consuming, labor-intensive process that does not scale well across multiple design points. While prior efforts have sought to automate DNN accelerator generation, they typically offer limited parameterization, cannot produce high-performance, tapeout-ready designs, provide limited support for multiple datatypes and quantization schemes, and lack an integrated, end-to-end software compiler.
This work proposes Voyager, a high-level synthesis (HLS)-based framework for rapid design space exploration and generation of DNN accelerators. Voyager overcomes the limitations of prior work by offering extensive configurability across technology nodes, clock frequencies, and scales, with customizable parameters such as number of processing elements, on-chip buffer sizes, and external memory bandwidth. Voyager supports a much wider variety of datatypes and quantization schemes versus prior work, including both built-in arbitrary-length floating-point, posit and integer formats, as well as user-defined custom formats with both per-tensor scaling and microscaling quantization. Voyager's PyTorch-based compiler efficiently maps neural networks end-to-end on the generated hardware, with support for quantization, operation fusion, and tiling.
We evaluate Voyager on state-of-the-art vision and language models. Voyager enables fast design-space exploration with full-dataset accuracy evaluation for different datatypes and quantization schemes. Generated designs achieve a high utilization across models and scales, up to 99.8%, and outperform prior generators with up to 61% lower latency and 56% lower area. Compared to hand-crafted accelerators, Voyager achieves comparable performance, while offering much greater automation in design and workload mapping.
This work proposes Voyager, a high-level synthesis (HLS)-based framework for rapid design space exploration and generation of DNN accelerators. Voyager overcomes the limitations of prior work by offering extensive configurability across technology nodes, clock frequencies, and scales, with customizable parameters such as number of processing elements, on-chip buffer sizes, and external memory bandwidth. Voyager supports a much wider variety of datatypes and quantization schemes versus prior work, including both built-in arbitrary-length floating-point, posit and integer formats, as well as user-defined custom formats with both per-tensor scaling and microscaling quantization. Voyager's PyTorch-based compiler efficiently maps neural networks end-to-end on the generated hardware, with support for quantization, operation fusion, and tiling.
We evaluate Voyager on state-of-the-art vision and language models. Voyager enables fast design-space exploration with full-dataset accuracy evaluation for different datatypes and quantization schemes. Generated designs achieve a high utilization across models and scales, up to 99.8%, and outperform prior generators with up to 61% lower latency and 56% lower area. Compared to hand-crafted accelerators, Voyager achieves comparable performance, while offering much greater automation in design and workload mapping.
Research Manuscript
EDA
EDA2. Design Verification and Validation
DescriptionThe semantic gap between tensor-centric software models and signal-level hardware testbenches creates significant productivity bottlenecks in verifying Domain-Specific Accelerators (DSAs). Existing frameworks like UVM and Cocotb suffer from prohibitive synchronization overheads due to fine-grained interactions. To address this, we propose vTen, a data-centric framework that strictly decouples verification intent from execution mechanics. By leveraging a declarative DSL and kernel-granular batching, vTen minimizes host-simulator interaction frequency. Evaluation on a production-scale 3D U-Net accelerator demonstrates that vTen achieves a 2x performance improvement in simulation latency and a 60.3% reduction in code complexity compared to Cocotb.
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
DescriptionStatic timing analysis (STA) is crucial for Electronic Design Automation (EDA) flows but remains a computational bottleneck. While existing GPU-based STA engines are faster than CPU, they suffer from inefficiencies, particularly intra-warp load imbalance caused by irregular circuit graphs. This paper introduces Warp-STAR, a novel GPU-accelerated STA engine that eliminates this imbalance by orchestrating parallel computations at the warp level. This approach achieves a 2.4X speedup over previous state-of-the-art (SoTA) GPU-based STA. When integrated into a timing-driven global placement framework, Warp-STAR delivers a 1.7X speedup over SoTA frameworks. The method also proves effective for differentiable gradient analysis with minimal overhead.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionTechnology Computer-Aided Design (TCAD) simulation of semiconductor devices is time-consuming. Existing machine learning surrogate models offer acceleration but require extensive labeled data from these slow TCAD simulations. To address this data dependency issue, we propose the Weak-Form Physics-Informed Neural Network (WF-PINN), a self-supervised method driven solely by physical laws and Dirichlet boundary conditions. WF-PINN utilizes an integral-based weak-form loss to eliminate the need for internal labeled solution data. Experimental results show that WF-PINN effectively simulates the physical characteristics of Fin Field-Effect Transistors, with all solution errors less than 5.36×10^(−2). Furthermore, WF-PINN's inference speed is 2.16×10^(4) times that of TCAD simulation.
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
DescriptionMixture-of-Experts (MoE) models achieve remarkable performance through sparse expert activation, yet resource-constrained deployment encounters challenges as parameter sizes exceed device capacity, necessitating CPU offloading. Existing systems leave CPU underutilized during GPU attention computation. We present Weaver, a stratified scheduling framework exploiting this idle window through score-stratified expert treatment. Our insight reveals asymmetric expert importance: low-scoring experts tolerate input approximation while high-scoring ones remain accuracy-critical. Weaver proactively executes low-score experts on CPU during attention using prior-layer inputs, initiates stratified prefetching, and reactively balances remaining workload. Evaluations on three MoE models demonstrate 1.47-3.58x speedup over state-of-the-art offloading systems while maintaining model quality.
Engineering Presentation
Design
EDA
DescriptionLPDDR6 delivers substantial gains in bandwidth, power efficiency, and scalability through phase‑aware command scheduling, multi‑sub‑channel architectures, and data rates reaching 12.8 GT/s. These architectural innovations significantly complicate the verification of minimum timing constraints and performance‑critical behaviours in memory controllers, where simulation‑based approaches fail to expose phase‑dependent corner cases and sub‑optimal scheduling interactions.
This paper presents a scalable Formal Property Verification (FPV) framework to exhaustively validate LPDDR6 timing correctness and systematically analyze controller‑level performance limitations. The approach employs a generic timing reference model augmented with frequency‑mode‑aware phase counters, constrained timing registers, and state‑driven SystemVerilog Assertions (SVA) to precisely track command scheduling across 1:2:4 and 1:4:8 clocking modes. The formal methodology detects minimum‑timing constraint violations and LPDDR6 performance degradation, ensuring strict minimum‑timing compliance while improving memory‑controller scheduling efficiency.
Applied across all 16 LPDDR6 supported data rates, the framework uncovered multiple timing and performance defects, improved scheduling robustness, and enabled early detection of corner‑case issues. These results demonstrate that FPV is an effective and scalable solution for ensuring LPDDR6 timing compliance while maximizing memory‑controller performance.
This paper presents a scalable Formal Property Verification (FPV) framework to exhaustively validate LPDDR6 timing correctness and systematically analyze controller‑level performance limitations. The approach employs a generic timing reference model augmented with frequency‑mode‑aware phase counters, constrained timing registers, and state‑driven SystemVerilog Assertions (SVA) to precisely track command scheduling across 1:2:4 and 1:4:8 clocking modes. The formal methodology detects minimum‑timing constraint violations and LPDDR6 performance degradation, ensuring strict minimum‑timing compliance while improving memory‑controller scheduling efficiency.
Applied across all 16 LPDDR6 supported data rates, the framework uncovered multiple timing and performance defects, improved scheduling robustness, and enabled early detection of corner‑case issues. These results demonstrate that FPV is an effective and scalable solution for ensuring LPDDR6 timing compliance while maximizing memory‑controller performance.
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
DescriptionDefending DNNs against bit-flip attacks (BFAs) typically incurs prohibitive computational overhead, a critical barrier for deployment in safety-critical systems. We challenge this "one-size-fits-all" defense paradigm by introducing the Weight Function Hierarchy (WFH), a framework that deconstructs DNNs into functionally distinct tiers: a compact Anchor Core for foundational representation, a Class-Sensitive layer for decision-making, and a Reservoir for redundancy. Based on WFH, our synergistic framework, WFH-BFs, couples offline preparation with online adaptation. Offline, we differentially harden the Anchor Core while embedding pruning tolerance elsewhere. Online, a lightweight integrity checker safeguards the core, while a dynamic pruning mechanism reconfigures the network to neutralize attacks. Experiments show WFH-BFs increases BFAs cost by 17.5x and maintains near-original accuracy (<1.4% drop) under high-rate random errors. Critically, our dynamic defense achieves this robust security while simultaneously reducing the overall computational load (GFLOPs), breaking the security-efficiency trade-off.
DAC Pavilion Panel
DescriptionChip design, EDA, and adjacent fields are experiencing a major renaissance, driving VCs to aggressively invest in a new wave of startups. Investors are now actively attending DAC to spot emerging trends and meet entrepreneurs. In this panel, we'll explore investors' theses, uncover what ideas they're backing and what they'd like to see more of, and hear their advice to current and aspiring founders.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionTrusted Execution Environments (TEEs) provide confidentiality and integrity, but their reliance on untrusted schedulers makes them vulnerable to CPU Denial-of-Service (DoS) attacks, compromising availability.
Existing solutions either enlarge the Trusted Computing Base (TCB), rely on static closed-world workload assumptions that block dynamic enclave admission, or require hardware modifications.
We propose AvaTEE, a lightweight framework for verifiable CPU availability in TEEs.
AvaTEE uniquely integrates resource negotiation into remote attestation, providing a verifiable resource commitment pre-deployment.
At runtime, a trusted scheduler performs dynamic monitoring and preemptive arbitration to mitigate DoS attacks.
We evaluate AvaTEE on an FPGA.
The results demonstrate that it (1) provides robust protection under contention, where native TEEs suffer a 94.8\% performance loss and up to a 19.7$\times$ slowdown; (2) incurs negligible overhead (less than 2\%) during enclave startup and normal runtime; and (3) maintains near-native performance, with only 1.70\% average overhead compared to Keystone.
Existing solutions either enlarge the Trusted Computing Base (TCB), rely on static closed-world workload assumptions that block dynamic enclave admission, or require hardware modifications.
We propose AvaTEE, a lightweight framework for verifiable CPU availability in TEEs.
AvaTEE uniquely integrates resource negotiation into remote attestation, providing a verifiable resource commitment pre-deployment.
At runtime, a trusted scheduler performs dynamic monitoring and preemptive arbitration to mitigate DoS attacks.
We evaluate AvaTEE on an FPGA.
The results demonstrate that it (1) provides robust protection under contention, where native TEEs suffer a 94.8\% performance loss and up to a 19.7$\times$ slowdown; (2) incurs negligible overhead (less than 2\%) during enclave startup and normal runtime; and (3) maintains near-native performance, with only 1.70\% average overhead compared to Keystone.
People
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionA major challenge in 3DIC design is determining the top-level floorplan of a chip stack while accounting for die-to-die (D2D) interfaces. Existing approaches often model modules as simple rectangles and optimize area, and wirelength under limited constraints, diverging from real scenarios where hard modules have fixed outlines and some have alignment constraints, while soft modules have undefined shapes. To address these issues, we propose a two-stage, whitespace-free rectilinear 3D floorplanning process. The first stage places hard modules and determines the spatial relationship of soft-modules using a force-directed algorithm; the second tessellates each soft module into a rectilinear shape with minimal area error using POLT. The method integrates constrained hard modules and flexible soft modules to generate rectilinear, whitespace-free floorplans that meet target areas and alignment constraints.
Experimental results show that our method achieves an average 5.6% reduction in HPWL on 3D floorplans with alignment constraints and reduces area error by 98% compared to existing rectilinear floorplanners, while maintaining less than 0.16% area error across all 2D and 3D stacking scenarios.
Experimental results show that our method achieves an average 5.6% reduction in HPWL on 3D floorplans with alignment constraints and reduces area error by 98% compared to existing rectilinear floorplanners, while maintaining less than 0.16% area error across all 2D and 3D stacking scenarios.
Research Panel
Design
DescriptionChiplet-based integration has become a driving force behind the scale-up of leading systems in high-performance computing (HPC) and artificial intelligence (AI). This migration introduces a spectrum of design choices and trade-offs, particularly in inter-chiplet signaling and interconnection. As chiplet systems scale out to hundreds or even thousands of compute and memory tiles, a new ecosystem is rapidly taking shape, spanning advanced packaging and substrate technologies to novel signaling protocols and physical interfaces.
While high-speed electrical links remain the workhorse of today's chiplet architectures, photonic interconnects present a compelling alternative, promising massive bandwidth and energy efficiency over longer distances. Yet, photonics still faces significant challenges: integration complexity, electro-optical conversion overhead, and thermal sensitivity. Electrical links, in contrast, offer greater maturity, lower cost, and proven scalability.
The question now stands: as chiplet systems continue to scale, will photonic links become essential, or will electrical interconnects maintain their dominance?
While high-speed electrical links remain the workhorse of today's chiplet architectures, photonic interconnects present a compelling alternative, promising massive bandwidth and energy efficiency over longer distances. Yet, photonics still faces significant challenges: integration complexity, electro-optical conversion overhead, and thermal sensitivity. Electrical links, in contrast, offer greater maturity, lower cost, and proven scalability.
The question now stands: as chiplet systems continue to scale, will photonic links become essential, or will electrical interconnects maintain their dominance?
Research Panel
AI
DescriptionThe Electronic Design Automation (EDA) industry is currently undergoing a "Cambrian Explosion" of agentic AI, where Large Language Models (LLMs) and Reinforcement Learning (RL) agents are no longer just assistants but primary drivers of testing, debugging, and synthesis. While these autonomous systems promise to solve the engineering talent gap and optimize PPA (Power, Performance, Area) beyond human capability, they introduce a radical non-determinism into a field that has historically demanded absolute precision. We are essentially replacing verified, hard-coded algorithms with probabilistic "black boxes" that are susceptible to hallucinations, subtle functional inaccuracies, and adversarial breaches that can bypass traditional verification guardrails.
The integrity of this new AI-driven flow is further threatened by the "Ouroboros Effect": the massive influx of generative synthetic data used to train the next generation of EDA models. As autonomous agents begin to train on the outputs of their predecessors, the risk of ";self-poisoning" or model collapse increases, leading to a loss of the rare, "edge-case" wisdom that human engineers provide. This panel will confront the central
paradox of modern design: if our verification tools are now driven by the same fallible AI models they are meant to check, we face a crisis of trust. We must determine if we are building a more efficient future or simply automating the creation of "silent" hardware vulnerabilities that will only manifest in silicon.
The integrity of this new AI-driven flow is further threatened by the "Ouroboros Effect": the massive influx of generative synthetic data used to train the next generation of EDA models. As autonomous agents begin to train on the outputs of their predecessors, the risk of ";self-poisoning" or model collapse increases, leading to a loss of the rare, "edge-case" wisdom that human engineers provide. This panel will confront the central
paradox of modern design: if our verification tools are now driven by the same fallible AI models they are meant to check, we face a crisis of trust. We must determine if we are building a more efficient future or simply automating the creation of "silent" hardware vulnerabilities that will only manifest in silicon.
People
Work in Progress
DescriptionWith the growing interest in using AI (Artificial Intelligence) for RTL (Register-Transfer Level) hardware development, robust and comprehensive verification has become more important than ever. As Large Language Models (LLMs) increasingly assist in creating and modifying RTL designs, ensuring that these changes preserve existing functionality is paramount for building trust in AI-driven workflows. We present SVApshot, a fully automated framework that leverages LLMs to generate and systematically attempt to correct SystemVerilog Assertions (SVA) for RTL modules, completely removing the human from the generation loop. SVApshot introduces a novel snapshot methodology that captures the current functionality of a design as a comprehensive set of formal assertions, creating a regression suite for validating future changes-whether made by humans or AI. The framework features automated duplicate detection, assertion set expansion, and iterative repair of failing assertions using LLM-guided debugging with formal verification feedback. Our experimental evaluation across diverse RTL modules demonstrates high coverage for complex modules, with successful detection of both manually-injected and AI-introduced bugs. Additionally, this framework opens the door for automatic generation of formal testbenches to benchmark LLMs for hardware design, moving beyond binary pass/fail metrics towards nuanced assertion-based scoring.
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
Description3D QLC NAND flash suffers from a slow two-step programming process.
The problem is exacerbated by write invalidation on wordlines, which occurs when pages are updated between the two steps, wasting programming effort and degrading performance.
While prior work offers partial mitigation, it fails to fundamentally address the root cause: the misalignment of page invalidation time within a wordline.
In this paper, we propose WILL, a proactive page mapping scheme that enables write invalidation skipping for QLC NAND flash.
The basic idea is to group pages with similar invalidation time into the same wordline and turn partially-invalid wordlines into fully-valid or fully invalid ones, thereby skipping the programming on fully-invalid wordlines.
To achieve this, WILL first predicts page invalidation time, then organizes pages accordingly to maximize the opportunities of write invalidation skipping.
Finally, WILL schedules short-lifespan data to wordlines requiring high-latency fine-step programming to increase the probability of skipping high programming latencies for further programming performance enhancement.
In addition, read parallelism is maintained by identifying and excluding read-hot data from this mapping.
Experimental results show that WILL reduces programming execution time by 10.0% compared with the state-of-the-art, along with skipping an additional 5.03% of fully-invalid wordlines and lowering the partially-invalid ratio by 15.9% on average, at a minimal read performance cost of 2.59%.
The problem is exacerbated by write invalidation on wordlines, which occurs when pages are updated between the two steps, wasting programming effort and degrading performance.
While prior work offers partial mitigation, it fails to fundamentally address the root cause: the misalignment of page invalidation time within a wordline.
In this paper, we propose WILL, a proactive page mapping scheme that enables write invalidation skipping for QLC NAND flash.
The basic idea is to group pages with similar invalidation time into the same wordline and turn partially-invalid wordlines into fully-valid or fully invalid ones, thereby skipping the programming on fully-invalid wordlines.
To achieve this, WILL first predicts page invalidation time, then organizes pages accordingly to maximize the opportunities of write invalidation skipping.
Finally, WILL schedules short-lifespan data to wordlines requiring high-latency fine-step programming to increase the probability of skipping high programming latencies for further programming performance enhancement.
In addition, read parallelism is maintained by identifying and excluding read-hot data from this mapping.
Experimental results show that WILL reduces programming execution time by 10.0% compared with the state-of-the-art, along with skipping an additional 5.03% of fully-invalid wordlines and lowering the partially-invalid ratio by 15.9% on average, at a minimal read performance cost of 2.59%.
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
DescriptionIntermittent computing systems suffer from limited observability and lack effective runtime analysis mechanisms. Unlike conventional systems, they are vulnerable not only to typical programming errors but also to a distinct class of failures known as \textit{intermittence bugs}, which emerge due to their fragmented, non-linear execution model caused by frequent power loss. Traditional debugging tools are ill-suited for this context, as they assume persistent power and continuous connectivity. To address these challenges, we present a new Wireless In-Vivo/Ex-Vivo runtime analyzer caller WINNER. It is a lightweight task management layer that records fine-grained execution traces—such as interrupts, memory/register states—into non-volatile memory during regular operation (in-vivo), and exposes them only during debug sessions (ex-vivo) triggered automatically by a user-defined property violation such as exceeding repeated task execution. Building on this, WINNER introduces a new software-based communication paradigm that leverages controllable electromagnetic emanations (side-channel) for frequent but low-rate uplink communication, enabling software-controlled ultra-low-energy data transmission (up to 4x lower than BLE). We evaluate our solution across six applications and five platforms, demonstrating its effectiveness in detecting various bugs. The introduced overhead is minimal, with a runtime increase of less than 1% and an average binary size increase of 6%. Our tool is open-sourced.
Workshop
DescriptionContemporary microelectronic design is facing tremendous challenges in memory bandwidth, processing speed and power consumption. Although recent advances in monolithic design (e.g. near-memory and in-memory computing) help relieve some issues, the scaling trend is still lagging behind the ever-increasing demand of AI, HPC and other applications. In this context, technological innovations beyond a monolithic chip, such as 2.5D and 3D packaging at the macro and micro levels, are critical to enabling heterogeneous integration with various types of chiplets and bringing significant performance and cost benefits for future systems. Such a paradigm shift further drives new innovations on chiplet IPs, heterogeneous architectures and system mapping.
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionAs Moore's Law nears its physical limits, wafer-scale chips (WSCs) are emerging as a promising platform for LLM workloads, offering extreme performance, and favorable cost. However, prior work seldom quantifies WSC cost end-to-end, hindering principled design-space exploration. We introduce WSC-Cost, a unified WSC cost model with two orthogonal dimensions. Horizontally, WSC-Cost quantifies mask-stitching costs across multiple metal layers. Vertically, it accounts for interposer fabrication and assembly costs. We validate WSC-Cost using open-source cost data and an in-house WSC prototype. WSC-Cost enables the co-optimization of wafer-scale systems and further provides cost-driven insights to unleash the full potential of wafer-scale integration.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionModern chip design teams are increasingly exploring Gen-AI to improve productivity in front-end design, yet practical adoption remains challenging. Most existing Gen-AI tools focus on code generation or downstream assistance, while critical front-end tasks before RTL, such as system partitioning, interface definition, and verification readiness, are still handled manually. As a result, many Gen-AI generated designs suffer from validation gaps or insufficient functional coverage, with issues only exposed during late-stage simulation or integration.
This work presents XChip, a guardrailed, agentic Gen-AI front-end design flow that enables disciplined intent-to-RTL design with a supported path to GDSII via existing backend tools. XChip treats system-level design as a first-class abstraction and introduces Natural-Level Synthesis (NLS), a design stage that bridges high-level intent and RTL by generating structured system partitions, interfaces, and constraints. Within the NLS stage, XChip incorporates Generation Rule Check (GRC) as a built-in verification mechanism, performing rule-based validation and verification at the same abstraction level.
XChip supports two complementary front-end flows: a direct NLS-to-RTL flow, and an NLS-to-HLS-to-RTL flow that integrates with existing HLS tools for incremental adoption. Experimental results show significant reductions in manual effort and iteration cycles, measured by fewer human interventions, reduced errors, and improved functional coverage, while preserving compatibility with established RTL, HLS, and backend toolchains.
This work presents XChip, a guardrailed, agentic Gen-AI front-end design flow that enables disciplined intent-to-RTL design with a supported path to GDSII via existing backend tools. XChip treats system-level design as a first-class abstraction and introduces Natural-Level Synthesis (NLS), a design stage that bridges high-level intent and RTL by generating structured system partitions, interfaces, and constraints. Within the NLS stage, XChip incorporates Generation Rule Check (GRC) as a built-in verification mechanism, performing rule-based validation and verification at the same abstraction level.
XChip supports two complementary front-end flows: a direct NLS-to-RTL flow, and an NLS-to-HLS-to-RTL flow that integrates with existing HLS tools for incremental adoption. Experimental results show significant reductions in manual effort and iteration cycles, measured by fewer human interventions, reduced errors, and improved functional coverage, while preserving compatibility with established RTL, HLS, and backend toolchains.
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
DescriptionEnsuring real-time and safety requirements is essential in mixed-criticality systems. While some studies have explored applying fault-tolerance techniques in multiple abstraction layers separately, such isolated-layer-specific fault-mitigation incurs high overheads in peak-power, energy, and timing. In this regard, we propose a novel scheme to address cross-layer reliability in mixed-criticality systems at design-time, by providing application-specific, low-cost fault-tolerance. This approach distributes fault-mitigation across multiple layers to minimize it on hardware layer, ultimately resulting in more cost-effective system designs. To achieve this, we introduce a machine-learning-based approach to select efficient fault-tolerance methods across multiple layers, reducing overheads while meeting real-time and reliability requirements.
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
DescriptionDesign space exploration for microarchitecture parameters is a critical aspect of processor design, requiring a balance between evaluation cost and accuracy. New designs can be evaluated using either coarse-grained simulation (high efficiency, low accuracy) or fine-grained simulation (low efficiency, high accuracy). To address this efficiency-accuracy tradeoff, we propose XSearch, which employs multi-fidelity Bayesian optimization by cross-evaluating across simulators with varying fidelity levels. This approach leverages the efficiency of low-fidelity simulations to explore more design points while ensuring accuracy through calibration using high-fidelity simulation data. XSearch has been successfully applied to explore the design space of the large-scale open-source high-performance processor core XiangShan.
Research Manuscript
Systems
SYS3. Embedded Software
DescriptionZoned Namespace (ZNS) solid-state drives (SSDs) have been widely adopted in database systems, file systems, and large-scale data centers because they expose part of the physical storage layout to the host system. By leveraging this visibility, host software can access correlated data in parallel and eliminate valid-data copying during garbage collection, thereby improving I/O efficiency. To further integrate ZNS SSDs into memory management, prior work developed a Linux-based ZNS swapping mechanism that reserves multiple zones as swap space, extending virtual memory capacity. However, when deploying Heterogeneous Graph Neural Networks (HetGNNs), this mechanism fails to fully exploit ZNS parallelism due to HetGNN's highly irregular and non-sequential access patterns, which lead to chip-level congestion and long-tail latency. To address this issue, we propose Z-ParaSwap, a ZNS-based Parallelism-Aware Swapping Management framework for HetGNN applications. Z-ParaSwap identifies access correlations in graph data and distributes highly correlated pages across different chips to enhance parallelism and mitigate I/O contention. Experimental results show that Z-ParaSwap reduces average swapping latency by 34% and tail latency by 44%, significantly improving overall HetGNN execution efficiency on ZNS-based systems.
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
DescriptionGlitches in synthesized netlists continue to pose significant risks to functional correctness, especially in deeply optimized, high‑performance digital designs. While structural glitch analysis provides fast, design‑wide coverage, it alone cannot fully capture the functional conditions under which hazards become observable. Conversely, formal verification and dynamic simulation each offer deeper semantic insight but suffer from scalability or stimulus limitations. This work presents a unified methodology that combines static/structural glitch detection with formal and simulation-based verification to deliver end‑to‑end confidence in glitch‑free implementations.
We introduce techniques for cross‑domain correlation, where structural analysis first identifies potential hazard sites, which are then refined through formal trace generation or targeted simulation. For example, in control-heavy FSM logic, structural + simulation verification enables exhaustive proof that identified glitch cones cannot propagate to state elements under any reachable condition. Similarly, in asynchronous FIFO, Formal techniques are used to identify glitches in Empty signal. Experimental results show that the combined flow significantly reduces false positives, improves coverage, and enables practical signoff for glitch vulnerabilities in complex SoC blocks.
We introduce techniques for cross‑domain correlation, where structural analysis first identifies potential hazard sites, which are then refined through formal trace generation or targeted simulation. For example, in control-heavy FSM logic, structural + simulation verification enables exhaustive proof that identified glitch cones cannot propagate to state elements under any reachable condition. Similarly, in asynchronous FIFO, Formal techniques are used to identify glitches in Empty signal. Experimental results show that the combined flow significantly reduces false positives, improves coverage, and enables practical signoff for glitch vulnerabilities in complex SoC blocks.
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
DescriptionLLM inference stresses GPU memory with large model weights and KV caches, which requires efficient compression techniques.
But existing approaches either sacrifice accuracy (e.g., quantization) or apply serial entropy coding (e.g., DFloat11) that limits throughput. Both constrain the overall inference efficiency.
In this work, we propose ZipFloat, an LLM-oriented floating-point compressor that achieves massively parallel com-/decom-pression while preserving full precision.
We observe that weights and KV-cache values in LLMs exhibit strong statistical redundancies, yet existing floating-point formats mask this redundancy and hinder effective compression.
To address this, ZipFloat employs
(1) Exponent Sparsification, which redefines the binary representation of floating-points to restore compressibility,
and (2) Bit-Matrix Packing, which leverages this restored structure with GPU-native parallelism to deliver extreme throughput.
Evaluations show that ZipFloat delivers up to 700 GB/s com-/decom-pression throughput, outperforming SOTA methods (e.g., DFloat11) by several orders of magnitude while maintaining comparable compression ratios, thereby reducing over 95\% TTFT and improving over 4x inference throughput in LLM systems.
But existing approaches either sacrifice accuracy (e.g., quantization) or apply serial entropy coding (e.g., DFloat11) that limits throughput. Both constrain the overall inference efficiency.
In this work, we propose ZipFloat, an LLM-oriented floating-point compressor that achieves massively parallel com-/decom-pression while preserving full precision.
We observe that weights and KV-cache values in LLMs exhibit strong statistical redundancies, yet existing floating-point formats mask this redundancy and hinder effective compression.
To address this, ZipFloat employs
(1) Exponent Sparsification, which redefines the binary representation of floating-points to restore compressibility,
and (2) Bit-Matrix Packing, which leverages this restored structure with GPU-native parallelism to deliver extreme throughput.
Evaluations show that ZipFloat delivers up to 700 GB/s com-/decom-pression throughput, outperforming SOTA methods (e.g., DFloat11) by several orders of magnitude while maintaining comparable compression ratios, thereby reducing over 95\% TTFT and improving over 4x inference throughput in LLM systems.
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
DescriptionZero-knowledge proofs (ZKP) enable verification without revealing private data, but proof generation remains compute-intensive, dominated by polynomial (POLY) and elliptic-curve (EC) operations over large-bitwidth fields. Efficient acceleration requires flexible multi-precision arithmetic and high utilization across shifting POLY and EC workloads, yet existing reconfigurable designs address these demands only partially. We propose ZK-Flex, a software–hardware co-designed framework that reduces computation through hardware- and workload-aware POLY and EC optimizers, and employs TCore, a Toom–Cook–based multi-precision core supporting diverse bitwidths. Across representative benchmarks, ZK-Flex delivers 5-11x speedup and up to 3.8x higher area efficiency than prior accelerators.
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionWhile ZKP hardware acceleration has focused on backend proving, we identify frontend trace generation as the new system bottleneck through system-level profiling. To address this, we propose ZK-Tracer, the first hardware accelerator architecture for the zkVM frontend. ZK-Tracer features a novel heterogeneous design couples a Main Trace Unit with parallel Permutation Trace Units, all managed by a lightweight ISA extension for efficient offloading. Our ASIC implementation shows ZK-Tracer accelerates trace generation by 1829x, delivering a remarkable 963x end-to-end system speedup. This work rebalances the ZKP system by eliminating the emerging frontend bottleneck.
Sessions
Engineering Presentation
EDA
Systems
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
Student
ACM SIGDA/IEEE CEDA University Demonstration (part 3)
10:00am - 4:00pm PDT Wednesday, July 29 Exhibit HallResearch Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
Additional Meeting
Additional Meeting
Engineering Special Session
AI
Chiplet
EDA
Systems
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
Research Manuscript
Design
DES3. Emerging Models of Computation
Engineering Special Session
AI
Design
EDA
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
Engineering Presentation
AI
Design
EDA
Research Manuscript
Design
DES4. Digital and Analog Circuits
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
Research Manuscript
AI
AI4-I. AI/ML Architecture Design
Engineering Presentation
EDA
Systems
Research Manuscript
Design
Quantum
DES6. Quantum Computing
Engineering Presentation
EDA
Research Manuscript
EDA
EDA4. Power Analysis and Optimization
Research Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
Research Manuscript
AI
AI3-I. AI/ML Application and Infrastructure
Research Manuscript
Systems
SYS2. Design of Cyber-Physical Systems and IoT
Research Manuscript
EDA
EDA8. Design for Manufacturability and Reliability
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
Research Manuscript
AI
AI4-II. AI/ML Architecture Design
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Engineering Poster
Engineering Presentation
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Engineering Poster
Engineering Poster Gladiator
Engineering Presentation
Engineering Track Award Ceremony
3:00pm - 4:00pm PDT Wednesday, July 29 DAC Pavilion, Exhibit FloorResearch Special Session
AI
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
Research Manuscript
EDA
EDA2. Design Verification and Validation
Engineering Presentation
Design
EDA
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
Engineering Special Session
EDA
Quantum
Engineering Presentation
AI
Design
EDA
Research Manuscript
EDA
EDA2. Design Verification and Validation
Research Manuscript
EDA
EDA6. Analog CAD, Simulation, Verification and Test
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Engineering Poster Gladiator
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Hack@DAC
9:00am - 5:00pm PDT Sunday, July 26 Mtg Room 103CNetworking Events
Hack@DAC
10:30am - 5:30pm PDT Monday, July 27 Mtg Room 103CResearch Manuscript
Design
DES1-I. SoC, Heterogeneous, and Reconfigurable Architectures
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
Additional Meeting
ICLAD Hackathon
9:00am - 5:00pm PDT Sunday, July 26 Mtg Room 201AResearch Special Session
AI
Research Manuscript
Systems
SYS3. Embedded Software
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript
AI
AI2-II. AI/ML Algorithms and Models
Research Manuscript
Chiplet
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
Engineering Presentation
Design
EDA
Systems
Late Breaking Results
Research Manuscript
EDA
EDA9. Test, Validation and Silicon Lifecycle Management
Engineering Presentation
EDA
Security
Research Special Session
AI
Research Manuscript
Design
DES2A. In-memory and Near-memory Computing Circuits
Research Manuscript
Design
DES2B-I. In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript
Design
DES2B-II. In-memory and Near-memory Computing Architectures, Applications and Systems
Networking Events
Networking Reception
6:00pm - 7:00pm PDT Monday, July 27 Exhibit HallNetworking Events
Networking Reception
6:00pm - 7:30pm PDT Tuesday, July 28 Level 1 Lobby & PromenadeNetworking Events
NVIDIA Welcome Reception
4:30pm - 5:00pm PDT Sunday, July 26 Level 1 FoyerKeynote
Opening: Awards
8:30am - 9:00am PDT Monday, July 27 Grand BallroomKeynote
Opening: Awards
8:30am - 9:00am PDT Wednesday, July 29 Grand BallroomKeynote
Opening: Awards
8:30am - 9:00am PDT Tuesday, July 28 Grand BallroomResearch Manuscript
AI
AI5-I. AI/ML System and Platform Design
Research Manuscript
EDA
EDA3. Timing Analysis and Optimization
Research Manuscript
Design
Quantum
DES6. Quantum Computing
Research Manuscript
EDA
EDA7-I. Physical Design and Verification
Engineering Presentation
Design
EDA
Security
Systems
Engineering Presentation
EDA
Research Manuscript
Systems
SYS6. Time-Critical and Fault-Tolerant System Design
Research Special Session
EDA
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
Research Manuscript
EDA
EDA5. RTL/Logic Level and High-level Synthesis
Research Manuscript
AI
Quantum
DES2B-II. In-memory and Near-memory Computing Architectures
AI3-II. AI/ML Application and Infrastructure; EDA
EDA9. Test
Validation and Silicon Lifecycle Management; Design
Applications and Systems; Design
DES6. Quantum Computing; Security
SEC1. AI/ML Security/Privacy; Systems
SYS1. Autonomous Systems (Automotive
Robotics
Drones)
Research Manuscript
Quantum
DES3. Emerging Models of Computation
EDA
EDA8. Design for Manufacturability and Reliability; Design
DES6. Quantum Computing; Chiplet
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package; Systems
SYS5. Embedded Memory and Storage Systems; AI
AI3-II. AI/ML Application and Infrastructure; Design
Research Manuscript
Quantum
DES6. Quantum Computing
Systems
DES2B-I. In-memory and Near-memory Computing Architectures
SYS2. Design of Cyber-Physical Systems and IoT; AI
AI4-II. AI/ML Architecture Design; AI
AI5-II. AI/ML System and Platform Design; Design
Applications and Systems; Security
SEC3-II. Hardware Security: Attack and Defense; Security
SEC3-I. Hardware Security: Attack and Defense; Design
Research Manuscript
Systems
EDA6. Analog CAD
Simulation
Verification and Test; Design
DES1-I. SoC
Heterogeneous
and Reconfigurable Architectures; AI
AI2-II. AI/ML Algorithms and Models; EDA
EDA7-I. Physical Design and Verification; AI
SEC2. Hardware Security: Primitives
Architecture
SYS5. Embedded Memory and Storage Systems; EDA
AI5-II. AI/ML System and Platform Design; Security
Design & Test
Research Manuscript
AI
DES5. Emerging Device and Interconnect Technologies
AI4-I. AI/ML Architecture Design; EDA
EDA5. RTL/Logic Level and High-level Synthesis; Security
SEC1. AI/ML Security/Privacy; AI
AI1. AI/ML Frontiers for Hardware Design; Design
DES2A. In-memory and Near-memory Computing Circuits; EDA
EDA3. Timing Analysis and Optimization; Design
Research Manuscript
AI
AI5-I. AI/ML System and Platform Design
AI4-II. AI/ML Architecture Design; AI
AI4-I. AI/ML Architecture Design; Design
DES2B-II. In-memory and Near-memory Computing Architectures
Applications and Systems; EDA
EDA2. Design Verification and Validation; Systems
SYS3. Embedded Software; Systems
SYS6. Time-Critical and Fault-Tolerant System Design; AI
Research Manuscript
EDA7-II. Physical Design and Verification
Design
EDA5. RTL/Logic Level and High-level Synthesis; Security
DES1-I. SoC
Heterogeneous
and Reconfigurable Architectures; AI
AI2-II. AI/ML Algorithms and Models; EDA
EDA7-I. Physical Design and Verification; AI
AI5-I. AI/ML System and Platform Design; EDA
SEC2. Hardware Security: Primitives
Architecture
Design & Test; EDA
Research Manuscript
AI2-I. AI/ML Algorithms and Models
Design
DES4. Digital and Analog Circuits; Design
DES1-I. SoC
Heterogeneous
and Reconfigurable Architectures; EDA
EDA4. Power Analysis and Optimization; EDA
EDA2. Design Verification and Validation; Security
SEC4. Embedded and Cross-Layer Security; AI
AI1. AI/ML Frontiers for Hardware Design; AI
Research Manuscript
Chiplet
SYS4. Embedded System Design Tools and Methodologies
EDA
EDA1. Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package; Design
DES3. Emerging Models of Computation; AI
AI3-I. AI/ML Application and Infrastructure; EDA
EDA6. Analog CAD
Simulation
Verification and Test; Design
DES2B-I. In-memory and Near-memory Computing Architectures
Applications and Systems; AI
AI2-I. AI/ML Algorithms and Models; Systems
Research Manuscript
Security
SEC1. AI/ML Security/Privacy
Research Manuscript
Systems
SYS1. Autonomous Systems (Automotive, Robotics, Drones)
Research Manuscript
Systems
SYS5. Embedded Memory and Storage Systems
Research Manuscript
Security
SEC4. Embedded and Cross-Layer Security
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
Engineering Presentation
Chiplet
EDA
Research Manuscript
AI
AI5-II. AI/ML System and Platform Design
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
Research Manuscript
Security
SEC3-II. Hardware Security: Attack and Defense
Research Manuscript
Security
SEC3-I. Hardware Security: Attack and Defense
Research Manuscript
AI
AI1. AI/ML Frontiers for Hardware Design
Research Manuscript
Security
SEC2. Hardware Security: Primitives, Architecture, Design & Test
Research Special Session
Systems
Engineering Special Session
AI
Design
EDA
Systems
Research Special Session
Systems
Engineering Presentation
EDA
Security
Research Manuscript
Systems
SYS4. Embedded System Design Tools and Methodologies
Engineering Presentation
Design
EDA
Research Manuscript
EDA
EDA7-II. Physical Design and Verification
Engineering Presentation
Design
EDA
Systems
Engineering Presentation
AI
EDA
Systems
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
Research Manuscript
Design
DES5. Emerging Device and Interconnect Technologies
Research Manuscript
AI
AI3-II. AI/ML Application and Infrastructure
Engineering Special Session
AI
EDA
Engineering Special Session
AI
Chiplet
Design
EDA
Quantum
Security
Systems
Exhibitor Forum
AI
EDA
Systems
Exhibitor Forum
AI
EDA
Systems
Research Manuscript
Design
Quantum
DES6. Quantum Computing
Research Special Session
EDA
Networking Events
Welcome Reception
5:45pm - 7:00pm PDT Sunday, July 26 Level 1 FoyerResearch Special Session
AI
Research Manuscript
Design
DES3. Emerging Models of Computation
Research Manuscript
AI
AI2-I. AI/ML Algorithms and Models
Women in Engineering Session
5:00pm - 7:00pm PDT Monday, July 27 Exhibit HallLate Breaking Results
Student
Work in Progress
Student
Work in Progress
Young Fellow Award Ceremony
3:30pm - 4:30pm PDT Wednesday, July 29 Mtg Room 104ANetworking Events
Young Fellows
9:00am - 5:00pm PDT Sunday, July 26 Mtg Room 104AContributors
Try a different query.
