All Software Is Hardware-Dependent

Any software that claims to be independent from hardware is inefficient, bloated software. The time for such software development is over.

popularity

I was lucky in my early career that I found two sets of great mentors. The first happened recently after graduating when I joined the Hilo development team. Members of that team included Phil Moorby, Simon Davdimann, Peter Flake, and others. They all had very different coding personalities, but most importantly, they worked as a team and used good foundational processes.

One outcome of that was that the code was written from the ground up to be portable. In an era when every computer was different and ran a different operating system, this was important because there were no dominant providers. One of my early jobs was to port the software onto new platforms, and that process generally took a day, from the moment of arrival with magnetic tape in hand to finished and tested. This was normally a shock to the computer companies because they expected me to be on-site for days.

Another aid to porting was the language used. This was before the C language became dominant and the simulator was coded in a language called BCPL. It was very similar in terms of language constructs to C, but it only had two data types — integer, and pointer to integer. For strings, we had to treat them as an array of characters packed into integers. Structures were created using macros, where each element was addressed as the base address for the instance plus an offset.

One of the things this taught me was the importance of data alignment, because if you got it wrong and misaligned it, loading an integer would take twice as long. That was an important performance issue.

Another important lesson was that overall performance was not tied to processor MIPS. You also had to consider the memory subsystem. A large cache with a good algorithm behind it was important. I was doing a port for a large computer manufacturer. With the port complete, its performance was measured. They then wanted me to try their brand-new machine that was just about to be launched. It was meant to have 3X the performance. We eventually gave up because it was so slow. Tests, which normally took an hour to run, remained incomplete at the end of the day. They had skimped on the cache or changed the paging algorithm. It spent the whole time thrashing the memory.

The second important lesson for me was when I joined one of the large EDA companies. They had spent a large amount of time and effort creating a foundation library on which the rest of the software was developed. They spent time making sure it was solid and fast. Some of this replaced the standard C libraries, which while highly generalized, were not optimized for the task at hand.

When I moved to another large EDA company, I was put in charge of a large piece of software that came from a sequence of acquisitions. All of the original developers had left at some point during those acquisitions. I would describe that development team as highly talented prima donnas with no time for process. The software was great, but quirky, and that resulted in lots of bugs. I inherited a very long bug list and some very important customers who were getting upset.

One of the things I did was to take an inventory of the software to get a picture of its structure. What I found was amazing. For example, there were seven different hashing routines. Every developer had written their own rather than trusting one from the other team members, or perhaps not even aware that others existed. We started a process of spending one day a week looking at these foundational building blocks. We wrote test routines and used these to assess each of the candidates. We found that some of them had bugs, others had huge performance issues. We selected the best and then spent more time ensuring its quality and performance.

One of the routines we tackled was memory allocation. This is a classic example of a standard library routine that is written to be highly general, but which can destroy rather than enhance data locality. It also fragments memory over time, wasting an increasing amount of space when lots of small blocks of memory are allocated and freed during a run. Data locality is important for performance, especially in software like simulators that exhibit randomness in their data access patterns. Many of the structures of a simulator are regular, and so by converting them to a pool of memory packets, we could not only do it faster, but also pull frequently accessed elements together. That resulted in less paging. We also added a number of debug features to our memory manager that could detect overflows, reading from uninitialized data, memory leaks and more.

The process took a few months, but when we had finished that first level of routines, we found that a large number of the bugs had disappeared. We also had a significant performance boost that highly pleased the customers.

Many of these lessons might seem unimportant these days. Software productivity is deemed more important than anything else, and that has resulted in layer upon layer of routines, languages that hide what is going on, and little understanding of the impact that software has on hardware. An increasing number of articles I write these days focus on these problems, and at least the hardware community says this has to change.

The memory wall is getting worse, and that means that it becomes more important to think about how tasks fit into memory, or at least the implications of having a task that does not fit into fast memory, and how to work with other memories in the system. Memory structure should become a first-class citizen of software architecture.

The second big area that may force change is power and thermal. There are two parts to this. The first is about memory again. You have to minimize memory transfer because it is so expensive in terms of time and power. Compute is almost free. Memory transfer is expensive. The EDA industry had to deal with this when wires became more impactful than gates. Every algorithm had to change. The focus had to change.

The second aspect is that energy consumption is not a metric that software understands. In some cases, it doesn’t matter if one algorithm is faster than another if it consumes more energy. If that additional energy consumption raises the temperature, then performance will drop and perhaps become slower than the other algorithm. Also, if that faster algorithm did not take into account what could happen in the memory system, always assuming the memory it needed was available in cache, then its performance could become highly variable, which for some industries is also a big issue.

Some of the lessons I learned may seem archaic today, but they were very important at the time — and they may be in the throes of coming back. Software must become more aware of the hardware it runs on — especially the memory.



1 comments

Tom Fitzpatrick says:

Many years ago, I worked at an EDA startup that had a graphical tool to define control flow from which we generated RTL for synthesis. Once I held a design review for a hardware group at a customer. My first question was always, “What is the hardware going to look like?”
If they could answer me, I knew the design was good. If not, they were in trouble. You always have to understand the hardware.
That’s the difference between hardware engineers and software engineers. HW folks have always had the basis and have had to learn more SW over the years. It’s nice to see that SW folks now have to bridge that gap too. 😉

Leave a Reply


(Note: This name will be displayed publicly)