The goal of this blog is to inform you of a business and/or career opportunity that has rapidly emerged over the past 2 years, and looks like it is accelerating. Herein, I am not trying to change your mind about climate change. What you believe is your responsibility, not mine. I am just informing you of an opportunity.
Last week Microsoft released its sustainability guidance. In a blog about that they said “The customer signal is clear—sustainability is now a business imperative. In a study of over 1,230 business leaders across 113 countries, 81 percent of CEOs have increased their sustainability investments. Sustainability is a top-10 business priority for the first time ever, and carbon emissions are forecasted to become a top-three criterion for cloud purchases by 2025. The number of large cities with net zero targets has doubled since December 2020 — from 115 to 235 and the global market for green data centers is projected to grow to more than $181.9B by 2026.” Thus, learning about Sustainable Software Engineering (aka green software) now might be a valuable asset to you professionally and the businesses you work with in the very near future.
Here are some links from the above blog where you can learn more:
1. Here is a very high level view of what hundreds, if not thousands, of large companies are doing about sustainability, and the career and business opportunities that are quickly opening up — https://blogs.microsoft.com/on-the-issues/2022/11/02/closing-sustainability-skills-gap/
2. Here is the blog where the above quote (and these links) came from, which has many more links as well — https://azure.microsoft.com/en-us/blog/improve-your-energy-and-carbon-efficiency-with-azure-sustainability-guidance/
3. The Green Software Foundation (part of CNCF) has green software principles, best practices, standards, and tooling — https://greensoftware.foundation/
4. Azure Well-Architected Framework Sustainability Guidance — https://learn.microsoft.com/en-us/azure/architecture/framework/sustainability/
5. Azure Kubernetes Service (AKS) Sustainabe Software Engineering Practices (some of which will likely apply to other K8s implementations as well) — https://learn.microsoft.com/en-us/azure/aks/concepts-sustainable-software-engineering
6. A 33 minute simple and fun online course from Azure Learn that summarizes much of the above — https://learn.microsoft.com/en-us/training/modules/sustainable-software-engineering-overview/
Good luck!
George Stevens
“Sustainability is now a business imperative” © 2022 by George Stevens is licensed under CC BY 4.0 You have permission to copy this blog so long as you copy the previous sentence into your copy of it, under the Creative Commons attribution license in the preceding link.
By George Stevens, Software Architect and Senior Software Engineer at Solid Value Software
The Well Architected Framework (WAF) is supported by all 3 of the largest cloud providers, as shown in a link later in this blog. Thus it acts as a generally accepted body of knowledge that aids product owners, architects, developers, devops engineers, and quality assurance engineers to identify, design, implement, and test the non-functional requirements (NFRs) necessary for highly effective cloud software.
What is a non-functional requirement? “a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. They are contrasted with functional requirements that specify behavior or functions. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture, because they are usually architecturally significant requirements.” From https://en.wikipedia.org/wiki/Non-functional_requirement.
The WAF can be used in at least three ways:
- As a general education tool to promote understanding of all the kinds of cross cutting concerns one must effectively deal with in cloud native apps. Please see the first part of my blog Best Practices for Azure Software Development for the reasons why cloud native apps are unique.
- As a reliable source for discovering the necessary Non-Functional Requirements. In other words one might wonder what kind of non-functional requirements are necessary for security, reliability, etc. for cloud native apps.
- As a reliable source for finding and understanding the best practices for implementing any cloud native NFR. In other words, given a Non-Functional Requirement, what are the best practices available for implementing that requirement in cloud native apps. For example, if an app needs data security the WAF can identify the options available and best practices (given a specific cloud provider) for implementing data encryption, plus other forms of data security if necessary.
In my opinion (based upon my experience plus observation of other teams) people and dev teams should start learning and using the WAF as soon as they start considering building a cloud native app. In this case use it for “general education” since one must do many, many more things dealing with NFRs for cloud native apps than with on-prem systems. You’ve really got to get ahead of this body of knowledge early since it is critical to having your nice, new cloud app work as it needs to without doing expensive and time consuming redos. Further, once you have committed to building a cloud app, you should frequently use the WAF for both NFR discovery, plus understanding the best practices for implementing an NFR. This will save time and make things easier.
Finally, it is highly appropriate to start detailing all the NFRs in your new cloud app and how they map to best practices and specific cloud technologies during the earliest phases in a cloud development project, aka the Fuzzy Front End. In other words, shift this whole NFR area left! Doing this sets you into a position of knowledge when it comes time to do detailed design and implementation, rather than having to scramble to overcome your ignorance of the NFR best practices of your chosen cloud provider. Such scrambling often does not work out well. Also, it is critical to plan for the amount of time it will take to do the detailed design and implementation of all NFRs (which requires knowing what all the NFRs you need in the first place). Doing this near the beginning of a development project in the Fuzzy Front End allows for the identification of all NFR discovery, implementation, and testing work activities and their expected duration, which can then be put into the backlog of implementation work for the project. All this NFR work can amount to a whole lot in cloud native apps, much more so than in on-prem systems, and thus needs to be planned for in advance so as to avoid blowing out the schedule, budget and/or quality of the project and its deliverables.
For example, the Azure WAF is based on 5 pillars of excellence, as follows, and closely approximates those of the other cloud providers:
- Reliability
- Security
- Cost Optimization
- Operational Excellence
- Performance Efficiency
For each pillar the Azure WAF provides guidance on how to design and implement cloud services using best practices relevant to that pillar. The WAFs for AWS and Google are much the same, with fairly minor differences as detailed in Analysis of Cloud Well Architected Frameworks.
Here are links to the Azure, AWS, and Google WAF top level documents:
- Azure — https://docs.microsoft.com/en-us/azure/architecture/framework/
- AWS — https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html
- Google — https://cloud.google.com/architecture/framework
Azure has another amazingly useful way of learning the details of their WAF. That is the Azure Learn set of lessons called Build great solutions with the Microsoft Azure Well-Architected Framework. It consists of an introduction lesson, plus one lesson for each of the 5 pillars of excellence. 6 lessons total. These lessons are free, very well designed, easy, fun, and effective learning tools that can be used to learn in bite sized chunks. And you do not have to know any coding techniques. Having taken them all in 2020, I highly recommend them. I was initially awed at the breadth and number of NFRs required for typical cloud native apps. However, I was reassured with the structure, depth and detail of the guidance the Azure WAF lessons provided.
The first Introduction lesson takes about an hour, and provides an excellent way for all kinds of people to understand the value of the WAF approach, and the kinds of specific topics it deals with within each pillar of excellence. For example a VP of IT, or CTO, or Engineering Manager, or Product Owner, or Project Manager will likely find great value in taking the 1 hour introduction lesson, as well as all architect, developer, devops, and quality assurance personnel. I suggest that people in the architect role take all 6 lessons. While developers would do well to take at least the Introduction, Security, Reliability, and Performance Efficiency lessons. And devops engineers take at least the Introduction, Operational Excellence, Security, and perhaps Cost Optimization. The goal here is to become familiar with the underlying cloud native NFR concepts, terms, and techniques before they hit you in the face when you need to use them during development.
I hope you find the Well Architected Framework as helpful as I have!
George Stevens
The Well Architected Framework: One Stop Shopping for Cloud Native Non-Functional Requirements © 2022 by George Stevens is licensed under CC BY 4.0 You have permission to copy this blog so long as you copy the previous sentence into your copy of it, under the Creative Commons attribution license in the preceding link.
By George Stevens, Software Architect and Senior Software Engineer at Solid Value Software
The Case for Sustainable Software Engineering
Stepping back from software, “sustainability” is about the long term sustainability of industrialized societies like ours which are dependent on the current natural systems and processes of planet earth. Sustainability is concerned with the things we have to do to prevent the irreversible degradation and possible subsequent collapse of some of earth’s natural systems and processes that support “life as we know it” in our industrial societies. If some key natural systems and processes of planet earth become degraded to the extent they are no longer able to support industrial societies, then humanity will face a lot of self-made trouble as the “goldilocks” environment that has nurtured us thus far fades away.
“For over two centuries and into modern times human ingenuity and the ongoing industrial revolution has led to great advancements, although with significant negative impact on our world’s climate. Greenhouse gas emissions due to our industrious achievements have contributed to an increase in the earth’s temperature, such that an additional increase of 1.5 degrees Celsius could result in catastrophic outcomes.” Ref 1.
Climate change, aka global warming, is a key disruptor here since it has the potential to disrupt multiple natural systems and processes that we depend on. A central cause of climate change is the increasing amount of “greenhouse gasses” (mostly CO2) in the atmosphere that is emitted by our industrial societies. The elevated levels of CO2 gas traps heat in the atmosphere, preventing significant quantities of it from naturally radiating back out into space as it did back in the “goldilocks” days. In turn, the trapped heat increases the temperature of the earth, its atmosphere and its oceans, disrupting many natural systems and processes we depend upon. Thus, reducing the “carbon footprint” of industrial societies is the antidote to climate change and its disruptions since the carbon it refers to is the C in CO2 gas. I hope the above section provides a useful frame of reference to readers of this paper. A more detailed, and highly readable, explanation of climate change can be found in Saving Us by Katharine Hayhoe, a well-known climate scientist. Ref 14.
Throughout this paper I use the term “workload” to mean “some kind of software running on some sort of computer”. A workload is a software application, a software service, or a database system running on a computer or server. The total carbon footprint of a workload is the sum of:
- Carbon emissions created by the generation of the electricity required to run a computer hosting a workload.
- Plus carbon emissions resulting from manufacturing the computer which hosts a workload. This is known as embodied carbon.
In this paper the focus is on ways to reduce the carbon emissions caused by generating the electric power used to run computers hosting workloads. Ways to reduce the embodied carbon footprint is out of scope of this paper.
The information contained herein comes from various sources including my own education in environmental science (BS) and computer science (MS), plus my decades-long professional experience in software engineering and software architecture. I explicitly cite all sources from outside my direct experience. Please see the References section at the end of the paper for sources I reference in the text. And I thank all the individuals and organizations that contributed to the references cited in this paper.
The goal of this paper is to enable software architects and developers to more easily and effectively decrease the carbon footprint of the workloads they design, implement, and use, so as to reduce the impact of these workloads in driving climate change. Often we use, build, or are responsible for workloads that run 24/7 in data centers and clouds across the globe. Taken as a whole this consumes a lot of electric power, resulting in a significant carbon foot print.
The Value Proposition of Sustainable Software Engineering
How big is the carbon foot print of our software workloads? As of 3/2020 “The carbon footprint of our gadgets, the internet and the systems supporting them account for about 3.7% of global greenhouse emissions…. It is similar to the amount produced by the airline industry globally…. And these emissions are predicted to double by 2025”, Ref 2. The preceding refers to what is known as the ICT industry: Information and Communications Technology. This includes cloud computing which is projected to continue to grow rapidly each year (double digits per year).
Thus, applying sustainable software engineering practices to workloads running on servers around the globe, i.e. converting them to “sustainable, green workloads”, can have a significant impact in reducing global carbon emissions now, and even more so in the future. Doing so is indeed worth the effort, given the negative disruptive consequences of doing nothing. And, as you will see in detail later, the effort required to produce “sustainable, green workload” designs and implementations is not really that large, nor that difficult, when compared with the “traditional” unsustainable techniques in widespread use today. Indeed, a number of the sustainable software engineering strategies, tactics, and techniques discussed herein also can potentially provide other valuable benefits like reducing operating costs by saving energy, plus reducing the lifetime costs of maintaining and evolving a software code base by reducing its complexity.
This line of thinking reveals the Sustainable Software Engineering Value Proposition. Effectively using Sustainable Software Engineering (SSE) practices typically produces two kinds of value:
- Survival Value — The decrease in the carbon emissions achieved via “sustainable, green workloads”. Please note that the elapsed time to convert existing workloads to “sustainable, green workloads”, or to build new ones, is on the order of a few months to a few years. This is a typically a much faster time to realizing the survival value of decreasing carbon emissions than the time it usually takes to build or modify physical infrastructure like transportation systems, sustainable power systems, supply chains, water systems, agricultural systems, etc. Thus, the widespread use of SSE practices can be a key way to buy time while other sectors with longer time-to-value are ramping up their contribution to other significant reductions in carbon emissions.
- Economic Value — The potential decrease in the operating costs, plus perhaps also reductions in future software development costs, of workloads that may result as a side effect of using SSE practices to decrease the carbon emissions. Where they exist (and some projects will not have them) such cost reductions result from many of the SSE practices outlined herein decreasing the energy wasted by a workload as the main way of reducing the carbon footprint. And reducing wasted energy often acts to reduce cost. For example, after an SSE “greening” project the number of Virtual Machines (VMs) required to host multiple workloads will likely be reduced since the VMs should now be running workloads at a much higher workload density on each VM. This reduces the cost of the VMs and prevents high levels of unused capacity (waste). Exactly where this applies is outlined in the section “Strategies with their Tactics, and Techniques” below.
This is a really good win/win value proposition — The environment and our society wins, and the individual business wins! And, some workload “greening” projects have the potential of eventually paying for themselves over time due to cost savings. I expect that many projects aimed at producing “sustainable, green workloads” by using the SSE practices will have an excellent return on investment (ROI). That alone is often enough to put an SSE project near the top of an organization’s Todo list. And this, in turn, acts to increase the pace of counteracting climate change — A virtuous feedback loop.
Another definition of sustainability worth mentioning is broader, “Microsoft has made bold sustainability commitments to be carbon negative, zero waste, and water positive by 2030.” Ref 3. These are excellent areas with which to measure and manage sustainability. Combined, they encompass most of the key areas that contribute to long term sustainability of industrialized societies like ours. Note, however, this paper is solely focused on sustainability via carbon negativity since that falls directly within the realm of software engineering.
Strategies, Tactics, and Techniques of Sustainable Software Engineering
Here are the software architecture and engineering strategies, tactics, and techniques that reduce the carbon footprint of local and global workloads, realizing the SSE Value Proposition. The following eight Principles of Green Software Engineering by Asim Hussain provide an excellent high level overview of what must be done to build software systems and workloads with a reduced carbon footprint, Ref 4. However, here in this paper we are focused on subsets of these eight principles so that it can directly state the specific techniques that software architecture and software engineering contribute, without going into many of the details of energy, carbon, sustainability, etc. Please examine the below links for a detailed definition of terms used within the principles SSE, plus how they impact sustainable software systems.
Here are the Principles of Green Software Engineering, quoted from Ref 4.
- “Carbon: Build applications that are carbon efficient.”
- “Electricity: Build applications that are energy efficient.”
- “Carbon Intensity: Consume electricity with the lowest carbon intensity.”
- “Embodied Carbon: Build applications that are hardware efficient.”
- “Energy Proportionality: Maximize the energy efficiency of hardware.”
- “Networking: Reduce the amount of data and distance it must travel across the network.”
- “Demand Shaping: Build carbon-aware applications.”
- “Measurement & Optimization: Focus on step-by-step optimizations that increase the overall carbon efficiency.”
Much the same information as above can also be obtained from the highly useful online Microsoft Learn course https://docs.microsoft.com/en-us/learn/modules/sustainable-software-engineering-overview/. Ref 5.
The newly formed (2020) discipline of Sustainable Software Engineering, aka Green Software Engineering, in part seeks to reduce the carbon footprint of workloads. There are several strategies for doing this, accompanied by a number of tactics and techniques for each strategy, presented below. Breaking up the information into strategies, tactics, and techniques provides for using both a large and small granularity of thought and action in the quest to identify areas to change that will reduce the carbon footprint of workloads. And then to design, plan, and make the required changes as well.
- Strategies are a few high level broad scope actions that can reduce a workload’s carbon foot print.
- Tactics are important focus points and general actions for implementing one or more strategies.
- Techniques are quite specific actions that can be taken to reduce the carbon footprint of workloads.
Strategies
There are at least four clear high level strategies for reducing the carbon foot print of a workload, listed below. And, they are not mutually exclusive! Indeed, they can all be applied to suitable workloads so as to compound the sustainability gains achieved. Or, when necessary, a strategy may be applied individually as well. These are from the main categories in Ref 8.
- Measure the energy efficiency and carbon foot print of a workload, before and after any optimizations.
- Shape the energy demand and mix of a workload, without changing the workload’s software or data, resulting in a lower carbon footprint. This is done by shifting the demand for the workload to times and regions having excess capacity.
- Reduce network load and the distance data travels resulting in a lower carbon footprint. This may or may not require software or data changes.
- Increase resource utilization and decrease waste resulting in a lower carbon footprint regardless of where the workload runs. This may or may not require software or data changes. This is done mainly by reducing idle time.
Strategies with their Tactics and Techniques
Tactics and Techniques that have references were obtained from documents and links cited in the References section below. They are cited in the following text by the reference number in the References section, e.g. Ref 3. In a few cases below, the same Technique can be applied to more than one Tactic. Thus, I mostly duplicate the text describing such a tactic so that the reader does not have to page back and forth when reading the tactics and their related techniques.
For each item listed below, be sure to check with your cloud provider’s documentation and guidance for ways that you can reduce costs while at the same time building “sustainable, green workloads”.
Strategy — Measure the energy efficiency and carbon foot print of a workload, before and after any optimizations.
The whole area of SSE, aka Green Software Engineering, is quite new. Articles, blogs, and tools have just started widely appearing since mid-2020 or so. The terms, measurements, and tools of SSE are now just being defined and produced.
- Tactic — Use the emerging standard terms and measurements of the Green Software Engineering industry.
- Technique — Standard terminology: Use the standard terminology that has emerged, shown above and at this link Principles of Green Software Engineering, Ref 4.
- Technique — Standard measurements: Use the standard measurements. However, currently there are few standardized measurements, and fewer tools. The Green Software Foundation has committed to provide open source tools in the future, as well as standard measurements, and is doing just that, Ref 10.
- Tactic — Use the emerging measuring tools of the Green Software Engineering industry.
- Technique — Measuring tools: Use green software measuring tools. Currently the only Workload Energy Efficiency oriented tool that I have found (having done a little looking around, but not much) is Microsoft’s free Emissions Impact Dashboard, Ref 6. I have not yet investigated it. It measures the carbon footprint of Microsoft Azure resources.
Strategy — Shape the Energy Demand and Mix of a Workload without changing the workload’s software or data, resulting in a lower carbon footprint.
The carbon footprint of a running workload can be decreased by adjusting when, where and upon what compute resources it runs, as follows:
- Tactic — Utilize a data center’s “excess capacity” so as to increase the utilization of idle resources by a workload. Potential operations cost reduction.
- Technique — Change run time or region: “Changing the time or regions for a batch job to run” so they use the excess capacity available during that time or in that region. Ref 8.
- Technique — Use “spot pools”. Ref 8. “A spot node pool is a node pool backed by a spot virtual machine scale set. Using spot VMs for nodes with your AKS cluster allows you to take advantage of unutilized capacity in Azure at a significant cost savings. The amount of available unutilized capacity will vary based on many factors, including node size, region, and time of day.” https://docs.microsoft.com/en-us/azure/aks/spot-node-pool
- Tactic — Significantly increase the percentage of renewable energy used by a workload. Potential operations cost reduction.
- Technique — Move to a better region: Move the workload to a location that is powered by a much higher percentage of renewable energy, like solar, wind, or hydroelectric power. In other words, run a workload in a region with lower carbon intensity, and thus having a lower carbon footprint. Ref 8.
- Technique — Run at a better time: A given data center may have different levels of carbon intensity at different times of the day. For example, on summer nights a data center may have much lower carbon intensity since a region’s air conditioners will be demanding much less and not requiring “overcapacity generating units” be on line that exclusively use non-renewable resources. Thus, one can run some workloads in such data centers at night when the carbon intensity is the lowest. Ref 5.
- Tactic — Favor running workloads in a public cloud over running them in on-prem data centers. Ref 3. Potential operations cost reduction.
- Technique — The document “The carbon benefits of cloud computing” in Ref 3 shows the results of a 2018 study that found:
- “the Microsoft Cloud is between 22 and 93 percent more energy efficient than traditional enterprise datacenters.”
- And it gets even better “when taking into account (Microsoft’s) renewable energy purchases, the Microsoft Cloud is between 72 and 98 more carbon efficient.”
- Technique — The document “The carbon benefits of cloud computing” in Ref 3 shows the results of a 2018 study that found:
Strategy — Reduce network load and the distance data travels resulting in a lower carbon footprint. This may or may not require software or data changes.
The carbon footprint of the Network usage by a workload can be decreased by decreasing the overall load on the Network. This is done in 2 ways, however, “never reduce network travel at the cost of meeting workload requirements” from Ref 8.
- Tactic — “Reduce the amount of network traffic your architecture creates per operation”, Ref 5.
- Tactic — Reduce the distance traveled over the network by each request and response, Ref 5 Paraphrased.
- For both of the above Tactics — The above reductions in network travel distance and network load can be achieved by the following Techniques. Most of the following techniques apply to both of the above tactics:
- Technique — Use caches: Use caches to store static assets, as a general rule, to both reduce the network traffic per operation and also reduce the distance the data has to travel. Ref 5, paraphrased, above and below.
- This includes using browser caches and proxy caches via caching headers.
- It also includes using Content Delivery Networks (CDNs) to move static assets nearer the application that is requesting them.
- Technique — Reduce payload size: “Reduce the size and optimize your bundles and static assets.” Ref 5.
- “Compression and decompression (of) the data you transmit over the network” typically “takes less overall energy than transmitting uncompressed data.” Ref 5.
- Technique — Transmit binary data rather than characters: Use data transfer objects that are serialized into binary formats rather than character data, resulting in reduced amounts of data traveling over the network thus consuming less existing network bandwidth. From my professional experience and https://particular.net/blog/transport-cost-is-zero, Ref 11. Potential operations cost reduction.
- The serialization/deserialization compute work required to use binary wire formats takes less energy than transmitting the more bulky character data. And XML is even worse than character data.
- For example the use of Protobuf in gRPC is more energy efficient that using standard HTTP.
- It is usually far easier and more effective to dynamically scale (based on load) compute resources required for serialization/deserialization into binary payloads than it is to scale network bandwidth to support bulky character or XML payloads.
- The serialization/deserialization compute work required to use binary wire formats takes less energy than transmitting the more bulky character data. And XML is even worse than character data.
- Technique — Shorten the distance between senders and receivers: Use technology to shorten the distance between the source of most of the requests and the location of the compute that serves the requests. All below from Ref 8.
- Create compute machines and clusters near the source of most the of network traffic.
- Use Azure Traffic Manager to route HTTP traffic to the closest compute.
- Use “proximity placement groups” to reduce the distance between Azure resources.
- Technique — Favor fire-and-forget messaging over RPC (remote procedure calls): Where practical, given the functional and performance requirements of a workload, use fire and forget messaging rather than RPC. From my own experience, plus Vasters Ref 7. Potential operations cost reduction.
- With fire-and-forget messaging (aka queueing or topics) the sender does not have to do a number of send retries when the receiver is too busy or offline. Thus, using messaging in significant load situations can greatly reduce the amount of network traffic by reducing retry traffic, thereby decreasing the carbon footprint.
- Technique — Use caches: Use caches to store static assets, as a general rule, to both reduce the network traffic per operation and also reduce the distance the data has to travel. Ref 5, paraphrased, above and below.
Strategy — Increase resource utilization and decrease waste resulting in a lower carbon footprint regardless of where the workload runs. This may or may not require software or data changes. This is done mainly by reducing idle time.
A key part of the rationale behind increasing resource utilization is that a running compute resource incurs a minimum “up-and-running overhead” carbon cost just due to the fact the resource is powered up and running. Therefore, it is better to have such resources more highly utilized since that reduces the “up-and-running overhead” carbon cost per workload. When one has numerous underutilized resources, each workload will have a higher “up-and-running overhead” and therefore a total higher carbon footprint.
Resource utilization can be increased and waste can be decreased by using the following Tactics and Techniques
- Tactic — Design and implement Software Architectures that have a specific requirement of increasing resource utilization and decreasing waste. Potential operations cost reduction.
- Technique — Favor fire-and-forget messaging over RPC (remote procedure calls): Where practical, given the functional and performance requirements of a workload, use fire and forget messaging rather than RPC. From my own professional experience, plus Vasters Ref 7.
- With fire-and-forget messaging (aka queueing or topics) the sender does not have wait for the receiver to respond, all the while occupying memory while doing nothing but waiting. This gets even worse when the sender has to retry sending the request when the receiver is overloaded, slow, errored, or when networking glitches happen. Thus, using fire-and-forget messaging in significant load situations greatly reduces the amount memory consumed by code that is doing nothing but waiting for a response.
- The following quotes from Clemens Vasters in https://vasters.com/blog/use-queues/ make the point:
- “CPU cycles are short. Computers get a lot work done in little time. Compared, network transfers take forever. Having a program sit around, hogging memory, while it waits milliseconds for a computation “over there” that only takes microseconds to be performed is a huge waste.” Ref 7.
- While Vasters does not directly mention this, the waste he identifies consists of wasted energy which directly increases the workload’s carbon footprint over what it could be when using queues (fire-and-forget messaging).
- “Business processes are long. From placing a purchase order to delivery, days may pass because physical goods must be moved. Having hundreds of program instances compete to rush through processing the accompanying digital steps in microseconds is a waste.” Ref 7.
- Again, the waste is wasted energy. Why have very fast processing burning up electricity when it is not a business requirement?
- “CPU cycles are short. Computers get a lot work done in little time. Compared, network transfers take forever. Having a program sit around, hogging memory, while it waits milliseconds for a computation “over there” that only takes microseconds to be performed is a huge waste.” Ref 7.
- Technique — Use asynchronous operations for the backend worker processes to run longer running jobs. Ref 4.
- This way a web app or WebAPI does not have to wait for the backend worker process to finish processing. Rather when the processing is done the backend can send an asynchronous notification to the end user via email, text messaging, SignalR, etc. Memory and CPU is freed up by the front end processes not having to wait in memory for the backend to complete its work.
- Note that this fits in with the above technique, as well as with many web apps.
- This way a web app or WebAPI does not have to wait for the backend worker process to finish processing. Rather when the processing is done the backend can send an asynchronous notification to the end user via email, text messaging, SignalR, etc. Memory and CPU is freed up by the front end processes not having to wait in memory for the backend to complete its work.
- Technique — Run backend worker processes at a lower priority than the front end processes. Ref 4.
- This keeps resource utilization high by running worker processes that are not in high demand by front end processes at a lower priority, rather than having to scale out when both are running at the same priority.
- Note that this fits in with the above techniques, as well as with many web apps.
- This keeps resource utilization high by running worker processes that are not in high demand by front end processes at a lower priority, rather than having to scale out when both are running at the same priority.
- Technique — Reduce the number of microservices used by using a level of conceptual abstraction in the high level design that limits overall complexity plus makes it easier to increase resource utilization. Ref 4.
- “Consider combining services, logically or physically, where similar scale points exist to reduce the footprint of the overall architecture”. Ref 4.
- “If two or more microservices are highly coupled, consider co-locating to reduce network congestion and latency”. Ref 4.
- “Use languages and technology stacks that optimize the efficiency of a specific microservice function. The independence and abstraction of functionality to an API layer means you are free to make the technical decisions that maximize utilization in your technical stack for each microservice.” Ref 4.
- Use high level software architecture design techniques like subsystems composed of multiple individual services, systems composed of multiple individual subsystems, and systems-of-systems composed of multiple individual systems to implement conceptual abstractions that greatly reduce complexity plus the work required to implement the above techniques. From my own professional experience, plus the IDesign Method, Ref 13.
- Technique — Favor fire-and-forget messaging over RPC (remote procedure calls): Where practical, given the functional and performance requirements of a workload, use fire and forget messaging rather than RPC. From my own professional experience, plus Vasters Ref 7.
- Tactic — Use fewer compute and memory resources at a higher rate of resource utilization. Note that increased utilization of these resources means decreasing the amount of idle time or unused capacity, Refs 4, 8. Potential operations cost reduction.
- Technique — Use hosts that provide the ability to easily fine tune resource utilization by workloads so as to be able to maximize utilization while having the required performance. This will provide the highest capability to reduce the carbon footprint by increasing utilization. All below from Ref 8.
- Use cluster computing hosts like Kubernetes and Service Fabric that excel at providing very high resource utilization, coupled with the ability to automatically scale in and out to meet shifting demand.
- This involves:
- Node sizing.
- Automated cluster scaling.
- The use of “spot pools”.
- Setting effective CPU and memory resource requests and limits per workload.
- Example — Assume you have a cluster with 4 nodes, each of which is runs a given workload at 50% unused capacity. By increasing the density of the workloads on the cluster you can reduce the node count to 3 nodes which decreases the unused capacity to 33%. This increased resource utilization reduces the carbon footprint of the workloads. Ref 8.
- This involves:
- Use PaaS to host your workloads. PaaS typically has higher resource utilization rates than bare VMs.
- If you are using VMs, use smaller VMs to increase your rate of resource utilization.
- In all of the above, use auto-scaling and/or burst mode to keep the rate of resource utilization high rather than relying upon statically provisioned capacity.
- Use cluster computing hosts like Kubernetes and Service Fabric that excel at providing very high resource utilization, coupled with the ability to automatically scale in and out to meet shifting demand.
- Technique — Use hosts that provide the ability to easily fine tune resource utilization by workloads so as to be able to maximize utilization while having the required performance. This will provide the highest capability to reduce the carbon footprint by increasing utilization. All below from Ref 8.
- Tactic — Increase utilization and decrease waste of Data Storage and Databases. Potential operations cost reduction.
- Technique — Optimizing which database you use as well as how the data is stored can reduce the energy used to run the database as well as decrease idle time waiting for queries to complete. Ref 4.
- “Ensure you are using the best database for interacting with your data set. For example, if you are running many relational queries on your data set, a relational database is better suited and likely more efficient to use than NoSQL database”. Ref 4.
- “If no single database is designed to handle all the ways you interact with your data set, consider keeping redundant copies of your data in different databases and using each database for the subset of interactions best suited for that database”. Ref 4.
- “Consider using an index if your database offers it”. Ref 4.
- “Consider evaluating and optimizing your queries”. Ref 4.
- “Consider using a database cache. In some cases, caching can reduce redundant queries to the database and decrease energy usage by the database, especially for complex or compute-intensive queries”. Ref 4.
- “Consider running any resource-intensive microservices in a region with a lower carbon intensity” or utilizing more renewable energy. Ref 4.
- Technique — Optimizing which database you use as well as how the data is stored can reduce the energy used to run the database as well as decrease idle time waiting for queries to complete. Ref 4.
- Tactic — Make adjustments to Workload Requirements, if practical, to promote higher resource utilization and decrease waste. Potential operations cost reduction.
- Making adjustments to a workload’s requirements involves 2 areas:
- Functional Requirements — The functionality provided by a workload.
- Performance Requirements — Latency, Availability, etc.
- A number of the previously covered Tactics and Techniques can be used when adjusting a workload’s requirements. But here the approach is different, and involves looking closely at the Functional and Performance requirements and asking if there are any that can be relaxed in such a way to still satisfy users, but also significantly decreasing the carbon foot print. For example:
- Do all the user/UI/Server interactions really need to have the lowest possible latency? Perhaps some interactions can still satisfy user needs but be slower or take the form of asynchronous notifications, including push technologies like SignalR, text messaging, or even email.
- Can some processing that is asynchronous in nature be run at different times or in different regions which have lower carbon intensity processing power available.
- It basically comes down to scrubbing the Functional and Performance requirements to find high return ways to decrease the carbon footprint while still creating satisfied users of an app.
- Making adjustments to a workload’s requirements involves 2 areas:
I hope this document aids you in reducing the carbon footprint of your workloads, thus moving the world a little farther away from the consequences of climate change.
References
- “A new standard for Scope 3 emissions transparency” by Copyright © 2021 Microsoft. https://go.microsoft.com/fwlink/p/?linkid=2161861
- https://www.bbc.com/future/article/20200305-why-your-internet-habits-are-not-as-clean-as-you-think and for confirmation see https://www.sciencedirect.com/science/article/pii/S2666389921001884 and https://www.upi.com/Science_News/2021/09/10/communications-tech-carbon-emissions/8771631295350/
- “The carbon benefits of cloud computing” a 2020 Microsoft white paper based on a 2018 study https://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf“
- Principles of Green Software Engineering, https://principles.green/ “A personal project by Asim Hussain, Green Cloud Advocacy Lead at Microsoft. Further training based on the Principles of Green Software Engineering is available on the Microsoft Learn platform.” principles.green material is copyrighted under the MIT License in a GitHub repo that contains the code for the principles.green website, https://github.com/jawache/principles-green.
- Microsoft Learn’s “Principles of Sustainable Software Engineering” course authored by Asim Hussain. https://docs.microsoft.com/en-us/learn/modules/sustainable-software-engineering-overview/
- Microsoft Emissions Impact Dashboard — https://www.microsoft.com/en-us/sustainability/emissions-impact-dashboard
- Here is an informative blog about this dashboard https://azure.microsoft.com/en-us/blog/empowering-cloud-sustainability-with-the-microsoft-emissions-impact-dashboard/
- “Queues” by Clemens Vasters, May 15, 2020. https://vasters.com/blog/use-queues/
- Sustainable Software Engineering Principles in Azure Kubernetes Service (AKS). https://docs.microsoft.com/en-us/azure/aks/concepts-sustainable-software-engineering and https://docs.microsoft.com/en-us/azure/aks/concepts-sustainable-software-engineering#reduce-network-travel
- Advancing reliability through a resilient cloud supply chain. This blog mentions Microsoft’s sustainability commitments.
- The Green Software Foundation — https://greensoftware.foundation/ “The Green Software Foundation is a non-profit with the mission to create a trusted ecosystem of people, standards, tooling and best practices for building green software.”
- “The Green Software Foundation was born out of a mutual desire and need to collaborate across the software industry. Organizations with a shared commitment to sustainability and an interest in green software development principles are encouraged to join the Foundation to help grow the field of green software engineering, contribute to standards for the industry, and work together to reduce the carbon emissions of software.”
- “OUR FOCUS AREAS
- Standards: Agreeing on standards, best practices and patterns for building green software.
- Innovation: Nurture the creation of trusted open-source and open-data projects and supporting academic research.
- Community: Facilitate the growth of a diverse and international community of green software ambassadors.”
- From “Fallacy #7: Transport cost is zero” by David Boike, 3/2/21, in his blog https://particular.net/blog/transport-cost-is-zero. This article is part of the series “Fallacies of Distributed Computing”, consisting of 8 articles.
- Microsoft Green Design Principles, https://wxcteam.microsoft.com/download/Microsoft-Green-Design-Principles.pdf . This reference is not quoted from in this paper, but it is quite useful as a guild of things to consider when starting a software project. Of great importance it presents some highly useful UX principles that and reduce the carbon footprint of an app, especially when used with the service oriented techniques focused on in this paper.
- Righting Software: A Method for System and Project Design by Juval Lowy, Copyright © 2020 Pearson Education, Inc.
- Saving Us: A Climate Scientist’s Case for Hope and Healing in a Divided World by Katharine Hayhoe. Copyright © 2021 by Kathrine Hayhoe. Published by Atria Books, New York, NY.
Sustainable Software Engineering’s Technical Strategies, Tactics, and Techniques © 2022 by George Stevens is licensed under CC BY 4.0. You have permission to copy this blog so long as you copy the previous sentence into your copy of it, under the Creative Commons attribution license in the preceeding link.
Helm 3.0 Named Templates provide an easy way to reduce the amount of “copy/paste” and “yaml writing” required when authoring the yaml for Kubernetes (K8s) Manifests. K8s Manifests are used to deploy app services, K8s services and other things to Kubernetes clusters. Each individual Daprized app service (having a dapr sidecar) deployed to K8s needs roughly the same Dapr Annotations, and also typically requires minor changes to the detailed value settings of each annotation. Thus, the yaml for Dapr Annotations is an ideal place to use Helm Named Templates. For information about the various annotations required to deploy app services with dapr side cars (aka Daprized app services) to Kubernetes clusters see https://docs.dapr.io/operations/configuration/configuration-overview/.
This article presents a highly focused example of the streamlining that results when Helm Named Templates are used for Dapr Annotations. It does not discuss what Helm is, or what Dapr is, or what Dapr Annotations are, or what Kubernetes deployments are, or what Kubernetes is. Please see the References section at the end of this paper for links to this kind of information.
The Helm files used in this document are also available on my DiscoverDapr/k8s-dapr-annotations-with-helm3 GitHub repository. However, be warned that the K8s Manifests generated by these Helm files may not run correctly when deployed to a K8s cluster! I derived the Helm files used herein from those in a set of microservices that does successfully deploy to and run on K8s clusters without error. But, since these derived Helm examples are only used here to demonstrate the concepts and techniques for using Named Templates for Dapr Annotations, I have not fully debugged them by also deploying then to a K8s cluster resulting in correctly running app services. I leave that task to the reader. But, these Helm files do correctly generate the desired K8s Manifests when run through the “helm upgrade” command when run with the “–dry-run” flag set. But, whether or not the app services deployed by the Manifests shown herein run correctly on K8s is another question.
At this point, defining key terms is necessary since all the files involved contain yaml, with many of them having the suffix “.yaml”. The below terms keep all these yaml artifacts differentiated from one another.
Helm Template Files — For example deployment.yaml, service.yaml, etc. Helm Templates are written in yaml. They are used as inputs to Helm which uses these files to produce yaml text that is output to K8s Manifests. Also note that Helm Named Templates are also Helm Template Files, although they use a “.tpl” suffix even though they are written in yaml as well.
Helm Values Files — For example values.yaml. Helm Values Files contain the yaml that defines detailed value settings consumed by the Helm Template Files. The separation of concerns between templates and the values that populate various items within a template allows a single Helm Template File to be used for multiple releases, i.e. dev, test, staging, production via using different values.yaml files for each release. Please refer to the Helm documentation and/or articles in the below Reference section for the exact details of how this reuse of Helm Template Files is achieved.
K8s Manifest — For example app-servicea.yaml, deployment.yaml, or service.yaml, etc. Confusing? These K8s Manifests are written in yaml either automatically by Helm, or manually by hand. K8s Manifests are used as inputs to the K8s Api Server (sometimes manually via the kubectl command) which creates the K8s objects defined in each manifest, and runs the objects in the K8s cluster. Thus, a K8s Manifest defines the desired state of a K8s cluster, or at least part of that cluster. The K8s runtime’s job is to continually bring the actual state of the cluster into the desired state as defined by all K8s Manifests processed by the API Server.
For example, this article uses a Helm Template File, a Helm Named Template file, plus a Helm Values File, to generate a K8s Manifest that is in turn fed into the K8s API Server by Helm. This process creates a running K8s pod containing an app-servicea container and its dapr sidecar container, as shown subsequently.
Below is an example of the “end result” K8s Manifest yaml generated by Helm 3.0 using the Helm Template File, the Helm Named Template, and the Helm Values File shown later in this paper. The Named Template defines the Dapr Annotations for a Daprized app service that runs in K8s. For the sake of simplicity, the below example shows the minimal yaml required to run the app-servicea and its dapr sidecar containers. Much more yaml is required to run a production app service in K8s. Please ignore everything in the below yaml except the bold faced annotations section, since they are the main focus of this article.
---
#Source: blog-testsys/charts/app-servicea/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: blog-testsys-app-servicea
labels:
helm.sh/chart: app-servicea-0.1.0
app.kubernetes.io/name: app-servicea
app.kubernetes.io/instance: blog-testsys
app.kubernetes.io/version: "0.2.0"
app.kubernetes.io/managed-by: Helm
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: app-servicea
app.kubernetes.io/instance: blog-testsys
template:
metadata:
annotations:
dapr.io/app-id: "blog-testsys-app-servicea"
dapr.io/enabled: "true"
dapr.io/app-port: "3500"
dapr.io/app-protocol: "http"
dapr.io/config: "blog-testsys-config"
dapr.io/log-level: "debug"
labels:
app.kubernetes.io/name: app-servicea
app.kubernetes.io/instance: blog-testsys
spec:
serviceAccountName: default
securityContext:
{}
containers:
- name: app-servicea
securityContext:
{}
image: app-servicea-demo:dev
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 80
protocol: TCP
env:
- name: "ASPNETCORE_URLS"
value: "http://+:80"
command:
- dotnet
- AppServiceA.dll
resources:
{}
---
Example 1. The K8s Manifest yaml generated by Helm when using a Helm Named Template to produce Dapr Annotations.
I have placed the above yaml in the GitHub code samples file charts\blog-testsys\K8sManifestProducedFor-app-servicea.txt so that you can inspect it. Also note that the GitHub code samples for this blog produce more than the minimal K8s Manifest yaml shown above. In addition to the yaml to deploy app-servicea and app-serviceb and their dapr sidecars, they also include the yaml for 2 K8s NodePort services that are used by test clients to talk to app-servicea and app-serviceb when run on a local dev cluster. All of this yaml is visible in your command window when you run the following Helm upgrade command.
When you use the “helm install” or “helm upgrade” command, Helm directly communicates with the K8s API Server. The K8s Manifest yaml generated by Helm is sent directly to the K8s API server and is not deposited in a file, although it is visible in the command window. If you want to visually check the generated yaml before you actually deploy it (a wise practice to make sure it is as you expect) use the “–dry-run switch”, as shown below. The below helm command will work on the files in the GitHub code sample. It is in the GitHub code samples file charts\blog-testsys\MakeHelmChartsForBlog.txt. The below command must be run in your ..\charts\blog-testsys directory on your development machine, which must have Helm 3.0 and kubectl installed. And your K8s config must be pointing at a running K8s cluster, done by the “kubectl config command”.
helm upgrade --install blog-testsys . --namespace default ^ --set app-servicea.image.tag="dev" ^ --set app-serviceb.image.tag="dev" ^ --debug --dry-run
Moving back to the main point of this blog — Instead of writing the yaml for the above K8s Manifest bold faced Dapr Annotations from scratch, or copy/pasting it from other files, for each and every Daprized app service one has deployed to Kubernetes, we can instead use a Helm Template that uses a Helm Named Template File, plus a Helm Values File, to do most of the yaml text generation. Using a Named Template saves time and greatly reduces errors and rework.
Here is the Helm Named Template that produced the dapr annotations in the K8s Manifest yaml in Example 1, above. This Helm Named Template is located in a file in the GitHub code samples named charts\blog-testsys\templates\_dapr-helpers.tpl.
{{/*
Dapr annotations
*/}}
{{- define "dapr.annotations" -}}
dapr.io/enabled: {{ .Values.daprAnnotations.enabled | quote }}
dapr.io/app-port: {{ .Values.daprAnnotations.appPort | quote }}
dapr.io/app-protocol: {{ .Values.daprAnnotations.appProtocol | quote }}
dapr.io/config: "{{ .Release.Name }}-config"
dapr.io/log-level: {{ .Values.daprAnnotations.logLevel | quote }}
{{- end }}
Example 2. The Helm Named Template yaml that produces most of the bold faced Dapr Annotations yaml in Example 1.
The above Named Template is named dapr.annotations.yaml. It contains a single line for each Dapr Annotation. The text in each Dapr Annotation line is copied to the output K8s Manifest after each Helm “variable name” (enclosed in the curly braces) has had its text replaced with the actual value associated with each Helm variable name. In this example most of these actual values are defined in charts\blog-testsys\values.yaml, while others are defined by Helm in its release object.
This Named Template is used in the Helm Template File (named deployment.yaml) shown in part below. This “dapr.annotations” Helm Template does the generation of the actual K8s Manifest yaml shown in Example 1 above.
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "app-servicea.fullname" . }}
labels:
{{- include "app-servicea.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "app-servicea.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
dapr.io/app-id: {{ include "app-servicea.fullname" . | quote }}
{{- include "dapr.annotations" . | nindent 8}}
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "app-servicea.selectorLabels" . | nindent 8 }}
spec:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "app-servicea.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: {{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: 80
protocol: TCP
.
.
.
Example 3. A Helm Template File that includes the above “dapr.annotations” Named Template (shown in bold), to create the app service deployment K8s Manifest yaml shown in Example 1.
The bold faced line in the Example 3 yaml above, i.e.
{{- include “dapr.annotations” . | nindent 8}}
invokes the Helm Named Template “dapr.annotations” to produce the final Dapr Annotations yaml shown in Example 1.
Note that I have “hard coded” the production of the first Dapr Annotation, app-id, directly in the above Helm Template, like this:
dapr.io/app-id: {{ include “app-servicea.fullname” . | quote }}
Hard coding this is due to me not wanting to take the time to figure out how to pass the string returned by “app-servicea.fullname” to the dapr.annotations Named Template. If you happen to figure out how to do that, please let me know.
At this point the only missing item is the detail of how Helm puts the “values” for the Dapr Annotation detailed settings put into the K8s Manifest produced. Here are the portions of the values.yaml file relevant to app-servicea. The entire file is in the GitHub code samples file charts\blog-testsys\values.yaml. this values.yaml file it contains values used by all the Helm Template Files and Named Templates in blog-testsys.
# app-servicea config
app-servicea:
replicaCount: 1
image:
repository: app-servicea-demo
pullPolicy: IfNotPresent
tag: ""
command: ["dotnet", "AppServiceA.dll"]
daprAnnotations:
enabled: true
appPort: 3500 #The app-port dapr listens to.
appProtocol: http #The protocol over the app-port.
logLevel: debug #debug, info, error, warning.
.
.
.
Example 4. A portion of the Helm Values File (values.yaml) that defines the values needed by app-servicea to be deposited into the detailed settings for each Dapr Annotation.
As you can see above, the “daprAnnotations” section of the values.yaml file contains the detailed settings for each dapr annotation. A key part of Helm’s function is to populate the parts of Helm Template Files having “variable names” that match items in the Helm Values File values.yaml. Please see the Helm book and Helm references for more detail on the relation between Help Template Files and Helm Values Files.
Finally, note that both the Helm Template Files (deployment.yaml and service.yaml) and the Helm Values File were initialized by Helm automatically generating the yaml (called scaffolding by Helm) via the “helm create blog-testsys” command. Thus, the amount of hand writing of Helm Template File yaml is minimal. One has only to make modest adjustments to the Helm generated yaml (mostly to values.yaml) to fit your purposes. Please see https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-4-creating-a-helm-chart-for-an-aspnetcore-app/ for the details on how to use Helm like this. This is an excellent series of articles.
References
- The code samples (yaml files) used in this article are available at my GitHub repo: https://github.com/georgestevens99/DiscoverDapr/tree/main/k8s-dapr-annotations-with-helm3.
- In depth information about Dapr, including using Dapr on Kubernetes and Dapr Annotations. https://docs.dapr.io/
- Andrew Lock has an excellent 12 part series of blog about “Deploying ASP.NET Core Applications to Kubernetes”. Below are 3 articles in his series that are relevant to this blog:
- A blog by Austin Dewey https://austindewey.com/2020/08/09/how-to-reduce-helm-chart-boilerplate-with-named-templates/ is very clear about how to use Helm Named Templates.
- The book Learning Helm: Managing Apps on Kubernetes by Matt Butcher, Matt Farine, and Jost Dolitsky by O’Reilly Media, 2021 is an excellent way to learn Helm. https://www.oreilly.com/library/view/learning-helm/9781492083641/
- The Helm open source website is at https://helm.sh/
- The Kubernetes open source website is at https://kubernetes.io/
- There are numerous quite helpful books about Kubernetes in print. I have found the following to be most helpful when it comes to the topics related to this blog:
- Kubernetes Best Practices: Blueprints for Building Successful Applications on Kubernetes, by Brendan Burns, Eddie Villalba, Dave Strebel and Lachlan Evenson. https://www.oreilly.com/library/view/kubernetes-best-practices/9781492056461/
- Kubernetes: Up & Running, by Brendan Burns, Joe Beda, and Kelsey Hightower. https://www.oreilly.com/library/view/kubernetes-up-and/9781491935668/
I hope you find the above sources as useful as I have, and thanks for taking the time to read this blog.
George Stevens
Software Architect, Sr. Software Engineer for Solid Value Software, LLC.
Microsoft Azure Certified Developer
This article explains the concept of the “Dapr-Mesh” plus provides links to a .NET C# code sample of 1 or 2 Dapr-Meshes in my DiscoverDapr GitHub repo — DiscoverDapr/grpc-pubsub-azservicebus.
Please see the following links for explanations of what Dapr is: Its features and capabilities, how it works, and the benefits of using Dapr:
- Dapr documentation, which is quite rich and useful — https://docs.dapr.io/
- My blog “Discover Dapr: What Is Dapr? A new way to increase developer productivity!”
The Dapr-Mesh
“Dapr-Mesh” is an abbreviation for “a Dapr-Mesh of Services”. A Dapr-Mesh is a collection of collaborating Daprized services, each using its own instance of a Dapr side car to enable collaboration, all within a ‘nearby network neighborhood’. The term collaboration means within the same Dapr-Mesh a Daprized service initiates requests to other Daprized services, and/or responds to requests from other Daprized services. The role of Dapr is that of an intermediary which enables such collaboration, along with providing additional work-saving features outlined in the above blog Discover Dapr: What Is Dapr? A new way to increase developer productivity! Interactions with services external to a given Dapr-Mesh are outlined a little later in this article.
Here is a diagram of a generalized Dapr-Mesh.

Figure 1 — Generalized Diagram of a Dapr-Mesh Implementation
“Note that a Dapr-Mesh is very different from Service Meshes like Linkerd, Istio, etc. in 2 ways:
- Dapr works at the Application Level while Service Meshes operate the Network Level. This is shown above with Dapr using application level HTTP and gRPC (based on HTTP/2) communications to do its job.
- Dapr provides capabilities to services using it that are far beyond the capabilities offered by Service Meshes (via Dapr’s Building Blocks and Components). For example, Dapr provides capabilities like State Management, Pub/Sub, Resource Bindings and Triggers, Virtual Actors, and more. These capabilities are broad “runtime” capabilities that also include service-to-service invocation of service operations, which is the main focus of Service Meshes. Please see this link for more about Dapr capabilities https://docs.dapr.io/concepts/overview/.” Quoted from DiscoverDapr/grpc-pubsub-azservicebus/RunMe.MD .
Here a word of caution may be helpful. Note that Dapr-Mesh diagrams are at the “implementation level” rather than at the “logical architecture level”. They show how a specific implementation of collaborating services is organized using Dapr. It is best to first define the interactions between services at the “logical architecture level”, without considering any implementation boundaries and runtime or hosting considerations. This allows you to more cleanly and efficiently work out logical architecture design issues. After the “logical architecture level” diagrams are correct, then map them to the “implementation level” Dapr-Mesh diagrams, making whatever adjustments are necessary to correctly implement the logical architecture.
Working first at the abstract “logical architecture level” hides many implementation details that often get in the way of effectively reasoning about the essential nature of each service and its interactions with other services. Maintaining a good separation of concerns between the logical structure of a system and its implementation makes both the design and implementation go faster, with much less rework. One form of a “logical architecture level” diagram is the Data Flow Diagram. You can see a high level example of that for an IoT system I designed, here at this link: https://dotnetsilverlightprism.wordpress.com/2017/01/31/why-i-like-modeling-iot-devices-with-azure-service-fabric-actors/.
There are also interactions between services other than those within a single Dapr-Mesh, as follows:
- Trans Dapr-Mesh interactions — A Daprized service within a Dapr-Mesh collaborates with other Daprized services in other Dapr-Meshes.
- By definition this interaction always happens between two or more Dapr-Meshes.
- Conceptually this involves the services within multiple Dapr-Meshes collaborating with each other. In other words, a Mesh-of-Dapr-Meshes, which itself is a higher order abstraction.
- External Dapr-Mesh interactions — These happen in 2 ways:
- Inside Dapr-Mesh to External Service interactions — A Daprized service within a Dapr-Mesh collaborates with non-Daprized services outside of any Dapr-Mesh (External Services).
- For example a Daprized service saves data (via a Dapr Component) to a cloud blob External Service.
- External Service to Inside Dapr-Mesh interactions — A non-Daprized service outside of any Dapr-Mesh (an External Service) collaborates with a Daprized service within a Dapr-Mesh
- For example a cloud pubsub External Service topic pushes a message it just received to a Daprized service (in a Dapr-Mesh) which is a subscriber to the cloud topic via its Dapr sidecar’s PubSub Component.
- Inside Dapr-Mesh to External Service interactions — A Daprized service within a Dapr-Mesh collaborates with non-Daprized services outside of any Dapr-Mesh (External Services).
The Mesh-of-Dapr-Meshes
It is worthwhile to briefly consider the implementation architecture of a set of collaborating Dapr-Meshes from this point-of-view. At minimum it will clarify your thinking. Below is a conceptual diagram of 4 collaborating Dapr-Meshes. The minute details of how they collaborate are intentionally omitted so that one can focus on the bigger picture. Once such high level Trans Dapr-Mesh collaborations have been sketched, the detailed mechanisms of such collaboration (including the details of each separate Dapr-Mesh) can be more easily diagrammed, and then implemented in code and yaml.

Figure 2 — Conceptual Diagram of a Mesh of Dapr-Meshes Implementation
To give some real life context to the above diagram, it could represent a Space Craft Launch Center (#1 Dapr-Mesh) which publishes the “Lift Off Completed” event over an external cloud based messaging backbone service when a space rocket clears the launching pad during blast off. This event informs numerous Space Craft Tracking Stations across the earth (#2, #3, … #N Dapr-Meshes) to start their copy of the “In Flight Monitoring Workflow” program for the just launched space craft.
The code sample that accompanies this Dapr-Mesh article can be found at DiscoverDapr/grpc-pubsub-azservicebus. ReadMe.MD provides links to the InstallMe.MD and RunMe.MD files that aid you understanding, setting up, and running the code sample. And, for more detail about the Mesh-of-Dapr-Meshes concept, please see the PowerPoint slide deck Thinking Outside the Dapr-Mesh with Dapr, gRPC.
I hope the concepts of the Dapr-Mesh and collaborating Dapr-Meshes (a Mesh-of-Dapr-Meshes), plus the diagrams herein, aid you in understanding the potential value of Dapr as a key building block of both small and large scale distributed systems.
George Stevens
Software Architect, Sr. Software Engineer for Solid Value Software, LLC.
Microsoft Certified Azure Developer
The first stable, enterprise ready version of Dapr was released today, Wednesday 2/17/21!
Here is the link to the Dapr blog: Announcing Dapr v1.0 — The distributed application runtime is now production ready. It contains a sketch of:
- What Dapr is
- The Dapr community
- How Dapr is being used used in building distributed software systems by several of businesses across the globe
- A sketch of the v1.0 feature set
- Excellent diagrams that nicely illustrate how Dapr fits in with modern microservice based distributed systems
For an in depth discussion of Dapr, the value it provides, and a list of valuable links please see my blog of 1/8/21, Discover Dapr: What Is Dapr? A new way to increase developer productivity!
Thanks for taking the time to read this blog.
George Stevens
Software Architect, Sr. Software Engineer for Solid Value Software, LLC.
Microsoft Certified Azure Developer
“Discover Dapr: Version 1.0 is released and ready to use” by George Stevens is licensed under Attribution 4.0 International

You need to know about Dapr when you are developing distributed systems and cloud software since it will likely make your job much easier and more productive. The term “Dapr” is an acronym for Distributed Application Runtime. It is a new open source project, approaching its V 1.0 release.
This article presents an overview of Dapr from an “increase software development productivity” point-of-view, i.e. reducing the amount of human time and work required to produce a given result. Dapr is rich in capabilities that have the potential to significantly increase the productivity of software developers and dev teams. Increased productivity usually saves time and money.
The primary productivity increasing benefits Dapr provides are:
- Reduced technical debt — Achieved via producing and using software that encapsulates areas having high rates of change, has an excellent separation of concerns, and has widespread decoupling.
- Decreased amount of coding work required — Achieved by using a “lower code” approach by providing many commonly required prebuilt software parts, especially parts implementing lower business value commodity plumbing functionality, e.g. code connecting services together.
- Increased developer focus on producing high value business logic — Achieved by reducing the need spend a lot of time producing commodity plumbing code and/or using manual process and tools where automated ones can be used.
- Effective coordination of concurrent access to shared state — This notoriously difficult area in distributed systems can use the Actor Model in many situations to produce this result.
So what is Dapr? If I told you that Dapr is an intermediary you would not get the full picture, by far. But this is a key idea to keep in mind. We will revisit the view of “Dapr as an intermediary” later.
If I told you that Dapr, as an intermediary, functions as a decoupler in distributed systems of software services you would know a key core value Dapr provides. Decoupling the services of distributed systems from each other makes developing, extending and maintaining software much more time and cost efficient, and easier as well. Why? Decoupling pieces of software from each other lets their internal code content and code structure vary independently of each other, significantly reducing the amount of work required for code changes when requirements change. Such decoupling is one of the best ways to reduce technical debt, and keep it low as the years go by, thus increasing productivity over the long run.
One key way Dapr produces decoupling is through its Building Blocks, each defining a conceptual interface to a capability commonly used by distributed systems. For most Building Blocks Dapr also provides a number of prebuilt plug-in Components, each implementing all or parts of the Building Block interface for a specific instance of the Building Block concept. Using prebuilt plug-in Components also acts to decrease the amount of coding work required. Without Dapr, how many times do developers have to write much-the-same low level plumbing code over and over to connect to and interact with each database or cloud service our code uses? Many, many times! With Dapr’s Building Block/Component approach the answer is zero! The use of prebuilt plug-in Components saves significant time, freeing developers to focus on higher value work. The list of Dapr links at the end of this article has details about Building Blocks.
If I told you that Dapr is a Swiss Army Knife utility service for distributed systems you will most likely want to discover Dapr’s many capabilities, given its role as both an intermediary and a decoupler. The main capabilities offered by Dapr are as follows, with many implemented via Building Blocks and Components, but not all:
- State Storage — Dapr defines a conceptual interface for a key/value pair State Store Building Block, and then provides a number of prebuilt plug-in state storage provider Components each connecting to a specific external key-value pair State Store like Redis storage or other popular key/value pair databases. The intent of the Dapr State Store is to provide low-latency storage, like a cache. For support of general purpose storage or database management systems see Resource Bindings and Triggers, below.
- Pub/Sub — Similar to State Storage, Dapr defines a conceptual interface for a Pub/Sub Building Block, plus provides a number of prebuilt plug-in Pub/Sub Components each of which connects to an external Pub/Sub messaging service like Azure Service Bus Topics, or Redis Streams for example.
- Secure Secret Access — The same idea as above, but applied to a variety of external Secret Stores like Key Vaults.
- Resource Bindings and Triggers — The same idea as above, but applied to a variety of external resources (many are cloud resources) like queues, event hubs, service meshes, blob storage, some databases, and more. Support of SQL databases is planned subsequent to the initial V 1.0 release.
- Service-to-Service Invocation — This Building Block allows “Daprized” services to communicate with each other via RPC with service-name-plus-method-name addressing, rather than by HTTP or gRPC addresses. This decouples service-to-service communications from specific network endpoints. It is required when using cluster computing hosts, and timesaving with other hosts as well.
- Actors — This Building Block allows each “Daprized” service to use the Actor Model to take advantage of its unique characteristics of 1) keeping state and the code that manipulates the state within the same entity and 2) “turn based concurrency” that prevents state from getting out of sync when multiple clients concurrently use the same Actor.
- Observability — The Observability Building Block conceptual interface provides tracing, metrics, and health monitoring of the flow of service-to-service interactions (including details of various Components) in a distributed system, plus sending data to external aggregators like Azure Monitor, Application Insights, and Zipkin for example.
- Effective Security — Provides high levels of often configurable security across the breadth and depth of a “Daprized” service, and also throughout a system of collaborating “Daprized” services.
- Middleware Pipelines — Allows custom “middleware pipeline component” code to be declaratively “plugged-in” to a Dapr request/response processing pipeline. This allows Dapr to orchestrate developer defined custom processing of the communications between a service and Dapr, and vice versa. For example Dapr provides a ready-to-use OAuth 2.0 middleware pipeline component.
- Massive extensibility — This is due to Dapr’s decoupling, interface based design, and it’s componentized plug-in architecture. Note that within a given Building Block developers may write their own code to implement a Dapr Component customized for their particular needs.
- HTTP and gRPC Communication, plus support of most popular programming languages and cloud providers.
Please note that all the above prebuilt plug-in Components are also configurable. And the act of “plugging-in” a particular Component is simply providing a declarative configuration file in a standard Components directory. Dapr takes care of the work required to load the Component’s code and “hook-it-up”.
If I told you that Dapr is a Sidecar you would know that typically a single instance of Dapr is paired with a single instance of a service, with each running in its own process. Dapr is currently not a code library. A Dapr Sidecar instance and the service instance that uses it communicate with each other across their process boundaries via HTTP or gRPC (using HTTP2) as shown below in Figure 1. When you pair an instance of a service with a Dapr Sidecar you essentially “Daprize” your service, a term used throughout this article.
Further, each Dapr Sidecar instance is aware of all the other Dapr Sidecar instances in a system of collaborating “Daprized” services. All these collaborating Dapr Sidecar instances communicate with each other completely behind the scenes on a separate Sidecar-only communication channel using gRPC, as shown below in Figure 1. Communication between Dapr Sidecars this way is key to Dapr’s Pub/Sub, Service Invocation and Actor Model capabilities. A Dapr Sidecar and the service that uses a Dapr Sidecar instance are typically run in separate containers or as separate stand-alone processes.
Figure 1 — Typical arrangement of “Daprized” Services and their Dapr Sidecars.
A “Daprized” service usually only interacts with its single, private Dapr Sidecar, as shown above in Figure 1, leaving all the messy plumbing details and knowledge of how to communicate with other services, storage, secrets, etc. to the Dapr Sidecars themselves, plus the Dapr Components used within the Dapr Sidecar instances. This makes the service code much less complicated and provides a great separation of concerns between the business logic code in the service and the plumbing logic code (aka infrastructure code) in the Dapr Sidecar and its Components. In Figure 1 above, the yellow services contain all the business logic, while the pink Dapr Sidecars contain most, if not all, of the plumbing code. This separation of concerns between business logic code and plumbing logic code is one key to the significant reduction in technical debt resulting from the use of Dapr. Thus, when requirements inevitably change usually much less code has to be changed than is typically the case when plumbing code is intermixed and intertwined with business logic code.
Finally, having a strong separation of concerns between business logic and plumbing code, coupled with Components, also focuses more developer time on the high value business logic, and less time on low value commodity plumbing code. While low level plumbing code is absolutely necessary, developing the low level plumbing code is complex, time consuming, and requires a high level of experience — all of which cost time and money. Using prebuilt “plug-in” components instead allows much of that developer time and skill to be redirected into developing the business logic that produces the value most directly relevant to the software’s end users.
Looking again at Dapr as an intermediary, one can now see how a Dapr Sidecar (plus the Components it uses) stands between the service that uses the Sidecar and all the other services in a distributed system, plus all the possible cloud services or on-prem services that a “Daprized” service can connect to. That can be a lot of services! Thus, Dapr’s role as an intermediary is vital when understanding the 3 following main usage scenarios for “Daperized” services.
- Portable services — Write once/Run anywhere. Dapr excels in this scenario, allowing developers to simply plug in different Components (configured declaratively) that the service uses to interact with external services and resources. Putting a service in a container, along with its Dapr Sidecar instance in a separate container, provides a very high degree of isolation from hard coded external dependencies. This allows the same “Daprized” and containerized service to run on-prem, in the cloud, or on Edge devices (like an IoT field gateway) without having to change the service code. Rather, changes may only be required to the declarative definitions of the Dapr Components interfacing to external dependencies. Essentially integrating a containerized Dapr Sidecar with a containerized service virtually decouples all external dependencies of the service, allowing maximum portability with a low cost of work required to do a port to a different hosting environment or to connect with different external services.
- Polygot systems of services. With Dapr’s Service Invocation, Pub/Sub, Secrets, plus State Storage and Resource Bindings and Triggers, services written in a wide variety of languages can communicate with each other without having to rewrite a lot of code and without developers having to learn a bunch of other languages. To facilitate this Dapr supplies Software Development Kits (SDKs) for many popular languages like .NET C#, Java, JavaScript, Python, and Go to mention a few. Essentially developers using any supported language and SDK will program against the same standardized Dapr interfaces rather than a bunch of ad hoc language specific or external service interfaces. This is also facilitates the other 2 usage scenarios as well.
- Extending the life of legacy software also falls under this usage scenario. Note, however, that the legacy software absolutely must support HTTP interactions with its Dapr Sidecar. If that is the case, then it may be feasible to “Daprize” the legacy software to allow it to more cost effectively be part of a system of services, communicating with other services and resources via a Dapr Sidecar and its Building Blocks and Components.
- Static services with dynamic dependencies. Need to change your Pub/Sub messaging backbone from say Redis to Azure Service Bus? This often happens when organizations need to adapt to change. With a “Daprized” system of services the per service cost of making this change (i.e. changing the declarative Component definitions) may be quite small compared to the cost of rewriting many lines of code that implements Pub/Sub messaging across many individual services without using Dapr. The same applies to the other Building Blocks and their Components.
Also note that “Daprized” services can be hosted in containers on Kubernetes or on other hosts that support containers like Docker, including using Docker Compose where appropriate. “Daprized” services can also be hosted without using containers as standalone processes on a variety of compute hosts, including your own development system.
Looking at the history of software, the potential time and labor saving characteristics of Dapr is really a big deal! There has not yet been anything quite like Dapr so far, that can run almost anywhere and also offers massive decoupling to the services of distributed systems, plus componentization and an excellent separation of concerns. All these decrease the work required in initial development and also result in significantly lower technical debit than usual in the long run. It all adds up to potentially significant increases in software development productivity in the short and long run, reducing the amount of work needing to be done, saving time and money.
If I told you that Dapr will be the glue that holds together the distributed systems of the near future, would you believe me? Time will tell. I hope this blog spurs you to investigate Dapr to see what it can do for you.
Again, what is Dapr? Dapr is an open source project initially sponsored by Microsoft. As of the date of writing this blog the Dapr development team has just released their second Release Candidate, RC 2. This means that the first stable release of Dapr, version 1.0, can be expected in the first quarter of 2021.
Above I have glossed over many key details, so please read the following links that provide in depth information. Plus, the Dapr open source team is very open to questions, concerns, bug reports, and feature requests, and is available at some of the links below.
- Dapr Building Blocks — Contains the details of Dapr Building Blocks and Components
- Dapr Documentation at https://docs.dapr.io/
- Dapr GitHub Repo at https://github.com/dapr/dapr
- Dapr Blog at https://blog.dapr.io/posts
- Dapr Community Chat for answers from the Dapr dev team https://discord.com/invite/ptHhX6jc34
- Dapr community call info, plus more https://github.com/dapr/dapr/blob/master/README.md#community
- In case you are wondering about performance, a Dapr sidecar has very low latency (i.e. is really fast) and consumes relatively few resources. See https://docs.dapr.io/operations/performance-and-scalability/perf-service-invocation/
- The book Learning Dapr: Building distributed cloud native applications by Haishi Bai and Yaron Schneider
- The book Practical Microservices with Dapr and .NET by Davide Bedin
- The free ebook Dapr for .NET Developers by Robert Vettor, Sander Molenkamp, and Edwin vak Wijk
- The above links cover both hosting in Kubernetes and hosting as stand-alone processes or containers.
I hope you find the above links as useful as I have, and thanks for taking the time to read this blog.
George Stevens
Software Architect, Sr. Software Engineer for Solid Value Software, LLC.
Microsoft Certified Azure Developer
P.S. In December 2019, after hearing good things about Dapr, I started regularly attending the hour long Dapr Community Meetings every two weeks in order to ascertain whether Dapr actually had the potential it seemed that it could have. Then from mid-November 2020 through the end of December 2020 I averaged about 30 hours per week learning to use the basics of Dapr (via the Dapr .NET SDK) by doing exploratory coding to develop a system of 4 collaborating ASP.NET Core gRPC microservices. This includes writing C# .NET code to use many of the above listed capabilities, coming up to speed on the basics of gRPC as implemented in ASP.NET Core, and doing in depth reading and study of the Dapr documentation and some code examples. My goal was to be able to assess whether Dapr is a highly useful technology that I can recommend. While I still have a few areas to explore in the very near future, my assessment of Dapr is highly favorable as this article indicates. However, as with all distributed systems development, allow sufficient time to learn to use the underlying technologies before making firm software delivery time commitments.
“Discover Dapr: What Is Dapr? A new way to increase developer productivity!” by George Stevens is licensed under Attribution 4.0 International

A portion of my current work involves learning how to use AI in responsible ways as a software architect and software engineer. For the past couple of months I have been coding a Proof-of-Concept (POC) in .NET C#, in part as a way to experiment using various new techniques so as to verify they are indeed useful and can be used in a software project for one of my clients. Since a POC does not produce real, production code, it is an appropriate place to try new things. If things go sideways, no big deal since it is only experimental code.
A week into the project I enabled a free subscription to GitHub Copilot that bolts into my Visual Studio Professional code editor. Thereafter I used the Code Completion feature a lot (some times I turn it off when the automated suggestions get in the way), plus I used the Chat feature on a couple occasions to answer questions in text or perform simple coding tasks. Stay tuned for another blog, yet to be written, about my very positive experience with GitHub Copilot’s Code Completion.
However while using Copilot I kept worrying about AI “having hallucinations”, i.e. generating inaccurate or untrue output. When AI Code Completion is generating rather smallish code snippets of 1 to 6 lines of code, perhaps occasionally a little more, it is not difficult to closely scrutinize the code by eye to verify it is accurate. And then, of course, test it! Doing these 2 steps is sufficient to avoid using Code Completion Hallucinations. But any more code than that, or when Copilot generates several paragraphs of text, requires much more time and effort to detect potential hallucinations and either correct or reject them.
So why not have a conversation with the free Desktop Copilot concerning AI Hallucinations in its underlying ChatGPT LLM to deepen my understanding of both the underlying problem and how best to deal with it? I hope you find the below transcription of my conversation informative and useful. However, I do wonder if there are any hallucinations in the below answers!
The Conversation
The below “Copilot Response” text was copied directly from the Copilot output window and pasted in here verbatim, with no editing on my part. You can query Copilot from your web browser at this link https://copilot.microsoft.com/chats/v4TAErBjGsF4GPr4ELYdu if you need to ask other questions.
———
My Prompt — What is the overall probability of ChatGPT producing a hallucination?
Copilot Response — The probability of ChatGPT producing a hallucination—meaning a confident but incorrect or fabricated response—varies significantly depending on the task and model version. According to recent evaluations:
- On simple factual questions, the GPT-4o model had a hallucination rate of about 51%, meaning it got roughly half of those questions wrong.
- For questions about public figures, the hallucination rate dropped to around 33%, with a 59% accuracy rate.
- In more structured academic tasks, such as generating references for systematic reviews, hallucination rates ranged from 28.6% to 39.6% for GPT-4 and GPT-3.5, respectively.
These numbers highlight that hallucination is not a rare glitch—it’s a known limitation, especially in tasks requiring precise facts or citations. The good news is that accuracy improves with clearer prompts, domain-specific context, and retrieval tools that ground the model in real data.
———
My Prompt — What are the implications of hallucinations in AI?
Copilot Response — AI hallucinations—when a model generates plausible-sounding but false or misleading information—have wide-ranging implications, and not just technical ones. They ripple through trust, safety, ethics, and even creativity:
1. Trust and Reliability
When AI confidently presents falsehoods, it undermines user trust. This is especially dangerous in high-stakes fields like medicine, law, or finance, where hallucinated advice can lead to misdiagnoses, legal missteps, or financial loss.
2. Misinformation and Harm
Hallucinations can spread misinformation at scale. There have been real-world cases where AI tools fabricated academic citations, recommended dangerous health practices, or even encouraged harmful behavior in mental health contexts.
3. Legal and Ethical Risks
If an AI system generates false legal precedents or misrepresents facts in journalism, it can lead to lawsuits, reputational damage, or regulatory scrutiny. The American Psychological Association has even called for regulation of AI in mental health due to these risks.
4. Bias and Discrimination
Hallucinations often reflect the biases in training data. This can perpetuate stereotypes or marginalize underrepresented groups, especially when AI is used in hiring, policing, or social services.
5. Security Vulnerabilities
Bad actors can exploit hallucinations through adversarial attacks—feeding misleading inputs to trigger false outputs. This is a growing concern in cybersecurity and autonomous systems.
6. Creative Potential
On the flip side, hallucinations can be a feature in creative domains. They can inspire surreal art, imaginative writing, or novel design ideas—so long as users understand they’re fiction, not fact.
7. Technical and Research Challenges
Hallucinations expose the limits of current AI architectures. They’ve sparked a wave of innovation, like retrieval-augmented generation (RAG), which grounds responses in real-time data to reduce fabrication.
In short, hallucinations are both a bug and a feature—depending on the context.
———
My Prompt — How can we spot AI hallucinations in outputs?
Copilot Response — Spotting AI hallucinations—those confident but incorrect or fabricated outputs—can be tricky, but there are some telltale signs and smart strategies to help you catch them:
🔍 Red Flags to Watch For
- Overly Specific but Unverifiable Claims
If an AI gives a precise statistic, quote, or citation that you can’t find anywhere else, it might be hallucinating. For example, a made-up academic paper with a real-sounding title and author. - Inconsistent Logic or Contradictions
Hallucinations often include internal contradictions or statements that don’t follow logically. If something feels off, it probably is. - Fake Citations or Sources
AI might invent journal articles, legal cases, or news stories. Always verify references—especially if they sound too perfect. - Confident Tone, Dubious Content
Hallucinated content is often delivered with absolute certainty. Don’t confuse fluency with factuality. - Answers That Change on Repetition
Ask the same question multiple times. If the answers vary wildly, especially in factual content, that’s a red flag.
🛠️ How to Detect and Prevent Hallucinations
- Cross-Check with Trusted Sources
Use search engines or databases to verify facts, quotes, and references. - Use Retrieval-Augmented Tools
Some AI systems (like those using RAG—Retrieval-Augmented Generation) pull from real-time data to ground their answers in verifiable sources. - Ask for Sources
Prompt the AI to cite its sources. If it can’t—or if the sources don’t exist—that’s a clue. - Use Domain-Specific Prompts
The more specific and structured your prompt, the less likely the AI is to improvise. - Look for Semantic Entropy
Researchers have developed methods to detect hallucinations by checking how consistent multiple AI-generated answers are. If they vary in meaning, the AI may be confabulating.
———
My Prompt — What are advanced techniques for detecting AI hallucinations?
Copilot Response — Detecting AI hallucinations—especially in large language models—has become a high-stakes challenge, and researchers are getting creative with some cutting-edge techniques. Here’s a breakdown of the most advanced methods being used today:
🧠 1. Factual Consistency Checks
This involves comparing AI-generated statements with trusted knowledge bases (like Wikidata or proprietary databases). Techniques include:
- Semantic triplet extraction (subject–verb–object) from AI output.
- Vector similarity using embeddings (e.g., BERT) to match facts.
- Retrieval models (like dual encoders) to fetch relevant facts in real time.
🔍 2. Probabilistic and Entropy-Based Analysis
By analyzing the token probability distributions, researchers can flag outputs with high uncertainty or entropy—often a sign of hallucination. This includes:
- Perplexity scores to measure how “surprised” the model is by its own output.
- Out-of-distribution token detection to catch when the model strays from known data.
🧬 3. Semantic Coherence and Logical Flow
Advanced models assess whether the generated content makes sense semantically and logically:
- BERTScore and semantic similarity metrics compare generated text to reference content.
- Causal and logical consistency checks ensure arguments follow a valid structure.
🧪 4. Knowledge Graph Validation
Tools like Pythia use structured knowledge graphs to validate claims in real time. They break down content into knowledge triplets and verify them against known facts.
🧰 5. Specialized Detection Tools
Several tools are purpose-built for hallucination detection:
- SelfCheckGPT: Compares multiple outputs from the same prompt to detect inconsistencies.
- FacTool and RefChecker: Evaluate factual grounding and citation accuracy.
- Guardrail AI and Galileo: Provide real-time hallucination detection in production environments.
🧭 6. Hybrid Human-AI Systems
Some systems combine automated checks with human-in-the-loop validation, especially in high-stakes domains like healthcare or law. These setups often include:
- Custom rule-based filters
- Confidence thresholds
- Manual review dashboards
These techniques are often layered together to maximize reliability. Want to try one of these methods in action or explore how to build your own hallucination detector?
———
My Prompt — How are these techniques applied in real-world AI systems?
Copilot Response — In real-world AI systems, hallucination detection and mitigation techniques are being woven into the fabric of how AI is deployed—especially in high-stakes environments like healthcare, law, finance, and enterprise search. Here’s how these advanced methods are being applied:
🏥 Healthcare
Hospitals and biotech firms use retrieval-augmented generation (RAG) and human-in-the-loop (HITL) systems to ensure clinical accuracy. For example, AI tools that assist with diagnoses or summarize patient records are paired with medical databases and reviewed by professionals before decisions are made.
⚖️ Legal Tech
Legal AI assistants like Lexis+ AI and Westlaw’s AI-Assisted Research integrate fact-checking pipelines and citation validation to reduce hallucinations. Despite this, studies show even these tools can hallucinate in 17–34% of cases, especially when misinterpreting user prompts or citing non-existent cases.
💼 Enterprise AI
Companies use tools like Knostic to ground AI outputs in internal knowledge bases, enforce access boundaries, and detect outdated or overshared content. These systems often include telemetry analysis, semantic drift detection, and feedback loops to catch hallucinations early and prevent them from spreading across teams.
🧪 Scientific Research
AI models used in academia or pharma are paired with knowledge graph validation and semantic entropy analysis to detect “confabulations”—fluent but false outputs. These systems often use tools like SelfCheckGPT or FacTool to compare multiple generations and flag inconsistencies.
🧠 AI Copilots and Chat Interfaces
In customer service, HR, and productivity tools, hallucination mitigation is handled through:
- Prompt engineering with context enforcement
- Guardrails that limit speculative answers
- Real-time retrieval from trusted sources
For example, HR copilots are trained to avoid giving outdated or incorrect policy advice by anchoring responses to live company documents.
🚨 Real-World Consequences
- Google Bard’s $100B mistake: Bard hallucinated a fact about the James Webb Telescope during a public demo, causing a massive stock drop.
- Legal sanctions: A New York lawyer was penalized for submitting a brief with fake case citations generated by AI.
- Medical risk: GPT-3.5 once recommended tetracycline for a pregnant patient with Lyme disease—an unsafe choice—highlighting the need for domain-specific safeguards.
These examples show that hallucination detection isn’t just a research problem—it’s a frontline defense in making AI trustworthy. Want to explore how to build one of these systems yourself or see how hallucination rates compare across models?
———
Conclusions
I hope you have found the above conversation useful! As you can see, it may take significant time, further Copilot queries, the use of specialized techniques, etc. to ensure the correctness of a given Copilot response. There is much work currently being done in this area so improvements will happen in the coming months and years.
This conversation has helped me formulate a tentative approach to using AI in mission critical software systems. Based on my past experience in industries that often explicitly manage significant mission critical risks (Nuclear Power, Test & Measurement, Investment Management, Medical Instruments) it appears to me that one must focus on explicitly controlling the risk of loss (of money, of customers, of people’s health and well being, etc.) due to AI hallucinations. This can be achieved as follows:
1) Developing a deep understanding of each particular risk unique to a business using AI in each specific way it actually uses AI. This includes significant experimentation and testing.
2) Develop one or more ways of detecting, controlling, and/or offsetting or mitigating each particular risk, perhaps by the use of some of the techniques listed above by Copilot.
3) Write down the above information (1 and 2) in an AI Hallucination Risk Control Plan that is used by personnel throughout the organization to inform and enhance their day-to-day business processes and/or product/service production activities that involve AI.
4) Assign appropriate people to be accountable for the planning, monitoring and management of each of the risk areas in the AI Hallucination Risk Control Plan.
5) Ensure continuous effective collaboration on all above areas by Software Engineering, AI Engineering, Software Operations, plus specific business areas involved, like Marketing, Risk Management, Finance, etc.
That is the only way I currently see to avoid the overall risk of getting blind sided by AI Hallucinations in mission critical businesses until AI becomes more mature, AI risks become more well known, as well as the techniques for managing them are more developed and refined, including automation.
As we used to say back in the day when working in mission critical industries: “If you cannot tell me exactly how you are managing such-and-such a risk, you are not managing it. Rather, that risk may end up managing you!”
Please stay tuned for more articles on AI as I continue exploring how to responsibly use AI in software systems and projects.
Thanks,
George Stevens, Software Architect, Solid Value Software LLC
It has been about 18 months since I posted an article on this blog. During this time I redirected my blogging time and energy towards preparing for the Microsoft Certified Azure Developer exam, and passed it in late November 2019. This article outlines the sources of information I used in this successful endeavor.
Microsoft has a certification program that spans an amazingly wide range of subject areas, and roles within each subject area. Click this link to see a complete list of Microsoft Certification exams. And click this link for the 10 certification exams for Azure. Click this link to see the 4 Azure role based certifications, each of which requires taking two exams.
I chose the Microsoft Certified Azure Developer, which requires taking the Developing Solutions for Microsoft Azure exam (AZ-203) since I am a software engineer and software architect first and foremost. I felt that having a very solid grounding in Azure from a software first perspective would be of the greatest value to the people that use my services. The AZ-203 exam subject matter helps me in both my role as a Software Architect (designing service oriented systems in Azure) and in my role as a Sr. Software Engineer (implementing software in Azure). In addition to this, I sometimes work in the role of a mentor of other software developers and I’ve never seen the blind leading the blind work out very well when dealing with complex software systems. This is especially the case with cloud software development which is new and can be quite different from what most software developers are used to. Azure is a wide-ranging, forward looking, very powerful cloud software and system platform often requiring unique specialized software development techniques, skills, and knowledge. Please remember this new mental model so as to get the most from Azure: The Azure cloud is a platform having numerous advanced services and capabilities that you can use to save time plus do things you cannot easily do on-prem, rather than Azure being just some other place to run some code. Preparing for the Developing Solutions for Microsoft Azure certification exam has set me up well to meet these challenges.
Below I share links I found useful in preparing for the exam.
Exam Skills Outline — As mentioned above, click the following link for the details about the Developing Solutions for Microsoft Azure Developer exam (AZ-203). This page has a link to a PDF listing of all the detailed areas the exam covers, aka the Skills Measured: Click on the “Download exam skills outline” link to see the entire outline. I suggest you use this “exam skills outline” as a detailed guide for your studies for the AZ-203 exam.
The Azure Documentation — Read key parts of the Azure Documentation for each Skill Measured in the above “exam skills outline”. This includes the Overview document, plus likely some or all of the multiple documents listed under Concepts in the index of the Azure Documentation for each particular Skill Area. It may also pay off to read and study other documents listed in the index as well. Finally, one of the things you should learn for each Azure service covered by an “exam skill” is the C# “programming model”, plus often an Azure CLI “deployment model”. You will need this knowledge to effectively write code in Azure. I found it helped to print out some of the Quickstarts so I could review the code of the programming model before I used it in writing learning code later. Note that this is all free, except for your cost of printer paper and ink when you print documentation. You will likely find reading these documents quite interesting as you explore and learn the capabilities of the Azure platform.
Microsoft Learn Courses — Another really good free resource is the Microsoft Learn modules and learning paths relevant to the AZ-203 exam. You can find a list of these on the Azure-203 exam page described above at this link: https://docs.microsoft.com/en-us/learn/certifications/exams/az-203. Some of the learning paths and modules are listed under the “Online — Free” tab. And there are others relevant to this exam listed at this link: https://docs.microsoft.com/en-us/learn/browse/?products=azure&resource_type=learning%20path. You will have to pick out the learning paths (and within a given path, the specific modules) that fit the Skills Measured in the detailed areas PDF described above. Be sure to use the nice “Filter” control on the left side of the screen to select the learning paths and modules you are interested in viewing.
Notice there are dozens and dozens of the Microsoft Learn courses! I really, really like them since they give you a hands-on coding, Azure CLI, and Azure Portal learning experience in the specific area of the course module. They are an amazingly effective way to learn the basics of the hands-on skills you will need to develop software on Azure. Essentially, a module in a learning path gives you a free 4 hour “sandbox” subscription to Azure, and then guides you step by step through all the “moves” you need to take to work the hands-on exercise portion of the module. In a typical course the development tools you use are Visual Studio Code as an editor, the Cloud Shell, the Azure CLI, the Azure Portal, and perhaps also PowerShell. Before you start working the hands-on exercise, you are presented with one or two or three concept teaching units that are well written, sometimes include videos, and often have quite nice and effective diagrams. Overall I am a big fan of Microsoft Learn courses, their effectiveness and high quality. And they are fun! Good job, Microsoft.
Write Some Code — The above will get you started, but it is likely not enough to get you to pass the test or to effectively develop code on Azure. After all the end goal of AZ-203 certification is to become effective at developing code on Azure. You will need to work with your hands on the code in each of the Skills Measured listed in the above PDF. Just write some learning code, perhaps following some example from Azure Documentation or someone’s blog or from an online course discussed below, especially in areas you are unfamiliar with. Work with the basic coding techniques and concepts until you feel comfortable that you can implement production code in each given area without significant problems.
Pluralsight Online Courses — I found it very useful to take the free Pluralsight courses that come with my Visual Studio Professional subscription (used to be MSDN). This link will take you to the relevant Pluralsight page https://www.pluralsight.com/partners/microsoft/azure. But before you access the videos you will need to activate this benefit by logging in to your Visual Studio subscription.
There are about 24 videos in the Pluralsight Azure Developer series that cover most of the required areas for the AZ-203 exam. All of the videos feature an instructor that writes code, using the programming model for the particular Azure service(s) the video features, and walks you through the code, programming models, concepts, etc. These videos are highly useful for just watching and learning, taking notes, or “following along by coding with the instructor”. The courses also include code examples, and many include practice test questions and answers. I found the Pluralsight course practice tests very useful for drilling myself on the key concepts. Thanks to Pluralsight for having quite good course material, and also to both Microsoft and Pluralsight for partnering to make these courses free for those having a Visual Studio Professional (and above) subscription.
Other Courses — Coupled with a lot of self driven hands on coding on my part, the above listed learning resources were sufficient for me to learn the required material. However there are plenty of other online courses that you can use to prepare for the Azure 203 exam. And there are instructor led courses you can attend in person or online, at times allowing you to ask questions and have a dialog with the instructor. Click on the “Instructor-led — Paid” link on the AZ-203 exam page described above to learn about these resources.
Practice Exams — I used the MeasureUp “official practice test” for the AZ-203 exam. I found it helpful in a couple ways. First it gives you good practice taking a test like this. Not something one does very often. Second, it has excellent references for each question, including links to Azure Documentation. Thus it is a good learning resource, as well as providing practice at test taking in the subject matter area. I found the MeasureUp practice test for AZ-203 quite worthwhile.
I hope you find the above links as useful as I have. And good luck when you take the AZ-203 certification test!
George Stevens
Software Architect, Practitioner of the IDesign Method, and Sr. Software Engineer at Solid Value Software, LLC.
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
Developing software in the cloud requires a completely new mind-set from that typically used in on-prem development. Effective cloud software development must allow for the following, to mention a few:
- A relatively high rate of failure that you must learn to embrace, much of it from transient faults.
- Eventual consistency is much more common.
- Using new security models, i.e. in the cloud identity becomes the primary boundary rather than “the network” as in on-prem systems.
- The cloud’s ability to support software systems of a truly vast scale due to the lower cost of cloud compute and storage, plus the broad global reach of cloud networks and data centers. Dealing with global scale systems requires a new mind-set all by itself.
The details of exactly how and why software architecture and development for the cloud require a completely different mind-set are very well presented in the Architecting Distributed Cloud Applications video series by Jeffery Richter. The first few minutes of this video series summarize the new mind-set required, and are well worth watching. I consider the remainder of the video that presents the details a “must watch” for people serious about developing high quality cloud based software.
Using best practices in developing software typically produces best results, reducing both the amount of rework required and the number and severity of bugs. While the above video will get you going, this blog aims at presenting links to a lot of the best practices for developing software in Azure. These links are listed below, along with related links, aimed at giving you an easy way to see exactly what information is available and then a way get to that information with a click.
In addition to the below info sources on Azure Best Practices it often pays off to search for best practices for any narrowly scoped area of interest to you. For example “Azure Table Storage Best Practices” yields some great material in the first 10 items returned, as does “Azure SQL Best Practices”.
Azure security best practices and patterns — An excellent wide ranging catalog of security best practices.
- Azure boundary security best practices
- Azure database security best practices
- Azure data security and encryption best practices
- Azure identity management and access control security best practices
- Azure network security best practices
- Azure operational security best practices
- Azure PaaS Best Practices
- Azure Service Fabric security best practices
- Best practices for Azure VM security
- Implementing a secure hybrid network architecture in Azure
- Internet of Things security best practices
- Securing PaaS databases in Azure
- Securing PaaS web and mobile applications using Azure App Service
- Securing PaaS web and mobile applications using Azure Storage
- Security best practices for IaaS workloads in Azure
Azure architecture center — This is a particularly rich area for best practices, plus other information aimed at aiding your transition from “on-prem” thinking to cloud thinking.
Best Practices
- API design
- API implementation
- Autoscaling
- Background jobs
- Caching
- Content Delivery Network
- Data partitioning
- Monitoring and diagnostics
- Naming conventions
- Transient fault handling
- Retry guidance for specific services
Application Architecture Guide
- Architecture styles
- Choosing a compute service
- Choosing a data store
- Design principles
- Pillars of software quality
- Design patterns
- App Service web applications
- Hybrid networks
- Identity management
- Jenkins server
- N-tier application
- Network DMZ
- SAP on Azure
- SharePoint Server 2016
- Availability patterns
- Data management patterns
- Design and Implementation patterns
- Management and Monitoring patterns
- Messaging patterns
- Performance and scalability patterns
- Resiliency patterns
- Security patterns
- Busy Database
- Busy Front End
- Chatty I/O
- Extraneous Fetching
- Improper Instantiation
- Monolithic Persistence
- No Caching
- Synchronous I/O
Design Review Checklists
Design for Resiliency
Scenario guides
- Azure for AWS Professionals
- Build microservices on Azure
- Manage identity in multitenant applications
- Migrate to Service Fabric from Cloud Services
- Extend Azure Resource Manager template functionality
Resources
- Azure Roadmap
- Microsoft Trust Center for Azure
- Icons and diagrams
- Calendar of Azure updates
- Training
- SQLCAT blog
- AzureCAT blog
- White papers
I hope you find the above links as useful as I have.
George Stevens
Software Architect, Sr. Software Engineer at Solid Value Software, LLC.
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

