Building highly available, scalable, and resilient software running in the cloud is quite different from building such software systems that run on-premises. Why? In the cloud you must plan for your software encountering a much higher rate of failures than usually encountered in on-premises systems. This article provides links that describe techniques and best practices for building your cloud software to effectively deal with such frequent failures.
Here is a rough sketch of the sources of these failures:
- Cloud hardware failures – The cloud uses vast numbers of cheap, commodity compute, storage, and network hardware units to host both the cloud provider’s PaaS services and customer services and apps. This cheap hardware fails more frequently than on-premises systems which generally utilize expensive, top-of-line compute, storage, and network hardware. On-premises hardware systems are designed to achieve a high Mean-Time-Between-Failure (MTBF) so that software running on them does not have to deal with a high rate of hardware failures. The cloud is the opposite, having a low hardware MTBF due to a much higher rate of failure of its cheap hardware. These routine hardware failures are very common and can happen multiple times a day to a single cloud service. The cloud control software (known as the “fabric”) is programmed to recover the software affected by hardware failures, both customer software and cloud provider service software. The “fabric” recovery happens in the background, out of sight. During the recovery process from these routine hardware failures the cloud provider’s services return a “not available” signal to customer software using the service. The duration of such “not available” failures is typically measured in seconds, perhaps minutes, rarely longer. This requires that customer software running in the cloud be designed to 1) gracefully handle the higher rate of routine, short term failures of both hardware and the cloud provider services it uses, plus 2) also to have a low Mean-Time-To-Recovery from non-routine failures as well. The much higher rate of such routine failures is the big difference between cloud and on-premises software. Note that the cost savings of using cheap, commodity hardware by cloud providers are passed on to customers.
- Cloud hardware overloading – Many cloud provider services are multitenant (software-as-a-service), i.e. they share blocks of hardware (nodes) between multiple customers utilizing a cloud provider service. For example Azure SQL is a multitenant cloud provider service that is used by multiple customer services and apps. A multitenant cloud provider service shares hardware amongst customers to reduce costs, with the savings passed on to the customer. When some customer’s software becomes very heavily loaded it may use too many resources provided by a particular cloud provider service sharing compute, storage, or network nodes. In this heavily loaded situation the cloud provider service itself and/or the “fabric” control software will start throttling the cloud provider service to protect it and its hardware from becoming fatally overloaded and crashing. Such throttling appears to the customer’s software as if the cloud provider service is temporarily unavailable. In other words, it appears as if the cloud provider service has failed for some reason since it will be unresponsive for a few seconds or minutes until the throttling stops. This intermittent protective throttling affects all customer software utilizing that cloud provider service in this way. Throttling is a very common occurrence, happening as much as several times per hour, or more during heavy usage periods, with a typical duration of seconds per occurrence, but occasionally longer. Customer software must be written so it is able to effectively deal with such throttling to remain resilient and available. Note that some cloud providers have non-shared (single tenant) PaaS services available for a premium price. Use of such premium services will side step throttling issues, other than the throttle you should build within your own customer developed services to avoid hard crashes due to overloading.
- Cloud catastrophic failures – Compared to the above failures, catastrophic failures are very rare. They occur perhaps a few times per year and typically involve the loss of one or more cloud provider services for use by customers for a half hour, several hours, or for a day or so in extreme cases. Such failures are caused by 1) Physical disasters, like earthquakes or terrorism, affecting data centers or their network infrastructure, 2) Massive hardware failures, 3) Massive software failures or bugs, or 4) Operational failures, i.e. the cloud provider operations staff making a big mistake or a series of smaller mistakes which cascade into a big outage. Mission critical customer services and apps must be designed to withstand these longer duration failures as well as the above shorter duration failures. One way to achieve such “high availability” is for customer software to “failover” to another data center located in a different geographical area. Note that this situation is quite similar to what can happen in an on-premises data center, and is also addressed by the links that follow.
The routine short term failures described above are known as Transient Faults in Azure. Please see the below item called “General Retry Guidance” in the ” Azure Cloud Application Design and Implementation Guidance” link for a full description of how Transient Faults happen and best practices to deal with them.
The good news in the area of failure is that the cloud “fabric” control software is very intelligent and will usually be able to automatically heal cloud hardware failures and hardware overloading failures. For these, the healing process may take a few seconds, or a minute, or generally some time that is within the Service Level Agreement (SLA) for a particular cloud service like Azure SQL or Azure Storage. A Service Level Agreement is a legal agreement between customers and a cloud provider that gives a cloud provider a financial incentive to provide a stated level of service to customers. Each cloud service usually has its own unique SLA. Typically, if the cloud provider is not able to fulfill the terms of the SLA for a particular cloud service, it will refund the customer’s payments for the services used to some stated extent. Below shows typical levels of service one can expect from an Azure SLA, usually measured on a monthly basis in terms of minutes of availability per month.
So, how much failure time per month can one expect from different SLAs?
- An SLA of “three 9s” (a cloud service is available 99.9% of the minutes in a month) results in a maximum unavailability time of 43.2 minutes per month, or 10.1 minutes per week.
- An SLA of “four 9s” (a cloud service is available 99.99% of the minutes in a month) results in a maximum unavailability time of 4.32 minutes per month, or 1.01 minutes per week.
- Many cloud services have a 99.9% availability. Some are a little higher, some a little lower.
- For more on Azure SLA’s please see the “Characteristics of Resilient Cloud Applications – Availability” section of the below link to “Disaster Recovery and High Availability for Azure Applications”.
Conclusion
- With 10.1 minutes of unavailability per week as typical, and appearing to customer software running in the cloud as if a cloud provider service has failed, you absolutely must build your cloud software to effectively deal with frequent failures of all kinds. Failure is a normal part of cloud computing. It is not exceptional at all.
- Plus, for mission critical services and apps running in the cloud you must also build them for high availability so that they can gracefully withstand a catastrophic failure as well, and very rapidly come back on line. Perhaps be back on line in seconds to minutes.
The info sources presented below describe specific techniques to deal with such failures.
Azure Cloud Application Design and Implementation Guidance by Microsoft Patterns and Practices — Over the past year Microsoft has pulled together its key Azure best practices into one place. This makes it so much easier to draw upon when building software to run in Azure. The Guidance contains links to 13 focused areas. In my opinion the “must reads” in the above list are as follows. They are required to gain a minimal effective understanding of what it takes to build “Highly Available, Scalable, Resilient Azure Services and Apps”.
- Retry General Guidance (this has more detail of why there are lots more failures in the cloud)
- Availability Check List
- Scalability Check List
- Monitoring and Diagnostics Guidance
- Background Job Guidance
Disaster Recovery and High Availability for Azure Applications – This Microsoft document covers strategies and design patterns for implementing high availability across geographic regions to cope with catastrophic failures. These patterns allow an Azure app or service to remain available even if an entire data center hosting the app or service ceases to function. They also aid in reducing the Mean-Time-To-Recovery for your cloud hosted software.
Hardening Azure Applications – A book by Suren Machiraju and Suraj Gaurav published by APress in 2015. It does a great job of identifying techniques to build “Highly Available, Scalable, Resilient Azure Services and Apps”, as well as including security, latency, throughput, disaster recovery, instrumentation and monitoring, and the “economics of 9s” in SLAs. It is invaluable in defining requirements and dealing with the business in these areas. The target audience is Architects and CIOs, but Senior Developers and Technical Leads will also benefit from it. We all have a steep cloud learning curve to climb in the area of understanding and defining an organization’s non-functional requirements for cloud services and apps, plus the techniques required to meet those requirements. This book speeds one on their way.
Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications – An online and paper back book by Microsoft Patterns and Practices, published in 2014. This provides excellent primers in key cloud topics like Data Consistency, Asynchronous Messaging, plus an excellent section with in depth explanations of a number of “Problem Areas in the Cloud”. So if you are unsure of terminology or technology terms, this is a good place to learn the basics.
Finally, a new way to aid building “Highly Available, Scalable, Resilient Azure Services and Apps” has just become available in Azure. It is called Service Fabric. I will cover that in future blogs.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
I’ve been able to get quite fast development time using Azure Stream Analytics (ASA) to analyze streams of unstructured data, plus to transform the format of such data, i.e. breaking up the data into different streams and/or reconstituting the data into different structures and streams. These are things we often need to do, and now we do not have to always write programs to do it. In some cases we can use ASA instead.
The learning curve is quite manageable for ASA. I found the longest part of the learning curve was working with ASA’s SQL like query language, particularly learning how to use its ability to do real time analysis of data streams via the Tumbling, Hopping, and Sliding time windows it offers. But if you know the basics of SQL this only takes an hour or so to learn, with good examples at hand (in the links below). I hope the links to ASA info sources will shorten your learning curve as much as they shortened mine, plus open your eyes to the possibilities ASA offers — It is a powerful, yet easy to use tool.
Here is a basic introductory example showing the process of building an ASA job and its query in the Azure Portal – “Get started with Azure Stream Analytics to process data from IoT devices“ by Jeff Stokes of Microsoft. The screen shots of the Azure Portal for ASA in this link will give you an understanding of how to work with ASA and its query language. Note that you need not write external code to get things working. All your work, including writing and debugging the query, is done in the Azure Portal UI. Note that you may need to write some C# code later for production monitoring of the ASA job and any Event Hubs it gets data from.
At the time of writing this blog article, ASA can input and output data from the following Azure services:
ASA Input Sources
Blob
Event Hub
IoT Hub
Reference Data in a Blob
ASA Output Destinations
SLQ database
Blob
Event Hub
Table Storage
Document DB
Service Bus Queue or Topic
PowerBI
These inputs and outputs provide an amazing array of options for processing data at rest (residing in a Blob) or data in motion (streaming into an Event Hub or IoT hub).
Here are 2 common usage scenarios of ASA:
- Searching for patterns in log files or data streams
- This can include using ASA to analyze log files that are programmatically created by ones software to look for errors and warnings of certain kinds, or for telltale evidence of security problems. “SQL Server intrusion detection using Azure Event Hub, Azure Stream Analytics and PowerBI” by Francesco Cogno of Microsoft is an example of such a usage scenario.
- Since ASA works on live data streams contained in Azure Event Hubs it can be used to search for patterns in telemetry data from the outside world, e.g. IoT systems. For example one could find each item in the input stream that had “Alert” in the field named “EventType” and place that record into a Service Bus Queue read by a Worker Role whose job it was to push alert messages to a UI.
- Calculating real time statistics on-the-fly
- An example is calculating moving averages, standard deviations, and being able to create alert records sent to an Alerts queue when such a calculation exceeds some preset level. “Using Stream Analytics with Event Hubs” by Kirk Evans of Microsoft presents an example of this usage scenario as does the first link, above.
Other Useful Info Sources
“How to debug your ASA job, step by step” by Venkat Chilakala of Microsoft. This can save lots of time when debugging.
“Query examples for common Stream Analytics usage patterns” by Jeff Stokes of Microsoft. For both simple and complex query techniques by example.
“Scale Azure Stream Analytics jobs to increase stream data processing throughput” by Jeff Stokes of Microsoft. This will give you in depth knowledge of ASA.
“Stream Analytics & Power BI: A real-time analytics dashboard for streaming data” by Jeff Stokes of Microsoft. How to quickly display charts from data output by ASA.
“Azure Stream Analytics Forum” on MSDN. I have found this forum to contain some really useful posts. Plus you can ask questions as well.
I hope you find these info sources as useful as I did in opening up a new world of cloud-based data analysis and transformation!
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
One of my current technology explorations is polyglot persistence. I am now mainly through the reading stage and it is quite clear that No SQL databases can be quite useful in certain situations, as can relational databases. Using both No SQL and relational databases together in the same solution, each according to its strengths, is the essence of the polyglot persistence idea.
Here are some sources of information I’ve found to be most useful on No SQL databases, their strengths, weaknesses, and when and how they can be best used:
- Martin Fowler’s book NoSQL Distilled (2013) has been immensely helpful in gaining an understanding of both the various DBs, their strengths and weaknesses, and key underlying issues like eventual consistency, sharding, replication, data models, versioning, etc. It is short little book that is truly distilled. If you read only one thing, this should be it.
- Also very useful is Data Access for Highly-Scalable Solutions (2013) from Microsoft Press and the Patterns and Practices group. It is written with a cloud mindset, contains code examples, and goes into much more detail that Fowler’s book. Importantly, it shows examples of how to design for No SQL DBs. I found the first few pages of its Chapter 8 “Building a Polyglot Solution” to be an excellent summary of the strengths, weaknesses, and issues one must deal with in using a No SQL database. That chapter also presents an excellent succinct summary of general guidelines of when to use a Key-Value DB, a Document DB, a Column-Family DB, and a Graph DB on page 194 of the book.
- The blog article I posted several months ago, CQRS Info Sources, contains links to good articles on techniques that themselves use No SQL persistence (sometimes by implication). Reading these links aided me in seeing areas where NoSQL DBs could be useful.
- Microsoft Press’s book Cloud Design Patterns contains a lot of useful information on patterns that can use NoSQL DBs; guidance on things like Data Partitioning, Data Replication; plus a primer on Data Consistency that promotes a good understanding of eventual consistency versus strong consistency (usually available with a relational DBs via transactions). Some of the patterns it describes that can be implemented with a NoSQL DB are Event Sourcing, CQRS, Sharding, and the Materialized View.
Finally, keep in mind that both books listed above advise that relational databases will typically be the best choice for the majority of database needs in a system, and to use No SQL DBs only when there are strong reasons to do so. The costs of not using a relational DB, with its capability to automatically roll back transactions spanning multiple tables, can be quite substantial due to the complexity of programming the error compensation (rollbacks) by hand.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
In just a single year a major change has happened in the in the multiple waves of technology change that have been washing over the computer and software industries for the last 7 years or so — The Cloud Wave is growing in size at a rate much faster than any of the other waves of change I described a year ago in my blog article “Waves of Technology Change: Grab Your Surfboard or Life Jacket?”
Job Trend Data
Last year I identified the following 4 waves of change from the information in Indeed’s Leading Tech Job Trends, based on the top 10 “fast growing tech key words found in online job postings” (quote is from Indeed Job Trends page):
- New Web Wave – HTML5 and jQuery
- Mobile Wave
- Big Data Wave
- Cloud Wave
The above waves were identified from data shown by Indeed on 2/9/2015. Please see my 2015 blog article (same as that above) for the data this categorization was based on.
During the past year Indeed has modified its Job Trends page. Now (February 2016) it displays only the top 5 tech job trends, rather than the top 10 as in 2015. Below is a comparison of the top 10 Leading Tech Job Trends of 2015 versus the top 5 of 2016, both listed in rank order of how fast the key word is growing in online job postings.
2015 2016
- HTML5 Data Scientist
- Mongo DB Devops
- iOS Puppet
- Android PaaS
- Mobile app Hadoop
- Puppet
- Hadoop
- jQuery
- PaaS
- Social Media
In the above 2016 data we have the following classification, mapping job key words to waves of technology change:
- Big Data Wave – Data Scientist (new in 2016) and Hadoop.
- Cloud Wave – Devops (new in 2016), Puppet, and PaaS.
Conclusion — The major change from 2015 is that the Cloud and Big Data waves have taken over the top 5 fastest growing jobs, completely displacing the Mobile and the New Web waves! And, since Big Data is heavily Cloud based these days, you can also say that the overall Cloud wave is the fastest growing wave of technology change washing over us right now.
Survey Research Data
The RightScale “2016 State of the Cloud Report” adds deeper insight to this conclusion. It is a survey of 1,060 technology professionals (executives, managers, and practitioners) from a large cross section of organizations concerning their adoption of cloud technology. I encourage you to examine the details in the report itself. Below are some key findings from this report:
- The use of “Any Cloud” increased from 93% to 95%. Note that all the data includes experimental projects as well as production systems. Wow, almost all respondents are using the cloud somehow!
- In the last year respondent’s use of “Hybrid Clouds” increased from 58% to 71%.
- Respondents typically use more than one cloud provider, both public and private.
- “Lack of resources/expertise” has replaced security as the top “cloud challenge” since 2015. Concern about security is now the number two “cloud challenge”.
- Percent of respondents running apps in the cloud in 2015 versus 2016 are shown below by cloud provider:
2015 2016 Year-to-year Change
- AWS 57% 57% 0%
- Azure IaaS 12% 17% +5%
- Azure PaaS 9% 13% +4%
- VMWare, Google, IBM, etc. were all between 4% and 8% in each year.
The above clearly shows that Microsoft’s Azure (with 4% to 5% growth) is taking market share from AWS (with 0% growth). By the way, grabbing market share from competitors is one key characteristic of a market leader.
What to Do? Grab a Cloud and Get Up to Speed
If you are a software development professional (whether executive, manager, architect, or developer) it should be clear that there is a high probability you will be called upon to participate in cloud based projects in the next few years.
My own cloud learning journey has thus far resulted in me learning how to architect and develop industrial strength cloud services and hybrid systems (using both cloud and on-premises systems) using the Azure Service Bus, Azure Storage, and Azure Cloud Services. After a number of months of full and part time study and development, I became proficient enough to successfully use these skills in my job in July 2015. It has required a substantial amount of time and effort to learn the basic skills, develop the vital “cloud mindset”, and integrate these together.
Developing cloud based software requires a very different mindset than developing software for on-premises systems. A “cloud mindset” is required – One has to specifically design for failure and eventual consistency, plus other incongruities as well. This has much farther reaching implications than you might first imagine. Some of the things one routinely practices in on-premises system development are anti-patterns and anti-practices the cloud! So not only do you have to learn new things to do high quality cloud software development, you also have to unlearn things you already know that do not work well in the cloud.
Below are a few information sources I’ve found most valuable on my cloud learning journey. They will help you on your learning journey should you choose Azure.
- “Microsoft Azure — The Big Picture”, by Tony Meleg, MSDN Magazine, October 2015. This article provides an excellent overview of what Azure has to offer from a software developer’s point of view.
- Exam Ref 70-532 Developing Microsoft Azure Solutions, March 2015, by Zoiner Tejada, Michele Leroux Bustamante, and Ike Ellis. At first I found the breadth of information required to develop software on Azure overwhelming. This book solved that problem, bringing it all together in one place so you do not have to spend hours sifting through online documentation and tutorials (save the excellent tutorials for after you’ve read the book). This book provides you with all the basic details needed to start developing software for Azure. It has a wide breadth that covers all the key features of Azure you’ll have to deal with. Plus it goes into a reasonable depth with code examples, and has good references to more in depth sources. It helps you learn to use Power Shell. And you don’t have to study for the certification exam and take if it you don’t want to! You can use it solely as a source book and learning guide.
- Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications by Homer, Sharp, Brader, et al. Copyright 2014, Microsoft Patterns and Practices. This is available in paperback (for a fee), or as a PDF (free download), or as a set of web pages. It contains 24 patterns, plus 10 guidance topics. There are also code snippets and samples provided as separate downloads. This book has been extremely helpful in showing me the bigger picture and the “cloud mindset” that one must absolutely learn to work with the cloud – like considering eventual consistency, designing for failure, scaling, replication, partitioning, etc. And it provides explicit guidance on how to effectively deal with these areas as well.
- Since about 2012 MSDN Magazine has published quite a number of well written articles on specific Azure software development technologies, most including code examples. Google “Azure MSDN Magazine” for a list of these articles. Of special interest are the articles published between 2014 and 2016, during the release of an astounding number of innovative and powerful new Azure capabilities that are also very well integrated with Microsoft’s software development tools like Visual Studio. Integration of Visual Studio with Azure capabilities measurably reduces development time and costs. These capabilities and tools, along with competitive pricing, are making Microsoft’s Azure cloud a clear market leader.
Good luck on your cloud learning journey.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
Given the challenges of developing apps for modern distributed systems outlined in my previous blog SO Apps 4, Coping with the Rapid Rise of Distributed Systems, exactly what techniques can be used to decrease Time-To-Market (TTM) of these systems and apps? Below, I list specific techniques I have found that will speed your TTM, both in the development of the initial release and subsequent releases. Many of these are root causes of slow TTM.
Software Structure
- Use Volatility based decomposition as a basis for designing your software architecture – The architectural decomposition of a system into microservices and their components needs to be driven by the goal of encapsulating the most volatile areas of the system into separate parts. This decouples high volatility areas from each other so they can vary independently. A volatile area refers to some aspect of the domain, system, or whatever that has a high probability of change at some point in the life of the system, and that such a change would be so great that it will cause severe disruptions to the system architecture if it were not designed to encapsulate these volatile areas. Code changes are typically caused by changing requirements or fixing bugs. Encapsulating volatile areas prevents such code changes from rippling through large swaths of the code base. When such code changes are well contained within a microservice and/or its components much, much less work needs to be done to make the change and a faster TTM results in both initial and follow on development phases.
- Control the expansion of complexity – Tightly constrain the number of interconnections to prevent the non-linear acceleration of complexity from soon burying the project in excess code, slowing TTM more and more over time. See Figure 2 in my SO Apps 4, Coping with the Rapid Rise of Distributed Systems article for a diagram and explanation of this accelerating non-linear effect. Controlling this complexity is readily achieved by constraining interconnections as follows:
- Limit and manage the number of interconnections between components within a service or microservice. A closed layered architecture works very well for this.
- Limit and manage the number of interconnections between services or microservices themselves.
- Avoid nano services (very, very tiny fine grained services) which will inevitably result in more interconnections, creating more non-linearly expanding complexity.
- Avoid fine grained service contracts that require a lot of service operations since this inevitably creates more interconnections, and thus more non-linearly expanding complexity. Instead favor service contracts with fewer coarse grained service operations and “chunky” data contracts.
- Focus your business logic in your services, not in your UI or split between the UI and services — With multiple UIs (web, mobile, etc) why set yourself up to have to make code changes in multiple places due to the business logic being sprinkled around? Rather put all business logic in services as described above, leaving the UI to implement the presentation logic. You’ll get a shorter TTM this way.
- Strongly separate system concerns from business concerns – In all your code keep most of the developers focused on consistently adding the highest value by writing code that directly implements business logic, rather than having to write plumbing code that deals with system concerns while they are also writing business logic code. System concerns, implemented by plumbing code, are required for messaging, pushing data to clients, logging, auditing, etc. Push the plumbing code implementing system concerns down into utilities and infrastructure modules that the business logic developers can call. Having most of the developers spending significant time repeatedly writing plumbing code that can be done by a framework, base classes, or utility services will greatly slow your TTM. It is worth having this sort of system code developed very early in a project by a few highly skilled developers. That investment will quickly pay off in a faster TTM throughout the remainder of development.
- Do the above 4 things at once and you have an Agile Codebase, one that supports increased Business Agility – An Agile Codebase will support future changes being done with a much lower TTM than is the current practice. Why? Because 4 of the root causes of poor TTM have largely been eliminated. The resulting Business Agility allows a business to adapt to changes of all sorts much more quickly, be the change an opportunity, a threat, or from rapidly changing technology.
Project Organization
- Determine the Critical Path of the project (the sequence of development activities that adds up to the longest duration in the network of the dependencies of development activities) since it determines the soonest time a project will be done, i.e. the minimum TTM. Given the work that must be done and the sequence in which it needs to be done, without knowing the longest path how can one schedule work and even hope of achieving the shortest possible TTM? The Critical Path will affect your project whether or not you choose to use it to your advantage. This is key to creating realistic expectations in all project stakeholders, and hence credibility.
- Put the very best developers on the Critical Path activities – Your best developers have the highest probability of getting these TTM determining activities done sooner.
- Test early and test often – Service oriented distributed systems require intense, repeated integration testing since bugs are much more difficult to detect and fix than in monolithic apps. Integration tests need to be written and run for each individual module, and also for each subsequent integration of tested modules into larger components or microservices. Do not delay integration testing to the latter part of the project when there is insufficient time to find and fix bugs. That is a sure way to increase TTM, and decrease quality.
Tools and Technologies
- Favor pre-integrated sets of development tools and frameworks – It typically takes significant developer time to integrate a bunch of disparate tools and frameworks. And when a new release of them comes out with bug fixes it often takes additional developer time to integrate the new release into existing tools, frameworks, and code. All this acts to slow TTM. Much of this work can be avoided by choosing pre-integrated tools and frameworks.
- Avoid using new “preview release” technologies just out of the box – While definitely interesting and alluring, preview release versions of new technologies and frameworks tend to be incomplete, have holes, have more bugs than usual, require many work arounds, have insufficient and sketchy documentation, require that developers spend significant time learning their basics and even more time to learn the best practices. Adopt brand new technologies in solid production releases, subsequent to the preview releases and speed up TTM as a result.
Classic Mistakes of Software Development
- Avoiding making any of the “Classic Mistakes” will definitely result in a faster TTM – To be forewarned is to be forearmed. How many of the “Classic Mistakes” are going to happen on your next project? They will slow your TTM and they are avoidable! Below are some links I’ve found helpful in this area:
- In 1996 Steve McConnell listed software development’s classis mistakes in his book Rapid Development. In 2008 his company, Construx, conducted a survey of 500 developers who rated the severity of the mistakes and then published a white paper listing the mistakes and summarizing the survey results. This white paper is definitely worth reading. You’ll have to register and login to be able to download a copy.
- Peter Kretzman’s blog “CTO/CIO Perspectives: Intensely practical tips on information technology management” has a relevant article that looks at the role of senior management in this area: Software development’s classic mistakes and the role of the CTO/CIO
- Jim Bird’s blog article Classic Mistakes in Software Development and Maintenance presents some of McConnell’s material, plus additional useful material from Capers Jones.
Near the end of this article please note that real world research from Capers Jones shows that “Not identifying and cleaning up error-prone code – the 20% of code that contains 80% of bugs” results in a 50% decrease in developer productivity. Here is an excellent way to decrease TTM.
I hope this list aids you in decreasing the TTM in developing your software as much as it has helped me.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
With AMQP (Advanced Message Queueing Protocol) multiple software technologies located anywhere there is an internet connection can now collaborate – sharing data and behavior.
In many scenarios, asynchronous message queueing is the favored mechanism for the communication and collaboration required between the components of distributed systems and service oriented apps. AMQP is in the process of being widely adopted as a standard way to implement this behavior. It was first developed in 2003 at JPMorganChase; was approved as a standard by OASIS in 2012; and was approved as an ISO and IEC International Standard in 2014 (Wikipedia — AMQP).
AMQP is a wire level application protocol, similar in concept to HTTP, but aimed at asynchronous message queueing scenarios rather than request response situations like HTTP. AMQP’s asynchronous message queueing provides the capability to interconnect software components implemented in a number of different technologies (Java, .NET, etc.) which may be running in various cloud or on-premises locations.
Why Use Message Queueing?
In considering AMQP it is useful to first review why message queueing is the favored inter-component collaboration mechanism in distributed systems and service oriented apps. The following summary is paraphrased from Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions by Gregor Hohpe and Bobby Wolfe, Copyright 2001 by Pearson Education, Inc. The rationale behind the use of messaging is presented in the section called “Messaging” (pp 53 – 56) of Chapter 2, “Integration Styles”:
There are generally 3 ways that the components of distributed systems can collaborate with each other:
- Sharing Data – This form of collaboration is achieved by transferring files or using a shared database. Its main limitation is that it does not allow sharing the functionality implemented by each component in the system.
- Remote Procedure Calls – Data plus the functionality of individual components is shared through remote procedure calls to each other, either with one-way or request-response scenarios. The main limitation here is that both the sending and receiving components must be running simultaneously for a successful collaboration to occur. This tightly couples the components together, often resulting in the overall system becoming unacceptably unreliable.
- Asynchronous Messaging – Asynchronous message queues combine sharing both data and the functionality of individual components, but without tightly coupling the components involved. With queues the sender and receiver of queued messages do not need to be active simultaneously. And messaging systems can also be designed to provide timely request-response scenarios. Thus, message based distributed systems are typically far more resilient and reliable than distributed systems based on Remote Procedure Calls alone.
The following quote from an article on the International Association of Software Architects (IASA) web site adds more depth – “From the Field: Escaping Appland” by Monty Montgomery, Master Architect at IDesign.
“The demands placed on modern systems mandate elasticity. It is no longer enough to scale out. Your system must also shrink as well. The easiest way to address this need is to use a queuing technology. Queues normalize system load in a very reliable, predictable and efficient way. They absorb unexpected spikes and cyclical rises in load at lower cost. Queues are also the recognized building code for high throughput. And most importantly, queues provide the essential temporal decoupling you will need between your subsystem level microservices to extend technical empathy to DevOps.”
“The rise of numerous lightweight, mature queue-based messaging technologies clearly indicates the value and need for queuing. And it’s no mistake that all of the queuing technologies that matter now also support AMQP. All the modern software systems that you know and love to use employ some type of queuing technology as the backbone of their system architecture. And queuing is of course the cornerstone of the Cloud.”
Here are the most common asynchronous messaging scenarios, from Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications, Copyright 2014 Microsoft. Please see pp 169 – 171 for the details of each below scenario:
- Decoupling workloads from each other.
- Temporal decoupling, so the sender and received do not have to be active simultaneously.
- Load balancing by spreading the load between multiple components, each servicing the same queue.
- Load leveling, using a queue as a buffer to prevent spikes in incoming communications from overwhelming the system.
- Cross-platform integration.
- Asynchronous workflow, allowing the workflow components to be decoupled from each other.
- Deferred processing to facilitate schedules and better use of compute resources.
- Reliable messaging, guaranteeing the delivery of an enqueued message so that it never gets lost.
- Resilient message handling, allowing multiple receive attempts to be made for a given message until at least one attempt is successful.
- Non-blocking message receivers.
AMQP Info Sources
Here are the AMQP information sources I have found most helpful:
- First, there is the OASIS AMQP website. The About page of this website gives an excellent one page rundown of the capabilities, key features, and business cases for the use of AMQP, plus links to other informative pages as well. Don’t miss this quick and useful way of coming up to speed on AMQP basics.
- Microsoft has widely adopted AMQP and is actively using it in the Azure Service Bus, Azure Event Hubs, and also in the Azure IoT Hub.
- For the Service Bus, “AMQP 1.0 Support in Service Bus” covers the details and has a lot of links to other relevant Microsoft documentation.
- For Event Hubs, “Getting Started with Event Hubs” introduces how AMQP is utilized.
- For IoT Hubs please see “Azure IoT Hub Developer Guide” for where AMQP fits in and for links to other articles.
- Clemens Vasters (Lead IoT Architect at Microsoft) has several excellent video blogs about AMQP. These will give you an in-depth understanding of the AMQP protocol and what it can do. As you will see, AMQP is very full featured and supports a variety of communication modes beyond simply sending a message.
- Overview, and a conversation with David Ingham of Microsoft, the co-editor of the AMQP 1.0 spec – “Announcing the General Availability of AMQP 1.0 in Windows Azure Service Bus!”
- A series of six 15 minute videos exploring the in-depth technical details and capabilities of AMQP.
- Apache has developed Qpid, a messaging framework based on AMQP for Java, C, and Python clients.
- Plus, there are other AMQP messaging systems as well — Just google “AMQP client” to list them.
So there you have it! The rationale behind the use of asynchronous message queueing in distributed systems, plus a lot of the basic information you’ll need to understand the capabilities of AMQP and put it to use.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
During 2015 my readings on software structure and architecture have taken me into Event Sourcing and CQRS. I’ve found a few very helpful sources which are well stated, and give me a useful conceptual model without overwhelming detail. Here they are:
- Martin Flowler’s blog article on CQRS is a great starting point.
- Also by Martin Fowler, and mentioned in the above article, are elaborations on where and how CQRS can be used in systems. There are the ReportingDatabase pattern and the EagerReadDerivation pattern. These 2 patterns demonstrate how these CQRS concepts can be used in a common real life scenario — Providing data to queries specifically targeted at populating ViewModels of user interfaces.
- And, the above 2 articles are a good “pairing” to the Aggregated Reporting pattern of Arnon Rotem-Gal-Oz, from his SOA Patterns book.
- At the end of Fowler’s above CQRS blog article there is a link to a short article by Greg Young “CQRS, Task Based UIs, Event Sourcing agh!” that succinctly nails CQRS. And annother link to an insightful article by Udi Dahan that provides valuable details — “Clarified CQRS”. I found both of these very helpful as well.
Interestingly, most of the above were written between 2009 and 2011, before all this became quite popular.
- I have also found a number of sections of Microsoft’s Patterns and Practices CQRS Journey book useful, especially the latter concept oriented sections.
- Finally, Martin Fowler’s blog article on the Event Sourcing pattern is also relevant.
As many of the above authors point out, CQRS is only for specialized situations and not a generalized approach for everything!
I hope you find these readings as helpful as I have.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
We have been building distributed computing systems since about 1970 [Rotem-Gal-Oz]. Their use has accelerated in the last 15 years, and has accelerated even more in the last 5 years (in part due to the demise of Moore’s Law). Distributed systems get their work done by distributing the computing work required by an app over a number of separate computers, rather than doing it all on a single computer. A specific part of the work is intentionally done on one computer, and other specific portions of the work are done on other computers. Some of the reasons for using distributed computing are to facilitate the reuse of the capabilities of a piece of software by multiple apps and services, and to produce extensibility and/or location transparency in a system.
Back in the “old days” most apps did all their computing work on only one or two computers. For example in 2008 Word or Excel would typically be run on your own desktop computer. And back then a typical website would usually involve running a browser on your desktop computer which communicated with a server running on a single remote computer – Just 2 computers.
Nowadays a single app is likely to use a number of different services and servers to get its work done. An individual service often runs on its own separate computer. So the computers used by today’s apps can be: A computer to run the browser or user interface (UI); another computer to run the primary server used by the browser/UI computer; plus other computers as well that the primary server may itself call upon for other specialized services; plus the app running in the browser/UI may also use multiple servers (running on separate computers) for their specialized services in addition to the primary server. Then add in the cloud (an interconnected system of distributed computers) with its elastic ability to dynamically scale the computers used by apps to handle varying workloads and you’ll find even more computers being used by a distributed system.
Internet-of-Things (IoT) systems are complex distributed computing systems as well. IoT system development will fuel the accelerating rise of distributed systems for years to come. Figure 1 below shows how much things have changed.
Figure 1 — A conceptual sketch of the increase in complexity of today’s apps and their supporting distributed systems, versus typical apps of 10 to 15 years ago.
And take note of this: All the different services and computers used by the apps of today are usually completely hidden from the end user, and rightly so. App users don’t want to be distracted by all these details. They just want to get their work done via a good user experience. So the accelerating rise of distributed computing is hidden from app users and public awareness. Out of sight and out of mind.
Thoughtful consideration of the accelerating rise of distributed systems produces three vital conclusions:
1. Distributed computing has become a permanent, disruptive part of a “new era” of software and systems rather than being something that only banks and the Department of Defense used when it first started 4 and a half decades ago [Rotem-Gal-Oz]. Today almost every time you use an app it involves at least 2 computers, and typically more. In spades for IoT systems. Distributed computing is becoming the rule, rather than the exception as was the case prior to 2008 or so. The “new era of ubiquitous distributed computing” is a significant paradigm shift that is well underway. You cannot afford to let this disruptive change go unnoticed.
2. The accelerating use of distributed computing has largely been off the radar of general awareness and has created a “knowledge gap” in the software industry:
o We in the software industry currently do not have widespread knowledge of the best practices of designing, developing, planning, and maintaining the software of distributed systems and apps. Why? 1) The backlog of software needing to be developed is vast, creating very high demand for software developers that need to be put to work ASAP. And 2) software technology is changing so fast on so many fronts at once in recent years that keeping up with all the changes is very time consuming. So new topics fall through the cracks, especially those not at the forefront of awareness and which are not specific products and technologies sold by software and systems vendors. Thus, a relatively small percentage of software industry participants now possess the knowledge of the best practices of developing distributed systems. And these best practices are increasingly required for successful development projects as more and more projects involve distributed systems.
3. Apps and software systems have become much, much more complex since their logic (contained in their software) is now spread over multiple computers that are connected by networks. And complex systems require more work and time to develop. The complexity and amount of work to develop distributed apps increases as follows:
o Apps themselves are both more complex and more numerous than in decades past. How many apps are on your smart phone? How many apps were on your desk top computer in 2008? There is more complexity since we have more apps. Plus each app typically requires more functionality and usability than in the “old days”, resulting in more complexity.
o The app software is now distributed, rather than most of it being on one or two computers — It takes substantial extra work to manage, coordinate, secure, deploy, do robust error recovery, debug, and adequately test the app’s software logic that is now spread out over a number of distributed computers.
o Apps now use far more networks and connectivity — It takes substantial extra work to manage, coordinate, secure, deploy, do robust error recovery, debug, and adequately test the networks and connections used by the interconnected distributed computers running the distributed software.
o The software, computers, networks and connectivity all must be effectively integrated — It takes significant extra work to integrate, coordinate, secure, deploy, do robust error recovery, debug, and adequately test all of the above so that the software, the computers, the networks and the connections cooperatively interact with each other to behave as if the app were running as a single unit on a single computer.
o To summarize, in distributed systems the whole is indeed greater than the sum of its many parts, including the whole amount of work it takes to design, plan, develop, test, debug, secure, deploy, operate, and maintain complex distributed systems and apps.
- Note, most design and planning methods, plus development processes, in use today do not account for the significant increase in complexity and the resultant increase in work caused by distributing an app’s computing work over many computers. This is a key “missing link” to a much less bumpy transition into the “new era”.
o And, the increase in complexity and the work required is not a simple linear straight line increase. Rather it is an upward accelerating curve, as shown below in Figure 2. A crude, yet fairly accurate, example of measuring the increase in the “distributed” complexity today, as compared to the “old days”, is to count the number of arrows (aka connections) and the number of “Server X Distributed Computer” boxes (aka functionality) in each of the diagrams in Figure 1 above. Then add the arrow count to the box count for each diagram [Sessions], and finally add in 1 to represent the UI computer at the top of each diagram, as follows:
In the “old days” we have: 1 arrow + 1 box + 1 UI = 3 total complexity.
Today we have: 7 arrows + 6 boxes + 1 UI = 14 total complexity.
By this measure the complexity has increased 14/3 = 4.67 times (4.7x or a 467% increase)! Note that I have held constant the “functionality” measurement of complexity per box above for brevity.
This is not to say the work required to produce and maintain the app will always increase this much. But it does say that the work required will not increase like a straight line. Rather the work increases in a non-linear curve since it is directly related to the non-linear increase in complexity of the app and it’s supporting distributed system. In other words, the work required to produce a distributed system app is a function of the complexity of the connections in the system (arrows), plus the complexity of the required functionality implemented in each part (boxes) of the system. Therefore, beware of the temptation to estimate the amount of effort required to build a distributed system by extrapolating from the actual effort previously used to build non-distributed systems. Your estimate will far fall short.
Below is a family of curves showing how a typical measure of interconnection complexity increases. Note what happens when the number of items (n, the number of services or distributed computers, i.e. boxes) is doubled. Double the items (a 2x increase) and the complexity can increase by 3x, 4x, or more — Even when the number of connections is tightly constrained (e.g. by a 75% reduction in connections). This is the driving force behind how the amount of work required in a complex distributed systems development project can undergo non-linear expansion behind the scenes over several months, eventually catching everyone by surprise when they find the project is buried in unanticipated work. This also applies to the features of an app – Research shows adding 25% more features can increase the complexity by 100% [Sessions], thus non-linearly increasing the work needed to build the app.
Figure 2 – Graph of formulas of interconnection complexity — Curves that constantly accelerate upward.
Above, the complexity of software systems and distributed systems (as well as the communication paths in groups of people) increases in an ever accelerating upward curve. This upward acceleration will happen unless specific complexity reducing software design techniques are used to prevent such accelerating expansion. Such simplifying techniques are key parts of the best practices for the software architectural design of distributed systems listed below. Reductions in complexity significantly reduce the amount of work that must be done in a project. And they also reduce the number of failures and security vulnerabilities [Sessions]. That’s a lot of return for the effort spent reducing complexity!
The Way Forward
Welcome to the “new era of ubiquitous distributed computing” being used by more and more new apps, plus IoT systems. Happily, despite the obstacles of the non-linear expansion of complexity and work, we have learned how to effectively develop distributed systems. Below I outline how to adapt to the disruptions caused by the “new era”, rather than being at their mercy.
If you are developing distributed systems you should not assume your development project will proceed like they did in the past with non-distributed systems. Many of the techniques we are accustomed to routinely using in software projects were developed back in the “old days”, before the explosion of complexity and work in distributed systems. And many of these are not up to the task required of them today. Things have changed. Not mitigating the risk of “no-longer-effective techniques” will almost surely result in troubled distributed system development projects.
Here is what to do to have successful software projects that produce robust distributed apps and systems which deliver the functionality, usability, business agility, data security, and operational efficiency required of software systems today.
o Adopt a distributed system mindset, a “new era” mindset, recognizing there is an important disruptive paradigm shift underway. Delay in adopting this mindset this will only cause you problems and pain. Instead, consider this as an opportunity to move ahead of the pack.
o Accept that the techniques used in the past in project estimating, planning, organization, and sequencing of activities will need modification to work well in developing distributed systems. Then use new techniques that adequately quantify cost, risk, the extra work due to complexity, the new development activities, their new sequencing, and the interaction of all of these with the overall project cost, schedule, and risk [PDMC].
o Recognize the key differences between complex distributed apps, versus simpler “old style” apps. You’ll not have to do things differently when developing the “old style” apps that do not use distributed computing. Although applying some of the best practices of distributed systems to “old style” apps can add significant value.
o Use proven Service Oriented software architecture and design techniques, including those that reduce the complexity of a system. These, combined with the improvements to project planning and organization, can significantly reduce the amount of work required to build a distributed system by reigning in the explosion of complexity. Plus these will keep the level of complexity and amount of work required to extend and maintain the system much, much lower for years throughout the life of the software system, in addition to speeding up the time-to-market of subsequent releases [AMC], [PDMC].
o Executives, product owners, project managers, architects, developers, and test engineers must explicitly focus beyond the features, the software and its logic, and the data (all were the main focus in the “old days”). Now they must also fully embrace dealing with complexity, the network, connectivity, plus the substantial integration and testing work (on multiple levels) that is required to make it all play well together. And please do this long, long before the end part of the project when you are out of time. This new focus will have an impact on both the amount of work required (more work) and the sequencing of activities in a project. However it will also produce a noticeable increase in quality, security, and being on schedule and on budget [AMC], [PDMC].
o Plan on the participants of a software development project having to climb a substantial learning curve. Not only for learning any new technologies involved (the usual focus of new learning), but also plan for learning the new techniques and best practices required to deal with complexity, the distributed software, the network, the connectivity, the integration, security, deployment, automated health monitoring of the app, automated scaling, distributed error recovery, distributed testing, plus new ways of project estimation, planning, and project design.
o Augment your staff with temporary outside experts in key areas as necessary. Not only can they immediately add great value to design, planning, organization and implementation, but they can also significantly reduce risk plus mentor your staff to bring them up to speed faster and more thoroughly than some other forms of learning. An example is to utilize the services of a security expert to design the security for your new app and it’s supporting distributed systems, plus training your staff in the security design and its implementation. [PDMC].
You can learn the details of many of the above items in the IDesign Architects Master Class and Project Design Master Class. These classes are developed and taught by Juval Lowy, an internationally recognized expert in distributed systems and service oriented design, who has been architecting and planning such systems for decades.
For more information on the details of any of the above items, please contact me by posting a comment on this blog. For more detailed information on aspects of building distributed systems please see the following, some of which also serve as references:
o An article on the additional effort it takes to build cloud apps in my December 2014 blog post: “A Perspective for More Accurately Estimating Cloud App Development Costs”.
o My February 2012 blog article “Software Structure Can Reduce Costs and Time-to-Market” shows that the post-initial-development cost of a software system can vary by over 400%. Now you see one key source of that variation – the complexity of a software system. Another key source is the encapsulation of volatility, but I’ll save that for another article.
o A must read article looking at the current mindset and architectural practices, and then showing diagrams of decoupled, scalable software architectures that work with both distributed and not-so-distributed systems: “From the Field: Escaping Appland” by Monty Montgomery, Master Architect at IDesign.
o [Rotem-Gal-Oz] An excellent in depth article on 8 specific challenges in building distributed systems – The reliability, latency, bandwidth, security, topology, administration, transport, and homogeneity of networks: “Fallacies of Distributed Computing Explained” by Arnon Rotem-Gal-Oz, Service Oriented Architect.
o [Sessions] An article on system complexity by Roger Sessions, an expert in IT Complexity who has been working in this area since at least 2008: “Thirteen Laws of Highly Complex IT Systems”.
o [AMC] IDesign Architects Master Class.
o [PDMC] IDesign Project Design Master Class.
I hope this article has been helpful to you, despite its length. In presenting the long winded “whole story” in rather broad brush strokes I have glossed over a number of details. But I believe it is vital to comprehend the big picture here. Major technology paradigm changes do not often happen. And when they do it is often not apparent until way after the fact. The “new era of ubiquitous distributed computing” has clearly taken shape. Understanding the big picture and the “way forward” can put you well ahead of the curve as this wave of technology change washes over human societies for decades to come.
Be mindful of the significance of this “new era” in all of human history – In the twenty-teens the human race definitively crossed the threshold into the “era of ubiquitous distributed computing”. Just connect to the internet from anywhere on earth and the awesome capabilities of multiple, powerful distributed information processing engines are available! This is a clear milestone of major technological and social change that will be visible far into the future as humans look back on their past. One cannot even begin to imagine the cascading changes that will result and how they will affect the course of history.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
Here are some of the most useful sources of information on Internet-of-Things (IoT) systems that I’ve run across in 2014 and 2015. When reading this, please keep in mind the key points made in my blog article “Reinventing the Wheel is Not Necessary for IoT Software Architecture”:
- It’s best to use an end-to-end system perspective when thinking about IoT Systems. They can be much more complex that just the internet and things.
- “When developing IoT Systems we can use all of the software structural (aka software architecture) knowledge we’ve gained over the past decade from developing secure, mission critical distributed systems, and Service Oriented Architectures (SOA), and Cloud Systems.”
The following info sources often apply the above perspective and techniques since they generally serve to facilitate the timely development of high quality IoT Systems.
General Info Sources
The IEEE Internet of Things web site is full of IoT articles and links. You can join the IEEE IoT Technical Community for free by clicking this link.
In June 2015 the Industrial Internet Consortium (IIC) released its Industrial Internet Reference Architecture (IIRA) document. Click to download the document. It outlines the requirements and the conceptual system architecture needed to build industrial strength IoT systems. This is about a lot more than hooking up your toaster to the internet! The 5 founding members of IIC are AT&T, Cisco, GE, Intel, and IBM. Note that most of them have deep experience in distributed systems, SOA, and/or Cloud Systems.
Looking through the Reference Architecture document, especially the diagrams, will give you an idea of the potential complexity of industrial IoT Systems and how our knowledge of the software architecture of distributed systems, SOA, and Cloud Systems can be applied to effectively manage this complexity. Figures 7-2 and 7-3 on pages 38 and 39 show the software required for an entire IoT System. Note that the “things” account for less than a quarter of the Figure 7-2 diagram! And notice the software “Architecture Patterns” listed on page 37 that can be brought to bear to manage some of this complexity. And this is just a partial list of the potentially applicable software architecture patterns.
Please read Section 1.1 “Rationale and Context: The Industrial Internet” on page 9. It will give you an appreciation for how sweeping the “Industrial Internet Revolution” will be for our world over the next few decades. In my opinion, the Industrial IoT has the potential to produce a revolution in the way humans make things and supply basic services that have the potential to have as much, or more, impact on society than the invention of the assembly line by Henry Ford in 1913. If we can effectively deal with the security and quality issues than can unravel complex systems.
Here is a realistic, sobering view on some of the security, safety, reliability, and quality issues with software based distributed IoT systems by William Buchanan — https://www.linkedin.com/pulse/iot-your-car-get-auto-update-features-william-buchanan. When designing and developing IoT systems, and any other distributed system software, you need to have a specific actionable plan to counteract the kind of problems listed in this article. If you do not have such a plan, and the ability to carry out that plan to the letter, then these sorts of software bugs will definitely occur.
Software Oriented Info Sources
Clemens Vasters, currently Lead Architect of Microsoft Azure IoT Services, is a good source from which to learn about IoT software architecture and security. There are few people having such vast, deep experience and knowledge of the challenges and techniques of building resilient, performant, scalable distributed systems using SOA. Mr. Vasters was the leader of the team that developed Microsoft’s internal AppFabric prototype, and then turned it into a product Microsoft released in June 2010. This was the original component of what is now Microsoft’s Azure Cloud System. Development of what was AppFabric (and has become Azure) was continued with Mr. Vasters as Principal Technical Lead. Subsequently he served as Lead Architect of Azure Service Bus Messaging before his current position.
Since 2012 Mr. Vasters has published a wide variety of material on the Internet of Things, drawing on his deep experience with mission critical distributed systems, SOA and Cloud Systems. Also, he is Microsoft’s representative at the Industrial Internet Consortium, described above.
“Internet of Things — Using Microsoft Azure Service Bus for … Things!”, by Clemens Vasters in MSDN Magazine, June 2012. I consider this as required reading for understanding the basics of the message based software architecture required for industrial IoT applications.
There is an 8 part series of 15 to 30 minute videos featuring Mr. Vasters in which he gives hands-on examples of the challenges to be met in hooking up a device (a Raspberry Pi) to the cloud. The links are listed below. I found those marked with “**” to be very useful. And please don’t miss the video “Why End-to-End Security Matters”.
Part 1 – Prototyping Platforms
** Part 2 – Pattern Overview and Commands with HTTP
** Part 3 – Safer Commands via a Cloud Gateway
Part 4 – Intermediated, Service Assisted Connectivity
Part 5 – Tunnel in Tunnel in Tunnel and Other Security Witchcraft
** Part 6 – Why End-to-End Security Matters
Part 8 – The Pi Code. OBDII to AMQP to Cloud
Clemens Vasters has other useful videos on IoT subjects in his Subscribe video blog and on Channel 9. He also has valuable IoT material in his written blogs as well: http://blogs.microsoft.com/iot/author/clemensvasters/ and http://blogs.msdn.com/b/clemensv/.
MSDN Magazine has published a number of informative software oriented articles on the Internet of Things and how Microsoft products enable development of end-to-end IoT Systems. For a list of such articles please Google “mdsn magazine internet of things”. In particular, Bruno Terklay has written a number of articles along these lines that I have found to be most useful.
Microsoft’s IoT Blogs offer articles on Microsoft IoT techniques and products, plus examples of some of the IoT Systems their customers have built. These articles are excellent sources for ideas on how IoT may apply to your needs, and on how to build IoT Systems.
AllJoyn® – “a collaborative open-source software framework that makes it easy for devices and apps to discover and communicate with each other.” Out of the box, Windows 10 IoT versions will include software to allow apps running on it to use AllJoyn’s capabilities. Windows 10 will be released in mid-2015. AllJoyn® combined with Windows 10 may prove to be a fast way to develop some IoT capabilities.
Microsoft Build 2015 IoT Videos and Articles
Every spring Microsoft hosts a large conference for developers called Build in which it shows the latest technologies being offered and new technologies it will introduce in the next year. Build 2015 was unique since Microsoft has an unusually large number of major technology areas with significant new offerings to be introduced in 2015 and 2016 – Windows 10, Service Fabric for Azure and Windows Server 2016, the Azure IoT Suite and other related IoT software support, plus many new Azure PaaS and IaaS features. Each of the hour long sessions is videotaped and available online at https://channel9.msdn.com/Events/build/2015. Here are a few of the IoT related articles and videos from Build 2015 that I found most informative.
“Best Practices for Creating IoT Solutions with Azure” by Kevin Miller, Principal Program Manager for Azure IoT , has an excellent 2 page summary document that details 4 best IoT practices Microsoft has discovered in working with their customers. Don’t miss this little article. And the Build 2015 video of Mr. Millers presentation on this topic also lays out Microsoft’s IoT product and service strategy — It introduces the Azure IoT Suite and the new IoT Hub service that will be made available this year. The IoT Hub is a very highly integrated cloud service that supports most of the things one needs to do for IoT systems – provisioning and device management, data ingestion and analysis, security, etc. No longer will one have to build each one of these areas from scratch. This is a very significant development.
Azure IoT Security — Clemens Vasters discusses IoT security in this must see video. This article summarizes Mr. Vasters points on IoT security, plus has some good security links. But if you have the time, watch the video. It explains the key details that a short article cannot. The basic idea is “The most foundational principle for building secure IoT scenarios is to follow a defense-in-depth strategy, applying appropriate security measures at each layer – from the physical environment to digital hardware, and to the data that gets stored and collected”.
Connecting Your Devices to the Azure IoT Suite — The details of the IoT Suite and IoT Hub can be seen in this video.
I hope these info sources aid you in understanding the key technologies and issues involved in developing industrial strength IoT systems. This understanding will:
- Speed development of your IoT Systems.
- Avoid making unnecessary costly mistakes.
- Increase IoT System quality, security, safety, and reliability.
- And, create highly satisfied customers.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
The focus of this article is on the software and its structure required for end-to-end Internet-of-Things (IoT) systems. Despite its name, IoT is not just about the internet and “things”. Rather it is about systems – complex, end-to-end hardware and software systems consisting of numerous components collaborating to fulfill a common goal.
The software of end-to-end IoT Systems typically involves the following:
- Taking data from the “things”, putting it into storage, and perhaps analyzing it on the fly.
- Using Big Data to further analyze the data.
- Using data analysis results to determine when to send control commands back to the “things”, creating feedback loops.
- Presenting analysis results to people in via User Interface technologies so they can make effective decisions.
- Plus keeping audit trails, ensuring there is proper security, provisioning and managing the “things”, and more.
Adopting such a systems perspective allows one to more accurately access the amount of effort, time, money, and skill sets required to develop a given IoT System.
Despite the complexity involved in an end-to-end perspective, the good news is when developing IoT Systems we can use all of the software structural (aka software architecture) knowledge we’ve gained over the past decade from developing mission critical distributed systems, and Service Oriented Architectures (SOA), and Cloud Systems. We do not have to reinvent the wheel for IoT system software architecture and software. The structure of much of an IoT System’s software is like that of typical Service Oriented Apps and that of Hybrid Cloud Apps (an app with components running in both a data center and in a cloud). This is why I have included this article in the blog’s SO Apps series. Applying the software architecture knowledge from distributed systems, Service Oriented Architecture, and Cloud Systems can decrease the cost and Time-to-Market of developing IoT Systems, plus at the same time increase their quality, reliability, and safety.
Here are basic definitions of these terms to aid you in adopting this system perspective.
Definitions – Distributed System, SOA, Cloud Systems
IoT Systems are distributed systems by their very nature, defined by the Wikipedia as “a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.” From http://en.wikipedia.org/wiki/Distributed_computing. We have been developing distributed systems for over 20 years and we have learned a lot (see the book Enterprise Integration Patterns for examples).
During the past 10 years or so we have also learned that using Service Oriented Architecture makes developing and maintaining distributed systems much more time and cost effective. This translates to a faster Time-to-Market for both initial development and for post-release enhancements to systems. SOA is defined in the Wikipedia as “an architectural pattern in computer software design in which application components provide services to other components via a communications protocol, typically over a network. The principles of service-orientation are independent of any vendor, product or technology.” From http://en.wikipedia.org/wiki/Service-oriented_architecture. We have learned a lot about effective ways to use SOA during this time (see the book SOA Patterns for examples).
Finally, all of today’s leading commercial Cloud Systems are built on top of distributed systems and use Service Oriented Architectures in their internal implementation. Cloud Systems are unique in that they are self-healing and self-scaling. In other words, when one of the computers in a cloud crashes the Cloud System software (like Microsoft’s Azure) is smart enough to automatically detect the crash and move all the customer apps running on the crashed computer to another good computer. And, when customer apps require more compute power when using cloud Platform-as-a-Service, the Cloud System software is smart enough to automatically start up more of its internal computers, and then move customer apps to the newly started computers to meet customer demand. The advanced techniques of distributed systems and SOA that are used in building Cloud Systems that do these advanced things can be applied to IoT Systems to enhance their security, safety, reliability, resiliency, and capacity scalability. You can see more references concerning some of the techniques and challenges of building Cloud Systems at https://dotnetsilverlightprism.wordpress.com/2014/03/16/build-cloud-apps-that-deliver-superior-business-value/.
Please see the subsequent article in this SO Apps series for a list of IoT System Info Sources that I’ve found helpful. To be published in July 2015.
I hope the systems perspective presented above, coupled with modern service oriented software architecture patterns and practices, will aid you in developing highly useful IoT Systems that can be delivered on schedule, on budget, and on quality.
George Stevens
dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.
