{"id":522,"date":"2021-02-06T20:41:56","date_gmt":"2021-02-06T20:41:56","guid":{"rendered":"https:\/\/thinkingtester.com\/?p=522"},"modified":"2021-02-06T20:41:56","modified_gmt":"2021-02-06T20:41:56","slug":"reliability-engineering","status":"publish","type":"post","link":"https:\/\/thinkingtester.com\/reliability-engineering\/","title":{"rendered":"Reliability Engineering"},"content":{"rendered":"\n<p>Imagine that you are working on a team that is creating a new feature that allows users to submit and watch videos.  The application uses a third party- we&#8217;ll call it Encodurz- that encodes the videos that are uploaded to your application.  You work hard to test the feature; making sure that the UI is flawless, that the user journeys make sense, and that the videos play correctly, among many other things.  <\/p>\n\n\n\n<p>It&#8217;s time for your new feature to be released to the world, and you&#8217;re excited!  But on the day of the release, Encodurz has an outage.  No videos can be encoded, so none of the uploaded videos will display!  Your users don&#8217;t know that it&#8217;s not your fault; all they see is that the new feature doesn&#8217;t work.  They call and complain to customer service and they post complaints on Twitter.  <em>This<\/em> is why reliability engineering is important!<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/thinkingtester.com\/wp-content\/uploads\/2021\/02\/laptop-5906264_640.png\" alt=\"\" class=\"wp-image-523\" width=\"226\" height=\"140\" srcset=\"https:\/\/thinkingtester.com\/wp-content\/uploads\/2021\/02\/laptop-5906264_640.png 640w, https:\/\/thinkingtester.com\/wp-content\/uploads\/2021\/02\/laptop-5906264_640-300x186.png 300w, https:\/\/thinkingtester.com\/wp-content\/uploads\/2021\/02\/laptop-5906264_640-600x371.png 600w\" sizes=\"auto, (max-width: 226px) 100vw, 226px\" \/><\/figure><\/div>\n\n\n\n<p><strong>Reliability engineering<\/strong> focuses on the ability of applications to be as available as possible.  It aims to offer a good user experience, even in the following situations:<br>* The server goes down<br>* The database is unavailable<br>* An API that your application relies on is unavailable<br>* A third-party provider that your application depends on is unavailable<\/p>\n\n\n\n<p>Do you know how your application will behave in those scenarios?  If not, it sounds like it&#8217;s a good time to test those scenarios!  There are two ways to test:<\/p>\n\n\n\n<p><strong>Bring the service down<\/strong><br>You can bring a server down by unplugging it, but chances are your server is not nearby for you to do that.  But you can also bring a server, a webservice, or a database down by shutting it down using scripted commands.  If you don&#8217;t have permission to do that, you can find someone in DevOps at your company who has the correct permissions.  <\/p>\n\n\n\n<p><strong>Change your connection strings<\/strong><br>It&#8217;s really easy to simulate an outage of a server, a database, an API or any other service your application depends on simply by changing the way your application connects to it.  For example, if your app connects to your company database with a username and password, all you need to do is simply send in a bad password.  Or you could change the URI needed for the connection so that it&#8217;s incorrect.<\/p>\n\n\n\n<p>NOTE that it is a bad idea to test either of the above strategies in Production, at least until your application is VERY resilient.  You will want to do this testing in your test environment.  <\/p>\n\n\n\n<p>Once you have discovered what happens when your application or its dependencies has an outage, it&#8217;s time to make your app as resilient as possible.  Here are seven strategies for doing that:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Use a &#8220;circuit-breaker&#8221;<\/strong><br>This method puts logic in the code that tries connecting to a resource a few times, and when it is unable to connect, switches over to a different resource.  For example, if your application usually points to Server A, and you have a backup server called Server B, when the circuit-breaker is tripped the connection changes over to Server B.<\/li><li><strong>Use retries<\/strong><br>Sometimes a third-party app will fail temporarily for an unknown reason.  You don&#8217;t want your request to the app to fail and never try again.  So you can build in some retries; perhaps if the request fails, you wait 30 seconds and try again, if it fails again you wait 60 seconds and try again, and so on.  You don&#8217;t want to retry indefinitely, but instead set some sort of time limit so if the request still hasn&#8217;t succeeded when the limit is met, an error is returned.<\/li><li><strong>Use cached data<\/strong><br>It&#8217;s a great idea to have some kind of caching service that will be able to serve up data if a request fails for some reason.  When the request fails, your application just grabs the slightly-stale cached data and returns that instead.<\/li><li><strong>Enter into read-only mode<\/strong><br>If your application detects that there&#8217;s a problem writing to a data source, you can configure it to go into a read-only mode so that your users can at least see their data.  You should set a message to display when this is the case to explain to users why they can&#8217;t update their data at the moment.  <\/li><li><strong>Provide messaging that something isn&#8217;t quite right<\/strong><br>It is so annoying to get a cryptic error like &#8220;Error: T-128556&#8221; when using an application.  That&#8217;s not helpful at all!  Instead, provide your users with as much detail as you can about what&#8217;s wrong.  In the example at the beginning of this post, there could have been a message that read &#8220;Sorry, we are having an issue connecting to our video encoding software at the moment.  Please try again in a few minutes.&#8221;<\/li><li><strong>Have a status page that explains what is going on<\/strong><br>If your application goes down completely or is very degraded, it&#8217;s a great idea to have a status page (hosted on a <em>different<\/em> server) that provides a way to communicate to your end users what&#8217;s happening.  You could include a timestamp with your time zone, a list of the features that are affected, and a message about what&#8217;s going wrong.  Then you can keep the status page updated at regular intervals until the problem is fixed.  <\/li><li><strong>After the problem has ended, do a post-mortem to see what lessons you&#8217;ve learned<\/strong><br>If the outage was very brief or only affected a few users, you might not need to make the post-mortem public.  But if it was a big outage, it&#8217;s a great idea to communicate to customers what happened, why it happened, and how you are going to prevent it in the future.  <a rel=\"noreferrer noopener\" href=\"https:\/\/slack.engineering\/slacks-outage-on-january-4th-2021\/\" data-type=\"URL\" data-id=\"https:\/\/slack.engineering\/slacks-outage-on-january-4th-2021\/\" target=\"_blank\">See Slack&#8217;s message about their January 4th outage<\/a> for a really good example of this.  <br><\/li><\/ol>\n\n\n\n<p>No application can run perfectly 100% of the time; servers are imperfect and our apps are almost always dependent upon outside forces.  But it&#8217;s important to know exactly what will happen when services are down and to figure out the best ways to respond to those issues <em>before<\/em> they happen.  As testers, we can encourage our team to participate in this process.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine that you are working on a team that is creating a new feature that allows users to submit and watch videos. The application uses a third party- we&#8217;ll call it Encodurz- that encodes the videos that are uploaded to your application. You work hard to test the feature; making sure that the UI is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"hide_page_title":"","footnotes":""},"categories":[],"tags":[],"class_list":["post-522","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/posts\/522","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/comments?post=522"}],"version-history":[{"count":8,"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/posts\/522\/revisions"}],"predecessor-version":[{"id":531,"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/posts\/522\/revisions\/531"}],"wp:attachment":[{"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/media?parent=522"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/categories?post=522"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thinkingtester.com\/wp-json\/wp\/v2\/tags?post=522"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}