Why Testing Needs Explainable Artificial Intelligence April 19, 2021
Posted by Peter Varhol in Algorithms, Machine Learning, Software development.Tags: Algorithms, explainable artificial intelligence, Machine Learning, neural networks, testing, XAI
add a comment
Many artificial intelligence/machine learning (AI/ML) applications produce results that are not easily understandable from their training and input data. This is because these systems are largely black boxes that use multiple algorithms (sometimes hundreds) to process data and return a result. Tracing how this data is processed, in mathematical algorithms, is an impossible task for a person.
Further, these algorithms were “trained” or adjusted based on the data used as the foundation of learning. What is really happening there is that the data is adjusting algorithms to reflect what we already know about the relationship between inputs and outputs. In other words, we are doing a very complex type of nonlinear regression, without any inherent knowledge of a casual relationship between inputs and outputs.
At worst, the outputs from AI systems can sometimes seem nonsensical, based on what is known about the problem domain. Yet because those outputs come from software, we are inclined to trust them and apply them without question. Maybe we shouldn’t.
But it can be more subtle than that. The results could pose a systemic bias that made outputs seem correct, or at least plausible, but are not, or at least not ethically right. And users rarely have recourse to question the outputs, making them a black box.
This is where explainable AI (XAI) comes in. In cases where the relationship between inputs and outputs is complex and not especially apparent, users need the application to explain why it delivered a certain output. It’s a matter of trusting the software to do what we think it is doing. Ethical AI also plays into this concept.
So how does XAI work? There is a long way to go here, but there are a couple of techniques that show some promise. It operates off of the principles of transparency, interpretability, and explainability. Transparency means that we need to be able to look into the algorithms to clearly discern how they are processing input data. While that may not tell us how those algorithms are trained, it provides insight into the path to the results, and is intended for interpretation by the design and development team.
Interpretability is how the results might be presented for human understanding. In other words, if you have an application and are getting a particular result, you should be able to see and understand how that result was achieved, based on the input data and processing algorithms. There should be a logical pathway between data inputs and result outputs.
Explainability remains a vague concept while researchers try to define exactly how it might work. We might want to support queries into our results, or to get detailed explanations into more specific phases of the processing. But until there is better consensus, this feature remains a gray area.
The latter two characteristics are more important to testers and users. How you do this depends on the application. Facial recognition software can usually be built to describe facial characteristics and how they match up to values in an identification database. It becomes possible to build at least interpretability into the software.
But interpretability and explainability are not as easy when the problem domain is more ambiguous. How can we interpret an e-commerce recommendation that may or may not have anything to do with our product purchase? I have received recommendations on Amazon that clearly bear little relationship to what I have purchased or examined, so we don’t always have a good path between source and destination.
So how do we implement and test XAI?
Where Testing Gets Involved
Testing AI applications tends to be very different than testing traditional software. Testers often don’t know what the right answer is supposed to be. XAI can be very helpful in that regard, but it’s not the complete answer.
Here’s where XAI can help. If the application is developed and trained in a way where algorithms show their steps in coming from problem to solution, then we have something that is testable.
Rule-based systems can make it easier, because the rules form a big part of the knowledge. In neural networks, however, the algorithms rule, and they bear little relationship to the underlying intelligence. But rule-based intelligence is much less common today, so we have to go back to the data and algorithms.
Testers often don’t have control over how AI systems work to create results. But they can delve deeply into both data and algorithms to come up with ways to understand and test the quality of systems. It should not be a black box to testers or to users. How do we make it otherwise?
Years ago, I wrote a couple of neural network AI applications that simply adjusted the algorithms in response to training, without any insight on how that happened. While this may work in cases where the connection isn’t important, knowing how our algorithms contribute to our results has become vital.
Sometimes AI applications “cheat”, using cues that do not accurately reflect the knowledge within the problem domain. For example, it may be possible to facially recognize people, not through their characteristics, but through their surroundings. You may have data to indicate that I live in Boston, and use the Boston Garden in the background as your cue, rather than my own face. That may be accurate (or may not be), but it’s not facial recognition.
A tester can use an XAI application here to help tell the difference. That’s why developers need to build in this technology. But testers need deep insight into both the data and the algorithms.
Overall, a human in the loop remains critical. Unless someone is looking critically at the results, then they can be wrong, and quality will suffer.
There’s no one correct answer here. Instead, testers need to be intimately involved in the development of AI applications, and insist on explanatory architecture. Without that, there is no way of comprehending the quality that these applications need to deliver actionable results.
Should Testers Learn to Code? The Definitive Answer September 23, 2020
Posted by Peter Varhol in Software development, Strategy.Tags: Agile, coding, testing
1 comment so far
I came across this fundamental question yet again today, and am long since weary of reading answers. Those who ask the question are predisposed to a particular answer (almost always in the affirmative). Tired of the mountains of answers that are there to be climbed, I decided to cogitate on the definitive answer for all time, to bury this question in the sands of time.
Before my answer sends gravity ripples across Known Space, let me say that I like to take contrarian viewpoints when possible. A few years ago, my friends at TestRail published a blog post on the topic, and I responded with my own post entitled “Should Coders Learn to Test?” Regrettably, my levity was not well received.
The answer to the fundamental question, however, is yes. It’s clear that in a world without constraints, testers should learn to code. More knowledge is always better than less, even if the value of that knowledge is indeterminate at the time it is acquired.
But we live in a world of constraints, whether it be time, other alternatives, inclination, aptitude, or other. These constraints are almost always a deciding factor on the actions we take in navigating our professional lives. How we respond to those constraints defines the directions we take at various points in life.
Is it possible that not learning to code can have detrimental effects on testers in both the short and long term. If team and management expectations are that they provide the types of error detection and analysis that assumes an in-depth knowledge of the code, then does not have the skills or inclination to do so penalize their standing with the team and their career prospects?
But it is just as possible that other knowledge could be just as effective in project and career success. Testers may be experts at the domain, and be able to offer invaluable advice on how the software will really be used; they may write the best documentation; or have the best problem-solving skills. Yet culturally we denigrate them because they can’t code?
I’ll explore that thought later in more detail, but it occurs to me that we as software professionals are more or less stuck in an Agile-ish way of thinking about our projects. We bandy about terms like Scrum, sprints, product owner, Jira, and retrospective as if they magically convey a skill and efficiency on the team that was not there in the past. And we truly believe that Agile team members don’t specialize; we are all just team members, which enables us to do any task required by the project. I would like to question that assumption.
I casually follow American football during the season, and listen to the talking heads praise coaches (Scrum masters???) such as the New England Patriots’ Bill Belichick for adapting to the strengths of his individual players, and designing offensive and defensive strategies that take advantage of those strengths. In contrast, many other coaches seem to fixate on their preferred “systems”, remaining in their comfort zones and forcing the players to adapt to their preferences. In many cases, those coaches don’t seem to last very long.
Continuing that analogy, can we adapt our project teams to take advantage of the strengths of individual team members, rather than always force them into an Agile methodology? Perhaps we need to study more about team structure and interpersonal dynamics rather than how to properly formulate and carry out Scrum. Let’s use our people’s strengths, rather than simply use our people.
<To be continued>
Automation Can Be Dangerous December 6, 2018
Posted by Peter Varhol in Software development, Software tools, Strategy, Uncategorized.Tags: aircraft, automation, testing
add a comment
Boeing has a great way to prevent aerodynamic stalls in their 737 MAX aircraft. A set of sensors determines through airspeed and angle of attack that an aircraft is about to stall (that is, lose lift on its wings), and automatically pitch the nose down to recover.
Apparently malfunctioning sensors on Lion Air Flight 610 caused the aircraft nose to sharply pitch down absent any indication of a stall. Preliminary analysis indicates that the pilots were unable to overcome the nose-down attitude, and the aircraft dove into the sea. Boeing’s solution to this automation fault was explicit, even if its documentation wasn’t. Turn off the system.
And this is what the software developers, testers, and their bosses don’t get. Everyone thinks that automation is the silver bullet. Automation is inherently superior to manual testing. Automation will speed up testing, reduce costs, and increase quality. We must have more automation engineers, and everyone not an automation engineer should just go away now.
There are many lessons here for software teams. Automation is great when consistency in operation is required. Automation will execute exactly the same steps until the cows come home. That’s a great feature to have.
But many testing activities are not at all about consistency in operation. In fact, relatively few are. It would be good for smoke tests and regression tests to be consistent. Synthetic testing in production also benefits from automation and consistency.
Other types of testing? Not so much. The purpose of regression testing, smoke testing, and testing in production is to validate the integrity of the application, and to make sure nothing bad is currently happening. Those are valid goals, but they are only the start of testing.
Instead, testing is really about individual users and how they interact with an application. Every person does things on a computer just a little different, so it behooves testers to do the same. This isn’t harkening back to the days of weeks or months of testing, but rather acknowledging that the purpose of testing is to ensure an application is fit for use. Human use.
And sometimes, whether through fault or misuse, automation breaks down, as in the case of the Lion Air 737. And teams need to know what to do when that happens.
Now, when you are deploying software perhaps multiple times a day, it seems like it can take forever to sit down and actually use the product. But remember the thousands more who are depending on the software and the efforts that go behind it.
In addition to knowing when and how to use automation in software testing, we also need to know when to shut it off, and use our own analytical skills to solve a problem. Instead, all too often we shut down our own analytical skills in favor of automation.
We Forget What We Don’t Use April 17, 2018
Posted by Peter Varhol in Software platforms, Strategy.Tags: aircrews, DevOps, IT, testing
add a comment
Years ago, I was a pilot. SEL, as we said, single-engine land. Once during my instruction, for about an hour, we spent time going over what he called recovery from unusual attitudes. I went “under the hood”, putting on a plastic device that blocked my vision while he placed the plane in various situations. Then he would lift the hood, to where I could only see the instruments.
I became quite good at this, focusing on two instruments – turn and bank, and airspeed. Based on these instruments, I was able to recover to straight and level flight within seconds.
My instructor pilot realized what I was doing, and was a lot smarter than me. The next time, it didn’t work; it made things worse, actually. I panicked, and in a real life scenario, may well have crashed.
Today, I have a presentation I generically call “What Aircrews Can Teach IT” (the title changes based on the audience makeup). It is focused on Crew Resource Management, a structured way of working and communicating so that responsibilities are understood and concerns are voiced.
But there is more that aircrews can teach us. We panic when we have not seen a situation before. Aircrews do too. That’s why they practice, in a simulator, with a check pilot, hundreds of hours a year. That’s why we have few commercial airline accidents today. When we do, it is almost always because of crew error, because they are unfamiliar with their situation.
It’s the same in IT. If we are faced with a situation we haven’t encountered before, chances are we will react emotionally and incorrectly to it. The consequences may not be a fatal accident, but we can still do better.
I preach situational awareness in all aspects of life. We need to understand our surroundings, pay attention to people and events that may affect us, and in general be prepared to react based on our reading of a situation.
In many professional jobs, we’ve forgotten about the value of training. I don’t mean going to a class; I mean practicing scenarios, again and again, until they become second nature. That’s what aircrews do. And that’s what soldiers do. And when we have something on the line, that is more valuable than anything else we could be doing. And eventually it will pay off.
SpamCast on Machine Learning September 20, 2017
Posted by Peter Varhol in Software platforms.Tags: Machine Learning, testing
add a comment
Not really spam, of course, but Software Process and Measurement, the weekly podcast from Tom Cagley, who I met at the QUEST conference this past spring. This turned out surprisingly well, and Tom posted it this past weekend. If you have a few minutes, listen in. It’s a good introduction to machine learning and the issues of testing machine learning systems, as well as skills needed to understand and work with these systems. http://spamcast.libsyn.com/spamcast-460-peter-varhol-machine-learning-ai-testing-careers
The Human In the Loop September 19, 2017
Posted by Peter Varhol in Software development, Strategy, Technology and Culture.Tags: software failures, testing
1 comment so far
A couple of years ago, I did a presentation entitled “Famous Software Failures”. It described six events in history where poor quality or untested software caused significant damage, monetary loss, or death.
It was really more about system failures in general, or the interaction between hardware and software. And ultimately is was about learning from these failures to help prevent future ones.
I mention this because the protagonist in one of these failures passed earlier this year. Stanislav Petrov, a Soviet military officer who declined to report a launch of five ICBMs from the United States, as reported by their defense systems. Believing that a real American offensive would involve many more missiles, Lieutenant Colonel Petrov refused to acknowledge the threat as legitimate and contended to his superiors that it was a false alarm (he was reprimanded for his actions, incidentally, and permitted to retire at his then-current rank). The false alarm had been created by a rare alignment of sunlight on high-altitude clouds above North Dakota.
There is also a novel by Daniel Suarez, entitled Kill Decision, that postulates the rise of autonomous military drones that are empowered to make a decision on an attack without human input and intervention. Suarez, an outstanding thriller writer, writes graphically and in detail of weapons and battles that we are convinced must be right around the next technology bend, or even here today.
As we move into a world where critical decisions have to be made instantaneously, we cannot underestimate the value of the human in the loop. Whether the decision is made with a focus on logic (“They wouldn’t launch just five missiles”) or emotion (“I will not be remembered for starting a war”), it puts any decision in a larger and far more real context than a collection of anonymous algorithms.
The human can certainly be wrong, of course. And no one person should be responsible for a decision that can cause the death of millions of people. And we may find ourselves outmaneuvered by an adversary who relies successfully on instantaneous, autonomous decisions (as almost happened in Kill Decision).
As algorithms and intelligent systems become faster and better, human decisions aren’t necessarily needed or even desirable in a growing number of split-second situations. But while they may be pushed to the edges, human decisions should not be pushed entirely off the page.
Coding Error or Testing Error? July 12, 2016
Posted by Peter Varhol in Publishing.Tags: Pokemon, testing
add a comment
Today the news is full of the revelation that the mobile game Pokemon Go takes full permissions if you sign in using your Google account. Niantic, the game’s developer, acknowledged this and said it was merely a coding error. Neither Niantic nor Nintendo had any plans for full access to anyone’s Google account, and apparently a patch is being prepared to fix the error.
I am copasetic with the explanation (I am not a gamer; if I were, I may feel otherwise), and I see how it can happen. But is it an error in coding, or an error in testing? I call it a testing error; here’s why.
Developers do whatever it takes in order to make an application functional. On more than one occasion in the past (I no longer code, except for fun), I have allocated too much memory, declared too many variables, and kept objects alive too long, in order to ensure that the application works as I expect it too. My job is to get the application and its features working.
Grabbing too many permissions is similar. I have seen teams that bypass security restrictions because that seems to be the only way to add the needed functionality. For example, an application may require that the user account have local admin privileges. Are these required? Probably not, but in many cases local user privileges didn’t work. Rather than diagnose why they didn’t work, the developers simply open it up completely. Their job is to get it working, after all.
I don’t know if Niantic has a formal testing program driven by professional testers, but this sort of problem is all too common. And testers need to step up and take responsibility for issues that fall into the category of access rights and security.
Granted, this isn’t functional UI testing, which even today many testers believe represents the scope of their responsibilities. Few testers look for permissions issues, and these are almost never discovered on development or testing computers, which typically have full permissions.
But they should. I was in a development lab that lost a major enterprise sale because our software required local machine admin rights in order to install. That enterprise didn’t give any staff employees local machine admin rights, and simply installing our software would have required an IT person to go around to hundreds of computers to adjust permissions.
This is the sort of thing that testing is all about. Understand your customer. And by your customer, I also mean the organizations you are selling into. Test the login process, not only to make sure it works, but also to make sure it doesn’t create a security failing.
Yes, this is a failure of testing. To define this as a coding error is misleading. A competent and curious tester should have caught this before it went out the door. If you are missing this kind of problem, it behooves you to rethink how you do your job.
Testing and the Aftermath of Brussels March 26, 2016
Posted by Peter Varhol in Software development.Tags: terror, testing
add a comment
My collaborator Gerie Owen and I had the honor and privilege of speaking twice at the Belgium Testing Days conference. We are very familiar with the areas that suffered from indiscriminate terrorist bombings earlier this week, and our hearts go out to the victims and survivors.
Much can be said about this. I will limit my comments to its impact on the testing community in Belgium and in Europe in general. While the organizers of Belgium Testing Days have confirmed to us that they and their loved ones are well and safe, it seems like any organization that is trying to bring people together in Belgium, and to perhaps a lesser extent in other parts of Europe, is at risk.
And that sets back the testing community immensely. One of the important things that we do as a profession is to gather people of similar interest and enthusiasm to exchange knowledge and ideas to advance the field. It has been my perception that Belgium Testing Days was an important part of that exchange in the European testing community.
No matter where we work or live, we are influenced by outside events. In this case, these events can stunt the growth of our community, and the professional development of individuals.
Gathering in large groups may not be what we want to do right now, because of the risk of being involved in a terror attack. We all make our own decisions, of course. But I am going to continue to participate in testing and DevOps conferences, in Europe and beyond.



