jump to navigation

Why Testing Needs Explainable Artificial Intelligence April 19, 2021

Posted by Peter Varhol in Algorithms, Machine Learning, Software development.
Tags: , , , , ,
add a comment

Many artificial intelligence/machine learning (AI/ML) applications produce results that are not easily understandable from their training and input data.  This is because these systems are largely black boxes that use multiple algorithms (sometimes hundreds) to process data and return a result.  Tracing how this data is processed, in mathematical algorithms, is an impossible task for a person.

Further, these algorithms were “trained” or adjusted based on the data used as the foundation of learning.  What is really happening there is that the data is adjusting algorithms to reflect what we already know about the relationship between inputs and outputs.  In other words, we are doing a very complex type of nonlinear regression, without any inherent knowledge of a casual relationship between inputs and outputs.

At worst, the outputs from AI systems can sometimes seem nonsensical, based on what is known about the problem domain.  Yet because those outputs come from software, we are inclined to trust them and apply them without question.  Maybe we shouldn’t.

But it can be more subtle than that.  The results could pose a systemic bias that made outputs seem correct, or at least plausible, but are not, or at least not ethically right.  And users rarely have recourse to question the outputs, making them a black box.

This is where explainable AI (XAI) comes in.  In cases where the relationship between inputs and outputs is complex and not especially apparent, users need the application to explain why it delivered a certain output.  It’s a matter of trusting the software to do what we think it is doing.  Ethical AI also plays into this concept.

So how does XAI work?  There is a long way to go here, but there are a couple of techniques that show some promise.  It operates off of the principles of transparency, interpretability, and explainability.  Transparency means that we need to be able to look into the algorithms to clearly discern how they are processing input data.  While that may not tell us how those algorithms are trained, it provides insight into the path to the results, and is intended for interpretation by the design and development team.

Interpretability is how the results might be presented for human understanding.  In other words, if you have an application and are getting a particular result, you should be able to see and understand how that result was achieved, based on the input data and processing algorithms.  There should be a logical pathway between data inputs and result outputs.

Explainability remains a vague concept while researchers try to define exactly how it might work.  We might want to support queries into our results, or to get detailed explanations into more specific phases of the processing.  But until there is better consensus, this feature remains a gray area.

The latter two characteristics are more important to testers and users.  How you do this depends on the application.  Facial recognition software can usually be built to describe facial characteristics and how they match up to values in an identification database.  It becomes possible to build at least interpretability into the software.

But interpretability and explainability are not as easy when the problem domain is more ambiguous.  How can we interpret an e-commerce recommendation that may or may not have anything to do with our product purchase?  I have received recommendations on Amazon that clearly bear little relationship to what I have purchased or examined, so we don’t always have a good path between source and destination.

So how do we implement and test XAI? 

Where Testing Gets Involved

Testing AI applications tends to be very different than testing traditional software.  Testers often don’t know what the right answer is supposed to be.  XAI can be very helpful in that regard, but it’s not the complete answer.

Here’s where XAI can help.  If the application is developed and trained in a way where algorithms show their steps in coming from problem to solution, then we have something that is testable.

Rule-based systems can make it easier, because the rules form a big part of the knowledge.  In neural networks, however, the algorithms rule, and they bear little relationship to the underlying intelligence.  But rule-based intelligence is much less common today, so we have to go back to the data and algorithms.

Testers often don’t have control over how AI systems work to create results.  But they can delve deeply into both data and algorithms to come up with ways to understand and test the quality of systems.  It should not be a black box to testers or to users.  How do we make it otherwise?

Years ago, I wrote a couple of neural network AI applications that simply adjusted the algorithms in response to training, without any insight on how that happened.  While this may work in cases where the connection isn’t important, knowing how our algorithms contribute to our results has become vital.

Sometimes AI applications “cheat”, using cues that do not accurately reflect the knowledge within the problem domain.  For example, it may be possible to facially recognize people, not through their characteristics, but through their surroundings.  You may have data to indicate that I live in Boston, and use the Boston Garden in the background as your cue, rather than my own face.  That may be accurate (or may not be), but it’s not facial recognition.

A tester can use an XAI application here to help tell the difference.  That’s why developers need to build in this technology.  But testers need deep insight into both the data and the algorithms.

Overall, a human in the loop remains critical.  Unless someone is looking critically at the results, then they can be wrong, and quality will suffer.

There’s no one correct answer here.  Instead, testers need to be intimately involved in the development of AI applications, and insist on explanatory architecture.  Without that, there is no way of comprehending the quality that these applications need to deliver actionable results.

Cognitive Bias in Machine Learning June 8, 2018

Posted by Peter Varhol in Algorithms, Machine Learning.
Tags: , , ,
add a comment

I’ve danced around this topic over the last eight months or so, and now think I’ve learned enough to say something definitive.

So here is the problem.  Neural networks are sets of layered algorithms.  It might have three layers, or it might have over a hundred.  These algorithms, which can be as simple as polynomials, or as complex as partial derivatives, process incoming data and pass it up to the next level for further processing.

Where do these layers of algorithms come from?  Well, that’s a much longer story.  For the time being, let’s just say they are the secret sauce of the data scientists.

The entire goal is to produce an output that accurately models the real-life outcome.  So we run our independent variables through the layers of algorithms and compare the output to the reality.

There is a problem with this.  Given a complex enough neural network, it is entirely possible that any data set can be trained to provide an acceptable output, even if it’s not related to the problem domain.

And that’s the problem.  If any random data set will work for training, then choosing a truly representative data set can be a real challenge.  Of course, will would never use a random data set for training; we would use something that was related to the problem domain.  And here is where the potential for bias creeps in.

Bias is disproportionate weight in favor of or against one thing, person, or group compared with another.  It’s when we make one choice over another for emotional rather than logical reasons.  Of course, computers can’t show emotion, but they can reflect the biases of their data, and the biases of their designers.  So we have data scientists either working with data sets that don’t completely represent the problem domain, or making incorrect assumptions between relationships between data and results.

In fact, depending on the data, the bias can be drastic.  MIT researchers have recently demonstrated Norman, the psychopathic AI.  Norman was trained with written captions describing graphic images about death from the darkest corners of Reddit.  Norman sees only violent imagery in Rorschach inkblot cards.  And of course there was Tay, the artificial intelligence chatter bot that was originally released by Microsoft Corporation on Twitter.  After less than a day, Twitter users discovered that Tay could be trained with tweets, and trained it to be obnoxious and racist.

So the data we use to train our neural networks can make a big difference in the results.  We might pick out terrorists based on their appearance or religious affiliation, rather than any behavior or criminal record.  Or we might deny loans to people based on where they live, rather than their ability to pay.

On the one hand, biases may make machine learning systems seem more, well, human.  On the other, we want outcomes from our machine learning systems that accurately reflect the problem domain, and not biased.  We don’t want our human biases to become inherited by our computers.

In the Clutch September 28, 2017

Posted by Peter Varhol in Algorithms, Machine Learning, Software development, Technology and Culture.
Tags: , ,
add a comment

I wrote a little while back about how some people are able to recognize the importance of the right decision or action in a given situation, and respond in a positive fashion.  We often call that delivering in the clutch.  This is as opposed to machine intelligence, which at least right now is not equipped to understand and respond to anything regarding the importance of a particular event in a sequence.

The question is if these systems will ever be able to tell that a particular event has outsized importance, and if they can use this information to um, try harder.

I have no doubt that we will be able to come up with metrics that can inform a machine learning system of a particularly critical event or events.  Taking an example from Moneyball of an at-bat, we can incorporate the inning, score, number of hits, and so on.  In other problem domains, such as application monitoring, we may not yet be collecting the data that we need, but given a little thought and creativity, I’m sure we can do so.

But I have difficulty imagining that machine learning systems will be able to rise to the occasion.  There is simply no mechanism in computer programming for that to happen.  You don’t save your best algorithms for important events; you use them all the time.  For a long-running computation, it may be helpful to add to the server farm, so you can finish more quickly or process more data, but most learning systems won’t be able or equipped to do that.

But code is not intelligence.  Algorithms cannot feel a sense of urgency to perform at the highest level; they are already performing at the highest level of which they are capable.

To be fair, at some indeterminate point in the future, it may be possible for algorithms to detect the need for new code pathways, and call subroutines to make those pathways a reality (or ask for humans to program them).  They may recognize that a particular result is suboptimal, and “ask” for additional data to make it better.  But why would that happen only for critical events?  We would create our systems to do that for any event.

Today, we don’t live in the world of Asimov’s positronic brains and the Three Laws of Robotics.  It will be a while before science is at that point, if ever.

Is this where human achievement can perform better than an algorithm?  Possibly, if we have the requisite human expertise.  There are a number of well-known examples where humans have had to take over when machines failed, some successfully, some unsuccessfully.  But the human has to be there, and has to be equipped professionally and mentally to do so.  That is why I am a strong believer in the human in the loop.

The Future is Now June 23, 2017

Posted by Peter Varhol in Algorithms, Technology and Culture.
Tags: ,
add a comment

And it is messy.  This article notes that it has been 15 years since the release of Minority Report, and today we are using predictive analytics to determine who might commit a crime, and where.

Perhaps it is the sign of the times.  Despite being safer than ever, we are also more afraid than ever.  We may not let our electronics onto commercial planes (though they are presumably okay in cargo).  We want to flag and restrict contact with people deemed high-risk.  We want to stay home.  We want the police to have more powers.

In a way it’s understandable.  This is a bias described aptly by Daniel Kahneman.  We can extrapolate from the general to the particular, but not from the particular to the general.  And there is also the primacy bias.  When we see a mass attack, was are likely to instinctively interpret that as an increase in attacks in general, rather than looking at the trends over time.

I’m reminded of the Buffalo Springfield song: “Paranoia strikes deep, into your lives it will creep.”

But there is a problem using predictive analytics in this fashion, as Tom Cruise discovered.  And this gets back to Nicholas Carr’s point – we can’t effectively automate what we can’t do ourselves.  If a human cannot draw the same or more accurate conclusions, we have no right to rely blindly on analytics.

I suspect that we are going to see increased misuses of analytics in the future, and that is regrettable.  We have to have data scientists, economists, and computer professionals step up and say that a particular application is inappropriate.

I will do so when I can.  I hope others will, too.

Design a site like this with WordPress.com
Get started