jump to navigation

Why Testing Needs Explainable Artificial Intelligence April 19, 2021

Posted by Peter Varhol in Algorithms, Machine Learning, Software development.
Tags: , , , , ,
add a comment

Many artificial intelligence/machine learning (AI/ML) applications produce results that are not easily understandable from their training and input data.  This is because these systems are largely black boxes that use multiple algorithms (sometimes hundreds) to process data and return a result.  Tracing how this data is processed, in mathematical algorithms, is an impossible task for a person.

Further, these algorithms were “trained” or adjusted based on the data used as the foundation of learning.  What is really happening there is that the data is adjusting algorithms to reflect what we already know about the relationship between inputs and outputs.  In other words, we are doing a very complex type of nonlinear regression, without any inherent knowledge of a casual relationship between inputs and outputs.

At worst, the outputs from AI systems can sometimes seem nonsensical, based on what is known about the problem domain.  Yet because those outputs come from software, we are inclined to trust them and apply them without question.  Maybe we shouldn’t.

But it can be more subtle than that.  The results could pose a systemic bias that made outputs seem correct, or at least plausible, but are not, or at least not ethically right.  And users rarely have recourse to question the outputs, making them a black box.

This is where explainable AI (XAI) comes in.  In cases where the relationship between inputs and outputs is complex and not especially apparent, users need the application to explain why it delivered a certain output.  It’s a matter of trusting the software to do what we think it is doing.  Ethical AI also plays into this concept.

So how does XAI work?  There is a long way to go here, but there are a couple of techniques that show some promise.  It operates off of the principles of transparency, interpretability, and explainability.  Transparency means that we need to be able to look into the algorithms to clearly discern how they are processing input data.  While that may not tell us how those algorithms are trained, it provides insight into the path to the results, and is intended for interpretation by the design and development team.

Interpretability is how the results might be presented for human understanding.  In other words, if you have an application and are getting a particular result, you should be able to see and understand how that result was achieved, based on the input data and processing algorithms.  There should be a logical pathway between data inputs and result outputs.

Explainability remains a vague concept while researchers try to define exactly how it might work.  We might want to support queries into our results, or to get detailed explanations into more specific phases of the processing.  But until there is better consensus, this feature remains a gray area.

The latter two characteristics are more important to testers and users.  How you do this depends on the application.  Facial recognition software can usually be built to describe facial characteristics and how they match up to values in an identification database.  It becomes possible to build at least interpretability into the software.

But interpretability and explainability are not as easy when the problem domain is more ambiguous.  How can we interpret an e-commerce recommendation that may or may not have anything to do with our product purchase?  I have received recommendations on Amazon that clearly bear little relationship to what I have purchased or examined, so we don’t always have a good path between source and destination.

So how do we implement and test XAI? 

Where Testing Gets Involved

Testing AI applications tends to be very different than testing traditional software.  Testers often don’t know what the right answer is supposed to be.  XAI can be very helpful in that regard, but it’s not the complete answer.

Here’s where XAI can help.  If the application is developed and trained in a way where algorithms show their steps in coming from problem to solution, then we have something that is testable.

Rule-based systems can make it easier, because the rules form a big part of the knowledge.  In neural networks, however, the algorithms rule, and they bear little relationship to the underlying intelligence.  But rule-based intelligence is much less common today, so we have to go back to the data and algorithms.

Testers often don’t have control over how AI systems work to create results.  But they can delve deeply into both data and algorithms to come up with ways to understand and test the quality of systems.  It should not be a black box to testers or to users.  How do we make it otherwise?

Years ago, I wrote a couple of neural network AI applications that simply adjusted the algorithms in response to training, without any insight on how that happened.  While this may work in cases where the connection isn’t important, knowing how our algorithms contribute to our results has become vital.

Sometimes AI applications “cheat”, using cues that do not accurately reflect the knowledge within the problem domain.  For example, it may be possible to facially recognize people, not through their characteristics, but through their surroundings.  You may have data to indicate that I live in Boston, and use the Boston Garden in the background as your cue, rather than my own face.  That may be accurate (or may not be), but it’s not facial recognition.

A tester can use an XAI application here to help tell the difference.  That’s why developers need to build in this technology.  But testers need deep insight into both the data and the algorithms.

Overall, a human in the loop remains critical.  Unless someone is looking critically at the results, then they can be wrong, and quality will suffer.

There’s no one correct answer here.  Instead, testers need to be intimately involved in the development of AI applications, and insist on explanatory architecture.  Without that, there is no way of comprehending the quality that these applications need to deliver actionable results.

Will We Have Completely Autonomous Airliners? January 2, 2020

Posted by Peter Varhol in aviation, Machine Learning, Technology and Culture.
Tags: , , , ,
add a comment

This has been the long term trend, and two recent stories have added to the debate.  First, the new FAA appropriations bill includes a directive to study single-pilot airliners used for cargo.  Second is this story in Wall Street Journal (paywall), discussing how the Boeing 737 MAX crashes has caused the company to advocate even more for fully autonomous airliners.

I have issues with that.  First, Boeing’s reasoning is fallacious.  The 737 MAX crashes were not pilot error, but rather design and implementation errors, and inadequate information for documentation and training.  Boeing as a culture apparently still refuses to acknowledge that.

Second, as I have said many times before, automation is great when used in normal operations.  When something goes wrong, automation more often than not does the opposite of the right thing, attempting to continue normal operations in an abnormal situation.

As for a single pilot, when things go wrong, a single pilot is likely to be too focused on the most immediate, rather than carrying out a division of labor.  It seems like in an emergency situation, two experienced heads are better than one.  And there are instances, albeit rare, where a pilot becomes incapacitated, and a second person is needed.

Boeing is claiming that AI will provide the equivalent of a competent second pilot.  That’s not what AI is all about.  Despite the ability to learn, a machine learning system would have to have seen the circumstances of the failure before, and have a solution, or at least an approximation of a solution, as a part of its training.  This is not black magic, as Boeing seems to think.  It is a straightforward process of data and training.

AI does only what it is trained to do.  Boeing says that pilot error is the leading cause of airliner incidents.  They are correct, but it’s not as simple as that.  Pilot error is a catch-phrase that includes a number of different things, including wrong decisions, poor information, or inadequate training, among others.  While they can easily be traced back to the pilot, they are the result of several different causes of errors and omissions.

So I have my doubts as to whether full automation is possible or even desirable.  And the same applies to a single pilot.  Under normal operations, it might be a good approach.  But life is full of unexpected surprises.

Cognitive Bias in Machine Learning June 8, 2018

Posted by Peter Varhol in Algorithms, Machine Learning.
Tags: , , ,
add a comment

I’ve danced around this topic over the last eight months or so, and now think I’ve learned enough to say something definitive.

So here is the problem.  Neural networks are sets of layered algorithms.  It might have three layers, or it might have over a hundred.  These algorithms, which can be as simple as polynomials, or as complex as partial derivatives, process incoming data and pass it up to the next level for further processing.

Where do these layers of algorithms come from?  Well, that’s a much longer story.  For the time being, let’s just say they are the secret sauce of the data scientists.

The entire goal is to produce an output that accurately models the real-life outcome.  So we run our independent variables through the layers of algorithms and compare the output to the reality.

There is a problem with this.  Given a complex enough neural network, it is entirely possible that any data set can be trained to provide an acceptable output, even if it’s not related to the problem domain.

And that’s the problem.  If any random data set will work for training, then choosing a truly representative data set can be a real challenge.  Of course, will would never use a random data set for training; we would use something that was related to the problem domain.  And here is where the potential for bias creeps in.

Bias is disproportionate weight in favor of or against one thing, person, or group compared with another.  It’s when we make one choice over another for emotional rather than logical reasons.  Of course, computers can’t show emotion, but they can reflect the biases of their data, and the biases of their designers.  So we have data scientists either working with data sets that don’t completely represent the problem domain, or making incorrect assumptions between relationships between data and results.

In fact, depending on the data, the bias can be drastic.  MIT researchers have recently demonstrated Norman, the psychopathic AI.  Norman was trained with written captions describing graphic images about death from the darkest corners of Reddit.  Norman sees only violent imagery in Rorschach inkblot cards.  And of course there was Tay, the artificial intelligence chatter bot that was originally released by Microsoft Corporation on Twitter.  After less than a day, Twitter users discovered that Tay could be trained with tweets, and trained it to be obnoxious and racist.

So the data we use to train our neural networks can make a big difference in the results.  We might pick out terrorists based on their appearance or religious affiliation, rather than any behavior or criminal record.  Or we might deny loans to people based on where they live, rather than their ability to pay.

On the one hand, biases may make machine learning systems seem more, well, human.  On the other, we want outcomes from our machine learning systems that accurately reflect the problem domain, and not biased.  We don’t want our human biases to become inherited by our computers.

Can Machines Learn Cause and Effect? June 6, 2018

Posted by Peter Varhol in Algorithms, Machine Learning.
Tags: , , ,
add a comment

Judea Pearl is one of the giants of what started as an offshoot of classical statistics, but has evolved into the machine learning area of study.  His actual contributions deal with Bayesian statistics, along with prior and conditional probabilities.

If it sounds like a mouthful, it is.  Bayes Theorem and its accompanying statistical models are at the same time surprisingly intuitive and mind-blowingly obtuse (at least to me, of course).  Bayes Theorem describes the probability of a particular outcome, based on prior knowledge of conditions that might be related to the outcome.  Further, we update that probability when we have new information, so it is dynamic.

So when Judea Pearl talks, I listen carefully.  In this interview, he is pointing out that machine learning and AI as practiced today is limited by the techniques we are using.  In particular, he claims that neural networks simply “do curve fitting,” rather than understand about relationships.  His goal is for machines to discern cause and effect between variables, that is “A causes B to happen, B causes C to happen, but C does not cause A or B”.  He thinks that Bayesian inference is ultimately a way to do this.

It’s a provocative statement to say that we can teach machines about cause and effect.  Cause and effect is a very situational concept.  Even most humans stumble over it.  For example, does more education cause people to have a higher income?  Well maybe.  Or it may be that more intelligence causes a higher income, but more intelligent people also tend to have more education.  I’m simply not sure about how we would go about training a machine, using only quantitative data, about cause and effect.

As for neural networks being mere curve-fitting, well, okay, in a way.  He is correct to point out that what we are doing with these algorithms is not finding Truth, or cause and effect, but rather looking at the best way of expressing a relationship between our data and the outcome produced (or desired, in the case of unsupervised learning).

All that says is that there is a relationship between the data and the outcome.  Is it causal?  It’s entirely possible that not even a human knows.

And it’s not at all clear to me that this is what Bayesian inference is saying.  And in fact I don’t see anything in any statistical technique that allows us to assume cause and effect.  Right now, the closest we come to this in simple correlation is R-squared, which allows us to say how much of a statistical correlation is “explained” by the data.  But “explained” doesn’t mean what you think it means.

As for teaching machines cause and effect, I don’t discount it eventually.  Human intelligence and free will is an existence proof; we exhibit those characteristics, at least some of the time, so it is not unreasonable to think that machines might someday also do so.  That said, it certainly won’t happen in my lifetime.

And about data.  We fool ourselves here too.  More on this in the next post.

More on AI and the Turing Test May 20, 2018

Posted by Peter Varhol in Architectures, Machine Learning, Strategy, Uncategorized.
Tags: , , ,
add a comment

It turns out that most people who care to comment are, to use the common phrase, creeped out at the thought of not knowing whether they are talking to an AI or a human being.  I get that, although I don’t think I’m myself bothered by such a notion.  After all, what do we know about people during a casual phone conversation?  Many of them probably sound like robots to us anyway.

And this article in the New York Times notes that Google was only able to accomplish this feat by severely limiting the domain in which the AI could interact with – in this case, making dinner reservations or a hair appointment.  The demonstration was still significant, but isn’t a truly practical application, even within a limited domain space.

Well, that’s true.  The era of an AI program interacting like a human across multiple domains is far away, even with the advances we’ve seen over the last few years.  And this is why I even doubt the viability of self-driving cars anytime soon.  The problem domains encountered by cars are enormously complex, far more so than any current tests have attempted.  From road surface to traffic situation to weather to individual preferences, today’s self-driving cars can’t deal with being in the wild.

You may retort that all of these conditions are objective and highly quantifiable, making it possible to anticipate and program for.  But we come across driving situations almost daily that have new elements that must be instinctively integrated into our body of knowledge and acted upon.  Computers certainly have the speed to do so, but they lack a good learning framework to identify critical data and integrate that data into their neural network to respond in real time.

Author Gary Marcus notes that what this means is that the deep learning approach to AI has failed.  I laughed when I came to the solution proposed by Dr. Marcus – that we return to the backward-chaining rules-based approach of two decades ago.  This was what I learned during much of my graduate studies, and was largely given up on in the 1990s as unworkable.  Building layer upon layer of interacting rules was tedious and error-prone, and it required an exacting understanding of just how backward chaining worked.

Ultimately, I think that the next generation of AI will incorporate both types of approaches.  The neural network to process data and come to a decision, and a rules-based system to provide the learning foundation and structure.

Google AI and the Turing Test May 12, 2018

Posted by Peter Varhol in Algorithms, Machine Learning, Software development, Technology and Culture, Uncategorized.
Tags: , , ,
add a comment

Alan Turing was a renowned mathematician in Britain, and during WW 2 worked at Bletchley Park in cryptography.  He was an early computer pioneer, and today is probably best known for the Turing Test, a way of distinguishing between computers and humans (hypothetical at the time).

More specifically, the Turing Test was designed to see if a computer could pass for a human being, and was based on having a conversation with the computer.  If the human could not distinguish between talking to a human and talking to a computer, the computer was said to have passed the Turing Test.  No computer has ever done so, although Joseph Weizenbaum’s Eliza psychology therapist in the 1960s was pretty clever (think Alfred Adler).

The Google AI passes the Turing Test.  https://www.youtube.com/watch?v=D5VN56jQMWM&feature=youtu.be.

I’m of two minds about this.  First, it is a great technical and scientific achievement.  This is a problem that for decades was thought to be intractable.  Syntax has definite structure and is relatively easy to parse.  While humans seem to understand language semantics instinctively, there are ambiguities that can only be learned through training.  That’s where deep learning through neural networks comes in.  And to respond in real time is a testament to today’s computing power.

Second, and we need this because we don’t want to have phone conversations?  Of course, the potential applications go far beyond calling to make a hair appointment.  For a computer to understand human speech and respond intelligently to the semantics of human words, it requires some significant training in human conversation.  That certainly implies deep learning, along with highly sophisticated algorithms.  It can apply to many different types of human interaction.

But no computing technology is without tradeoffs, and intelligent AI conversation is no exception.  I’m reminded of Sherry Turkle’s book Reclaiming Conversation.  It posits that people are increasingly afraid of having spontaneous conversations with one another, mostly because we cede control of the situation.  We prefer communications where we can script our responses ahead of time to conform to our expectations of ourselves.

Having our “AI assistant” conduct many of those conversations for us seems like simply one more step in our abdication as human beings, unwilling to face other human beings in unscripted communications.  Also, it is a way of reducing friction in our daily lives, something I have written about several times in the past.

Reducing friction is also a tradeoff.  It seems worthwhile to make day to day activities easier, but as we do, we also fail to grow as human beings.  I’m not sure where the balance lies here, but we should not strive single-mindedly to eliminate friction from our lives.

5/14 Update:  “Google Assistant making calls pretending to be human not only without disclosing that it’s a bot, but adding “ummm” and “aaah” to deceive the human on the other end with the room cheering it… horrifying. Silicon Valley is ethically lost, rudderless and has not learned a thing…As digital technologies become better at doing human things, the focus has to be on how to protect humans, how to delineate humans and machines, and how to create reliable signals of each—see 2016. This is straight up, deliberate deception. Not okay.” – Zeynep Tufekci, Professor & Writer 

Let’s Have a Frank Discussion About Complexity December 7, 2017

Posted by Peter Varhol in Algorithms, Machine Learning, Strategy, Uncategorized.
Tags: , , , ,
add a comment

And let’s start with the human memory.  “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information” is one of the most highly cited papers in psychology.  The title is rhetorical, of course; there is nothing magical about the number seven.  But the paper and associated psychological studies explicitly define the ability of the human mind to process increasingly complex information.

The short answer is that the human mind is a wonderful mechanism for some types of processing.  We can very rapidly process a large amount of sensory inputs, and draw some very quick but not terribly accurate conclusions (Kahneman’s Type 1 thinking), we can’t handle an overwhelming amount of quantitative data and expect to make any sense out of it.

In discussing machine learning systems, I often say that we as humans have too much data to reliably process ourselves.  So we set (mostly artificial) boundaries that let us ignore a large amount of data, so that we can pay attention when the data clearly signify a change in the status quo.

The point is that I don’t think there is a way for humans to deal directly with a lot of complexity.  And if we employ systems to evaluate that complexity and present it in human-understandable concepts, we are necessarily losing information in the process.

This, I think, is a corollary of Joel Spolsky’s Law of Leaky Abstractions, which says that anytime you abstract away from what is really happening with hardware and software, you lose information.  In many cases, that information is fairly trivial, but in some cases, it is critically valuable.  If we miss it, it can cause a serious problem.

While Joel was describing abstraction in a technical sense, I think that his law applies beyond that.  Any time that you add layers in order to better understand a scenario, you out of necessity lose information.  We look at the Dow Jones Industrial Average as a measure of the stock market, for example, rather than minutely examine every stock traded on the New York Stock Exchange.

That’s not a bad thing.  Abstraction makes it possible for us to better comprehend the world around us.

But it also means that we are losing information.  Most times, that’s not a disaster.  Sometimes it can lead us to disastrously bad decisions.

So what is the answer?  Well, abstract, but doubt.  And verify.

Are Engineering and Ethics Orthogonal Concepts? November 18, 2017

Posted by Peter Varhol in Algorithms, Technology and Culture.
Tags: , , ,
add a comment

Let me explain through example.  Facebook has a “fake news” problem.  Users sign up for a free account, then post, well, just about anything.  If it violates Facebook’s rules, the platform generally relies on users to report, although Facebook also has teams of editors and is increasingly using machine learning techniques to try to (emphasis on try) be proactive about flagging content.

(Developing machine learning algorithms is a capital expense, after all, while employing people is an operational one.  But I digress.)

But something can be clearly false while not violating Facebook guidelines.  Facebook is in the very early stages of attempting to authenticate the veracity of news (it will take many years, if it can be done at all), but it almost certainly won’t remove that content.  It will be flagged as possibly false, but still available for those who want to consume it.

It used to be that we as a society confined our fake news to outlets such as The Globe or the National Inquirer, tabloid papers typically sold at check-out lines at supermarkets.  Content was mostly about entertainment personalities, and consumption was limited to those that bothered to purchase it.

Now, however, anyone can be a publisher*.  And can publish anything.  Even at reputable news sources, copy editors and fact checkers have gone the way of the dodo bird.

It gets worse.  Now entire companies exist to write and publish fake news and outrageous views online.  Thanks to Google’s ad placement strategy, the more successful ones may actually get paid by Google to do so.

By orthogonal, I don’t mean contradictory.  At the fundamental level, orthogonal means “at right angles to.”  Variables that are orthogonal are statistically independent, in that changes in one don’t at all affect the other.

So let’s translate that to my point here.  Facebook, Google, and the others don’t see this as a societal problem, which is difficult and messy.  Rather they see it entirely as an engineering problem, solvable with the appropriate application of high technology.

At best, it’s both.  At worst, it is entirely a societal problem, to be solved with an appropriate (and messy) application of understanding, negotiation, and compromise.  That’s not Silicon Valley’s strong suit.

So they try to address it with their strength, rather than acknowledging that their societal skills as they exist today are inadequate to the immense task.  I would be happy to wait, if Silicon Valley showed any inclination to acknowledge this and try to develop those skills, but all I hear is crickets chirping.

These are very smart people, certainly smarter than me.  One can hope that age and wisdom will help them recognize and overcome their blind spots.  One can hope, can’t one?

*(Disclaimer:  I mostly publish my opinions on my blog.  When I use a fact, I try to verify it.  However, as I don’t make any money from this blog, I may occasionally cite something I believe to be a fact, but is actually wrong.  I apologize.)

Design a site like this with WordPress.com
Get started