AI Testing 101: Practical Tips for Testing AI Systems
AI-based systems, also called Neural Networks (NN), are “systems” like any other apps, and as such they require testing. This article will guide you on testing AI and NN-based systems and understanding the relevant concepts.
What Is Different About Testing AI Systems?
“Traditional” software is built and based on a deterministic algorithm inside. For example, for a system to convert Celsius to Fahrenheit degrees, it will use the simple F =1.8C + 32 formula.
AI is used in cases where the “formula” is unknown, but you have enough examples of inputs and outputs to estimate the formula based on examples.
Eventually, AI does not create the formula, but creates a network of decisions based on previous knowledge. If one knows the formula, there is very little value in creating a AI to solve it. For example, we can create an AI to determine “C” from “F.” Given a few thousand examples, we will get results which are in proximity of the real results. If we look inside the AI, we can likely find some figures such as “31.96595” or “32.003024” or other figures close to the number from the F2C formula (32). However, those will be determined based on examples, and not based on knowledge.
Can we always use a formula? The short answer is, “No, we can’t.” In the above example, we obviously can, but what about more complex examples like, Is this a penguin in this picture? There is no simple formula to determine what a penguin looks like inside a picture. There are endless examples of “pictures of penguins” and they vary in size, position, colors, lighting, types, etc.
What does a penguin look like? There is no clear-cut formula. Enter AI.
AI practically mimics the human brain’s way of operation in terms of training and gives its best guess (i.e. accuracy) based on previously learned examples. With humans, we treat our ability to determine a “penguin” as part of our intelligence.
Just like humans, AI can make mistakes or be tricked. This is where “testing AI” comes into play. Take a look at the examples above and below. Is this a penguin or maybe, if you look at it upside down, a giraffe? It might be problematic if you take a trip to the Serengeti, stand on your head, and suddenly start thinking you’re in Antarctica.
Penguin or giraffe? AI can be tricked, just like humans.
Testing AI Applications: Important Considerations
AI will give results with a certain level of accuracy. It’s very rare to get 100% accuracy for positive results and it’s very rare to get 0% accuracy for negative results.
Testing for Accuracy
A good AI will have a significant delta between and as a factor to positive and absolute accuracy (100%). When you’re testing, you will get different levels of accuracy. That’s normal, but if you’re getting a 99.99% positive result on object A and a 98% negative result on object B, it might be problematic to determine which is positive and which is not.
99% is not always better than 90%. It is relative to the other results. If positive is 80% and negative is 30% and below, your AI is OK. If positive is above 99% and negative is below 98%, that’s problematic to determine. Remember, one can NEVER test all inputs, so the tester’s role is to determine the QUALITY of the AI.
Static or Dynamic AI?
Static AI is provided “as-is” to the application. It will have the SAME results given the same inputs until it’s updated. Static AI is commonly provided by external vendors. For example, your app might be using an image recognition or NLP engine provided by a third party.
Testing Static AI
From a testing perspective, static AI is important to test mainly as part of the development process (as acceptance) and as part of a version release sanity test. But, being static, developers and testers don’t really need to test them again and again. Whatever your existing strategy is for OEM, including external third-party components, it should be the same strategy for testing static AI.
Testing Dynamic AI
Dynamic AI is constantly improving itself. It starts the same way as static AI, but once released, the verified output is injected into the AI again as additional “teaching data” to increase accuracy. This is very similar to the way our brain works.
As with our brain, more is not always better. “Improving” might have a negative impact on the AI and testers should always perform “production testing” to ensure that the AI is indeed improving or at least stays as it used to be.
For this to happen, use a static set of testing data and the accuracy number (if available). You can use the same 20% of test data as used to develop the original AI as it is NOT part of the teaching data. The testing data should produce the same or better results. The cadence is usually related to the percentage increase in teaching data. A good starting point would be 1%.
For example, if the original AI teaching data is 100,000 entry points, per each new 1,000 additional data introduced to improve the AI, run the test data and check the results. Less than 1% will likely not have a significant effect on the AI values.
Single or Multi-NN?
This is a very important question which might be tricky to understand. Let’s take a chatbot as an example. A chatbot might be based on a messaging platform or a voice platform. In the case of a voice platform, there is a NN-based Speech to Text prior to any NLP used to determine the context of the conversation.
This means that there are TWO NN at play here. In some cases, this might be tricky. For example, a collision detection system might use AI to analyze the base images and a relatively simple algorithm to determine if a collision is possible.
In this case, the tester needs to answer the very basic question, “What exactly am I testing?” Here are some hints:
- In most cases of multi-NN, you are actually testing only ONE NN and you rely on the rest to provide basic information.
- Black box is still black box. You need to test the overall quality of the system. However, as in all testing, you should be making risk-based decisions. You can’t test everything, so FOCUS.
False or Fraud: the Security Aspect of AI
We will dive into this in more detail when discussing AI completeness. But in almost all cases, AI have potential attack vectors that can be used for fraud. In related research, an example was given on how a “red traffic light” + an additional 11 white pixels can be determined as an “oven.”
Even slight changes to an image can confuse AI, making it susceptible to fraud.
Testing for Secure AI
To better define your testing needs, consider the following:
Do I expect fraud inputs? Why?
For example, in the above example, if someone wants to cause a car crash, he might use the above anomaly to fool a given traffic light.
However, for chatbots, if the input is not recognized correctly, the results are likely not fraudulent by nature. This means that you can cause a false recognition, but for what reason?
What’s the cost of a false detection?
In the traffic light example, fraudulent or not, the results of bad detection can be catastrophic. It might be caused by someone with evil intentions or by a few drops of rain. Good testing should detect such anomalies because of the possible high cost of false detection.
In your chatbot example, a false detection will usually result in a “Sorry, I didn’t get that” response, and besides an annoying interface, there’s no harm done. While you obviously want to make sure that false detections are minimized, the cost of false detections are not catastrophic.
Is the system autonomous or not?
In most cases, the cost of false detection is higher with autonomous systems. It’s not always related to life treating situation like in the above traffic light example. But it might still result in a high cost.
A false license plate detection might mean the carpark barrier will not go up in time or a driver could falsely be charged for toll roads.
If the FLOW of the system includes a human which can “fix” AI mistakes, the cost of false detection is usually much lower.
Number of Possible Inputs
In most cases, AI is used where the number of possible inputs is extremely big or practically infinite. For example, in a system used to determine if a given picture is a penguin, the possible inputs are ‘any picture’ by definition
Practically, it’s not important to understand how many possible inputs there are. And obviously, you can’t test all of them. What is required is to determine a solid testing data strategy.
Testing for Number of Inputs
There are few factors that can help reduce the number of test inputs.
The Inputs You Are Interested in Testing
Previously, we discussed multi-layer NN. If, for example, your system is relying on a computer vision (CV) component to identify objects (for example, a system which returns a list of animals in a given picture), you don’t really need to test that component too much or too often. And it can significantly reduce the list of inputs to “list of animals.”
A sanity test for this will be to ask the developers what their code is doing — not the OEM code, but their code. If their code is starting with inputs which can be an animal name, this is your focus, not “any image.”
NN is created in such a way that they group themselves based on the input’s low-level values. This might be too complicated to explain, but if we are searching for penguins, a possible grouping might be “not animals,” “other animals,” or “penguins.” Staying focused on context, if your system should detect penguins, there is little difference between a “chair” and a “table,” meaning, there’s no point in testing all furniture.
Other groupings could be lighting conditions, size, position, colors, etc.
NN tends to be sensitive to slight changes in the inputs. If you are sensitive to those types of testing (i.e. mainly autonomous systems), add some tests which loop through a certain parameter. For example, the same image, but with different lighting conditions.
This is also helpful for non-CV, non-audio inputs. For example, if an NN parameter is an age, try to give vectors of dates with a frequency of a single day.
Trick the System
Part of your “false checking” should include noise-level testing, which includes positive inputs with an added level of noise, such as image noise, audio noise, etc.
While AI testing seems like it can’t be automated, that’s not true. Most testing can be automated if given objective measurements.
- If you have a set of known inputs and have a set of known outputs (even if those are ranges of numbers), it can be automated.
- If you are sitting in front of the system and thinking about how to fail the system, you are doing something wrong.
- If you did it once, it can be added to known inputs and outputs.
- AI is not considered heavy on processing. While there are many possible inputs, the AI is extremely optimized and usually a AI decision should take very little time (in many cases, measured in msec. or less).
- If testing is taking too much time, you might delay the CI cycle, so consider daily and weekly cycles. However, make this decision ONLY if you are suffering from poor performance, not before. As said, AI processing is usually very fast.