View all web browser and mobile devices available in our cloud-based test lab.
More and more AI/ML algorithms are being embedded within the dev process. The UI may look the same, but underneath there are massive algorithms gathering insights to help serve the clients better.
For example, Twitter/X, Amazon, and the majority of other news websites leverage content personalization algorithms based on tastes/preferences and browsing history. It is all is based on a learning system that collects these data points.
The use of AI/ML in a new type of apps, AIIA — AI Infused Applications, will create of a new breed of sophisticated software defects. Here are the top six new categories that should be considered for AI/ML defect classifications.
An AI-based algorithm makes predictions based on the user’s ability to train its engine. The algorithm will label things based on the data it is trained on. Hence, it will simply ignore the correctness of data. For example, if the algorithm is trained on data that reflects racism or sexism, the prediction will mirror it back instead of correcting it automatically. Therefore, one needs to make sure that the algorithms are fair, especially when they are used by private and corporate individuals.
This defect category means that training the AI engine should also include a dedicated set of rules and data that refers to ethics, depending on the target market segments, geographies, and exposure of the app or website.
Such a category needs to be included in the test planning and classified upon relevant detection of relevant issues. It will also require the ability to perform all sorts of testing within the lifecycle of the app (unit/APIs/UI/data inputs, etc.).
In a recent article by Harvard Magazine, the author gave an example around the use of autonomous cars and people with an arrest warrant against them entering such vehicles. The dilemma here is whether or not the car should drive the suspect directly to the nearest police station without acknowledging a potentially life-threatening emergency that may require a different behavior. In this example, there needs to be various rules and conditions that are part of the AI algorithm that can make a decision that matches the reality as much as possible.
This challenge may occur when data is not labelled but can be divided into groups based on similarity and other measures of natural structure in the data. An example is the organization of pictures by faces without names, where the human user has to assign names to groups, like iPhoto on Macs.
The complexity here is to get the groups right and continuously expand the data in a correct manner. In machine learning, there are various ways of clustering data sets, such as K-means, density-based methods, and others. An additional clustering use case applies to the biology field, where an ML solution can help classify different species of plants and animals.
Here is an example of the K-means algorithm:
Initialize k means with random valuesFor a given number of iterations:Iterate through items:Find the mean closest to the itemAssign item to meanUpdate mean
The algorithms that are being developed and used must be based on the right characteristics. They must be trained based on large and cohesive data sets.
In addition to adding a new category to the classified test failures, such a persona must challenge as much as possible. This can be done through testing and parallel data sets to obtain as many outputs as possible in order to build trust in the clusters. In addition, as the product matures, new clusters, as well as data points will be added — this needs to be continuously tested and fed into the testing processes.
Machine learning and AI algorithms aren’t well designed in various cases (e.g. determining weather forecast through analysis of massive data points from satellites and other sensors) to deal with stochastic events. ML can be limited and generate wrong outputs due to the fact that it does not have physical constrains like “real” platforms that are led by humans. As technology evolves, such constrains may be limited. However, this is a category that requires the awareness of developers and testers.
They will need to understand the limitations and constrains of the algorithms in the edge cases and situations where things such as the above-mentioned examples may occur, and either reroute the app to an alternative source, or avoid using the algorithm altogether.
Test engineers will need to include the “human” scenarios in such use cases and challenge the apps in various happy and negative paths toward a trustworthy algorithm.
In this book, this word might be the most repeated one, since data resides at the core of all major AI/ML algorithms, and it is in charge of the success or failure of apps that leverage such algorithms. When thinking of data, there can be a few types of data-related failures, as outlined below:
The algorithms must be trained with large and accurate sets of data that are relevant to the problems being handled, as well as to be solid enough to cover varying conditions. Such algorithms need to also consider the above and below failure types like ethics, deterministic approaches, stochastics, and more.
The entire test plan must include the right level of scenarios that challenge the apps and websites through various data points — good or bad. The test plan must also place proper assertions so that developers can understand the data-specific root cause of failure. Maintaining the tests over time and updating the test data is of course something that must be included in the test planning.
This category refers to the validity and quality of the trained engine to handle unexpected results or outputs. This can be a real issue that in many cases can cross-path with the above-mentioned categories as well. AI/ML models and algorithms are being created from the beginning to handle such issues. But when they fail, this category needs to be clear and properly reported to developers. The inability for a model to sometimes realize a relationship between two variables may result in a wrong speculation that will obviously then result in a wrong output. Sometimes such an output can cause serious outcomes.
When they develop the algorithm, it must leverage best practices like P-Hacking (data phishing) or scope-analysis to base outputs on mountains of data until a correlation between variables is showing a statistically-consistent result.
Testers must model the applications in a way that they are challenged by multiple variables from various angles to test the reliability of the model, relevancy of the outputs, and the consistency over time and use cases.
One of the common failures around ML/AI algorithm as it relates to this category are false positive results. These need to be identified and eliminated in the testing phases. A recommendation to testers is to not test the system after “looking” at the data, but rather test using statistical approaches and pre-registered data, and then analyze the app.
This is a very important failure category that is not only technical, but also quite business related. If the selected model is not interpretable, then it does not serve its purpose and will cause major regressions. Interpretability is a paramount quality that machine learning methods should aim to achieve if they are to be applied in practice. As an example, if a model cannot prompt simple, relevant, and understandable outputs to the clients, they won’t be used or accepted by them.
Models must translate the algorithm outputs in a meaningful and simple manner back to the users. Once developers can achieve this objective, they will get back relevant feedback from the users, together with growth of usage and system adoption.
Testers must focus on the business outcomes of such embedded ML/AI algorithms, so the product meets its purpose and drives back happy customers. Testing for unclear strings, outputs of chatbots, translations problems, context-related issues, and others must be covered and reported back to the developers.
These are the defect types that are expected to pop up as more ML/AI models are embedded into our existing mobile and web applications. A single defect management system with proper classification can help developers distinguish the root causes of defects and resolve them fast.
Another key here would be to properly divide and segment the two types of tests and validations within a single software iteration — test scoping that covers both the platforms, the functional and non functional tests, as well as the AI-specific cases. This will be the new normal for DevOps teams.
VP of Product Management, Perfecto
Tzvika Shahaf is the VP of Product Management at Perfecto. His experience includes business development, strategy, and investment in technology companies and venture capital firms. His passion is building new, powerful, and effective ways to collaborate with Global 2000 enterprises in order to resolve high-impact business problems using data-driven processes and analytics. Tzvika is partnering with leading DevOps teams to revolutionize the testing space by making it smarter, faster, and cost effective with a clear goal of maturing software delivery lifecycle. Tzvika is keynote speaker at industry leading events, blogger, and a Co-Author of the book, “Continuous Testing for DevOps Professionals: A Practical Guide from Industry Experts.”