MetaAutomation: Metaautomation: Is this a failure that we’re better off ignoring for the moment?

Yesterday’s post was about automated analysis of test failures. Today I post about accelerating business value from automated tests even more: moving past transient failures outside of team ownership, before even the results are reported.

In the age of information, systems are highly interconnected, and external dependencies are everywhere. Systems fail sometimes or just take too long for all sorts of reasons. Other systems (if done well) handle these hiccups gracefully.

Suppose the system that your team is working on is a high-impact system with external dependencies. It’s almost inevitable that there will be failures in the external systems that you really can’t do anything about (except maybe choose to make adjustments to your locally-defined timeouts). Will your system report on these failures to make a potential action item? In the interest of generating business value from the software quality team, should it?

IMO this is a great opportunity to skip past common non-actionable failures to increase the chances of finding something that really is actionable. People are smart, but they’re expensive to keep around, so we don’t want them to spend time on something that can be automated away.

Notice I didn’t say to toss out the artifacts from failures that can be quickly judged non-actionable. These artifacts could be useful later, and as noted earlier, storage space is cheap J

Suppose the team’s system is doing a financial transaction. An external authentication system (call it EAS) does a robust authentication for us. Sometimes, this external system times out, e.g. on the big shopping days after Thanksgiving or before Christmas. The system that the team owns handles this for external customers as gracefully as can be expected, and suggests trying again.

From the point of view of automated tests running on a version of the team’s system in a lab (or maybe, a live system with some automated tests as well) the EAS fails maybe several times an hour. The quality team doesn’t want to spend any time on these failures if it can avoid it, and it also doesn’t want to slow down the rest of the automated testing just because of a simple timeout; the team wants to get right to the more interesting test results if it can.

If the engine that schedules and runs your tests is capable of repeating a test, or running sets of atomic tests (see this post http://metaautomation.blogspot.com/2011/09/atomic-tests.html) and reporting on the set? In this case, on a failed test, try again and see if the error is reproducible.

1. Stop retry after N sequential failures that correlate with each other within C (and, report the reproduced error as a single test failure linked to the other failures in the set)

2. Stop retry after M sequential failures (and, report on the failure set, at a lower priority than a reproduced error. All failures are linked together for potential follow-up)

3. Stop retry on success (and, report the success, linked to any failed tries)

N could be configured as 2, and M might be 5. C could be 80%, or could be a set of correlation strengths depending on the correlation type, if multiple correlation types are used.

All artifacts get saved. All individual test tries get analyzed by the system, because here the engine that distributes and runs the tests is also looking at correlations between the tests.

One-off failures don’t require immediate attention from anybody, although it would be useful to go back and examine them in a batch to make sure that the failure pattern doesn’t show some larger problem that the team needs to take action on.

Failures that end up on tester’s plates are now fewer. People are much less likely to get bored and preoccupied with repeated test failures on external dependencies. The team is more productive now, and looks more responsive and productive to internal customers. Product quality is more effectively measured and monitored, and for a high-impact product, that’s where you want to be!

Can this be applied to your project?

MetaAutomation

Tuesday, September 27, 2011

Metaautomation: Is this a failure that we’re better off ignoring for the moment?

No comments:

Post a Comment