Mind maps are an effective way to gather information quickly and organization those ideas. For software testing they provide a great tool to share test plans with product owners and testers in an easy to comprehend manner. The visual clarity of mind maps display content in a usable manner.
Hexawise allows you to import and export mind maps. So you can brainstrom ideas together (users, business analysts, product owners, testers, managers...) and agree on the imporant items to test. And then you can import the mind map into Hexawise and it will generate an optimized test plan with efficient combinatorial coverage (enhanced pairwise testing to test the performance of the software interactions between parameters and parameter values).
This interview with Mike Bland is part of our series of “Testing Smarter with…” interviews. Our goal with these interviews is to highlight insights and experiences as told by many of the software testing field’s leading thinkers.
Mike Bland aims to produce a culture of transparency, autonomy, and collaboration, in which “Instigators” are inspired and encouraged to make creative use of existing systems to drive improvement throughout an organization. The ultimate goal of such efforts is to make the right thing the easy thing. He's followed this path since 2005, when he helped drive adoption of automated testing throughout Google as part of the Testing Grouplet, the Test Mercenaries, and the Fixit Grouplet.
Hexawise: If you could write a letter and send it back in time to yourself when you were first getting into software testing, what advice would you include in it?
Mike: When I first started practicing automated testing and had a lot of success with it, I couldn’t understand why people on my team wouldn’t adopt it despite its “obvious” benefits. One of the biggest things that experience, reading, and reflection has afforded me is the perspective to realize now that different people adopt change differently, at different rates and for different reasons, and that you’ve got to create the space for everyone to adapt accordingly.
As I say in my most recent presentation, “The Rainbow of Death”, metrics and arguments are far from sufficient to inspire action in either the skeptical or the powerless, and the greater challenge is to create the cultural space necessary for lasting change.
Oh, and you’ve got to repeat yourself and say the same thing different ways multiple times—a lot.
Hexawise: What change management lessons did you learn while driving adoption of test automation methods at Google between 2005-2010? Which of those lessons were applicable when you were involved in the recent U.S. federal government effort to bring in talented tech people to bring new ways of working with technology into the government? Which of those lessons were not?
Mike: The top objective is to make the right thing the easy thing. Once people have the knowledge and power to do the right thing the right way, they won’t require regulation, manipulation, or coercion—doing things any other way will cease to make any sense.
Most of what I learned in terms of specific approaches to supplying the necessary knowledge and power has come from trying different things and seeing what sticks—and I’m still working to make sense of why certain things stuck, years after the fact. The most important insight, as mentioned earlier, is that different people adopt change differently, for different reasons, and as a result of different stimuli. Geoffrey A. Moore’s Crossing the Chasm was the biggest eye-opener in this regard. Then, years later, when I saw fellow ex-Googler Albert Wong present his “Framework for Helping” to describe his first experience in the U.S. Digital Service, I instantly saw it snapping in place across the chasm, describing how the Innovators and Early Adopters from Moore’s model—who I like to call “Instigators”—need to fulfill an array of functions in order to connect with and empower the Early Majority on the other side of the chasm. Of course, through the filter of my own twisted sense of humor, I thought “Rainbow of Death” might make the model stick in people’s brains a little better.
So these models helped provide context for why the specific things the Testing Grouplet did worked; and how, despite the fact that there were many scattered, parallel efforts underway, they ultimately served to reinforce one another, rather than creating confusion and chaos. Of course, Google’s open communication channels and the Testing Grouplet’s shared vision—that emerged two years into our five year run—helped keep everything aligned. The point being, don’t wait for the clear vision and perfect plan up-front—start doing things and pay attention to what’s working, and why, and develop your plans as you go. That’s just good Agile practice, isn’t it?
And did I mention that you have to reiterate things you’ve already said in different language over and over—like, a lot?
Every lesson applied, in that the real lessons were about human nature, not technology. Google disabused me of the notion that one metric, one tool, or one method of persuasion would suffice to change an entire population's behavior. In other words, there's no silver bullet.
See the full interview for useful details.
The top objective is to make the right thing the easy thing. Once people have the knowledge and power to do the right thing the right way, they won’t require regulation, manipulation, or coercion—doing things any other way will cease to make any sense.
Hexawise: Describe a testing experience you are especially proud of. What discovery did you make while testing and how did you share this information so improvements could be made to the software?
Mike: Probably like many folks, I remember my first time the most vividly. Immediately on the heels of a death march—when my team barely got a steaming pile of other people’s code to meet a critical spec by a harsh deadline and very nearly would’ve killed one another were it not for Strongbad’s Emails to keep us one hair’s breadth away from going completely insane—we got some time and freedom to try to make the program faster.
I’d gotten the idea that we needed to rewrite a particular subsystem to take advantage of data we weren’t even using, and at about the same time, I happened to read an issue of the C/C++ Users Journal that had an article on using CppUnitLite, I believe. Unit testing sounded like a neat idea, so I practiced it at the same time I started rewriting this subsystem from scratch.
See the full interview for additional useful details.
Hexawise: In watching your videos and reading your content online your ideas resonate with those of W. Edwards Deming, Russell Ackoff and Peter Senge from management, culture change and systems thinking perspectives. Who are your greatest influence in this area?
Mike: I’m a little ashamed to admit I haven’t read any of their stuff, or at least not much. Certainly what little I’ve gleaned of Deming resonates with my experience. I’ve begun reading Senge’s The Fifth Discipline, and while the introduction resonated very clearly, I’ve not yet read further. Ackoff is a new name for me (and thanks for the tip!). That said, it is gratifying when I do read an established author and find that, yep, more learned minds than mine have clearly articulated widely-accepted concepts that I’ve only figured out due to trial, error, and intuition.
In fact, one of the things I’m trying to do moving forward is to go back through the literature and connect it to the experiences I’ve had—not just for my own validation, but to reassure my audience and clients that the things I’ve done and the things I recommend aren’t all crazy talk. I’ve got Geoffrey A. Moore’s Crossing the Chasm model combined with Albert Wong’s model to form the Rainbow of Death, which comprises the core of my narrative now; and I’ve also recently added a very high-level view of Kurt Lewin’s theory of social change, which someone only recently suggested to me.
Views on Software Testing
Hexawise: In your online presentation, Making the Right Thing the Easy Thing, you note: [Use] "amplifying feedback loops to make sure knowledge is shared where needed as quickly and clearly as possible." How do you suggest this idea be applied by those involved with software testing?
Mike: Heh, that’s a paraphrase of the Second Way of Devops (out of Three), originally articulated by Gene Kim. Clearly the extent to which you can automate testable cases, to make it easy and fast to do, the better. People need to know that there are different kinds of automated tests for different levels of the software—they need to learn how to do the right thing the right way. Once developers in particular have gotten some traction with writing automated tests, then folks performing manual or system-level automated testing won’t waste their time catching (and re-catching!) bugs the developers could’ve easily caught, and can focus on truly pushing the limits of the software—reporting not just on whether it meets functional and nonfunctional requirements, but on the overall quality of the product.
In other words, a healthy balance of automated testing and manual testing plays to the strengths of all the humans and machines involved. When you’ve got optimal resource utilization happening, you eliminate a lot of both physical (in terms of slowness) and human friction, and a feeling of true partnership can take hold. Testers aren’t just the people reminding you that your code isn’t perfect—you’ve already reminded yourself of that through your own automated tests!—they’re the ones helping you make it even better.
it’s not about defects; it’s about feedback and collaboration. If you arrange incentives to produce an adversarial relationship between team members, e.g. if developers are incentivized to minimize defects and testers are incentivized to report defects, then that’s a house divided against itself.
Hexawise: What do you wish more developers, business analysts, and project managers understood about software testing?
Mike: Oh my. For one, it’s not about defects; it’s about feedback and collaboration. If you arrange incentives to produce an adversarial relationship between team members, e.g. if developers are incentivized to minimize defects and testers are incentivized to report defects, then that’s a house divided against itself. Some people think a degree of competition and/or adversarialism is a good thing, but when it comes to producing a product as a team—i.e. achieving a mission—you should keep it to a minimum in favor of fostering a spirit of collaboration.
Collaboration doesn’t mean blind consensus; it means communicating honestly in an environment in which we feel safe to do so, in which we share criticism in a spirit of mutual self-interest, not cutthroat competition.
One test type does not fit all. First, in terms of automated tests, unit testing can find a truly large number of errors, very quickly and cheaply, and tends to encourage better code quality (i.e. readability, maintainability, extensibility) overall. Integration tests can shake out errors and ambiguities between component contracts. High-level, developer-written system tests (as opposed to more extensive system tests developed by a dedicated tester) can quickly affirm that the entire product is in a buildable, runnable state. All of this “white box” testing by the developers is essential to giving the testers as high-quality a product as possible, so they can apply their “black box” techniques to push the product to its limits, rather than waste time alerting developers to defects they could’ve much more quickly, easily, and cheaply discovered themselves.
To this last point, I like to point to the examples of goto fail and Heartbleed. So many Internet “experts” threw up their hands and claimed that bugs like these were “too hard to test”. In both cases, after 2.5 years out of the industry (another story), I spent an evening diving into code I’d never seen before and wrote a test to reproduce each bug and validate its fix. After that, some liked to say, “Oh well, lots of other tools and techniques could’ve found these bugs.”
My claim isn’t that automated testing would’ve been the only way; my claim is that the discipline of automated testing likely would’ve prevented these bugs from ever existing even before writing a single test. With goto fail, the offending block of code was copied and pasted throughout the file six times! It was just that one of the six contained the errant “goto fail” line. But as I demonstrated with my version of the “fix”, extracting a common function and testing that six ways from Sunday likely would’ve avoided the problem entirely. In the case of Heartbleed, it was a failure to validate that an input buffer was actually as long as the user-supplied length indicated. Testing that kind of corner condition is unit testing 101, and the kind of thing you become more sensitive to every time you write a line of code once you’re in the habit of testing.
Hence, as difficult as it would’ve been for manual testing to discover these errors, and as long as it took for them to get shaken out months or years after their widespread deployment—Heartbleed via third-party fuzz testing, goto fail who knows how—both very, very likely could’ve been stopped dead in their tracks (or never would’ve existed!) if the developers were in the everyday habit of unit testing their code.
Hexawise: Our CTO, Sean Johnson, shared your memorably-named "Rainbow of Death" presentation with our management team. We absolutely loved it. In your presentation, you describe a series of concrete, practical steps you and your colleagues at Google took over the course of 5+ years to overcome resistance to change, educate teams, and successfully achieve broad adoption of automated testing efforts at Google across many teams, including lots of teams that were initially very change resistant. Can you please describe for our readers 2 or 3 noteworthy aspects of that change management journey?
Mike: What I hope the Rainbow of Death model, in combination with Geoffrey A. Moore’s Crossing the Chasm model, make apparent is that different people adopt change differently. There are many needs that need to be met by and for many different people, and the chances of figuring out the perfect plan to execute before taking any action are practically zero. After all, don’t the Agile and DevOps models that are all the rage comprise tools and practices for adapting to change, for performing experiments and adjusting course based on feedback? Organizational change is no different, yet many people remain conditioned to expect waterfall-like solutions to their social problems.
Also, I mention in the talk that “The problem you want to solve may not be the problem you have to solve first.” In our case, we wanted to solve the problem of developers not writing enough automated tests. But first, we had to solve two other problems: People back then had very little exposure to or experience with automated testing, leading to the “My code is too hard to test” excuse, because they had no idea how to test it, or to write testable code to begin with.
The second problem was that the tools at the time couldn’t keep up with the growth of the company, its products, and its code base. It was growing ever more painful to write any code to begin with, yet delivery pressure was intense and Imposter Syndrome was rampant—on top of the fear of admitting your code might contain flaws, how could you make any time to learn how to write automated tests to begin with? Hence the “I don’t have time to test” excuse.
So we couldn’t just say “Testing is good! Yay testing! Please write moar testz!”
I think this mix of perspective, empathy, creativity, collaboration, tenacity, and patience is crucial to changing not only tech organizations, but society at large. I hope to put this notion, and the Rainbow of Death model in particular, to the test continuously throughout the remainder of my career.
my advice to both developers and testers is to identify the priorities, the social structures and dynamics at play in the organization. How can you work with these structures and dynamics instead of against them—or do you need to create a culture of open communication and collaboration in parallel with (or even before) communicating the testing message?
Industry Observations / Industry Trends
Hexawise: Large companies often discount the importance of software testing. What advice do you have for software testers to help their organizations understand the importance of expecting more from the software testing efforts in the organization?
Mike: Sadly, there’s no one message that works for every company, every culture, everywhere. It’s up to the Instigators in each environment to take the timeless principles I believe are essential—that testing is about feedback and collaboration, that different types of tests all catch different and important bugs, that developers and testers have different and mutually-reinforcing roles to play—and find the right cultural hooks to hang those messages on. In the case of Google, it took the Testing Grouplet five years to figure out and successfully implement, and it took an array of parallel efforts across multiple groups to saturate the culture with the message, not just one magical tool or technique or team to bind them all.
So my advice to both developers and testers is to identify the priorities, the social structures and dynamics at play in the organization. How can you work with these structures and dynamics instead of against them—or do you need to create a culture of open communication and collaboration in parallel with (or even before) communicating the testing message? This is the punchline of my Rainbow of Death presentation: The problem you want to solve may not be the problem you have to solve first, and the Standard Narrative from which all the problems emerged will not produce any solutions—though it may provide the keys necessary to unlock effective solutions.
At Google, it was the Testing Grouplet’s Test Certified program and all the other education and tooling efforts supporting it that provided the right hook—after two years of experimentation and reflection! But don’t focus on what Test Certified was comprised of: focus on why that approach worked for us, and see if that reflection inspires an approach that will work for your company.
In addition to that, it probably wouldn’t hurt to remind anyone who’ll listen of goto fail and Heartbleed, and how basic unit testing practices and the coding habits they encourage could’ve prevented these potentially catastrophic defects from even being written in the first place.
Hexawise: The story of your team's journey is fantastic. We highly recommend it to IT organizations embarking on any large improvement effort. Thank you for sharing it. We've recently started using elements of your approach to help our clients successfully adopt test optimization approaches at scale in their organizations.
Mike: That’s incredibly gratifying to hear! Please keep me in the loop of how well it’s working for you and your clients. Just as the impact of Testing Grouplet’s efforts were far greater than the sum of any individual part, and as my Rainbow of Death presentation benefited enormously from the input of trusted fellow Instigators to help illustrate it, I’m sure there are many more insights and improvements waiting to emerge from the model once more people have applied it farther and wider than I ever could on my own!
Hexawise: Do you believe the DevOps movement is resulting in better software testing within organizations? Do you see any other trends that software testers could leverage to promote improved application of software testing practices?
See full interview for Mike’s response.
Staying Current / Learning
Hexawise: What software testing-related books would you recommend should be on a tester’s bookshelf?
Mike: Sadly, I’m not current enough to make any solid recommendations. In my career, I’ve moved more into the culture change space than being a 100% in-the-trenches practitioner. That said, I certainly did years of quality time with my old “library” of programming and algorithms books, and was fortunate to be part of a culture that itself generated a broad swath of automated testing knowledge. Though I don’t keep up with the details of the latest developments, that time spent internalizing the core principles has served me very well throughout my career.
That said, I’m sure that great books exist, and people dedicated to the craft would do themselves a great service by discovering them and spending years plumbing their depths, rather than trying to read every book on the subject forevermore. That’s the model that worked for me, at least; but perhaps a more voracious reading regimen suits you better. Everybody’s different.
Mike aims to produce a culture of transparency, autonomy, and collaboration, in which “Instigators” are inspired and encouraged to make creative use of existing systems to drive improvement throughout an organization. The ultimate goal of such efforts is to make the right thing the easy thing. He's followed this path since 2005, when he helped drive adoption of automated testing throughout Google as part of the Testing Grouplet, the Test Mercenaries, and the Fixit Grouplet. He was instrumental in the execution of Test Certified and Testing on the Toilet, and the four company-wide Fixits he organized led to the development and rollout of the Test Automation Platform. His account of Google’s automated testing adoption also appears as a case study in The DevOps Handbook by Gene Kim, et. al.
He also served as a member of the Websearch Infrastructure team, which practiced DevOps before he was aware it had a name. Frequently working in concert with other indexing infrastructure teams, he also worked closely with Release Engineers and Site Reliability Engineers to package, release, deploy, and monitor multiple indexing services.
Most recently he served as Practice Director at 18F, a technology team within the U.S. General Services Administration, where he personally launched and drove several initiatives to increase 18F’s capability as a learning organization, including the Pages platform, the Guides series, and the Handbook.
This interview with James Bach is the first our series of “Testing Smarter with…” interviews. Our goal with these interviews is to highlight insights and experiences as told by many of the software testing field’s leading thinkers.
James Bach, one of the most well-known and controversial leaders in the software testing community, challenges himself and others to continually develop their software testing approaches. James believes that excellent testing is a craft that requires many skills and ongoing practice and focus to develop and maintain those skills. The skills of testing include general systems analysis and critical thinking, but also social skills. In some sense, any child can test. But children and other amateurs cannot test systematically, nor can they provide professional self-assessment and reporting on the testing they do.
James Bach, Founder and CEO of Satisfice Inc
Hexawise: What one or two software testing-related experiences have you found to be most personally satisfying in your career?
James Bach: In 2002, Microsoft complained to a federal judge that I hadn’t given it a power cord. Yes, an ordinary power cord of the kind you can pull out of the back of any standard desktop computer. Yes, to a federal judge. No, I am not making this up. Yes, I also thought it was bizarre-- bizarre but kind of satisfying.
This happened during the Microsoft Remedies Trial, wherein nine American states were suing Microsoft and the government because they wanted a tougher punishment for Microsoft after it lost its big antitrust case. The states hired me as an expert witness to find out if Microsoft was telling the truth when it claimed it was “technically infeasible” to remove IE and the Windows Media Player. I gathered a team and went to work. I soon discovered that it was possible to remove these things-- using only public information and Microsoft’s own helpful tech support people to set up my testing to prove it. (The tech support people did not realize the purpose of my questions, and cheerily gave me all that I needed to know.)
When I revealed my results, Microsoft demanded that I turn over all the materials necessary to reproduce my results. So I gave them one of my test systems. At the last moment, I pulled out the power cord, thinking that I was causing some low-level techie 30 seconds of annoyance. But the next day Microsoft was in court acting like I had withheld the Golden Power Cord of Truth.
It was satisfying because Microsoft never even attempted to proved me wrong on the facts. They used lawyer tricks to stop my truth bombs, instead. Way to go, Bill Gates.
Another satisfying moment was watching my son find a catastrophic bug in a life-critical piece of medical equipment. He didn’t ask for a spec or a test case specification. He used video-gamer techniques to confuse the system until it overrode its own safety features and melted itself inside the simulated patient. Yeah. Melted. That was a $3000 piece of equipment he ruined. I’ve rarely been prouder of myself for having such foresight as to create a son like him: he is such a good test tool.
people who don’t “embrace exploratory testing” are, to me, not even testers. They are fact checkers, maybe. I think that’s not good enough.
Hexawise: Failures can often lead to interesting lessons learned. Do you have any noteworthy failure stories that you’d be willing to share?
James: How about the time I tried to set up my corporate server. I got it all working. Then I moved it from the conference room to the server room. I couldn’t get it to come online after that. For 12 hours I worked on it, all through the night. At long last I relented to my brother’s suggestion-- that we move it back to the conference room. I had refused to do that because it didn’t make any sense. The conference room simply connected to the server room. How would adding an extraneous variable like that change anything. But, zoom, we were back online.
After a moment of “wha??” the solution flashed into my mind: we must have two feeds to the Internet instead of one. It turned out that the conference room was patched into the open net, but the other port I had been using in the server room was routed through a firewall which in turn connected to the net. Nobody told me this during the buildout of my office space. It was a completely missing possibility in my mind.
So what did I learn? I learned about the importance of de-focusing, which includes trying apparently silly things to solve problems. I’m more open to that now.
Here is another interesting failure. I recently wrote a report involving the calculation of percentages. A non-technical person (a lawyer I worked with) checked my math and found it to be wrong. In fact, every one of her calculations was wrong. But in the process of refuting her claims, I discovered a different error in one of my own numbers. So, isn’t that interesting? Even if a critique of your work is incorrect, it could still be a useful stimulant to help you find your own problems.
Hexawise: What kinds of activities do you enjoy when you’re not at work?
James: I run a business, so I feel like I’m always at work. But I guess I do take little bits of time off each day. What do I do? I daydream. I read science news. I solve math and logic puzzles. I try to walk each day. And I watch videos with my wife. We binge on English television series, mostly.
Hexawise: Describe a testing experience you are especially proud of. What discovery did you make while testing and how did you share this information so improvements could be made to the software?
James: Well, many of those things I now use as testing exercises for my students, so I don’t want to spoil them.
But, hmm, here’s one. I was given one day to break into an invoicing system for a large pharmaceutical company. I found three ways to do it. One of the methods I used was to get one of the sales engineers to sit with me while I tested. I asked him to demo the system to me and then he hung around while I tried to break-in. The first time I broke in (using a traversal attack if you follow such things) I didn’t even know I had done it until the sales engineer said “hey you aren’t supposed to see that data.” Good thing he was there, huh? So part of testing can be charming people into helping you, and you never know what that help will bring.
Views on Software Testing
Hexawise: Some of the thought-provoking ideas you and Michael Bolton have come up with, like the important distinction between Testing vs. Checking have received a great deal of attention within the community. Other intellectual contributions to the community you have made are not as well known but are arguably equally important and insightful. One such contribution that comes to mind which really resonates with us at Hexawise is the exploratory-scripted (or formality) continuum you and your brother Jon described.
Do you have one or two intellectual contributions to the community that you wish were more widely known?
James: I wish that more people understood the folly, the sheer silliness, of counting test cases and calculating pass rates.
I don’t care if you have 80 test cases or 8 million of them. That number tells me nothing about you or your testing. It tells me nothing by itself, and it tells me nothing in conjuction with other information (except in rare cases not worth talking about). It’s like telling me that you have broken your day into 27 tasks, of 1353 tasks, or whatever. Just stop. Instead of fake science smoke rings, tell me what you actually did. Here’s a simple suggestion: instead of giving me a number, give me a list: a list of test ideas, test cases, test activities, bugs, features, people… I can do something with a list. But if you give me a number I just have to say “show me the things you are counting.”
Hexawise: Can you describe a view or opinion about software testing that you have changed your mind about in the last few years? What caused you to change your mind?
James: I used to think it was useful to talk about exploratory testing. But now I think it’s more helpful to say that all testing is exploratory. To say “exploratory testing” is the same as saying “testing.” Instead, I speak of how testing can be more or less formal, but it is always informal to some degree or else it ceases to be testing.
Also, in the last few years I have concluded that term “test automation” is toxic and should be avoided. It deposits a little poison in the mind whenever it is uttered. An angel loses its wings every time someone calls himself an “automated tester.”
I changed my mind on both things as the result of ongoing attempts to teach students and hearing their questions and seeing where they get confused. That, and deep conversations with Michael Bolton.
Hexawise: How do/would you test very complex systems such as genetic algorithm systems and evolutionary systems? How do you test systems when we don't understand how they work? It seems kind of like medical differential diagnosis: poke, observe, learn, hypothesize, poke again. Or is there a better way?
James: I test them using social science methods. That, after all, is how scientists attempt to test their theories about social life. That means an emphasis on qualitative analysis, but bringing in statistical methods whenever applicable.
I agree that the medical world is a good example of where statistical methods and heuristic approaches are also needed.
In testing complex things, some of what you need to do includes:
You must use time to your advantage-- observing systems over time the way primatologists observe chimps in the wild.
You must use Grounded Theory, beginning with immersion and observation, until patterns begin to reveal themselves.
You must focus on testability. To create an environment where you can control and observe more of what is there.
You must pay attention to clues. Many, many clues. Stop looking for simplistic “test cases” that will “prove” that the software works.
You must become expert at data wrangling, since these systems usually involve huge amounts of data.
Let other people help you.
Forge partnerships with users.
If it’s a training gig then my objective is to show them what testing can be, show them a path to get there, and encourage them to walk that path. A lot of that is about removing the obstacles to moving along.
Hexawise: It is clear from your writings and frequent presentations, that you feel passionately that the software testing community would greatly benefit if far more testers embraced Exploratory Testing. It’s a deeply held conviction. What particular testing practice(s) do you most wish the software testing community would embrace?
James: Testing is exploratory. So, people who don’t “embrace exploratory testing” are, to me, not even testers. They are fact checkers, maybe. I think that’s not good enough.
I wish more testers were mathematically inclined. You must see this, too, at Hexawise-- the widespread math-phobia in our field. I want to talk about Karnaugh maps and the value of de Bruijn sequences. But I have to keep that stuff out of my classes or I will freak most people out. It’s not that I am a mathematics expert. I’m just an enthusiast who wants to be held to a higher standard. But even my dalliances in Bayesian belief nets sound like high elf incantations to most testers. Mathematical disabilities, in general, make our craft prey to quackery and fraud of all kinds.
At the same time, I want to be inclusive. Mathophobes have a lot of offer and I don’t want them to think I don’t welcome them. But must they necessarily be the majority of testers? I guess I’m saying I want a cure for mathophobia, please.
Hexawise: What do you wish more developers, business analysts, and project managers understood about software testing?
James: I wish they understood that it benefits from specialization. When software people get heart disease they don’t limit themselves to a GP, do they? If they need surgery they don’t insist on being operated on by a rotating team of generic medical people who took a three day class in “Agile medicine” do they?
I get to be a specialist tester mainly because I do it on my own time. My clients are paying me, usually, for training, not for doing testing. So I usually test in my “free time” to sharpen my skills. I recently did have a wonderful and lucrative testing-related gig (because this particular project knew that it needed the best tester and analyst it could possibly get and became convinced I was the Chosen One), but those gigs are few and far between for someone like me.
Hexawise: When individual companies hire you for consulting engagements, how would you describe what it is that you usually seek to provide to them?
James: If it’s a training gig then my objective is to show them what testing can be, show them a path to get there, and encourage them to walk that path. A lot of that is about removing the obstacles to moving along. Chief among those obstacles is lack of confidence. So I do a lot of pep talking. Another obstacle is the very primitive, mechanical way that people think about testing. I have to replace that with systems thinking.
If it’s a testing gig then my objective is usually to provide deep, exemplary testing, that is transparent to my client. I want them to feel that they see their product in a beautiful focus. The danger I am always in when I test is that I will get too deep (and therefore be too slow and expensive). But for me, deep testing is the most fun, so it’s a constant struggle to hold myself back from using my most penetrating methods and tools.
Industry Observations / Industry Trends
Hexawise: As Artificial Intelligence increases in capability (for example the strides made by Watson) do you foresee an increase in the capabilities of computer checking? I am thinking, not of an elimination of the difference between human lead testing and what can be done without people but to what extent you see the possibility for AI to do a progressively much better job of checking.
James: I foresee a collapse of critical thinking about these complex systems, followed by some sort of disaster, followed by a new realization of the risks of surrendering human judgment to a machine. I foresee that this will be an ongoing cycle.
This collapse of critical thinking will lead to more shallow testing and perfunctory checking, presented as if it were deep. For an example of what I mean, see this old computer commercial:
Pay attention starting at 2:05. Oh look, the computer is assuring us that it has no errors. Everyone relax! Its “electronic brain” can be trusted!
Hexawise: Do you have any predictions about how large an impact Artificial Intelligence and Machine Learning will have on software testing in the next 5-10 years?
James: I don’t think it will have any impact on testing as such, except inasmuch as many people (not skeptical testers, but people who might otherwise hire testers) will trust black boxes when they should be challenging them.
I suppose as machine learning become more available to the masses, someone might try to train one to recognize bad output of some kind. That’s a sort of test tool. But it would only apply to well-established kinds of badness.
If you think about it, anti-spam systems are a sort of test system. Machine learning is used in spam filters. So, I guess testing is already using machine learning in that sense. But I don’t see the average tester applying machine learning methods to testing. I don’t see a developer doing that, either. It’s too involved and complicated; too narrow in application.
I hear that people at Google are going to “put coders out of business” with a system that writes code based on people just talking. You know what that’s called? A compiler. They are inventing a high level compiler. Now the people who talk will be called developers and will have to learn to talk properly, because it will emerge that normal people can’t say what they actually want.
Hexawise: Have you seen a particularly effective process where the software testing team was integrated into the feedback from a deployed software application (getting feedback from users on problems, exploring issues the software noted as possible bugs...)? What was so effective about that instance?
James: Not really. What I see is developers ignoring feedback. It’s too overwhelming. I suspect there are people who are really good at doing that. But I haven’t run across any.
One of the things that has happened with DevOps is a de-emphasis on testing and more of an emphasis on overall risk management. That’s a valid strategy, of course, but it has interesting blind spots. Whenever I hear a developer speak about wanting feedback from users I immediately think about how abusive and incompetent most users are about reporting problems. No, my developer friends, you really don’t want to read all those Internet comments on your software. You will be demoralized. But testers? We love reading that stuff. It’s our wheelhouse. We get clues and then we can reproduce the problems and make them sensible for the devs.
My brother, at eBay, with his testing mentality, loves going over the user feedback and bringing it to the teams there. But he will tell you it’s a constant struggle to get the attention of the dev teams.
attend a conference. Don’t bother to go to the talks, though. Most of the talks are full of fluff. Instead, find people and talk to them. Compare notes, make friends. Go to the testing lab.
Hexawise: Often one of the major roadblocks to software testers is their own management. Do you think this is a fair statement? Do you have suggestions for how testers can attempt to improve the situation. My background is strongly influenced by W. Edwards Deming so I have a tendency to look at the organization as a system and see room to improve the management system. It seems to me often the biggest gains are not possible if we keep departments separate (software development, software testing, marketing, customer service...). We can make improvements in software testing even if it is largely seen as separate from the organization but in doing so we miss much greater potential improvement.
James: The collapse of the test management industry is a terrible problem. It’s getting harder to find any kind of test manager out there. Do they even exist in Silicon Valley any more or have they all been hunted down by parasitic wasps who lay “scrum master” eggs in their living carcasses?
People who seem to know little about management or testing tell me that test managers are not needed. Okay, that means a whole lot of things that test managers do will not get done. This includes: providing a protected place for testers to work, free of harassment; negotiating for testability; negotiating for resources; assuring that schedules are reasonable; assuring that testing gets the respect it requires in order to attract and keep talented people; assuring that deadlines are met; explaining testing to management; assuring that testers are properly trained. When those things aren’t happening, testers tend to become more zombie-like and reactive (I’m not speaking of those fast zombies); or they become cheerleaders for the devs, instead of critical thinkers.
Staying Current / Learning
Hexawise: How do you stay current on improvements in software testing practices?
James: I’m not convinced there are improvements in testing practices in the absence of improvements in the thinking and social systems that drive practices-- and those things don’t improve much, as I’m sure you’ve noticed. Seems to me that the current nonsense in our craft is very similar to the old nonsense. Maybe some of the buzzwords have changed, but not much else.
The landscape of testing has definitely changed. Agile and Lean have aggressively colonized a lot of the testing space. Since most testers are young people (and test management has been eviscerated) they are easy pickings for the Agile Universalists (the people who think that we don’t need testers because we can just all test whenever we feel like it).
What this means is that testing remains rather primitive wherever I go (with a few interesting exceptions, driven inevitably by a single enlightened Elrond-like or Galadriel-like manager, who always seems to disappear off into the Grey Havens within a couple of years of me meeting him or her).
How I become aware of new and interesting ideas is through my community. For instance, a student told me about Karnaugh maps the other day and now I am trying to find a use for them.
Hexawise: How would you suggest testers stay current?
James: I don’t think currency is a thing in testing, except with respect to learning about certain emerging technologies and buzzwords.
The bigger thing in testing is to push us forward, which not enough testers are trying to do. Don’t worry about currency, worry about whether you truly understand testing, and keep working on that study.
Read widely about science. Get ideas from that. And play with the ideas. For instance, I read on Hacker News about 350,000 free images being released by the Metropolitan Museum of Art. I decided to experiment with turning that into a practical resource for test data. This led to playing with data wrangling and image analysis tools.
Also, attend a conference. Don’t bother to go to the talks, though. Most of the talks are full of fluff. Instead, find people and talk to them. Compare notes, make friends. Go to the testing lab.
Or host a little conference. Invite testers to a small gathering where you can share experience reports.
Hexawise: What software testing-related books would you recommend should be on a tester’s bookshelf?
James Bach has authored two books and consulted and presented on software testing worldwide. It is difficult to put into words how unique and insightful James is. In order to get a feel, we suggest listening to his presentations yourself and reading his excellent blog.
what areas did we spent a lot of effort on that did not result in learning much?
what lessons learned should we incorporate for next time?
The idea of delibrately examining your software development and testing practices will be familar to those using agile retrospectives. The power of continually improving the development practices used withing the organization is hard to appreciate but it is immense. The gains compound over time so the initial benefits are only a glimpse of what can be achieve by continuing to iterate and improve.
Performance testing examines how the software performs (normally "how fast") in various situations.
Performance testing does not just result in one value. You normally performance test various aspects of the software in differing conditions to learn about the overall performance characteristics. It can well be that certain changes will improve the performance results for some conditions (say a powerful laptop with a fiber connection) and greatly degrade the performance for other use cases. And often the software can be coded to attempt to provide different solutions under different conditions.
All this makes performance testing complex. But trying to over-simplify performance testing removes much of its value.
Another form of performance testing is done on sub components of a system to determine what solutions may be best. These often are server based issues. These likely don't depend on individual user conditions but can be impacted by other things. Such as under normal usage option 1 provides great performance but under larger load option 1 slows down a great deal and option 2 is better.
Software testing concepts help us compartmentalize the complexity that we face in testing software. Breaking the testing domain into various areas (such as usability testing, performance testing, functional testing, etc.) helps us organize and focus our efforts.
But those concepts are constructs that often have fuzzy boundries. What matters isn't where we should place certain software testing efforts. What matters is helping create software that users find worthwhile and hopeful enjoyable.
One of the frustration I have faced in using internet based software in the last few years is that it often seems to be tested without considering that some users will not have fiber connections (and might have high latency connections). I am not certain latency (combined maybe with lower bandwidth) is the issue but I have often found websites either actually physically unusable or mentally unusable (it is way too frustrating to use).
It might be the user experience I face (on the poorly performing sites) is as bad for all users, but my guess is it is a decent user experience on the fiber connections that the managers have when they decide this is an OK solution. It is a usbility issue but it is also a performance issue in my opinion.
It is certainly possible to test performance results on powerful laptops with great internet connections and get good performance results for web applications that will provide bad performance results on smart phones via wifi or less than ideal cell connections. This failure to understand the real user conditions is a significant problem and an area of testing that should be improved.
I consider this an interaction between performance testing and user-experience testing (I use "user-experience" to distinguish it from "usability testing", since I can test aspects of the user experience without users testing the software). The page may load in under 1 second on a laptop with a fiber connection but that isn't the only measure of performance. What about your users that are connecting via a wifi connection with high latency? What if the performance in that case is that it takes 8 seconds to load and your various interactive features either barely work or won't work at all given the high latency.
In some cases ignoring the performance for some users may be OK. But if you care about a system that delivers fast load times to users you need to consider the performance not just for a subset of users but consider how it performs for users overall. The extent you will prioritize various use cases will depend on your specific situation.
I have a large bias for keeping the basic experience very good for all users. If I add fancy features that are useful I do not like to accept meaningful degradation to any user's experience - graceful degradation is very important to me. That is less important to many of sites that I use, unfortunately. What priority you place on it is a decision that impacts your software development and software testing process.
Hexawise attempts to add features that are useful while at the same time paying close attention to making sure we don't make things worse for users that don't care about the new feature. Making sure the interface remains clear and easy to use is very important to us. It is also a challenge when you have powerful and fairly complex software to keep the usability high. It is very easy to slip and degrade the users experience. Sean Johnson does a great job making sure we avoid doing that.
Maintaining the responsiveness of Hexawise is a huge effort on our part given the heavy computation required in generating tests in large test case scenarios.
You also have to realize where you cannot be all things to all people. Using Hexawise on a smart phone is just not going to be a great experience. Hexawise is just not suited to that use case at all and therefore we wouldn't test such a use case.
For important performance characteristics it may well be that you should create a separate Hexawise test plan to test the performance under server different conditions (relating to latency, bandwidth and perhaps phone operating system). It could be done within a test plan it just seems to me more likely separate test plans would be more effective most of the time. It may well be that you have the primary test plan to cover many functional aspects and have a much smaller test plan just to check that several things work fine in a high latency and smart phone use case).
Within that plan you may well want to test out various parameter values for certain parameters
Of course, what should be tested depends on the software being tested. If none of the items above matter in your case they shouldn't be used. If you are concerned about a large user base you may well be concerned about performance on various Android versions since the upgrade cycle to new versions is so slow (while most iOS users are on the latest version fairly quickly).
If latency has a big impact on performance then including a parameter on latency would be worthwhile and testing various parameter values for it could be sensible (maybe high, medium and low). And the same with testing various levels of bandwidth (again, depending on your situation).
Usability testing is the practice of having actual users try the software. Outcomes include the data of the tasks given to the user to complete (successful completion, time to complete, etc.), comments the users make and expert evaluation of their use of the software (noticing for example that none of the users follow the intended path to complete the task, or that many users looked for a different way to complete a task but failing to find it eventually found a way to succeed).
Usability testing involves systemic evaluation of real people using the software. This can be done in a testing lab where an expert can watch the user but this is expensive. Remote monitoring (watching the screen of the user; communication via voice by the user and expert; and viewing a webcam showing the user) is also commonly used.
In these setting the user will be given specific tasks to complete and the testing expert will watch what the user does. The expert will also ask the user questions about what they found difficult and confusing (in addition to what they liked) about the software.
The concept of usability testing is to have feedback from real users. In the event you can't test with the real users of a system it is important to consider if you are fairly accuratately representing that population with your usability testers. If the users of the system of fairly unsophisticated users if you use usability testers that are very computer savy they may well not provide good feedback (as their use of the software may be very different from the actually users).
"Usability testing" does not encompass experts evaluating the software based on known usability best practices and common problems. This form of expert knowledge of wise usability practices is important but it is not considered part of "usability testing."
When you outsource testing responsibilities to a third party, there are several key points to consider beyond the three most frequently asked questions of:
What is the cost per hour of your services?
What evidence can you provide of your firm’s level of expertise in our industry and/or the technical skills we feel you’ll need? and
What is your plan for increasing testing automation? In particular, it is worth developing a thorough understanding how the vendor will identify what kinds of things should be tested, how those things should be tested, and how that information will be communicated to you.
How do you ensure testing is performed thoroughly - beyond just “happy path”/ confirmatory tests?
[Someone] said “I have only confirmed that it works. I haven’t tested it.”
Beware of answers along the lines of “we have two positive tests and one negative test” for every requirement. That’s too simplistic of an approach that potentially leaves too much of your system untested.
Do you use Exploratory Testing broadly? Would you recommend we use it? Exploratory Testing is a well-established testing approach.
Champions of exploratory testing include James Bach and Michael Bolton.
If your vendor hasn’t heard of exploratory testing and starts to explain the disadvantages of undisciplined “ad hoc testing” it’s a warning sign that they might have rigid tunnel vision about what good testing entails. Ad hoc testing is not what exploratory testing is but those that haven't learned about exploratory testing may get the two ideas confused.
How do you communicate the diminishing marginal returns that exist when you execute sets of test scripts?
In Hexawise, analyze tests screen is how we help show this concept. That screen shows both pecentage of parameter value pairs tested and the coverage matrix showing the parameter values pairs tested and untested.
This view can seen for any specific number of tests along the test plan (using the adjustable bar at the top of the image). This image shows 80% of the paired values have been tested after 10 test cases. Using the slider this view lets you see exactly which parameter value pairs are untested at each point. In this example after 15 tests there is 96% pairwise coverage and 19 tests covers all pairwise combinations.
Be conscious that your vendor has a strong conflict of interest here. 80% 2-way coverage might be sufficient for your needs on a given project but testing to that level of thoroughness (instead of the complete set of tests) could represent 50% of their billable hours in which case they may go to 100% 2-way coverage (to increase billable hours). Of course, 100% 2-way coverage may be called for, these decisions have to be made with an understanding of the software being tested.
Implications: ideally, you would like to have a testing vendor that is happy to openly share and discuss the tradeoffs that exist and there strategy to get you the best results based on your situation.
Do your testers use a structured test design approach such as pairwise test design?
If so, how many of your testers regularly use pairwise test design or similar methods to design tests?
During the process of selling you their testing services, vendors have a strong incentive to highlight their efficiency-enhancing capabilities. As soon as the ink dries on the contract though they have an incentive to keep billable hours high.
Why (and in what types of testing) do they use it?
Pairwise and combinatorial test design approaches can be widely used in virtually all kinds of software testing.
If your vendor indicates that they use these approaches only in narrow areas (e.g., most likely functional testing or configuration testing), that is a large red flag that they don’t understand these test design approaches very well at all.
If your vendor does not understand how to apply these test design methods broadly (including in systems integration testing, performance testing, etc.) then you can safely assume that there will be considerable wasteful redundancy in any areas where these test design approaches are not used.
If we gave you 50 test scripts we already have, could you generate a set of pairwise tests for us that could test the same system more thoroughly in significantly fewer tests?
A knowledgeable vendor should be able to analyze your tests and provide a draft set for your review within 24 hours.
If the vendor takes multiple days and they need to search far and wide for an available expert to create a set of optimized set of tests for you, that indicates that the actual number of testers capable of performing this kind of test design is probably pretty low.
Another practical way to double-check vendor claims is to search sites like LinkedIn for keywords such as pairwise testing, orthogonal array testing, and/or Hexawise.
Making sure your software testing process is staying current with the best ideas in software testing is an important factor in creating great software solutions that your customers love. Often companies understand the need to stay current on software coding practices that are successful but fewer organizations pay attention to good practices in software testing. This often means there is a good deal to be gained by spending some time to examine and improve your software testing practices.
When thinking about how to test software these questions can help you think of ideas you might otherwise miss. These tips are useful for anyone testing software; we do integrate tips that are specfically relevant to those using the Hexawise software testing application.
What additional things could I add that probably won't matter (but might?)
What are the two parameters in the plan with the most values? And should I shrink the number of parameter values for either case?). If you have a parameter or two with many more parameter values than other parameters have that can greatly increase the number of test cases that must be run to explore every pairwise (or higher) combination. If those are critical to test, then they should be, but often a high number of parameter values is an indication that they really could be reduced. Using value expantions may well be a wise choice.
Do the tests cover all the high priority scenarios that stakeholders want covered? Once you generate the tests think about high priority scenarios and if they are missing add them as required test cases. It may well be worth adding tests for the most common scenarios. Often this can be done without requiring extra tests when using Hexawise. Hexawise will just consider those and then create test cases to match your criteria which often won't require an extra test. Sometimes it will add a test or a few tests to the total, in which case you can decide the benefit is worth the added cost of those tests.
If you're testing a process, what are the if/then points where the process diverges? Make sure the alternative processes are being tested.
Might it make a difference to how the system operates if you were to include additional variation ideas related to: who, what, when, where, why, how and how many? Those questions are often helpful in identifying parameters that should be tested (and occassionally in identifying parameter values to test).
Have you accidentally included values in your plan that everyone is confident won't impact the system? You want to test everything that is important but if you test things that are not going to impact whether the software works or not it adds cost with no benefit.
Have you clearly thought about the implications of the distinction between impossible to test for combinations vs combinations that will lead the system to display error messages? Impossible to test combinations can be avoided using the invalid pair feature. But situations where users can enter values that should results in error messages should be tested to validate those error messages appear.
Have you considered the cost/benefit trade-off of executing, say, 50% of the Hexawise generated test set vs 75% vs 100%? It is best to think about what is the most sensible business decision instead of just automatically going for some specfic percentage. Depending on the softare being tested and the impact on your business and customers different levels of testing will make sense.
Too often software testing fails to emphasis the importance of experienced software testers using their brains, insight, experience and knowledge to make judgements about what is the best course of action. Hexawise is a wonderful tool to help software testers but the tool needs a smart and thoughtful person to wield it and produce the results we all want to see.
The amount of detail used in a test plan will depend upon your particular context. When I'm asked how much detail to include by clients, I've started drawing a simple 2 X 2 matrix that looks like this:
There is simply too much variation between different teams of testers and business contexts to provide a one-size-fits-all answer to this question. There are good and valid reasons that different teams around the world use very different test plan documentation approaches when it comes to test case writing styles.
You and your colleagues should familiarize yourself with several different approaches that other teams use. Think about the pro's and con's of using formalized and detailed script structures as opposed to the pro's and con's of other "documentation-light" approaches. Once you're aware of the options available, you should consciously adopt approach that works best for your context.
These sources are directly on point and provide more examples for your consideration than I can get into here:
Dr. Cem Kaner's excellent piece "What's a Good Test Case?"
A presentation I gave at an STP conference called "Documenting Software Testing Instructions - A Survey of Successful Approaches"
We have written before about the general question of "Should I Use One Test Plan or Multiple Plans?" This post addresses the same question with a focus on plans that have a relatively large percentage of constraints (e.g., a relatively large number of Invalid Pairs and/or Married Pairs).
Hexawise creates a Plan Scorecard that analyzes every set of tests Hexawise users generate. The Plan Scorecard exists to help you identify potential problems with plans you create and to make you aware of possible ways to improve your sets of tests. One of the notifications the Plan Scorecard provides goes like this:
Consideration: "53% of the parameter values are directly or indirectly constrained."
Explanation: Test plans with more than half of the parameter values constrained are often trying to do too much. They may be better broken into more than one test plan.
Constraints are used to prevent "impossible to test for" scenarios from being generated by the Hexawise test case generator. For more information about entering constraints into Hexawise, see these explanations of Invalid Pairs and Married Pairs. If you do not use Invalid Pairs or Married Pairs in a plan, 0% of your parameter values would be constrained.
What should you do if you get a notification that your plan is highly constrained? Simple: consider your options. Specifically, consider the pro's and con's of splitting the plan you're working on into multiple separate plans.
Why can heavily-constrained plans be a problem?
A number above 50% or so indicates that it might make sense to consider breaking your plan into 2 or more plans.
Why? Because with more and more constraints in a plan, keeping track of them all, making sure they're all accurate, and making sure the constraints in certain parts of your plan do not conflict with constraints in other parts of your plan in unintended ways, can start to take a lot of mental energy.
Furthermore, if your constraints do begin to conflict with one another, that could make it impossible for the Hexawise algorithm to identify valid values to populate in some parts of some of your tests. When that happens, instead of an actual value appearing in a test case, you will see the words "No Possible Value" appear.
Why are multiple simpler plans often preferable to one more complex one?
It is often much easier and quicker (from a modeling standpoint) to create two different plans. Creating two separate plans instead of one single plan often makes it possible to eliminate the need for the majority of the constraints in your plan.
For example, if you had a pizza ordering application where there were a lot of constraints around the value "meat pizza" and a lot of constraints around the value "vegetarian pizza" it could be attractive to create one plan (e.g., one set of tests) for meat pizzas and a different set of tests for veg. pizzas.
Simpler plans with fewer constraints tend to be easier to understand, modify, and maintain.
What are practical considerations when splitting a single plans into multiple ones?
To determine where / how to split a plan, begin by asking "what values have the most constraints associated with them?"
In the example above, "meat pizza" and "veggie pizza" had the most constraints; creating one plan for meat pizzas and a separate plan for veggie pizzas was the way to go. It would not have made sense to split the plan into one plan with scenarios involving transactions paid for in cash and a different plan with scenarios involving scenarios paid for with credit cards if types of payment type did not have many Invalid Pairs or Married Pairs associated with them.
We were recently talking to a client recently where 58% of their plan's parameter values were constrained. We helped them look at where most of the constraints were coming from. It turned out that "Timing of Loan Payment" was the main culprit. As a result, we suggested they consider three separate plans; one plan for Delinquent Payment Scenarios, one for Regular/Timely Loan Payment Scenarios, and one plan for Loan Pre-Payment Scenarios.
While working with another client that was dealing with a highly constrained plan, "Type of User" was the source of most of the constraints. Super-Users were allowed to perform all kinds of activities on the System Under Test. "Regular Users" were able to perform a far more limited number of actions. It made sense in that case to break the original combined plan into two separate plans; one plan for Admin User Scenarios and one plan for Regular User scenarios.
After determining where to split a plan, the next steps tend to be relatively straightforward:
If you're starting with one combined plan and want to break it into two plans, we would recommend these steps:
Start by creating 3 copies of the same plan:
Make a copy of the original combined plan so you can easily go back to an earlier known version if things start to go horribly wrong (or if you realize that the multiple plan strategy results in the creation of significantly more tests than the original single plan version)
Make a copy that you will modify for, e.g., "Regular User Scenarios"
Make a copy that you will modify for, e.g., "Admin User Scenarios"
Take advantage of Hexawise's Bulk Edit feature and tailor each plan as needed.
Delete any unnecessary Parameters, Invalid Pairs, Married Pairs, Requirements, and Expected results
Some of those using Hexawise use Gherkin as their testing framework. Gherkin is based on using a given [a], when [b] --> then [c] format. The idea is this helps make communication clear and make sure business rules are understood properly. Portions of this post may be a bit confusing for new Hexawise users, links are provided for more details on various topics. But, if you don't need to create output for Gherkin and you are confused you can just skip this post.
A simple Gherkin scenario: Making an ATM withdrawal
Given a regular account
And the account was originally opened at Goliath National
And the account has a balance of $500
When using a Goliath National Bank
And making a withdrawal of $200
Then the withdrawal should be handled appropriately
Hexawise users want to be able to specify the parameters (used in given and when statements) and then import the set of Hexawise generated test cases into a Gherkin style output.
In this example we will use Hexawise sample test plan (Gherkin example), which you can access in your Hexawise account.
I'll get into how to export the Hexawise created test plans so they can be used to create Gherkin data tables below (we do this ourselves at Hexawise).
In the then field we default to an expected value of "the withdrawal should be handled appropriately." This is something that may benefit from some explanation.
If we want to provide exact details on exactly what happens on every variation of parameter values for each test script those have to be manually created. That creates a great deal of work that has very little value. And it is an expensive way to manage for the long term as each of those has to be updated every time. So in general using a "behaves as expected" default value is best and then providing extra details when worthwhile.
For some people, this way of thinking can be a bit difficult to take in at first and they have to keep reminding themselves how to best use Hexawise to improve efficiency and effectiveness.
To enter the default expected value mouse-over the final step in the auto scripts screen. When you mouse over that step you will see the "Add Expected Results" link. Click that and add your expected result text.
The expect value entered on the last step with no conditions (the when drop down box is blank) will be the default value used for the export (and therefor the one imported into Gherkin).
In those cases when providing special notes to tester are deemed worth the extra effort, Hexawise has 2 ways of doing this. In the event a special expected value exists for the particular conditions in the individual test case then the special expected value content will be exported (and therefore used for Gherkin).
Or we can use the requirements feature when we want to require a specific set of parameter values to be tested. If we chose 2 way coverage (the default, pairwise coverage) every pair of parameter values will be tested at least once.
But if we wanted a specific set of say 3 exact parameter values ([account type] = VIP, [withdrawal ATM] = bank-owned ATM, [withdrawal amount] = $600 then we need to include that as a requirement. Each required test script added also includes the option to include an expected result. The sample plan includes a required test case with those parameters and an expected result of "The normal limit of $400 is raised to $600 in the special case of a VIP account using a Goliath National Bank owned ATM."
So, the most effective way to use Hexawise to create a pairwise (or higher) test plan to then use to create Gherkin data tables will be to have the then case be similar to "behaves as expected." And when there is a need for special expected results details to use the auto script or requirements features to include those details. Doing so will result the expected result entered for that special case being the value used in the Gherkin table for then.
When you click auto script button the test are then generated, you can download them using the export icon.
Then select option to download as csv file.
You will download a zip file that you can then unzip to get 2 folders with various files. The file you want to use for this is the combinations.txt file in the csv directory.
The Ruby code we use to convert the commas to pipes | used for Gherkin is
Of course, you can use whatever method to convert the format you wish, this is just what we use. See this explanation for a few more details on the process.
Now you have your Gherkin file to use however you wish. And as the code is changed over time (perhaps adding parameter value options, new parameters, etc.) you can just regenerate the test plan and export it. Then convert it and the updated Gherkin test plan is available.
I'm coining a new term today, "grapefruit juice bugs."
My inspiration for this term is a blog post in the New York Times that David Pogue wrote. I was fascinated by the post and it got me to thinking about a particular kind of bugs in software that are more common than most people may realize. You could say that these bugs are surprisingly common. In fact, if you wanted to be more precise, you could even say that this term applies to a specific type of "surprisingly common type of surprising bugs." Let me explain.
There's something about the chemical makeup of grapefruit juice that makes it interact with our biology and a large number of different drugs in ways which result in dangerous conditions. For example, certain drugs lose their effectiveness dramatically when interacting with grapefruit juice which can have life-threatening consequences. Other times, the interactions with grapefruit juice can dramatically increase a drug's potency. This can result in "safe doses" becoming very unsafe.
The 42-year-old was barely responding when her husband brought her to the emergency room. Her heart rate was slowing, and her blood pressure was falling. Doctors had to insert a breathing tube, and then a pacemaker, to revive her.
They were mystified: The patient’s husband said she suffered from migraines and was taking a blood pressure drug called verapamil to help prevent the headaches. But blood tests showed she had an alarming amount of the drug in her system, five times the safe level.
Did she overdose? Was she trying to commit suicide? It was only after she recovered that doctors were able to piece the story together.
“The culprit was grapefruit juice,” said Dr. Unni Pillai, a nephrologist in St. Louis, Mo. ...
The previous week, she had been subsisting mainly on grapefruit juice. Then she took verapamil, one of dozens of drugs whose potency is dramatically increased if taken with grapefruit. In her case, the interaction was life-threatening.
Last month, Dr. David Bailey, a Canadian researcher who first described this interaction more than two decades ago, released an updated list of medications affected by grapefruit. There are now 85 such drugs on the market, he noted, including common cholesterol-lowering drugs, new anticancer agents, and some synthetic opiates and psychiatric drugs, as well as certain immunosuppressant medications taken by organ transplant patients, some AIDS medications, and some birth control pills and estrogen treatments.
Under normal circumstances, the drugs are metabolized in the gastrointestinal tract, and relatively little is absorbed, because an enzyme in the gut called CYP3A4 deactivates them. But grapefruit contains natural chemicals called furanocoumarins, that inhibit the enzyme, and without it the gut absorbs much more of a drug and blood levels rise dramatically.
For example, someone taking simvastatin (brand name Zocor) who also drinks a small 200-milliliter, or 6.7 ounces, glass of grapefruit juice once a day for three days could see blood levels of the drug triple, increasing the risk for rhabdomyolysis, a breakdown of muscle that can cause kidney damage.
So what do interactions between grapefruit juice and drugs have to do with software testing?
Like grapefruit juice's impact on prescription drugs, software testing involves critical interactions between different parts of the system. And risks exist when these different parts interact with one another. This is true whether you're talking about "large parts" interact in System Testing or "small parts" interact in Unit Testing.
Interactions between things are a very rich source of bugs in software. As anyone who has heard the infernal phrase "works on my machine" can tell you, software features and functions often work perfectly fine in many usage scenarios, hardware and software configurations , etc. - only to fail to work in ever-so-slightly different situations.
The difference between plain old every-day "Dual-Mode Faults" and "Grapefruit Juice Bugs"
A dual-mode fault occurs whenever two test inputs must both be present to trigger a defect. Most software testers start encountering them quite frequently within days of starting their jobs. Some examples:
This "buy" button works fine. Except when the customer is a "new user." (First, action = "click on the buy button" and Second, customer = "new user")
Transaction prices for share purchases are calculated correctly. Except when denominated in Japanese Yen. (First, Action = "sell shares" and Second, Currency = "Japanese Yen")
Like grapefruit juice's impact on prescription drugs, software testing involves critical interactions between different parts of the system. And risks exist when these different parts interact with one another. This is true whether you're talking about "large parts" interact in System Testing or "small parts" interact in Unit Testing.
While all grapefruit juice bugs are dual-mode faults, not all dual-mode faults are Grapefruit Juice Bugs:
Grapefruit juice bugs have got to have a little of the element of surprise in them. When you explain them to a developer, their first reaction should be "Huh? How is that even possible?" or at least "Hmmm... That's odd. Let me investigate."
Anything along the lines of "This feature usually works, except in IE6, when..." is almost definitely not a grapefruit juice bug. Problematic interactions with IE6 are an incredibly common type of dual-mode fault, not a surprising one.
Whenever you hear "works on my machine" replies to your bug reports, and it takes a while for the issue to be replicated, odds are pretty good that a grapefruit juice bug might be involved.
Here's an example of an especially surprising grapefruit juice bug. This excerpt from Apple's online help files that the company posted after users of the original iPad complained about problems with Wi-Fi connectivity. Certain screen brightness settings were causing problems with the Wi-Fi signals. I'm not even to begin to guess how one would have anything to do with the other.
How to identify grapefruit juice bugs during your testing?
What is a tester to do when faced with more possible potential grapefruit juice bugs than he can handle using traditional methods?
If you're a software tester trying to do your best to determine whether a feature or function in your System Under Test will work "on everyone's machine," you've got a nightmare on your hands . Really nasty combinatorial explosions arise when you consider all of the possible combinations that would be required to test multiple hardware options, multiple software options, multiple usage scenarios, multiple test data inputs (and multiple combinations of the test data itself), multiple ways in which users enter data, and all of the rest of the "stuff that could vary" when people use your application. If you take the time to think expansively about the possible variations in a medium-sized applications, Quadrillions of possible tests often result.
While not eating grapefruit and not drinking grapefruit juice might be wise if you are taking drugs, there is rarely, if ever, such an easy method for eliminating the possibility of negative results due to software interactions. Refusing to support IE 6 in order to avoid the disproportionate number of grapefruit juice-like problematic interactions associated with IE6 would be as close as you could come in the world of software.
Design of Experiments-based test design methods can help testers come to grips with this challenge. Orthogonal array software testing (often referred to as OATS or simply OA testing) is a test design strategy that allows us to efficiently detect bugs created by interactions within the system. Orthogonal array software testing is based on the principles of multifactor designed experiments as first explored by Sir RA Fisher.
Design of Experiments-based test design methods are very-closely related to pairwise testing (AKA allpairs testing, all pairs testing, and pairwise-testing). Any of these test design strategies will allow a software tester to quickly generate a set of tests that includes tests for every single pair of test inputs.
This approach to test design often has multiple advantages, including faster test creation, more varied test scenarios, 100% coverage of all potential dual-mode faults (including hard-to-predict grape-fruit juice bugs), and often a smaller resulting set of tests that will be quicker to execute. Having said that, it is by no means a magical silver bullet. This approach to test design requires test designers with above average analytical abilities to identify the appropriate Parameters and Values for their system under test; this is sometimes easier said than done because it requires a new mindset from test designers.
Software testers can take solace that the challenges of software testing, while significant, are simple when compared to trying to understand the effects of drug interactions in people.
Combinatorial testing can look at bugs created by the interaction between multiple (3, 4, 5, 6...) variables. So if there was a bug that didn't get triggered just by using Chrome on Windows but it would get triggered if you also tried to replace an existing photo in your profile with a new profile photo into your profile (test idea number 3), then pairwise testing might not catch it. Pairwise test design would create a set of tests that would include at least one test for each of these pairs:
Chrome & Windows and
Chrome & replace photo and
Windows & replace photo, but...
A set of pairwise might not fail to test for the specific combination of all three of those test inputs in the same test. With the use of combinatorial test design approaches, you could create test plans with 100% coverage for 3 way interactions and be sure that all 3-way interactions or 4-way interactions are covered. When you create sets of 3-way tests, 4-way tests, 5-way tests, and 6-way tests though, you'll quickly discover that the number of tests required starts to balloon.
Hexawise allows you to create test plans with the coverage interactions you desire. This allows you to create sets of tests from 2-way up all the way up to phenomenally-thorough 6-way sets of tests. In fact, it even lets you generate clever sets of risk-based tests that will, say, prioritize comprehensive 4-way coverage on 4 sets of Parameter Values while ensuring only pairwise coverage of the other, lower-priority, interactions in your system under tests. Hexawise also lets you create mixed strength test plans so if you have certain factors that you are very concerned about and want to provide coverage for more possible interactions you can set the interaction levels for those at a higher level.
Since creating Hexawise, I've worked with executives at companies around the world who have found themselves convinced in the value of pairwise testing. And then they need to convince their organization of the value.
They often follow the following path: first thinking "pairwise testing is a nice method in theory, but not applicable in our case" then "pairwise is nice in theory and might be applicable in our case" to "pairwise is applicable in our case" and finally "how do I convince my organization."
In this post I review my history helping convince organizations to try and then adopt pairwise, and combinatorial, software testing methods.
About 8 years ago, I was working at a large systems integration firm and was asked to help the firm differentiate its testing offerings from the testing services provided by other firms.
I Googled "Design of Experiments Software Testing." That search led me to Dr. Madhav Phadke (who, by coincidence, had been a former student of my father). More than 20 years ago now, Dr. Phadke and his colleagues at ATT Bell Labs had asked the question you're asking now. They did an experiment using pairwise test design / orthogonal array test design to identify a subset of tests for ATT's StarMail system. The results of that testing effort were extraordinarily successful and well-documented.
Shortly after doing that, while working at that systems integration firm, I began to advocate to anyone and everyone who would listen that designing approach to designing tests promised to be both (a) more thorough and (b) require (in most but not all cases) significantly fewer tests. Results from 17 straight projects confirmed that both of these statements were true. Consistently.
Repeatable Steps to Confirm Whether This Approach Delivers Efficiency and Thoroughness Improvement (and/or document a business case/ROI calculation)
How did we demonstrate that this test design approach led to both more thorough testing and much more efficient testing? We followed these steps:
Take an existing set of 30 - 100 existing tests that had already been created, reviewed, and approved for testing (but which had not yet been executed).
Using the test ideas included in those existing tests, design a set of pairwise tests (often approximately half as many tests as were in the original set). When putting your tests together, if there are particular, known, high-priority scenarios that stakeholders believe are important to test, it is important to make sure that that you "force" your pairwise test generator to include such high-priority scenarios.
Have two different testers execute both sets of tests at the same time (e.g., before developers start fixing any defects that are uncovered by testers executing either set of tests)
Document the following:
How long did it take to execute each set of tests?
How many unique defects were identified by each set of tests?
How long did it take to create and document each set of tests?*
*This third measurement was usually an estimate because a significant number of teams had not tracked the amount of time it took to create the original set of tests.
The results in 17 different pairwise testing "bake-off" projects conducted at my old firm included:
Defects found per tester hour during test execution: when pairwise tests were used, more than twice as many defects were found per tester hour
Total defects found: pairwise tests as many or more defects in every single project (despite the fact that in almost every case there were significantly more tests in the each original set of tests)
Defects found by pairwise tests but missed by traditional tests: a lot (I forget the exact number)
Defects found by traditional tests but missed by pairwise tests: zero
Amount of time to select and document tests: much less time required when a pairwise test generator was used (As mentioned above, precise measurements were difficult to gather here)
More recent project benefits have included these:
Those experiences - combined with the realization that many Fortune 500 firms were starting to try to implement smarter test design methods to achieve these kinds of benefits but were struggling to find a test design tool that was user-friendly and would integrate into their other tools - led me to the decision to create Hexawise.
Additional Advice and Lessons Learned Based on My Experiences
Once the testing the value of pairwise software testing at a specific organization it is very common to find the proponent of taking advantage of pairwise testing advantages to find themselves saying:
I have already elaborated some test plans that would save us up to 50% effort with that method. But now my boss and other colleagues are asking me for a proof that these pairwise test cases suffice to make sure our software is running well.
In that case, my advice is three-fold:
First, appreciate how your own thinking has evolved and understand that other people will need to follow a similar journey (and that others won't have as much time to devote as you have had to experience learnings first-hand).
When I was creating Hexawise, George Box, a Design of Experiments expert with decades of experience explaining to skeptical executives how Design of Experiments could deliver transformational improvements to their organizations' efficiency and effectiveness, told me "Justin, you'll experience three phases of acceptance and acceptance will happen more gradually than you would expect it to happen.
First, people will tell you 'It won't work.'
Next, they'll say "It won't work here."
Eventually, he said with a smile, they'll tell you 'Of course this works. I thought of it first!'
When people hear that you can double their test execution productivity with a new approach, they won't initially believe you. They'll be skeptical. Most people you're explaining this approach to will start with the thought that "it is nice in theory but not applicable to what I'm doing." It will take some time and experience for people to understand and appreciate how dramatic the benefits are.
Case in point: I will be talking to a senior executive at a large capital markets firm soon about how our tool can help them transform the efficiency and effectiveness of their testing group. And I can introduce them to a client of ours that is using our test design tool extensively in every single one of their most important IT projects. Will that executive take me up on my offer? I hope so, but based on past experience, I suspect odds are good that he'll instead react with 'Yes, yes, sure, if companies were people, that company would be our company's identical twin, but still... It won't work here.'
Third, at the end of the day, the most effective approach I have found to address that understandable skepticism and to secure organizational-level buy-in and commitment is through gathering hard, indisputable evidence on multiple projects that the approach works at the company itself through a bake-off approach (e.g., following those four steps outlined above. A few words of advice though.
My proposed approach isn't for the faint of heart. If you're working at a large company with established approaches, you'll need patience and persistence.
Even after you gather evidence that this approach works in Business Unit A, and B and C, someone from Business Unit D will be unconvinced with the compelling and irrefutable evidence you have gathered and tell you 'It won't work here. Business Unit D is unique.' The same objections may likely arise with results from "Type of Testing" A, B, and C.
As powerful and widely-applicable as this test design approach is, always remember (and be clear with stakeholders) that it is not a magical silver bullet.
James Bach raises several valid limitations with using this approach. In particular, this approach won't work unless you have testers who have relatively strong analytical skills driving the test design process. Since pairwise test case generating tools are dependent upon thoughtful test designers to identify appropriate test inputs to vary, this approach (like all test design approaches) is subject to a "garbage in / garbage out" risk.
Project leads will resist "duplicating effort." But unless you do an actual bake-off stakeholders won't appreciate how broken their existing process is. There's inevitably far more wasteful repetition hidden away in standard tests than people realize. When you start reporting a doubling of tester productivity on several projects, smart managers will take notice and want to get involved. At that point - hopefully - your perseverance should be rewarded.
Some benefits data and case studies that you might find useful:
If you can't change your company, consider changing companies
Lastly, remember that your new-found skills are in high demand whether or not they're valued at your current company. And know that, despite your best efforts and intentions, your efforts might not convince skeptics. Some people inevitably won't be willing to take the time to understand. If you find yourself in a situation where you want to use this test design approach (because you know that these approaches are powerful, practical, and widely-applicable) but that you don't have management buy-in, then consider whether or not it would be worth leaving your current employer to join a company that will let you use your new-found skills.
Most of our clients, for example, are actively looking for software test designers with well developed pairwise and combinatorial test design skills. And they're even willing to pay a salary premium for highly analytical test designers who are able to design sets of powerful tests. (We publicize such job openings in the LinkedIn Hexawise Guru group for testers who have achieved "Guru" level status in the self-paced computer-based-training modules in the tool).
We have created a new site to highlight Hexawise videos on combinatorial, pairwise + orthogonal array software testing. We have posted videos on a variety of software testing topics including: selecting appropriate test inputs for pairwise and combinatorial software test design, how to select the proper inputs to create a pairwise test plan, using value expansions for values in the same equivalence classes.
Here is a video with an introduction to Hexawise:
Subscribe to the Hexawise TV blog. And if you haven't subscribed to the RSS feed for the main Hexawise blog, do so now.
Software testers should be test pilots. Too many people think software testing is the pre-flight checklist an airline pilot uses.
The checklists airline pilots use before each flight are critical. Checklists are extremely valuable tools that help assure steps in a process are followed. Checklists are valuable in many professions. The Checklist – If something so simple can transform intensive care, what else can it do? by Atul Gawande
Sick people are phenomenally more various than airplanes. A study of forty-one thousand trauma patients—just trauma patients—found that they had 1,224 different injury-related diagnoses in 32,261 unique combinations for teams to attend to. That’s like having 32,261 kinds of airplane to land. Mapping out the proper steps for each is not possible, and physicians have been skeptical that a piece of paper with a bunch of little boxes would improve matters much. In 2001, though, a critical-care specialist at Johns Hopkins Hospital named Peter Pronovost decided to give it a try.
Pronovost and his colleagues monitored what happened for a year afterward. The results were so dramatic that they weren’t sure whether to believe them: the ten-day line-infection rate went from eleven per cent to zero. So they followed patients for fifteen more months. Only two line infections occurred during the entire period. They calculated that, in this one hospital, the checklist had prevented forty-three infections and eight deaths, and saved two million dollars in costs.
Checklists are extremely useful in software development. And using checklist-type automated tests is a valuable part of maintaining and developing software. But those pass-fail tests are equivalent to checklists - they provide a standardized way to check that planned checks pass. They are not equivalent to thoughtful testing by a software testing professional.
I have been learning about software testing for the last few years. This distinction between testing and checking software was not one I had before. Reading experts in the field, especially James Bach and Michael Bolton is where I learned about this idea.
Testing is the process of evaluating a product by learning about it through experimentation, which includes to some degree: questioning, study, modeling, observation and inference.
(A test is an instance of testing.)
Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.
I think this is a valuable distinction to understand when looking to produce reliable and useful software. Both are necessary. Both are done too little in practice. But testing (as defined above) is especially underused - in the last 5 years checking has been increasing significantly, which is good. But now we really need to focus on software testing - thoughtful experimenting.
Hexawise allows you to adjust testing coverage to focus more thorough coverage on selected, high-priority areas. Mixed strength test plans allow you to select different levels of coverage for different parameters.
Increasing from pairwise to "trips" (3 way) coverage increases the test plan so that bugs that are the results of 3 parameters interacting can be found. That is a good thing. But the tradeoff is that it requires more tests to catch the interactions.
The mixed-strength option that Hexawise provides allow you to do is select a higher coverage level for some parameters in your test plan. That lets you control the balance between increased test thoroughness with the workload created by additional tests.
As that example shows, Hexawise allows you to focus additional thoroughness on the 3 highest priority parameters with just 120 tests while also providing full pairwise coverage on all factors. Mixed strength test plans are a great tool to provide extra benefit to your test plans.
A computer glitch involving the new health care law may mean that some smokers won’t bear the full brunt of tobacco-user penalties that would have made their premiums much higher — at least, not for next year.
The Obama administration has quietly notified insurers that a computer system problem will limit penalties that the law says the companies may charge smokers, The Associated Press reported Tuesday. A fix will take at least a year.
Tip of the Iceberg
This defect was entirely avoidable and predictable. Its safe to expect that hundreds (if not thousands) of similar defects related to Obamacare IT projects will emerge in the weeks and months to come. Had testers used straightforward software test design prioritization techniques, bugs like these would have been easily found. Let me explain.
There's no Way to Test Everything
If the developers and/or testers were asked how could this bug could sneak past testing, they might at first say something defensive, along the lines of: "We can't test everything! Do you know how many possible combinations there are?" If you include 40 variables (demographic information, pre-existing conditions, etc.) in the scope of this software application, there would be:
possible scenarios to test. That's not a typo: 41 QUADRILLION possible combinations. As in it would take 13 million years to execute those tests if we could execute 100 tests every second. There's no way we can test all possible combinations. So bugs like these are inevitably going to sneak through testing undetected.
The Wrong Question
When the developers and testers of a system say there is no way they could realistically test all the possible scenarios, they're addressing the wrong challenge. "How long would it take to execute every test we can think of?" is the wrong question. It is interesting but ultimately irrelevant that it would take 13 million years to execute those tests.
The Right Question
A much more important question is "Given the limited time and resources we have available for testing, how can we test this system as thoroughly as possible?" Most teams of developers and software testers are extremely bad at addressing this question. And they don't realize nearly how bad they are. The Dunning Kruger effect often prevents people from understanding the extent of their incompetence; that's a different post for a different day. After documenting a few thousand tests designed to cover all of the significant business rules and requirements they can think of, testers will run out of ideas, shrug their shoulders in the face of the overwhelming number of total possible scenarios and declare their testing strategy to be sufficiently comprehensive. Whenever you're talking about hundreds or thousands of tests, that test selection strategy is a recipe for incredibly inefficient testing that both misses large numbers of easily avoidable defects and wastes time by testing certain things again and again. There's a better way.
The Straightforward, Effective Solution to this Common Testing Challenge: Testers Should Use Intelligent Test Prioritization Strategies
If you create a well-designed test plan using scientific prioritization approaches, you can reduce the number of individual tests to test tremendously. It comes down to testing the system as thoroughly as possible in the time that's available for testing. There are well-proven methods for doing just that.
There are Two Kinds of Software Bugs in the World
Bugs that don't get found by testers sneak into production for one of two main reasons, namely:
"We never thought about testing that" - An example that illustrates this type of defect is one James Bach told me about. Faulty calculations were being caused by an overheated server that got that way because of a blocked vent. You can't really blame a tester who doesn't think of including a test involving a scenario with a blocked vent.
"We tested A; it worked. We tested B; it worked too.... But we never tested A and B together." This type of bug sneaks by testers all too often. Bugs like this should not sneak past testers. They are often very quick and easy to find. And they're so common as to be highly predictable.
Let's revisit the high-profile bug Obamacare bug that will impact millions of people and take more than a year to fix. Here's all that would have been required to find it:
Include an applicant with a relatively high (pre-Medicare) age. Oh, and they smoke.
Was the system tested with a scenario involving an applicant who had a relatively high age? I'm assuming it must have been.
Was the system tested with a scenario involving an applicant who smoked? Again, I'm assuming it must have been.
Was the system tested with a scenario involving an applicant who had a relatively high age who also smoked? That's what triggers this important bug; apparently it wasn't found during testing (or found early enough).
If You Have Limited Time, Test All Pairs
Let's revisit the claim of "we can't execute all 13 million-years-worth of tests. Combinations like these are bound to sneak through, untested. How could we be expected to test all 13 million-years-worth of tests?" The second two sentences are preposterous.
"Combinations like these are bound to sneak through, untested." Nonsense. In a system like this, at a minimum, every pair of test inputs should be tested together. Why? The vast majority of defects in production today would be found simply by testing every possible pair of test inputs together at least once.
"How could we be expected to test all 13 million-years-worth of tests?" Wrong question. Start by testing all possible pairs of test inputs you've identified. Time-wise, that's easily achievable; its also a proven way to cover a system quite thoroughly in a very limited amount of time.
Design of Experiments is an Established Field that was Created to Solve Problems Exactly Like This; Testers are Crazy Not to Use Design of Experiments-Based Prioritization Approaches
The almost 100 year-old field of Design of Experiments is focused on finding out as much actionable information as possible in as few experiments as possible. These prioritization approaches have been very widely used with great success in many industries, including advertising, manufacturing, drug development, agriculture, and many more. While Design of Experiments test design techniques (such as pairwise testing and orthogonal array testing / OA testing) are increasingly becoming used by software testing teams but far more teams could benefit from using these smart test prioritization approaches. We've written posts about how Design of Experiments methods are highly applicable to software testing here and here, and put an "Intro to Pairwise Testing" video here. Perhaps the reason this powerful and practical test prioritization strategy remains woefully underutilized by the software testing industry at large is that there are too few real-world examples explaining "this is what inevitably happens when this approach is not used... And here's how easy it would be to avoid this from happening to you in your next project." Hopefully this post helps raise awareness.
Let's Imagine We've Got One Second for Testing, Not 13 Million Years; Which Tests Should We Execute?
Remember how we said it would take 13 million years to execute all of the 41 quadrillion possible tests? That calculation assumed we could execute 100 tests a second. Let's assume we only have one second to execute tests from those 13 million years worth of tests. How should we use that second? Which 100 tests should we execute if our goal is to find as many defects as possible?
By setting the 40 different parameter values intelligently, we can maximize the testing coverage achieved in a very small number of tests. In fact, in our example, you would only need to execute only 90 tests to cover every single pairwise combination.
The number of total possible combinations (or "tests") that are generated will depend on how many parameters (items/factors) and how many options (parameter values) there are for each parameter. In this case, the number of total possible combinations of parameters and values equal 41 quadrillion.
This screen shot shows a portion of the test conditions that would be included the first 4 tests of the 90 tests that are needed to provide full pairwise coverage. Sometimes people are not clear about what "test every pair" means. To make this more concrete, by way of a few specific examples, pairs of values tested together in the first part of test number 1 include:
Plan Type = A tested together with Deductible Amount = High
Plan Type = tested together with Gender = Male
Plan Type = A tested together with Spouse = Yes
Gender = Male tested together with State = California
Spouse = Yes tested together with Yes (and over 5 years)
And lots of other pairs not listed here
This screen shot shows a portion of the later tests. You'll notice that the values are shown in purple italics. Those values listed in purple italics are not providing new pairwise coverage. You will note in the first tests every single parameter value is providing new pairwise coverage value, toward the end few parameter value settings are providing new pairwise coverage. Once a specific pair has been tested, retesting it doesn't provide additional pairwise coverage. Sets of Hexawise tests are "front loaded for coverage." In other words, if you need to stop testing at any point before the end of the complete set of tests, you will have achieved as much coverage as possible in the limited time you have to execute your tests (whether that is 10 tests or 30 tests or 83). The pairwise coverage chart below makes this point visually; the decreasing number of newly tested pairs of values that appear in each test accounts for the diminishing marginal returns per test.
You Can Even Prioritize Your First "Half Second" of Tests To Cover As Much As Possible!
This graph shows how Hexawise orders the test plan to provide the greatest coverage quickly. So if you get through 37 of the 90 tests needed for full pairwise coverage you have already tested over 90% of all the pairwise test coverage. The implication? Even if just 37 tests were covered, there would be a 90% chance that any given pair of values that you might select at random would be tested together in the same test case by that point.
Was Missing This Serious Defect an Understandable Oversight (Because of Quadrillions of Possible Combinations Exist) or was it Negligent (Because Only 90 Intelligently Selected Tests Would Have Detected it)?
A generous interpretation of this situation would be that it was "unwise" for testers to fail to execute the 90 tests that would have uncovered this defect.
A less generous interpretation would be that it was idiotic not to conduct this kind of testing.
The health care reform act will introduce many such changes as this. At an absolute minimum, health insurance firms should be conducting pairwise tests of their systems. Given the defect finding effectiveness of pairwise testing coverage, testing systems any less thoroughly is patently irresponsible. And for health insurance software testing it is often wiser to expand to test all triples or all quadruples given the interaction between many variables in health insurance software.
Incidentally, to get full 3 way test coverage (using the same example as above) would require 2,090 tests.
tl;dr: When you have parameters that only have sensible values depending on certain conditions you should include a value like "N/A" or "Does not appear" for those parameters.
You can try this example out yourself using your Hexawise account. If you do not have an account yet you can create a demo account for free that lets you create effective test plans.
Let's take a simple, made up example from version 1 of a restaurant ordering system that has 3 parameters:
Entree: Steak, Chicken, Salmon Salad: Caesar, House Side: Fries, Green Beans, Carrots, Broccoli
Everything is just fine with our test plan for version 1, but then let's suppose the business decides that in version 2, people that order "Chicken" don't get a "Salad". Easy enough, we just make an invalid pair between "Chicken" and "Caesar" and "Chicken" and "House", correct? No, Hexawise won't let us. Why? Because then it has no value available for "Salad" to pair with "Chicken" as the "Entree".
But that's what we want! "Salad" will disappear from the order screen as soon as we select "Chicken". So there is no value. That's OK. We just need to add that as the value:
Entree: Steak, Chicken, Salmon Salad: Caesar, House, Not Available Side: Fries, Green Beans, Carrots, Broccoli
At this point we could create the invalid pairs between "Chicken" and "Caesar" and "Chicken" and "House", and Hexawise will allow it because there is still a parameter value, "Not Available", left to pair with "Chicken" in the "Salad" parameter.
If we do this though, we'll find that Hexawise will force a pairing between "Steak" and "Not Available" and "Salmon" and "Not Available". Not exactly what we wanted! So we can also add an invalid pair between "Steak" and "Not Available" and "Salmon" and "Not Available".
With these four invalid pairs, we have a working test plan for version 2, but rather than the four invalid pairs, this scenario is exactly why Hexawise has bi-directional married pairs. A bi-directional married pair between "Chicken" and "Not Available" tells Hexawise that every time "Entree" is "Chicken", "Salad" must be "Not Available" and every time "Salad" is "Not Available", "Entree" must be "Chicken". So it gives us precisely what we want for this scenario by creating just one bi-directional married pair rather than four invalid pairs.
Now let's suppose version 3 of the menu system comes out, and now there is a fourth Entree, "Pork". And "Pork", being the other white meat, also does not have a salad option:
Entree: Steak, Chicken, Salmon, Pork Salad: Caesar, House, Not Available Side: Fries, Green Beans, Carrots, Broccoli
When we go to connect "Entree" as "Pork" and "Salad" as "Not Available" with a bi-directional married pair, Hexawise will rightly stop us. While we can logically say that every time "Entree" is "Chicken", "Salad" is "Not Available" and every time "Entree is Pork", "Salad" is "Not Available", we can't say the reverse. It's nonsensical to say that every time "Salad" is "Not Available", "Entree" is "Chicken" and every time "Salad" is "Not Available", "Entree" is "Pork".
This is precisely why Hexawise has uni-directional married pairs. What we do in this case is create an uni-directional married pair between "Chicken" and "Not Available" which says that every time "Entree" is "Chicken", "Salad" is "Not Available", but it's not the case that every time "Salad" is "Not Available", "Entree" is "Chicken". This of course leaves us free to create a uni-directional married pair between "Pork" and "Not Available". With this design, we're back to Hexawise wanting to pair "Steak" and "Not Available" and "Salmon" and "Not Available" since our uni-directional married pairs don't prohibit that, so we need to add our invalid pairs for those two pairings.
So our final solution for version 3 looks like:
Entree: Steak, Chicken, Salmon, Pork Salad: Caesar, House, Not Available Side: Fries, Green Beans, Carrots, Broccoli
Uni-directional Married Pair - Entree:Chicken → Salad:Not Available
Uni-directional Married Pair - Entree:Pork → Salad:Not Available
Invalid Pair - Entree:Steak ↔ Salad:Not Available
Invalid Pair - Entree:Salmon ↔ Salad:Not Available
Let's suppose the specifications for version 4 now hit our desks, and they specify that those that chose the "House" "Salad" get a choice of two dressings, "Ranch" or "Italian". We can then end up with a dependent value that's dependent on another dependent value. That's ok. We've got this!
Entree: Steak, Chicken, Salmon, Pork
Salad: Caesar, House, Not Available
Dressing: Ceasar, Ranch, Italian, Not Available
Side: Fries, Green Beans, Carrots, Broccoli
Uni-directional Married Pair - Entree:Chicken → Salad:Not Available
Uni-directional Married Pair - Entree:Pork → Salad:Not Available
Uni-directional Married Pair - Entree:Chicken → Dressing:Not Available
Uni-directional Married Pair - Entree:Pork → Dressing:Not Available
Bi-directional Married Pair - Salad:Caesar ↔ Dressing:Caesar
Bi-directional Married Pair - Salad:Not Available ↔ Dressing:Not Available
Invalid Pair - Entree:Steak ↔ Salad:Not Available
Invalid Pair - Entree:Salmon ↔ Salad:Not Available
Invalid Pair - Entree:Steak ↔ Dressing:Not Available
Invalid Pair - Entree:Salmon ↔ Dressing:Not Available
Hexawise tests can uncover any pair-wise defects in the identified parameters for version 4 of our hypothetical menu ordering system in just 20 tests out of a possible 192. We just saved ourselves from executing 172 extra tests or missing some defects!
Here is a wonderful webcast that provides a very quick, and informative, overview of rapid software testing.
Software testing is when a person is winding around a space searching that space for important information.
James Bach starts by providing a definition of software testing to set the proper thinking for the overview.
Rapid software testing is a set of heuristics [and a set of skills]. Heuristics live at the border of explicit and tacit knowledge... Heuristics solve problems when they are under the control of a skilled human... It takes skill to use the heuristics effectively - to solve the problems of testing. Rapid software testing focuses on the tester... Tacit skills are developed through practice.
Automated software tests are useful but limited. In the context of rapid software testing only a human tester can do software testing (automated checks are defined as "software checking"). See his blog post: Testing and Checking Refined.
Many teams are trying to generate unusually powerful and varied sets of software tests by using Design of Experiments-based methods to generate many or most of their tests. The two most popular software test design methods are orthogonal array testing and pairwise testing. This article describes how these two approaches are similar but different and suggests that in most cases, pairwise testing is preferable.
Before advancing, it may be worth pointing out that Orthogonal Array Testing is also known as OA or OATS. Similarly, pairwise testing is sometimes referred to as all pairs testing, allpairs testing, pair testing, pair-wise testing, or simply 2-way testing. The difference between these two very similar approaches of pairwise vs. orthogonal array is that orthogonal array-based solutions require the same coverage goal that pairwise solutions do (e.g., that every pair of inputs is tested at least once) plus an additional hurdle/characteristic, that there be a uniform distribution throughout the domain.
I have studied the question of how can software testing inputs be combined most efficiently and effectively pretty steadily for the last 7 years. I started by searching the web for "Design of Experiments" and "software testing" and found references to Dr. Madhav Phadke (who, by coincidence, turns out was a former student of my father).
I discovered that Dr. Phadke had designed RDExpert which, although it had been primarily created to help with Research & Design projects in manufacturing settings, could also be used to select small sets of powerful test sets in software testing projects, using the Orthogonal Array-based test selection criteria.
I used RDExpert to create test sets (and compared those test sets against sets of tests that had been selected manually by software testers)
I gathered results by asking one tester to execute the manually selected tests and another tester to execute the the Orthogonal Array-based tests; the OA-based tests dramatically outperformed the manually-selected ones in terms of defects found per tester hour and defexts found overall.
So, in short, I had confirmed to my satisfaction that an OA-based test data combination strategy was far more effective than manually selecting combinations for the kinds of projects I was working on, but I was curious if other techniques worked better.
After more study I have concluded that:
Pairwise is more efficient and effective than orthogonal arrays for software testing.
Orthogonal Arrays are more efficient and effective for manufacturing, and agriculture, and advertising, and many other settings.
Why is a pairwise testing strategy better than an orthogonal array strategy?
Pairwise testing almost always requires fewer tests than orthogonal array-based solutions (it is possible, in some situations, for them to have an equal number of tests).
Remember, the reason that orthogonal array-based solutions require more tests than a pairwise solution to reach the coverage goal of testing all pairs of test conditions together in at least one test is the additional hurdle/characteristic that orthogonal array testing has, e.g., that there be a uniform distribution throughout the domain.
The "cost" of the extra tests (AKA experiments) is worth paying in many settings outside of the software testing industry because the results are non-binary in those tests. Someone seeking a desired darkness and gloss and luminosity and luster for a particular shade of green in the processing of film, for example, would benefit from with the information obtained from the added information gathered from orthogonal arrays.
In software testing, however, the added costs imposed by the the extra tests are not worth it. You're generally not seeking some ideal point in a continuum; you're looking to see what two specific pieces of data will trigger a defect when they appear in the same transaction. To identify that binary approach most efficiently and effectively, what you want is a pairwise solution (with fewer tests), not a longer list of orthogonal array-based tests.
Let me also add these points.
First, unlike some of my other views on combinatorial test design, my opinion on this narrow subject is not based on multiple empirical studies; it is based on (a) the reasoning I laid out above, and (b) a dozen or so conversations I've had with PhD's who specialize in the intersection of "Design of Experiments" and software test design, and (c) anecdotal evidence from using both methods.
Secondly, to my knowledge,very few, if any, studies have gathered empirical data showing benefits of pairwise solutions vs. orthogonal array-based solutions in software testing scenarios.
Thirdly, I strongly suspect that if you asked Dr. Phadke, he would give you his reasons for why orthogonal array-based solutions are appropriate (and even preferable) to pairwise test case selection methods for certain kinds of software projects. I have a huge amount of respect for both him and his son.
Time doesn't allow me to get into this last point much now, but "mixed strength" tests are another even more powerful test design approach for you to be aware of as well. With mixed strength testing solutions, the test designer is able to select a default coverage strength for the entire plan (e.g., pairwise / AKA 2-way coverage) and, in the same set of tests, select certain high priority values to receive higher coverage strength (e.g., 4-way coverage strength selected for each "Credit Rating" and "Income" and "Loan Amount" and "Loan to Value Ratio" would give you a palm that achieved pairwise coverage for everything in the plan plus comprehensive coverage for every imaginable combination of values from those four high priority parameters. This approach allows you to focus on risk-based testing considerations.
Sorry if I got a bit long-winded. It's a topic I'm passionate about.
Originally posted on Stack Exchange, Additional note added after the first 3 comments were submitted:
@Hannibal, @Peter K., and @MichaelF, Thanks for your comments! If you'd like to read more about this stuff, I recommend the multiple links available through this "bundle of links" about pairwise testing and combinatorial testing. In particular, Michael Bolton's article on pairwise testing is directly relevant and very clearly written. It is one of the few introductory articles around that accurately describes the difference between orthogonal array-based solutions and pairwise solutions. If I remember correctly though, the example Michael uses is a rare exception to the rule; the OA solution has the same number of tests as an optimal pairwise solution does.
More than 100 Fortune 500 firms use Hexawise to design their software tests. While large companies pay six figures per year for enterprise licenses, Hexawise is available for free to schools, open source projects, other non-profits, and teams of up to 5 users from any kind of company. Sign up for your Hexawise account.
The test cases in a given test plan should be sufficiently similar and should not have wildly divergent paths depending on the value of parameters in a test case.
When you do find your test case flow diverges too much you often want to either break your test plan down into a few different test plans, so that you have a plan for each different kind of pass through the system.
Another similar approach is that you may want to decrease the scope of your test plan a bit so that you end up with test cases that are all similar in the plan.
Lastly, let's say the flows aren't wildly divergent, but only slightly so. As a silly example let's say you were testing a recipe that varied based on the fruit selected.
Fruit: Apple, Grape, Orange, Banana
And then you wanted a step for how the peeling was done.
Peeling: By hand, By manual peeling tool, By automated peeler
Peeler type: Hand crank, Battery powered, AC powered
Now... our testing flow here has some divergence. Grapes and Apples don't get peeled in this recipe, so they never enter that flow. And Bananas are always peeled by hand so they only get a part of that flow. If this was just the tip of the iceberg of the divergence, we should create a test plan for Grapes and Apples and a different one for Oranges and Bananas.
But if this is the entire extent of the divergent flow, then we want to take advantage of N/A values and married and invalid pairs.
Peeling: By hand, By manual peeling tool, By automated peeler, N/A
Peeler type: Hand crank, Battery powered, AC powered, N/A
We marry Grape and Orange (uni-directionally) to the two N/A's so they don't participate in the peeling flow. We marry Banana (unidirectionally) to "By hand" and the 2nd N/A so it has a partial and circumscribed pass through the peeling flow.
Lastly we don't allow Orange to be paired with either N/A with an invalid pair.
That's how a slight flow variation can be accommodated. Please comment with any questions about any of these approaches to your problem.
I saw these words of advice from Conrad Fujimoto in an email and thought they were worth passing on. I'm using them with Conrad's permission:
Over the years, I’ve taught many software testing courses. Trainees are appreciative of the ideas, insights, and techniques presented to them.They are convinced that principles and methods taught are useful and effective. Yet, often I hear the phrase “but, that won’t work here.”
Some of the reasons given for such pessimism are resource constraints, organizational politics, lack of testing focus, and little management understanding and support. The trainees knew what adjustments needed to be made but they felt powerless to affect any meaningful change. Fortunately, much can be accomplished by strategic planning and being aware of opportunities.
When no one is taking a leadership role in improving the process, consider assuming that role (people are often happy to see someone take charge).
Seek opportunities to form relationships and work with others who share the same concerns about the existing process.
Establish your authority and credentials for speaking on testing matters by recognizing and promoting the successes of your testing team.
Be proactive and constantly monitor and report the progress of both development and testing against published schedules.
Be ready to implement corrective actions or invoke contingency plans in the event of schedule slippages; where appropriate, suggest process changes that reduce future slippages.
Always perform test closure activities and ensure that lessons learned are recorded and reported.
Get testing representation in requirement review meetings and on the change control board.
Foster an attitude of continuous improvement; build on your successes.
As software testers, we have a professional obligation to do our best in assisting our organizations to build quality software. We may not necessarily have the term “manager” in our job title, but we still have the ability to be leaders. We can guide our organizations to creating better software.
Conrad Fujimoto is an expert instructor and consultant. He teaches Software Tester Certification for SQE Training.
John's book, and blog, discuss the challenges of actually getting improvements put into action in the workplace. Getting past the resistance to new ideas, new ways of working and change is more difficult than it should be. But there are practical steps you can take to get improvements adopted, including those mentioned above.
A client informed us that they had created (and used) approximately 3,500 test cases to test the search functionality of their application. They had a strong suspicions that (a) they should be able to test the search functionality of their application with fewer tests, (b) the tests they had accidentally omitted tests of many hundreds of plausible combinations of values that would be useful to test for (but did not know how to precisely identify were those gaps were without a huge amount of work), and that (c) many of these tests were quite inefficient in that they repeated many steps that had already been tested in other tests in the plan (even if they were not 100% duplicative of any other single test in the plan).
This client should have spoken to Lanette Creamer before they got into that situation. Lanette is a testing expert and blogger with ideas worth paying attention to. For example, her paper, Reducing Test Case Bloat, is well worth reading as is her blog.
There are times when what you cut may not be bloat. There are some situations where the decisions are the equivalent of “Do we cut off the arm or the head?” Well, a person can live without an arm. If you are in a situation where you are so time constrained that critical areas will be untested, you can still communicate the risk, be transparent and use a strategy to test the most important areas first. It is possible to plan for and do testing for a very time constrained project.
Of course avoiding this situation is best. Improving testing processes to use the best thoughts and tools is a better option. Cutting the bloat can allow resources to be applied to those areas they are really needed. Often though, people are scared of trying new ideas and cling to old methods, even if that results in the organization having to take increased risks by failing to test critical areas sufficiently. They are just more scared of trying new ideas than of getting away with saying we need more funding if you want more testing.
As part of your plan to reduce bloat, it can be helpful to state your assumptions about who is important and where you are placing testing priority and why. When reducing test case bloat you are taking a calculated risk. You are weighing the risk of being unable to test new features by insisting on testing every legacy case against the risk of purposefully not running some tests. When you share your starting assumptions with your stakeholders you offer them the chance to counter with their own assumptions and often you can clarify the boundaries of your testing this way to avoid gaps in testing or duplication.
See the full article for more good ideas on how to get better results for the existing testing resources available to your organization.
As background to my answers: The studies and dozens of proof of concept pilot projects that I’ve been directly involved with have sought to answer these 3 questions:
1) Is it actually faster to generate tests with Hexawise than creating and documenting them manually?
Consistent findings: Yes. It takes, on average, about 40% less time to create and document tests using Hexawise because using Hexawise allows testers to partially automate test selection and test documentation steps.
2) Is it possible to generate smaller sets of tests that will be as thorough or more thorough than larger sets of manually created tests and allow testers to find more defects in less test execution time?
Consistent findings: Yes. Typically more than twice as many defects per tester hour. See, e.g., the IEEE Computer article written with 3 PhDs showing an increase in defects found per tester hour of 2.4 times. A more recent set of 10 proof of concept pilot projects at an insurance firm revealed 3.0 times as many defects per tester hour. See: Does pairwise testing really work? Evidence, data, and case studies.
This is because Hexawise-generated tests (or any pairwise tests, for that matter) consistently have dramatically less wasteful repetition than manually selected tests will and because Hexawise-generated tests leave no potential dual-mode faults untested (that is, no potential pairwise defects involving test inputs that have been contemplated by the tester and included in their models).
3) Finding more defects per tester hour is certainly nice, but do Hexawise-generated tests find MORE defects?
Consistent findings: Yes. Much smaller set of Hexawise tests have consistently found more defects. On average by about 13%
In answer to specific questions:
Phil Kirkham: “Seems a very basic measure of a testers productivity How about the severity of the defects ?”
In my experience in more than 5 years of helping teams conducting these proof of concept pilot projects, pairwise and Hexawise-generated tests are just plan more effective at finding defects. They find more of ALL kinds of defects. My experience has been that the types of defects being found is not skewed towards less significant types of defects or missing severe defects. A case in point: at my old firm, one of the early adopters of orthogonal array testing approaches ran pilot project after pilot project with teams of testers reporting into him. I can’t remember the exact number of pilots he had conducted, perhaps 20 pilot projects or so, before he experienced a single defect that escaped the Hexawise-generated tests that was found by the much longer set of manually selected tests. So, a short, blunt, honest answer to your excellent question, is “Believe it or not, Phil, it almost never matters. This approach will find ALL of the defects you otherwise would have found. Plus additional ones.”*
*Major caveat here that calls into question my specific answer here (to your question concerning severity) as well as all of the results from all of the studies I’ve been involved with. Testers like you and me are strong proponents of Exploratory Testing. These studies, though, treat the test inputs and test cases as “frozen.” You have the test cases in list A (created manually) and the test cases in list B (created by using Hexawise). The ideas about what can be changed from test to test (parameters) and how each of those things can be changed (values) are identical in both lists. The difference is that list A has lots of wasteful repetition and lots of gaps in coverage. List B has neither. That’s the only difference. Then one tester executes the tests from list A and another tester executes the tests from list B. But what if you have an unskilled tester following rote scripts executing one set of tests and someone like you, Rob Sabourin, Michael Bolton, James Bach, Shmuel Gershon, Ajay Balamurugadas, etc., executing the second set? Whoa! All bets are off. What would happen is that skilled Exploratory Testers would use the Hexawise-generated test ideas (which they would not want to be overly-detailed), and go “off script” to explore interesting test ideas that they cooked up in real time as they were doing their testing. So skilled Exploratory Testers would be able to find defects (presumably including serious ones) that the written test cases, regardless of whether they were manually created or created by Hexawise, would not lead them to directly. That’s an important topic for another time. I’ll be talking about Exploratory Combinatorial Testing at the Conference of the Association of Software Testing – CAST – this year in my home town of Madison, Wisconsin. Since you’re also going, perhaps we could collaborate and you could share your experiences (good or bad) with the attendees. I’ll happily give you 10 minutes of my speaking time to share your experiences if you’d like.
PK: Are you only counting functional defects ?
JH: Not explicitly. The directions I give to teams running these pilot projects is. Try to answer the 3 questions above. Report defects. We’re not after a count of “failed test cases.” We’re after a number of defects. Having said that, as you might suspect, most of the defects reported tend to be functional defects.
PK: What about all the other types of defect ?
JH: They’re not reported as often but we count them too. In situations where one tester reports a “hard to spot” bug (e.g., one that might take a more experienced tester to identify), it raises the possibility that the bug is being reported not because one set of tests is superior to the other but because one tester is better than the other. Accordingly, in an effort to keep an apples to apples comparison, we talk with the tester and try to determine with the tester’s input whether the tester would have found that same defect with the other set of tests. If the answer is yes, we’d report the defect as “found” in both sets of tests. This doesn’t happen as often as you might suspect it would.
PK: Does the project type matter ?
JH: Yes. Benefits tend to be relatively smaller when there are a disproportionately high percentage of small, discrete, one-off tests. And higher when there are more than 5 parameters that interact in meaningful ways. And easier to capture when the System Under Test does not have a lot of conditional branching logic.
PK: How about the devs they are working with and the practices they follow ?
JH: I don’t have enough empirical evidence to say definitively. It’s used successfully by thousands of testers in waterfall projects and thousands of testers in Agile projects.
PK: What about the experience of the tester, does that make a difference?
JH: Even more important than experience level of the tester are, in order, (1) analytical ability, (2) willingness to try new things, and (3) willingness to ask questions. By my estimates about 50% of the testers I come across at our clients (almost all at Fortune 2000 firms) would not be able to design excellent sets of pairwise tests from scratch. This is because above average analytical ability is required for testers to select parameters and values from their Systems Under Test in a thoughtful way. Getting back to experience level, some of our strongest users at our clients are straight of out of college. They start work, get exposed to Hexawise, “get it” and don’t look back. Interestingly, some testers who have been testing for, say, 10 years or more – while experienced – sometimes seem to be too set in their ways to embrace this rather different approach to designing tests.
PK: If they are working with top of the range developers (as some lucky testers are cough cough) then there aren’t that many functional bugs to be found and you’re looking at browser compatibility, usability, race conditions – is combinatorial testing going to find these more quickly ?
JH: Yes. Absolutely. If you’d like to collaborate to test that and help gather empirical evidence that you could share at CAST, I would be happy to work with you to do just that. If your experience contradicts what I’m saying here, you’d have the floor to tell CAST participants what your actual experience was.
PK: I read the study in the link – 97% of defects could be found by pairwise combinatorial testing ? Really ? ALL types of defects ? Really ? How can pairwise find a defect caused by a missing or ambiguous or inconsistent requirement, or a performance or security ?
JH: The statistics I quote are a lot lower than that. The pie chart I use averages out several studies done by PhD’s that have found, on average, 84% of defects could be triggered by testing for all combinations of 2 test inputs. The 97% figure is eyebrow-raising on its own (regardless of industry). Given that it was in the medical device industry in the United States (one of the most litigious area in the history of the world?), that statistic is particularly mind-boggling. What the PhDs in that study did was take a look at all of the medical devices that had been taken off of the market in the United States as a result of software defects. Then they investigated how many test inputs would be required to trigger each of those defects. They authors of that study found that an astonishing 97% of those defects could have been triggered by just 2 test inputs.
PK: Love your passion and enthusiasm and I do have a beta of Hexawise to see if it can do anything for my productivity – and I might agree that there is a lack of empirical studies, not just among the testing community but the s/w community as a whole into the effectiveness or not of how software is produced
JH: Thanks. I hope you have positive experiences with using Hexawise and I’m happy to help you if ever have any questions about using Hexawise on your projects.
Attempting to assess the relative benefits of more than 200 software development practices is not for the faint of heart. Context-specific considerations run the risk of confounding the conclusions at every turn. Even so, Capers Jones, a software development expert with dozens of years of experience and nearly twenty books related to software development to his credit, recently attempted the task. He's literally devoted decades of his career to assessing such things for clients. We're quite pleased with how using Hexawise fared in the analysis.
Software development, maintenance, and software management have dozens of methodologies and hundreds of tools available that are beneficial. In addition, there are quite a few methods and practices that have been shown to be harmful, based on depositions and court documents in litigation for software project failures.
In order to evaluate the effectiveness or harm of these numerous and disparate factors, a simple scoring method has been developed. The scoring method runs from +10 for maximum benefits to -10 for maximum harm.
The scoring method is based on quality and productivity improvements or losses compared to a mid-point. The mid point is traditional waterfall development carried out by projects at about level 1 on the Software Engineering Institute capability maturity model (CMMI) using low-level programming languages. Methods and practices that improve on this mid point are assigned positive scores, while methods and practices that show declines are assigned negative scores.
The data for the scoring comes from observations among about 150 Fortune 500 companies, some 50 smaller companies, and 30 government organizations. Negative scores also include data from 15 lawsuits.
The article provides guidance, based on the results achieved by many, and varied, organizations with respect to software projects.
finding and fixing bugs is overall the most expensive activity in software development. Quality leads and productivity follows. Attempts to improve productivity without improving quality first are not effective.
This is an extremely important point for business managers to understand. Those involved in software development professionally don't find this surprising. But business people often greatly underestimate the costs of maintaining and updating software. The costs of bugs introduced by fairly minor feature requests to a system that doesn't have good software test coverage or test plans often create far more trouble than business managers expect.
This is especially true because there is a high correlation between software applications that have poor software testing processes (including poor test coverage and poor or completely missing test plans) and those application that were designed without long term maintenance in mind. Both deficiencies result of decisions made to minimize initial development costs and time. They both show a lack of appreciation for wise software engineering practices and software application project management.
The article discusses a complicating factor for accessing the most effective software development practices: the extremely wide differences in software engineering scope. Projects range from simple applications one software developer can create in a short period of time to massive application requiring thousands of developer-years or effort.
In order to be considered a “best practice” a method or tool has to have some quantitative proof that it actually provides value in terms of quality improvement, productivity improvement, maintainability improvement, or some other tangible factors.
Looking at the situation from the other end, there are also methods, practices, and social issues have demonstrated that they are harmful and should always be avoided.
Although the author’s book Software Engineering Best Practices dealt with methods and practices by size and by type, it might be of interest to show the complete range of factors ranked in descending order, with the ones having the widest and most convincing proof of usefulness at the top of the list. Table 2 lists a total of 220 methodologies, practices, and social issues that have an impact on software applications and projects.
The average scores shown in table 2 are actually based on the composite average of six separate evaluations:
Small applications < 1000 function points
Medium applications between 1000 and 10,000 function points
Large applications > 10,000 function points
Information technology and web applications
Commercial, systems, and embedded applications
Government and military applications
The data for the scoring comes from observations among about 150 Fortune 500 companies, some 50 smaller companies, and 30 government organizations and around 13,000 total projects. Negative scores also include data from 15 lawsuits.
The scoring method does not have high precision and the placement is somewhat subjective.
Top 10 tools and practices listed in the article:
1. Reusability (> 85% zero-defect materials)
2. Requirements patterns - InteGreat
3. Defect potentials < 3.00 per function point
4. Requirements modeling (T-VEC)
5. Defect removal efficiency > 95%
6. Personal Software Process (PSP)
7. Team Software Process (TSP)
8. Automated static analysis - code
8. Mathematical test case design (Hexawise)
10. Inspections (code)
We are obviously thrilled that Hexawise is listed. We have seen the value our customers have achieved using mathematical based combinatorial software test plans (see several Hexawise case studies). It is great to see that value recognized in comparison to other software development practices and judged to be of such high value to software development projects.
The article makes it clear the importance of the results is not "the precision of the rankings, which are somewhat subjective, but in the ability of the simple scoring method to show the overall sweep of many disparate topics using a single scale."
The methodology behind the results shown in the article can be used to evaluate your organization's software development practice and determine opportunities for improvement. But, as stated above, software projects cover a huge range of scopes. The specific software project needs will drive which practices are most critical to achieving success for a specific project. The list in the article, of what practices have provided huge value and what practices have resulted great harm, is a very helpful resource but project managers and software developers and testers need to apply their judgement to the information the article provides in order to achieve success.
A leading company will deploy methods that, when summed, total to more than 250 and average more than 5.5. Lagging organizations and lagging projects will sum to less than 100 and average below 4.0.
The use of Hexawise has been growing; that has helped increase the number of software projects using best practices (that score 9, or higher), however as the article states there is quite a need for improvement.
From data and observations on the usage patterns of software methods and practices, it is distressing to note that practices in the harmful or worst set are actually found on about 65% of U.S. Software projects as noted when doing assessments. Conversely, best practices that score 9 or higher have only been noted on about 14% of U.S. Software projects. It is no wonder that failures far outnumber successes for large software applications!
A score of 9 to 10 for a practice means that practice results 20-30% improvement in quality and productivity of software projects.
Conclusion: while your individual mileage may vary, this report provides further evidence that using Hexawise really does lead to large, measurable improvements in efficiency and effectiveness.
We are very proud of the success of Hexawise thus far; as a new year starts we see huge potential to help many organizations improve their software development efforts.
The article includes a list of references and suggested readings that is valuable. Included in that list are:
The video makes the case that the value to be gained from human-computer cooperation is being ignored far too often. A focus on maximizing the results based on improving the ability to cooperate is worthwhile.
What this means in practice is people taking more responsibility for using computers as tools to accomplish what is needed. This already happens a great deal but in a way which is unexamined and often therefore the current methods leave a great deal of room for improvement. We rarely focus on how to enhance the co-operation, we mainly see the software as one separate part of the process and a person's contribution as another separate part of the process. Focusing on computers (and software) as tools used by people to accomplish objectives is helpful.
Weaknesses in how people use the product, service or software are often weaknesses in focusing on the way people will really use it versus how it is "supposed" to be used. By understanding the process that matters is one of a person and the computer together adding value, we can create more effective software applications.
People often try to design software solutions that remove the need for humans to be involved. For complex problems, though, it is often much more effective to design solutions where people take advantage of computer tools to achieve results. People should use computers to automate the things that make sense to automate, keep track of data, and make calculations, thus leaving themselves free to use their superior insight, vision, intuition, and flexibility in making judgements.
Hexawise is built to take advantage of this type of cooperation. Even though it is a "test design tool," Hexawise doesn't take the lead role in designing the tests. Humans do. Humans do the things that they're better than computers at, such as (a) thinking up clever test ideas and test inputs, and (b) identifying, from dozens of possible parameters, which are the ones that are most important to vary in order to achieve potentially interesting interactions from test to test. Computer algorithms aren't nearly as good as humans at such tasks. Computers, though, will run circles around any human who tries to construct a set of tests such that (a) the variation between each test is as different as possible, (b) the wasteful repetition of combinations of values that appear together in different tests is minimized, (c) gaps in coverage are minimized (by, e.g., ensuring that every single pair or every single 3-way combination of tests appears in at least one test case), and (d) all of the above objectives are achieved in the fewest possible test cases. Computer algorithms eat those kind of challenges for breakfast. And complete them without error. In seconds.
Said a different way, people are better at figuring out interesting ideas to test. Once those are identified, those test conditions and other test ideas need to be combined together and put into tests. Generating a highly efficient, maximally varied, minimally repetitive set of tests based on a given set of test inputs is something computer algorithms are more effective at than a person.
Hexawise is not intended to eliminate the need for software testing experts. Hexawise is designed to allow software testing experts to focus on what they do well and allow Hexawise to make everything else easy (creating test plans based on software experts inputs etc.). This allows software testing experts to spend their time thinking by removing time consuming tasks they have to do. Hexawise also creates test plan coverage that is simply beyond the ability of people to create no matter how much time they were given. And software testing experts provide the inputs that no matter how much time the Hexawise software could not create in any amount of time.
Hexawise is designed to optimize the ease of cooperation. We spend a great deal of time optimizing the software to make it most useful for people. The design decisions made in creating a software application are very different if the users are meant to thoughtfully interact with the application.
We see Hexawise as an extension of the software tester. We seek to optimize how a person can use Hexawise to create the most value. The measure is how much more effective the testing solutions are, not just how Hexawise performs in isolation from the user.
A great deal of our time has been spent on how to help software testing experts use Hexawise most effectively. These efforts often take the form of many many small improvements that create an experience that is more similar to cooperation between two parties with different strengths instead of a more typical user experience of forcing you to do whatever the software demands.
Based on my experience, over dozens of pilot projects where we've gathered hard data, many software testers would literally more than double their productivity overnight on many projects if they used combinatorial test design methods intelligently (in comparison to selecting test case conditions by hand).
In this 10 project study, Combinatorial Software Testing Case Studies, we found 2.4 times more defects per tester hour on average when we compared testers who executed manually-selected test cases to testers who executed test cases created by a combinatorial testing algorithm designed to achieve as much coverage as possible in as few tests as possible.
How many participating testers thought they would see dramatic increases before they gathered the data? Almost none (even the testers told about the prior experiences of their other colleagues on similar projects). How many participating testers are glad that they took the time to use the scientific method?
Every one of them.
What stops more people from using the scientific method on their projects and gathering data to prove or disprove hypotheses like the one addressed in the study above? A pilot could take one person's time for less than 2 days. If past experience is any indication of future results (and granted, it isn't always), odds would appear pretty good that results would show that productivity would double (as measured in defects found per tester hour).
What's stopping the testing community from doing more such analysis? Perhaps more importantly, what is stopping you from gathering this kind of data on your project?
Additional empirical studies on the effectiveness of software testing strategies would greatly benefit the software testing community.
My experience indicates that an effective way to increase the likelihood that you will trigger such defects (without explicitly looking for them) is to try to maximize the variation between each test case you execute.
A case in point: when I sat down to dinner with James Bach a year or so ago in Boston at a testing conference, he gave me a quick testing challenge (as he is fond of doing with testers he meets for the fist time to see how we think). He asked how I would test a very simple calendar entry application that allowed users to record the start and end times of diary events. Key inputs to use as test conditions for these tests included start times and end times.
I proposed a set of times to try that were designed to provide as much variety as possible from one test case to the next. As inputs into the start and end times, I used a small number of different times spread throughout the morning, afternoon, and evening as well. The strategy I used quickly identified the testing defect the puzzle was designed to uncover in a small handful of tests. What was most memorable about the experience from my perspective was not that I "succeeded" in triggering the bug but that the tests I created triggered a type of bug that was, in Kaner's words, a "side effect of the tests rather than explicitly planned foci of the tests."
The business logic in the calendar application that should have identified invalid beginning and end time combinations was coded incorrectly. Instead of using numbers in the business logic, the business logic was ordering the numbers alphabetically. I was not consciously looking to identify that kind of a flaw in the business logic, but by maximizing the variation from test case to test case, I maximized my odds of finding it.
Efficiently achieving structured variation is difficult because it is hard for a human brain to remember whether dozens of different test conditions have been tested together (or we're accidentally repeating ourselves). This is where pairwise and combinatorial test case generating tools like our Hexawise tool come in. They are designed to achieve as much variation from test case to test case as possible. One of the relatively unsung benefits of this approach is that doing so will help find bugs, like these, that you aren't even consciously looking for.
Malcolm Gladwell's three best-selling books are filled with fascinating examples of all sorts of things to underscore the interesting points he makes about a wide variety of topics. I'm currently readying his book "Blink," about how people make process information and make decisions in the 'blink of an eye."
The example he shares about an overcrowded hospital Emergency Room's decision making process really resonated with me. Brendan Reilly, named the chairman of Medicine at Cook's County Hospital in Chicago in 1996, was in dire need of a way to improve the conditions of the hospital when he showed up there. The Emergency Room faced extreme amounts of resource pressure in particular; there were too many patients needing care and too few doctors and beds to serve them all promptly. One of the most resource-sapping processes was the treatment of patients who entered the ER complaining of chest pains.
"From the beginning, the question of how to deal with heart attacks was front and center." About 30 people a day came into the ER "were worried that they were having a heart attack. And those thirty used more than their share of beds and nurses and doctors and stayed around a lot longer than other patients."
"Reilly's first act was to turn to the work of a cardiologist named Lee Goldman. In the 1970's, Goldman...p. 133
He... announced that he was holding a bake-off. For the first few months, the staff would use their own judgment in evaluating chest pain, the way they always had. Then they would use Goldman's algorithm, and the diagnosis and outcome of every patient treated under the two systems would be compared.
For two years, data were collected, and in the end, the result wasn't even close. Goldman's rule won hands down in two directions; it was a whopping 70 percent better than the old method at recognizing patients who weren't actually having a heart attack. at the same time, it was safer. the whole point of chest pain prediction is to make sure that patients end up having major complications are assigned right away to the coronary and intermediate units. Left to their own devices, the doctors guessed right on the most serious patients somewhere between 75 and 89 percent of the time. The algorithm guessed right more than 95 percent of the time.... He went to the ED and changed the rules."
I suspect that if you asked 100 cardiologists whether they would perform more accurate assessments that the 3 question Goldman algorithm would, you would find that all 100 of those cardiologists would feel confident in their ability to outperform the 3 question algorithm. After all, the algorithm didn't allow for subject matter expertise.
How is this Relevant to Software Testing?
When many business leaders think of software testing what they want is something that catches bugs before the product is released to the customer. To some extent this can be automated. There are many previous bugs (and experience with bugs in other software) that allow us to create specific tests which will identify known bugs.
So if we have a form on the web application, we can test various scenarios and make sure the form is processed properly: submitting the form when it should or displaying the proper error message when it should do that. It is wise to have test plans to check for these bugs.
Creating simple easy to follow guidelines to address known issues is a wise action. This is true for creating a simple checklist to use in screening for heart attacks and for creating test plans that cover the known issues.
While those are wise they are by no means sufficient. Physicians bring a great deal of expertise to evaluating patients for a huge number of symptoms. And they can use their knowledge and experience to judge where they need more evidence (for example, a diagnostic x-ray or an analysis of a blood sample) to make a judgement. And to judge is software will work as users wish simply following a test plan robotically is not sufficient. The experience and expertise of a software tester allows them to probe the software applications and spot not only bugs but weaknesses that should be addressed so users of of the software have the best experience.
The appropriate methods depend upon the need. And sometimes the expert is not as effective as simply using tools or a check sheet that has been developed specifically for that purpose. It is important to design processes in your organization to find the most effective strategies given the local circumstances.
Experimenting, by using evidence based management practices, is needed to determine what is most effective. Assuming that the best solution is always relying on the experts is not accurate (as the example Gladwell used shows). A great tool to aid in determining what practices work best is the Plan-Do-Study-Act cycle made famous by Dr. W. Edwards Deming.
Removing inefficiency is good, sure, but it is not why Design of Experiments is so friggin' powerful. Saying DoE is interesting to know about because it can help identify and remove specific inefficiencies is a bit like saying Canada is a good country to visit because you can sometimes find a good cup of coffee there. To my mind, saying DoE is primarily about removing inefficiency misses the main point.
Design of Experiments is so powerful because it allows practitioners to predictably, systematically, and consistently find out more useful, actionable information in much less time than they would otherwise take to obtain this information (if they could find it at all with their less-structured approaches).
In manufacturing circles (e.g., when engineers produce new prototypes), DoE's ability to do this is no longer questioned. This is because leaders like George Box taught people in industry how to apply DoE and they gathered conclusive evidence that DoE allowed manufacturers to learn much faster through techniques like applying factorial designs. Box and other DoE experts (Taguchi, Montgomery, my dad, etc.) dealt with skeptical manufacturing engineers for four decades by showing them the facts and using DoE on the skeptics' own projects right under their noses. The evidence that DoE allows manufacturers to learn much faster (about a wide variety of learning goals) than the other methods they used prior to 1960 is incontrovertible.
In 2010, in the gradually maturing field of software testing, Design of Experiments-based methods of test case design has not caught on much at all yet. As an industry, it's adoption of DoE-based approaches is roughly where manufacturing was in 1960. Most software testers, even very good ones, don't know anything at all about how DoE can help them. Many other software testers have heard a bit about pairwise but mistakenly think pairwise and related, structured, DoE-based, test case selection method can't help them.
Even some of the best testers in the world who have written some of the most clearly-written and well-reasoned articles about pairwise approaches do not (in my view) seem to fully-understand: (a) how powerful the benefits are, (b) how often the approach can be applied / in how many diverse kinds of testing situations they can be utilized, and/or (c) how consistently the efficiency and effectiveness benefits are be generated when they are used properly. DoE methods, including pairwise and n-wise and mixed strength automatic test condition generation (made possible by tools like our Hexawise tool and also, to a great extent by James Bach's free AllPairs tool) allow software testers to learn much faster about critically important questions like: (1) where are the bugs?, (2) what is causing the bugs to appear?, (3) am I confident I have efficiently tested for a huge range of combinations of values in the System Under Test that might trigger defects? (4) am I succeeding in avoiding redundant repetition of steps in many test cases?, (5) how many bugs would be likely to find if we were to continue to run the next 100 tests?, etc.
In summary, the reason for the existence of Design of Experiments methods (whether we're talking about their applicability to testing software as efficiently and effectively as possible, or DoE methods' applicability to a huge variety of other objectives) - and, for that matter, the reason that they have been continuously refined and improved for 40+ years - is that DoE methods consistently and predictably allow users to learn actionable results as quickly as possible.
I am passionate about pairwise software testing techniques. I have helped dozens of teams, for example, carefully measure the benefits that can be created when teams of testers adopt pairwise and related combinatorial testing approaches to identify the test cases they will execute (as compared to manual test case identification methods). What usually happens is that tester productivity doubles. (See Combinatorial Software Testing - pdf download).
I believe these approaches will be much more widely adopted in a few years than they are now for the simple reason that they consistently deliver dramatic benefits to both the speed of software test design and the efficiency and thoroughness of software test execution. As more teams try these methods for themselves, and measures the benefits they achieve with them, broader adoption seems highly likely to me.*
I see three main barriers to broader adoption by the testing community at large:
The first barrier is that testers will not make an attempt to apply this method to their testing projects so they will never find out how effective it is. The second is that ill-informed testers will try to apply the approach but do such a poor job at implementation that they do not generate benefits.
The second barrier is that even testers who use the approach effectively a few times, will not realize how much more effective it is making them. A dismissive thought process guilty of this might sound something like this: "Those 11 bugs I just found? Yeah. I found them because I'm a good tester; the fact that I happened to use pairwise tests just now? That's largely irrelevant. I'm sure I would have found them regardless.")
The third barrier is that testers unfamiliar with the basics of pairwise testing principles will design test cases without thinking about what they are doing, and achieve "garbage in / garbage out" results. The benefits that would have been so easily achieved in the testing project - like Lindsey Jacobellis' opportunity to win a gold medal for Snow Boarding - disappear in a groan-worthy moment of bone-headed stupidity.
This blog post addresses this third barrier. When testers sabotage their own test plans with a poor choice of inputs, they may well blame the test design strategy rather than themselves, which would be unfortunate. Here's one common problem I see (exaggerated a bit in this example to make my point).
Objective: create a set of tests that will check to see if the underwriting engine for a car insurance firm is calculating premium estimates correctly.
Our aspiring pairwise test designer enters stage left and identifies a set of parameters:
First Name, Last Name, Age of Primary Driver, Credit Score, Number of Cars, Number of Accidents, Number of Speeding Tickets, and Number of Additional Drivers
So far so good. We now have the initial ingredients for a thing of beauty; we have a set of parameters that could quickly result in a combinatorial explosion of possibilities and, ready to save the day, we have a test designer who has correctly identified this as an opportunity to achieve efficiency and thoroughness benefits through the application of pairwise testing methods. Our potential hero is a couple minutes away from creating a concise set of tests that will confirm not only confirm that each of the data points in the plan work as they should but that they work as they should in combination with each of the other data points in the test plan.
In other words, the plan will not only confirm that "Number of Accidents = 3" will impact premiums as it should on its own, but also that "Number of Accidents = 3" will work as it should when tested in combination with the other relevant inputs in the application, e.g.,: 3 accidents with every relevant input for "Age of Primary Driver," 3 accidents with every relevant input for "Credit Score," 3 accidents with every relevant input for "Number of Cars," 3 accidents with every relevant combination for "Number of Speeding Tickets," and 3 accidents with every relevant input for "Number of Drivers."
He's seen the Promised Land of improved efficiency and effectiveness and he's ready to enter. Unfortunately, with his next move, he demonstrates he's a doofus. Entry to Promised Land denied. Check out the values he chose to enter for each of his parameters.
Notice anything wrong here?
Just for fun, let's take a close up look at Lindey's disastrous Snow Boarding maneuver here.
... and let's break down our shame-faced test designer's bone-headed move here. Can you notice what is wrong in with his choices of values?
There are nine different parameters in the mix here. Of those, two ("First Name" and "Second Name"), are the least important to our current objective of looking for problems in the underwriting engine calculations. And yet...
He's added ten values to each of them. Oops! Whenever you are putting together a pairwise (or 2-way) test plan, the number of tests required will never be lower than the product of the number of parameter values from the two parameters that have the highest number of values. In plain English, that high-falutin' previous sentence means: when you have a plan with 7 parameters that have a maximum of 4 values each, "10 largely irrelevant values X 10 largely irrelevant values = you're a big fat idiot" because you'll create a test plan that has 100 test cases (as compared to a test plan that could have covered the System Under Test more effectively with fewer than a quarter of the tests you've just created).
For more information on pairwise and combinatorial testing, I would recommend the following sources:
*The manufacturing industry followed a similar pattern of adoption to similar methods that consistently delivered dramatic efficiency and effectiveness benefits. It took decades before multi-variate Design of Experiments methods were widely adopted by manufacturers even long after the benefits were proven to be dramatic and repeatable to anyone who would look at the clear, unambiguous, objectively-measurable evidence. Today, it is impossible to find a Fortune 500 manufacturing firm that does not regularly use multi-variate Design of Experiments in their manufacturing processes. One day it will be the same for Fortune 500 firms with respect to their adoption of multi-variate Design of Experiments methods of software testing.
Elisabeth Hendrickson's new book, Explore It!, will begin shipping from Amazon in a week. If you're interested in software testing, I highly recommend it without reservation. It's outstanding. It is currently available for sale on Prag Prog and for pre-order on Amazon. The paper version will be published on January 22nd. Since Amazon apparently doesn't allow people to review books until they officially go on sale, I can't yet post my review on Amazon, but here, one week early, is my glowing review:
Explore It! is one of the very best software testing books ever written. It is packed with great ideas and Elisabeth Hendrickson's writing style makes it very enjoyable to read. Elisabeth Hendrickson has a well-deserved global reputation in the software testing community as someone who has the enviable ability to clearly communicate highly-practical, well-thought-out ideas. Tens of thousands of software testers who have already read her "Test Heuristics Cheat Sheet" no doubt already appreciate her uncanny ability to clearly convey an impressive number of actionable ideas with a minimal use of ink and paper. A pdf download of the cheat sheet is available here. If you're impressed by how much useful stuff Hendrickson can pack into one double-sided sheet of paper, you should see what she can do with 160 pages.
Testers at all levels of experience will benefit from this book. Like the best TED talks, Explore It! contains advanced ideas, yet those ideas are presented in way that is both interesting and accessible to a broad audience. Beginning testers will benefit from learning about the fundamentals of Exploratory Testing (an important and incredibly useful approach to software testing that is increasingly getting the respect it deserves). Experienced testers will benefit from practical insights, frameworks for thinking about challenges that bedevil all of us, and Hendrickson's unmatched ability to clearly explain important aspects of testing (including her superb explanations of test design principles).
Chapter 4 "Find Interesting Variations" in itself is worth far more than the price of the book. It is my favorite chapter in any software testing book I have ever read. A large part of the reason I have so much appreciation for this chapter is that I have personally been teaching software testers how to create interesting variations in their testing efforts for the last six years and know from experience that it can be a challenging topic to explain. I was excited to see how thoroughly Hendrickson covered this important topic because relatively few software testing books address it. I was humbled by how effortlessly Hendrickson seemed to make this complex topic easy to understand.
Buy it. You won't regret it. I'm buying multiple copies to give to developers and testers at my company as well as multiple copies to give to our clients.
A combinatorial explosion is when the configuration settings and user actions and data entered etc. makes it impossible to test everything. The number of tests required to individually test every single possibility is many thousands of times greater than could realistically be tested.
When faced with taking over an existing software applications without a good test suite (or any test plan) often is daunting. And the problems of creating an unfathomable number of tests face you due to combinatorial explosion. Hexawise is a software as a service that aids in dealing with this dilemma for software testers. Software test plans are created that provide far better coverage than is seen in practice with a tiny fraction of the test required for complete combinatorial coverage (that is testing every possible combination [pairwise or 3, 4, 5... way] individually).
The Google Maps test plan provides a good example of combinatorial explosion faced by the testers (in this case, those who tested Google Maps). Take a look at the Google Maps test plan by login to your Hexawise account (creating a demo account is free and simple). The Google Maps test plan is one of 9 samples currently provided in Hexawise.
For creating your own test plan, while you are exploring the software application and testing it out to find "where the weak points are," you will probably find it useful to vary things as much as possible, repeat your actions as little as possible. Those points are true whether you're doing relatively informal lightly documented exploratory testing or more heavily documented test scripts. It addition, since a large percentage of defects can be triggered by the interaction of just two test inputs, it would be nice, if you had time, to test every single possible combination involving two test inputs; that's the rationale behind allpairs, pairwise and orthogonal array-based test case prioritization methods.
To recreate a similar - very early draft - plan for yourself, I'd suggest going through the following steps to put together a relatively small number of highly informative end-to-end-ish tests:
Ask what can change as users go through the system? Think about configuration settings, user actions, data formats, data ranges, etc. even throw in more "creative" ideas like user personas. Let your creativity and common sense guide you. Enter those in as parameters.
Ask how those parameters can change? (for the parameter "Browser" enter IE7, IE8, FF, etc.) Put those in as values under each parameter (entering constraints as required)
Ask does that variation matter? When possible (when it doesn't matter as much) use equivalence classes and be biased towards fewer values - at least for your early draft tests.
Ask what special paths thorough the system do you want to be sure to include? (Most common happy path, paths to trigger certain business rules, etc.)
Click the Create Tests button in Hexawise and you'll instantly get a very nice draft starter set of highly varied tests. If they look like they're relatively interesting and don't miss hugely important things, start informally executing them and you'll be sure to learn some more things as you do about the system's weak points that would result in you going back to those draft tests and iterating them to make them stronger and cover more.
To get a bit more on using this approach see our case studies. Hexawise TV provides narrated videos online showing how to make your life easier as a software tester.
Creating test plans for create, read, update and delete (CRUD) functionality is a very common requirement. There are a few different ways to model it. Let's review the simplest solution first then move to a few optional extra ideas that you could use in addition.
Option 1: Only one C, R, U, D action tested in each of your tests
Have one Parameter called "CRUD action" - have 4 different values: Create, Read, Update, Delete
Option 2: Include two CRUD actions in (most of) your tests
Create 2 Parameters, each with the following 4 and 3 Values:
First CRUD action: Create, Read, Update, Delete Second CRUD action: Read, Update, Delete
You may want to add an invalid pair with the first CRUD action = Delete and each of the three values in the second CRUD action (in which case you'd need to add a 4th Value called N/A and leave it as the only available option for scenarios that deleted a record in the first test.
Option 3: Up to 4 CRUD actions in each of your tests
Create a new record: create, don't create
Read a record: read, don't read
Update a record: update, don't update
Delete a record: delete, don't delete
That's the most basic modeling decision. (e.g., one CRUD action / test, two CRUD actions / test, or potentially 4 CRUD actions / test).
What other things might be interesting to vary?
You might want to think about "newspaper reporter questions" - e.g., (who, what, when, where, why? how? how much?)
Which user type (or which system) is performing the CRUD ACTION? - Are certain types of users or systems not allowed to perform certain actions? Include test inputs to make sure they can't. Are others allowed to? Include test inputs in your model to make sure they can.
What kind of CRUD actions are being made? Valid ones? Invalid ones? How many invalid types of actions can you think of? What is supposed to happen when special characters are entered? What is supposed to happen when blanks are entered? What is supposed to happen when extremely long names are entered (e.g., are there concatenation rules?)
When are the updates being made? Are there any system downtimes? What happens to CRUD actions that are attempted then? What happens if the time stamp on a CRUD action is from the future? From the past? Originates in a different time zone?
When? ("version 2")
When are the CRUD actions tested in relation to one another? e.g., Update a record and then try to read it or read it and then try to update it? If you wanted to "mix up the order" of the Option 3 actions, for example, you could add a parameter called perform CRUD actions in what order? with values such as normal create then read then update then delete order, the reverse of the normal order as much as possible, start with read or update. This would not give you a "perfect solution" to cover every timing combination but it would lead to additional variability within your tests if you wanted them.
Why is a CRUD action being conducted? Do any interesting testing ideas come to mind when you think about this angle? How about bad or malicious motivations?
How are the actions being undertaken? By a keyboard? By a mouse? By a batch processing system?
Lastly, if you have an account you created on Hexawise.com, check out sample test plan "08. Modifications to a Database" for some additional ideas.
Bottom line: when test designers create tests (whether they use Hexawise to generate their tests or whether they select and document their tests by hand), they will have a lot of discretion over what kinds ideas to include / exclude from their tests. Using Hexawise offers these advantages over selecting tests by hand:
You will maximize variation in your tests
You will minimize wasteful repetition in your tests
You will achieve 100% coverage of the combinations you tell the tool to include in your solution (e.g., pairwise testing coverage, three-way coverage, risk based testing coverage, etc.)
You will be able to select those combinations and documented tester instructions faster and with fewer errors in the documented tests.
This post is based on a user question. See the Hexawise user forum page
for some additional details related to the users follow up questions.
Bet that got your attention. It's true, but let me qualify it: Running test cases over and over in the hope that bugs will manifest sucks. It’s boring, uncreative work and since half the world thinks that is all testing is about, it is no great wonder few people covet testing positions. Testing is either too tedious and repetitive or it’s downright too hard. Either way, who would want to stay in such a position?
For the hard parts of the testing process like deciding what to test and determining test completeness, user scenarios and so forth we have another creative and interesting task. Testers who spend time categorizing tests and developing strategy (the interesting part) are more focused on testing and thus spend less time running tests (the boring part).
So all the managers out there need to ask themselves what they've done lately to make their testers more creative. If you don't have an answer, then testing isn't the only thing that sucks.
One of the great benefits of Hexawise is that it takes care of figuring out the best test plan to provide the coverage is needed. The software test planer needs to use their knowledge, experience and creativity to determine what factors and parameters are critical to test. Then Hexawise generates a test plan that provides maximum coverage with the fewest possible tests. If people try to manually create test plans to address interaction between factors to be tested it is not only extremely time consuming, not very fun and it is essentially impossible to do well.
Some things are just so complex or so effectively handled with well designed software people cannot compete. Designing software test plan coverage is one of those areas.
Hexawise also lets the software tester easily tune the test coverage based on what is most important. Certain factors can be emphasized, others can be de-emphasised. Knowledge is needed to decide what factors are most important, but after that designing a test plan based on that knowledge shouldn't take up staff time, good software can take care of that time consuming and difficult task.
Another nice feature included with Hexawise is automated detailed tester instructions are generated. And you can easily provide customized text to assure the test instructions and the expected outcomes are clear and complete.
Hexawise greatly reduces the number of test that need to be run by creating powerful test plans that provide more coverage with fewer tests. This again, frees up tester time to focus on value added activities.
Allowing testers to focus on adding value is a key aim of ours. We strive to automate what we can and allow testers to apply their knowledge, experience and creativity to helping create great software. Hexawise grew out of the work of George Box, William Hunter (the founders father) and W. Edwards Deming, they sought to use statistical tools to free people to focus on creative tasks. For example read: Managing Our Way to Economic Success, Two Untapped Resources by William G. Hunter - "Two resources, largely untapped in American organizations, are potential information and employee creativity."
Hexawise includes an array of sample plans when a new user account is created. These provide concrete examples of how to categorize items when creating a combinatorial test plans (also called pairwise test plans, orthogonal array test plans, etc.). Once you [sign in to your Hexawise account](http://hexawise.com/ (or setup a new, free, account) looking at this [sample test plan](https://app.hexawise.com/share/HT3UG7M8 (which is similar to the situation raised in the question that follows), might be useful.
Within your Hexawise account you can copy the sample test plans that you are provided with and then make adjustments to them. This lets you quickly see what effects changes you make have on real test plans. And it also lets you see how easy it is to adjust as changes in priorities are made, or gaps are found in the existing test plan.
A Hexawise user sent us the following question.
What is the recommended approach to configuring parameter with one or more values?
I have two parameters which are related.
If Parameter 1 = Yes, Parameter 2 allows the user to select one or more values out of a list of 25 - most of which are not equivalent.
For Parameter 2, is the recommended approach to handle this to create separate parameters each with a yes/no value? i.e. create one parameter for each non-equivalent value, and one parameter for the equivalent values. Then link each of these as a married pair to Parameter 1.
I'm open to suggestions as to alternatives.
Here's the screen in question. Parameter 1 = "Pilot", Parameter 2 = checkboxes for types of plans.
I would recommend that you use different parameters for each option (e.g., "Scheduled Commercial" as a parameter with "Selected, Not Selected" as your Values associated with it).
Also, I'd recommend following these 3 strategies to maximize the effectiveness of your tests.
First, consider using adjusted weightings. You may find it useful to weight certain values multiple times, e.g., have 4 values such as "Select, Do Not Select, Do Not Select, Do Not Select" to create 3 times as many tests with "Do Not Select" as "Select."
Second, use the MECE principle. The MECE principles states you should define your Values in a way that makes each of them "Mutually Exclusive" from the others in the list (no subsets should represent any other subsets, no overlaps) and "Collectively Exhaustive" as a group (the set of all subsets, taken together, should fully encompass all items, no gaps)
Third, avoid "ands" in your value names. As a general rule it is unwise to define values like "Old and Male" or "Young and Female", etc. A better strategy is to break those ideas into two separate Parameters, like so:
First Parameter = "Age" --- Values for "Age" = Old / Young
Second Parameter = "Gender" --- Values for "Gender" = "Male / Female"
It's common to have a test plan where the possible values of one parameter depend on the value of another parameter. There are many options for how you can represent this scenario in Hexawise, some options that involve using value expansions (when there is equivalence) and other options that do not use value expansions (when there is not equivalence).
Using Value Expansions in Hexawise
The general rule of thumb for value expansions is that they are for setting up equivalence classes. The key there being the equivalence. The expectations of the system should be the same for every value listed in that particular value expansion.
Let's consider a real world example involving a classification parameter with a value that is dependent on the value of a role parameter:
So if the Role parameter has a value of Student, then the Classification parameter must have a value of Freshman, Sophomore, Junior or Senior, but if the Role parameter has a value of Staff, then the Classification parameter must have a value of Adjunct, Assistant, Professor or Administrator.
Using value expansions in this case might be a good option. You could setup your inputs, value expansions and value pairs this way:
Value Expansion Student Classification: Freshman, Sophomore, Junior, Senior Staff Classification: Adjunct, Assistant, Professor, Administrator
When Role=Student Always Classification=Student Classification
When Role=Staff Always Classification=Staff Classification
You would use this approach if there were no important differences in the business logic or expected behavior of the system when the different expansions of the value were used. If Freshman versus Sophomore is an important label for the users to be able to enter and see, but the system under test doesn't change its behavior based on which value is selected, then those expansions of the value are equivalent and don't need to be tested individually for how they might interact with other parts of the system and create bugs. If this equivalence scenario is true, then you will greatly simplify things for yourself and create fewer tests that are just as powerful by using value expansions.
In the scenario that would support using value expansions, the system might have different behavior for a Junior versus an Adjunct Professor, but not for a Freshman versus a Senior. A Freshman and a Senior are always equivalent in the system, so they can be combined in a value expansion.
However, if the expectations are not the same, then a value expansion should not be used. For example, let's suppose this hypothetical system has business logic giving priority class scheduling to Seniors and only last available scheduling priority to Administrators. In this case, using value expansions as described above would probably be a mistake. Why? Because a Sophomore and a Senior aren't treated the same way by the system, yet Hexawise considers all the expansions of the Student Classification value as equivalent. As long as you've got a test that has paired a value expansion of the Student Classification value with the Overbooked value of the Class Status parameter, then Hexawise won't insist on pairing all the other value expansions for the Student Classification value with Class Status = Overbooked in other tests. You could therefore miss a bug that only occurs when a Senior signs up for an overbooked class.
"One to many" or "multi-valued" married pair model
If the system under test does not consider the values to be equivalent and has requirements and business logic to behave differently, then by using value expansions to signal equivalency to Hexawise when there isn't equivalency is probably a mistake.
So what would you do in that case?
We've decided that it might be nice to be able to set up your inputs and value pairs like this:
When Role=Student Always Classification=Freshman, Sophomore, Junior, or Senior
When Role=Staff Always Classification=Adjunct, Assistant, Professor, or Administrator
Unfortunately, this kind of a "one to many" or "multi-valued" value pair is something we've only recently realized would be very helpful, and is something we have on the drawing board for Hexawise in the intermediate future, but is not a feature of Hexawise today. In the meantime, you could model it with three parameters:
When Role=Student Always Staff Classification=N/A
When Role=Staff Always Student Classification=N/A
Another modeling option to consider, if there is only special logic for Administrator and for Seniors, but the rest of the values we've been discussing are equivalent, is to use value expansions for just the equivalent values:
Inputs Role: Student, Staff Classification: Underclassman, Senior, Professor, Administrator
Value Expansions Underclassman: Freshman, Sophomore, Junior
Professor: Adjunct, Assistant, Full
When Role=Student Never Classification=Professor
When Role=Student Never Classification=Administrator
When Role=Staff Never Classification=Underclassman
When Role=Staff Never Classification=Senior
I hope this helps you understand the role of value expansions in Hexawise, when to use them (in cases of equivalency), and when to avoid them, and how value pairs and value expansions can be used together to handle cases of dependent parameter values. Value Expansions are a powerful tool to help you decrease the number of tests you need to execute, so take advantage of them, and if you have any questions, just let us know!