A few years ago I was working at a software company that was releasing a new product for the iPad. Our initial release of the product went to a small customer and was used lightly by a couple of people. They were happy, and the product was doing most of what it should. We considered the release a success, and rapidly moved on to larger customers with more users. More customers using the product in a wider variety of scenarios and heavier load on our servers led to some issues being found in production.
Customers experienced data loss, performance issues, and seemingly unexplainable errors on the iPad.
My manager started collecting some numbers: bugs documented per tester, bugs reported per feature, bugs reported by customer, defect removal efficiency (DRE), and so on. He thought this would help him understand product quality, but the main result was making management think that the testing group was missing important bugs and not keeping up with the pace of development. These measurements didn’t end up helping to improve the product, or my team’s reputation. What could we have done differently?
Get TestRail FREE for 30 days!
Single Use Measurement as a Question
Most of the measurement programs I have experience with are an attempt to capture every data point possible. The leadership at companies using large measurement programs were generally busy with things that kept them away from the development team, such as customer site visits, planning meetings, resource allocation, and so on. Despite being detached from the process of making software, these managers still needed information to guide the team and to report to their supervisors. Every week, the test team would update an Excel spreadsheet containing all manner of measurements. Every once in awhile, after reading the latest “best practices” article about measuring test team performance, a few more measures would be added to the sheet. The result, of course, was a large data set. It was nearly impossible to understand any piece of data without an hour long conversation describing the measurement and current team context.
My preference today is a much simpler approach involving one measurement at a time, and throwing away past data when it isn’t useful anymore.
A good manager or test lead will have a gut feeling about specific things that can be improved on a team because they are there to observe the work. Customers reporting large numbers of problems in production is an easy problem to spot, so we will use this as an example. Let’s say you are working on a new product, and every release it feels like more and more bugs are being reported by customers over email, in phone calls to the implementations people, in the bug tracker, and in frustrated late night calls to the CEO. We want to find out the truth, so we start collecting data: number of bugs reported each day, severity, who reported them, what part of the product the problem was associated with, and what browser the customer was using at the time.
We can learn some very important things from this data. The majority of these issues were being reported by a new customer that was going through an implementation plagued by configuration problems. There were several bugs reported but we were not able to do anything about them. These reported things like ‘shopping cart broke’, ‘page doesn’t load’, and ‘I can’t log in’. We discovered one interesting category of problems: our customers were using a version of Internet Explorer that we were not testing on. Our testers and developers had relatively new computers and were testing in Chrome, FireFox, and Internet Explorer 9 and up. This new customer, like many large companies, supported several very old software platforms internally that would only run consistently on Internet Explorer 8 or lower. They were unable to upgrade even if they wanted to.
Once this implementation was complete, the number of reported issues dropped off significantly. The test team added a virtual machine with older browser versions so that we could add that to our strategy. Having one measure to focus on allowed us to pinpoint a couple of real problems and get them fixed. When the reported numbers started dropping off, we stopped collecting the data and moved on to the next initiative.
Good measurement is tricky. Say you collect all the data you can find, stick it in an Excel sheet, create charts and diagrams, and have weekly meetings to talk about what the graphs say. That approach is a bit like a fortune teller trying to read tea leaves to predict the future. Without an understanding of the problem, people will tend to invent stories to explain the measurements, which can actually create more dysfunction. With some understanding of the problem, you can select a very small number of measures (one is a good number) to collect while you make some changes and observe behavior. Assuming you can actually measure the right thing, that is.
Another option is to not measure anything at all.
Here is a common scenario. You are a tester working with a development group that is striving for increased agility. You are releasing every two weeks. Usually around the second day of the second week, there is a big delivery of new code changes on the test system. Testers test until they run out of time and developers spend the remaining time in that week fixing any reported bugs and getting prepared for the next release cycle. Any automation work that happens is done on a one week lag while development is making new changes from the sprint. The problem is that testing lags behind development and then has to scramble at the end of the second week to get everything done.
A measurement based approach would start with figuring out which data points to collect, probably the burndown, average time from start to development done on a feature, and bugs reported per feature. This of course takes time away from staff that are already busy.
The thing is, you already know the problem.
Personally, I would consider an alternative: Start with an experiment. There are, after all, a wide variety of things we can do immediately to improve the situation. The first thing I ask in this situation is “why can’t we test earlier?” That might mean doing TDD, or pairing testers with developers through the development process, or teaching testers to look at testing problems without a fully formed user interface. There are plenty of options.
Picking any one of these potential solutions, trying them for a sprint or two, and then doing a debrief will tell you more than a measurement program. Teams can move from identifying a problem, to actually making real improvements quickly. This is a tool for self-directed and empowered teams. In some cases it might make sense to measure something, such as the day of the sprint that code is first delivered to test, and then to see whether the policy change improves the measure. Once the experiment is done, you can find a new hypothesis and come up with an appropriate measure. This approach was formalized at the University of Maryland as Goal Question Metric, or GQM.
Question, Not Control
The phrase ‘be careful what you wish for, you just might get it’ always comes to mind when I talk about measurement and metrics. Measuring the number of bugs a person documents during a release cycle is a slippery slope. Managers often jump to using this as a way to judge performance. The people that document more bugs are obviously better testers, right?
You can laugh a bit here, the article will wait.
Testers that encounter this management philosophy do the obvious thing, have a race to see who can document the most bugs whether or not they are actually useful. This is what measurement turns into when it is used to control people.
For the sake of argument, let’s assume more measurement is a management directive. In that case you will be using metrics, you just want them to be as cheap and valuable as possible. Instead of pursuing a goal of management by spreadsheet, without human intervention, you can turn the concept around, seeing the numbers as a place to start asking questions. Significant spikes or dips in any number can be an interesting conversation. Why did the number of bugs reported go up by 30% in this release? Why did you deliver 40% fewer features in the last two weeks? Why are you seeing so many more issues from production today?
The answer to any one of these questions will probably make sense once you talk about what is happening in the project.
Sometimes, that answer might even lead to an idea that makes the next release better.
This is a guest posting by Justin Rohrman. Justin has been a professional software tester in various capacities since 2005. In his current role, Justin is a consulting software tester and writer working with Excelon Development. Outside of work, he is currently serving on the Association For Software Testing Board of Directors as President helping to facilitate and develop various projects.
- Announcing TestRail 5.5 Release with Ranorex Integration, GDPR, Admin, UI and Performance Enhancements
- TestRail a Leader in the G2 Crowd Grid for Software Testing