I did something dumb yesterday. I got into an argument with someone on twitter. I usually try to stay away from that, it never ends well. The other person usually walks away thinking I’m dumb, and I leave not thinking much better of them. This time, the argument started with the claim that software development teams doing estimation with story points and task hours have 250% better quality.

Martin is referencing the information presented in this white paper —  ImpactofAgileQuantified2015 — where quality is defined in terms of the number of bugs found downstream. The Whitepaper was behind an email wall, and I balked all over twitter about the claim without having read first. Skin in the game is important. I registered, downloaded my copy, and will talk about their working definition of quality with some guidance from material I have read on measurement, as well as the work of Dr. Kaner.

Reliability and Validity

Meaningful measures have what we call reliability and validity. A measure is reliable to the extent that it can be performed many times by different people, and each person will get the same result. There are three types of validity, I am concerned with construct validity right now. Construct validity is the extent to which a measure (example: 250 downstream bugs reported over a period of 1 week) correlates to a theory or idea like quality.

Any kind of bug count as a measure of quality, including down stream bugs found per increment of time, has problems of both reliability and validity.

Reliability Example

Lets say that I release a software product to the market, customers are using it for a period of two weeks, and over two weeks 45 bugs are logged into a tracking system by the people using the product. Members of the development staff review those bugs over a few days and categorize 6 of them as feature requests, 3 of them as unable to reproduce, and 4 as not a bug. The remaining 32 issues are categorized by priority (how important it is to fix them), and severity (how bad the failure is).

The varying reports in this scenario point to unreliability. The customer(s) feel that they experiences 45 different problems while using the product. The development group, after reviewing those issues, feel that only 32 of those issues are bugs. Which party is right? There are also a few hidden questions to consider here:

  1. Did the customer experience bugs and not report them?
  2. Did every bug reported get documented and tracked? Maybe some got lost in email threads.
  3. If a different customer found more bugs over the same period of time, does that mean product quality is worse?

Validity Example:

Assume we have a product that is being used by 10 people over a period of two weeks. During that time, those 10 people report 20 bugs. In another experiment, that same product is used by 1000 people for two weeks and that group of people reports 200 bugs.If the first group that used the product was happy and wanted to continue to be paying customers despite the fact that group 2 found 200 bugs, does the product have good quality or bad quality?

Quality and Bugs Are Social

This points to the idea that quality is a social construct, it is a judgement people make based on their personal value system. A product can be valuable for one person, and terrible for another at the same time. The concept of a bug is also social and there is no standard definition for what it means. If you have ever taken part in a bug triage meeting where reported issues are getting routinely re-categorized as features, or working as designed, then you have experienced this first hand. Bugs are also not equal things. A server crash is not the same at a form failing to submit because of a special character, and those are not the same as a performance problem. Counting bugs and making a conversion from number to quality usually pretends that bugs are consistent things.If they were, we wouldn’t need descriptors like severity or priority to help the business make decisions about what to fix, and what to leave be.

Numbers are intoxicating. I can look at them and they seem to tell a clear, simple story. There isn’t one story there though, there are several. And, when I look at data, I am creating a story around what I see. That story may or may not represent a reality for the people using the product. I think it was Edwin Boring who said measurements are used to construct a reality for people that were not there to observe it. Whose reality are you constructing?

Inquiry or Control

I get that managers, directors, and C* level people have a need to understand how a project is going. They can not be there to understand the reality of building software, because they are busy tending to other aspects of the business. Sometimes measurement is how they get that feeling. I think using measurement as a place to begin asking questions is healthy. If a team I am working on puts a new product version into production for two weeks and gets 5 bug reports, and then two weeks later releases another new version and gets 100 bug reports the right thing to do there is ask “What’s going on here”? The change is probably a signal that something is happen with the development team, or something is happening with the customer, or maybe both. But, if we don’t start with a question, we’ll never know.

So, back to the moral of the story, it is not possible to measure a 250% improvement in quality based on a reduction in down stream bugs. It probably isn’t possible to measure a 250% quality improvement period. If there is some way, I suspect it would be a very complicated (and nuanced) anthropology experiment that would be too expensive for any company to bother performing.

Here are some of the books I have read (incompletely) that shaped the way I think about measurement.

measurement books