big dog, little dog
Some agile practitioners interpret Goodheart's law as saying formalization of metrics is bad. They're wrong PHOTO: Therese Banström

Have you ever seen a salesperson boast about meeting their revenue targets without caring about whether the sale was profitable?

Have your kids proudly showed you an A on their geometry test and ignored the fact that they haven't turned their homework in for a month? 

Has a delivery team bragged about their agile velocity leading the enterprise but failed to mention that their defect backlog just keeps growing?

These behaviors are manifestations of Goodheart's law — "when a metric becomes a target, the metric ceases to matter." 

When people believe they are judged or compensated by a singular metric, they try to maximize their performance against that metric — even at the expense of other factors that are critical for holistic success.

No Metric Is an Island

While Goodheart was correct, some practitioners in the agile movement have understood the law as saying that the formalization of metrics inside of an enterprise technology shop is bad. This interpretation is unfortunate because of one inconvenient truth: Size matters. 

Allow me to demonstrate.

Size matters for many metrics in agile product development shops including throughput, test coverage and A/B experiment sample size. When trying to apply the lessons of Goodheart’s law to your business, the key words to consider are "a metric." Not "metrics” but "a metric." 

The lack of plural is critical here. When you are trying to drive exceptional performance out of a business, a system, a platform or an organization, any individual metric you use needs to be measured within the context of others in order to prove any value.

All metrics have complementary hedging metrics that they can be paired with to avoid both weaponization from outside the system and gamification from within. 

Consistently presenting complementary metrics in context is what prevents an individual metric from becoming a target, because the complementary metrics provide "hedges" against each other. For example, profit can be hedged against profit margin and/or growth rate to allow for a more nuanced view of the actual value of the profit.

Despite that profit vs. margin vs. growth rate is a simple example that most people are familiar with, some people still struggle with how to turn this concept into a reality within their complex environment. Real-product engineering environments have complex metrics that are not as easily understandable, much less applicable, as profit and profit margin.

Metrics Are Your Friends 

With these complexity and applicability problems in mind, let's go through three examples, each with a different challenge.

Throughput/Velocity

Velocity is a key metric for agile development teams, but many organizations hesitate to formalize its tracking across a portfolio or enterprise. This is due to the high degree of subjectivity that individual teams can introduce (i.e., gamification) and the possibility for comparison of individuals and teams (i.e., weaponization). 

Just because the metric may be misused doesn't justify ignoring it. Here's why, with some examples of complementary metrics to reduce any perceived risks:

Value — Higher levels of velocity within a product engineering shop indicate smaller cycle times, smaller batch size, faster time to market and ultimately and greater organizational agility to respond to feedback from the marketplace.

Complementary Metrics for Presentation:

  • Defect Escape Rate — Higher velocity with a high rate of defects per release indicate a lack of controls within the development process that allow for an artificially high velocity rate. When both velocity and defect escape rate are high, a leadership team may choose to refine the definition of done for a team or raise the priority of automated testing scope to hedge against the out of balance norm.
  • Function Points — Function points measured by an objective backfiring tool help normalize anomalous velocity numbers that come from differences in estimation processes and assumptions. When velocity is high and function points are anomalously low, it could mean that teams are sand bagging estimates. While this outcome can be achieved without malicious behavior, it's still worth watching, if only to be able to reward teams for hitting a near impossible target.
  • Defect MTTR — Mean time to resolution (MTTR) for defects is a proxy metric for gauging the health of a product engineering team. A high velocity paired with an elongated timeframe for defect resolution indicates either a lack of balance within an agile team or a system/platform that has grown too complex via tight coupling or some other architectural fault.

Automated QA Test Coverage 

There's an ongoing debate amongst practitioners about the value of automated QA test coverage as a formalized metric. The back and forth on this topic is a distraction from what matters. 

Achieving 80 percent code coverage in testing suites is a reasonable goal and QA coverage metrics should only be presented with complementary metrics to provide appropriate context.

Value — Higher levels of coverage within a system/platform simultaneously drive smaller cycle times and higher levels of quality in changes and new features. The promises of a DevOps transformation cannot be achieved without high levels of automated test coverage, because lower test coverage numbers must — by definition — mean either longer QA cycle times or forgoing testing. Either of these options are unacceptable for any moderately complex system or platform. 

Complementary Metrics for Presentation:

  • Defect Escape Rate — High coverage percentages with a high rate of defects per release are an anomaly worth investigating. Unless other systems or processes are introducing the possibility for defects (e.g., poor environmental controls, changes made but not tested, etc.) it likely means that the measurement tools are not functioning as intended or that a team has figured our how to artificially inflate their test coverage number.
  • Velocity/Cycle Time — High coverage percentages paired with a lower than average velocity metric could mean that teams have significant differences in their estimate assumptions or that a team has been unable to break down epics and stories to a small enough level. While achieving high velocity with low coverage is possible, it's usually associated with a higher defect escape rate than is desirable, as well as a longer MTTR.
  • Defect MTTR — High coverage percentages paired with an elongated timeframe for defect resolution indicates either a lack of balance within an agile team (e.g.,  prioritization being overly skewed towards new feature/functionality, falling short at decomposing work into small enough chunks) or a system that has grown too complex.

Metrics Can Provide a Means to Discovery

Sample Size

Unlike velocity or test coverage, sample size is not really a trendable or reportable metric. Sample size acts as a contextual constraint to establish the significance of other metrics.

Agile enterprises champion cultures of experimentation. Within this context, A/B testing is usually an indispensable tactic because of the evidence-based objectivity it provides. 

Noting this method's prevalence, vendors now include tools and suites to help shops manage this process and ultimately lower the barriers to experimentation.

What the software vendors often fail to mention during their pitch is that sample size limitations make whole classes of proposed tests non-starters. The time it takes to get statistically valid insights from a test is inversely correlated with sample size (e.g., an experiment for Google can yield statistically valid results in an hour while the same experiment for Joe's Search Shack could run for years without any form of statistically valid result).

Assuming your site has less volume than Google but more than Joe's Search Shack, take a look at the control points in your experiment to inform both the design and expectations of your hypothetical experiment. 

For example, an experiment on an ecommerce site aimed at identifying interface changes for higher conversion rates has to account for the variability introduced by the different products being sold, the seasonality of buying trends, the geo-locations of users and a host of other things (i.e., "Did the red button raise our sales, or was it the positive economic news combined with the start of the holiday shopping season and the increased demand for media streaming devices compared to lower demand for buggy whips?").

Furthermore, a test between two alternatives can yield results, but you won't be able to determine what influenced the result unless the differences between the contexts are very small. As each of these aspects makes your comparable sets smaller and smaller, your experiment has to run for months on end to deliver insights with any statistical validity.

Flying Blind Helps No One

Two key points to remember: First, an individual metric means very little (if anything) without understanding its context. In fact, a singular metric with no context could reasonably be considered worse than no metric at all as it makes you liable to act on false premises. 

And finally, having formalized metrics are better than having none. Flying blind lowers not only contextual understanding for those removed from the day to day work, it also lowers trust and erodes craft-based cultures by removing the possibility of organizational recognition (e.g., an inability to see trends within an overall portfolio, an inability to drive rewards to high achievers, an inability to drive help to underserved teams, etc). 

In other words, while gamifying and weaponization of metrics may be bad, blinding oneself is far, far worse.