You know how it is. It’s already hard enough to squeeze everything you need into a sprint without trying to find an extra 10%–20% of engineering time to pay back technical debt. If you’ve ever argued for carving out time for this, you know that it can feel like a crusade of epic proportions.
But it can be done, and in this guide we’ll find out how to do it.
I've seen too many meetings waste too many hours on this very topic. Evidence is usually anecdotal and emotions can run high, with the opinion of the loudest person in the room prevailing.
The dilemma is this: if business pressures take over, your company risks taking on too much technical debt—engineers become demotivated, the company goes technically bankrupt, and your competitors win. If engineering pressures take over, the company risks taking on too little technical debt, so your competitors ship products and features faster, capture the market, and use that cash to pay back their technical debt later. Again, you lose.
Conventional wisdom says that engineering teams should build an intuitive sense for the codebase—where technical debt lies—and the effects it'll have on the company, thus building trust in the organisation. If your founding Chief Architect tells you to refactor core code right now, you (usually) just do it.
It makes sense to try and retain our engineers, to create a culture of knowledge, sharing, and trust. But it takes years of hard work, and we still come out the other end of a refactoring effort with virtually no idea of whether our time was well spent. Did we just save days of future engineering time? Or could we have waited a little longer to repay the technical debt and shipped a few more features instead? We'll never know for sure and we’ll chalk it up to product development being more art than science.
Well, it's about time we injected more science into it.
What Site Reliability and tech debt budgets have in common
Properly managed, deliberately planned technical debt can be an invaluable tool. Like financial debt, we can use it for extra leverage when we’re aware of what we’re doing. But if we unknowingly take on too much—for example, without really understanding the terms of the deal (i.e. the impact on our codebase, customers, team, and business)—it can lead to our company's demise.
The best Site Reliability Engineering teams think about their site reliability budget in terms of managed technical debt. Site Reliability—a concept popularised by Google—is responsible for keeping software products up and running but interestingly, companies like Google don't aim for 100% uptime. That's because 99.99% uptime is enough for Google products to appear supremely reliable to real-world users. That last 0.01% is exponentially more difficult to reach and it simply isn't worth fighting for.
Consequently, if this allows them 52 minutes of down-time per year, Google will want to get as close to that as possible. Anything less than 52 minutes is a missed opportunity for taking extra risks and delivering more ambitious features for their customers faster.
Think of your technical debt budget like your site reliability budget. Provided it's prudent technical debt you're taking on—and you remain below the maximum amount of tech debt you can tolerate before affecting your customers and business—you should be increasing the amount to take more risks and beat your competitors.
When your technical budget is in the red, pay some of that debt back. If it's in the green, you can afford to take more risks and take on more debt. Your goal is to constantly stay as close to your ideal amount of tech debt as you can. In other words, if you're at the peak of the red portion of the graph, the ideal tech debt budget is A ⇒ B. If you’re at the peak of the green portion, it's B ⇒ C. Just remember that A ⇒ C is too big a budget.
Because technical debt can now be measured—a subject we wrote about in another article—this isn't just conceptual anymore, it's fully practical.
How to make the most of your tech debt budget
You should aim for a tech debt budget that brings you back down, or up, to the maximum amount of technical debt you can tolerate. To define that budget, identify the areas of your codebase where tech debt is worth paying back immediately, i.e., debt that will prevent your company from reaching its current objectives. You don't want to pay back too little debt, but you don’t want to pay back too much either.
Andreas Klinger, Head of Remote at AngelList, puts it well in his article ‘Refactoring larger Legacy Codebases’:
Not everything needs refactoring. If it’s not critical, or nobody needs to improve its functionality in the next months, or it’s just too complicated, consider acknowledging it as tech debt.’
Put simply, your goal is to identify the intersection of things you'll work on this sprint, month, or quarter, and the parts of your codebase that have tech debt. Pay off the debt in that intersection, but not outside of it.
And that's where the science complements the art. You can use data to identify areas where you need to pay back tech debt soon:
- Identify files in your codebase with weak ownership because code ownership is a leading indicator of your codebase's health. You can discover more on this in our article The one cultural characteristic you need for a healthy codebase.
- Measure cohesion and coupling for these files and prune your list to a set of files with weak ownership, low cohesion, and high coupling. You can find out more about each of these metrics in our article Use research from industry leaders to measure technical debt.
- Calculate churn for each of these files to identify the subset of problem files. As Microsoft Research has shown, ‘active files constitute only between 2%-8% of the total system size, contribute 20%-40% of system file changes, and are responsible for 60%-90% of all bugs’.
- Compare these files with your roadmap for the quarter. Will any of the features listed on your roadmap require engineers to work on the subset of problem files you've identified? If so, target these files for refactoring, estimate the work required, and assign it to the engineers—who should be the files’ owners. Bake this job into your plans.
Get into a long-term relationship with tech debt
We’ve implemented this data-driven approach at Stepsize, as well as with many world-class software companies. Not only is the topic of technical debt a lot easier to broach now, but we also know how much debt we're willing to take on, and when/how to pay it back. We rarely wonder whether we’ve made the right trade-off between new features and tech debt. We've removed a big chunk of guesswork and a lot of the fear and anxiety that went with it.
To be clear, this is no silver bullet to use once a year and forget about. You need to get intimate with your tech debt. Track your progress on all the metrics each sprint, and keep improving the whole process to reach technical wealth.