The probability that you will change a piece of code in the near future increases when you make changes to that code or to code in its vicinity.
Recently, we were about to make changes to parts of a system that hadn’t been changed in about ten years. When we discussed how to approach the change, the argument that we probably wouldn’t change that code again in the foreseeable future came up. So why should we bother to refactor it and make sure it was testable? The cost for refactoring that code was deemed extremely high, and was an investment that would not pay off any time soon.
We opted for the quick fix, and made as minor changes to the code as possible. The reasons for not touching the code more than necessary seemed overwhelming.
Well, the following weeks we found ourselves making changes to that code over and over again. While testing that part of the code, we found that we needed a few more changes after all. One change led to another, and it was like we’d brought that code back to life, and now had to take care of it. Since we did not invest much in code quality at any stage as “this is the last change in ten years – promise”, the cost for each change was very high.
I don’t think this is an uncommon phenomenon. My theory is that once you make a change to a file, the probability that you’ll revisit that file increases. Whenever you make changes to a piece of code you start testing it. When you start testing it you might find things that needs to be fixed. When your customers see your new functionality they might come up with new suggestions. All this leads to new changes to the code and the process starts all over again.
Finding Some Data
To test this theory out, I created a small ruby application that reads a subversion log. For each path in the log, it records the dates that path has been changed. It then records the time between each consecutive change to each file and counts the number of times the different intervals are found. This is put into a histogram with each bar representing a week. If the time between two consecutive changes to a file is two and a half week, the bin in the histogram that represents two week intervals is increased by one. Files that only occurs once in the log are discarded as they have only been added to the repository, but never been worked on (this might be an oversight on my part, but I don’t think so).
Armed with the histogram, we can plot the probability density of the time interval between changes to a file.
When I ran this against our subversion repository as well as two open source repositories (gcc and python), the resulting histograms showed similar characteristics (the histogram below shows the results for gcc). The bar on the far left shows the amount of changes that occurred to files within a week. The bar on the far right represents the sum of all changes with over a year between them. This bar represents an interval well over a week, so it should probably be represented lower and wider bar.

Histogram for gcc svn log
The data from these repositories shows a 50% chance that a file will be revisited within a 4-5 week period, and around a 30 % chance that the file will be revisited within the same week.
Implications
So, if we’re correct in our assumption that each change to a file increases the chance of us revisiting it, might that affect how we go about and make those changes?
I think that one thing this shows us is that the probability that an investment in code quality made now will pay off in the near future is quite high. Invest a bit each time you change something and chances are that you’ll reap the benefits of that investment within a few weeks.
I think it also shows that you might get away with smaller refactorings at each time you change code than you might think. As you’ll probably work on the same code in the near future, you can opt to spread that investment in code quality over a couple of times rather than doing it all at once. Hence you can strike a balance between making code better and moving forward.
In our particular case with the ten year old corpse brought back to life, the rationale for the quick fixes was that since we won’t touch this code often, it does not matter that much if each change costs a bit more than usual. Besides, it would mean a lot of work to make code better and we saw little benefit from that work in the near future.
What in fact happened was that we ended up with quite a few changes to the code that were each quite hard because no investment was made to make subsequent changes easier. The accumulated cost was quite high. Had we from the beginning worked under the assumption that we would revisit this code regularly, things might have gone differently.
We hopefully learned our lesson.


