The probability that you will change a piece of code in the near future increases when you make changes to that code or to code in its vicinity.
Recently, we were about to make changes to parts of a system that hadn’t been changed in about ten years. When we discussed how to approach the change, the argument that we probably wouldn’t change that code again in the foreseeable future came up. So why should we bother to refactor it and make sure it was testable? The cost for refactoring that code was deemed extremely high, and was an investment that would not pay off any time soon.
We opted for the quick fix, and made as minor changes to the code as possible. The reasons for not touching the code more than necessary seemed overwhelming.
Well, the following weeks we found ourselves making changes to that code over and over again. While testing that part of the code, we found that we needed a few more changes after all. One change led to another, and it was like we’d brought that code back to life, and now had to take care of it. Since we did not invest much in code quality at any stage as “this is the last change in ten years – promise”, the cost for each change was very high.
I don’t think this is an uncommon phenomenon. My theory is that once you make a change to a file, the probability that you’ll revisit that file increases. Whenever you make changes to a piece of code you start testing it. When you start testing it you might find things that needs to be fixed. When your customers see your new functionality they might come up with new suggestions. All this leads to new changes to the code and the process starts all over again.
Finding Some Data
To test this theory out, I created a small ruby application that reads a subversion log. For each path in the log, it records the dates that path has been changed. It then records the time between each consecutive change to each file and counts the number of times the different intervals are found. This is put into a histogram with each bar representing a week. If the time between two consecutive changes to a file is two and a half week, the bin in the histogram that represents two week intervals is increased by one. Files that only occurs once in the log are discarded as they have only been added to the repository, but never been worked on (this might be an oversight on my part, but I don’t think so).
Armed with the histogram, we can plot the probability density of the time interval between changes to a file.
When I ran this against our subversion repository as well as two open source repositories (gcc and python), the resulting histograms showed similar characteristics (the histogram below shows the results for gcc). The bar on the far left shows the amount of changes that occurred to files within a week. The bar on the far right represents the sum of all changes with over a year between them. This bar represents an interval well over a week, so it should probably be represented lower and wider bar.

Histogram for gcc svn log
The data from these repositories shows a 50% chance that a file will be revisited within a 4-5 week period, and around a 30 % chance that the file will be revisited within the same week.
Implications
So, if we’re correct in our assumption that each change to a file increases the chance of us revisiting it, might that affect how we go about and make those changes?
I think that one thing this shows us is that the probability that an investment in code quality made now will pay off in the near future is quite high. Invest a bit each time you change something and chances are that you’ll reap the benefits of that investment within a few weeks.
I think it also shows that you might get away with smaller refactorings at each time you change code than you might think. As you’ll probably work on the same code in the near future, you can opt to spread that investment in code quality over a couple of times rather than doing it all at once. Hence you can strike a balance between making code better and moving forward.
In our particular case with the ten year old corpse brought back to life, the rationale for the quick fixes was that since we won’t touch this code often, it does not matter that much if each change costs a bit more than usual. Besides, it would mean a lot of work to make code better and we saw little benefit from that work in the near future.
What in fact happened was that we ended up with quite a few changes to the code that were each quite hard because no investment was made to make subsequent changes easier. The accumulated cost was quite high. Had we from the beginning worked under the assumption that we would revisit this code regularly, things might have gone differently.
We hopefully learned our lesson.
13 Comments
Very cool!
I wonder about omitting files that are never changed, though. Shouldn’t they be counted in the same bucket as the ones that are very infrequently changed?
That should sway your percentages a little, it’d be interesting to see how much.
Cool. Data presented on a gut-feeling I’ve had for years. Way to go!
@Kim: Good thinking. Will check that.
I wonder if there is any particular characteristic that files that are only added, but never changed share with each other.
Will need to check way more repositories as well.
Anecdotally I disagree with this. Was the high mod count on recently changed files because active development (not maintenance) was happening there?
This analysis of yours gives a great insight.
I wonder how the shape of the hystogram is related to the file sizes. I would expect that the bigger the average file size is, the less pronounced the “decay” would be.
@SuperlativeMan:
Thanks for the feedback.
I am not sure what you mean the difference between active development and maintenance is, though?
How do you think that would affect the interpretation of the data?
Stefano wrote:
I’d like to see a histogram for files < 50 lines, 50 – 100, 100 – 500, etc. (or some such grouping.
I suspect the size would strongly correlate with change.
This kind of behaviour has been observed and described repeatedly at conferences like Mining Software Repositories, International Conference of Program Comprehension, PROMISE, ICSE, ICSM, etc.
Co-change is also observed. If you change two files together in the past, you will probably change those two files again in the future.
@MSR:
Wow! I never heard of MSR before. Thanks for the info.
Are there any specific articles or other materials about this that you could point me at?
@Brett, @Stefano:
Interesting tip. Thanks.
I also wonder if there is a way to display the distribution between what files ends up where in the histogram. Are we just seeing the same few files being changed over and over again, or is there a more interesting distribution?
Great information! Thank you for the research that you did and there seems to be something to the results you found.
I don’t think that histogram is very conclusive. You get an exponential falloff if you plot a histogram of waiting times between events from a Poisson process. A scatter plot comparing successive commit intervals might show the correlation you are looking for.
Yes, I think I see your point. My data just shows that there is a distribution of waiting times between changes to code, but it fails to show anything about whether changes to dormant files really lead to a higher frequency of revisits. Right?
3 Trackbacks/Pingbacks
[...] the time it would have been better getting the code under test right away. And a few days I found somebody who actually took the time to get some numbers proving why you should get the code under test as fast as possible. The interesting fact is that in [...]
[...] Karlsson says in his post The Locality of Code Changes “The probability that you will change a piece of code in the near future increases when you [...]
[...] though it is not very scientifically conclusive Joakim Karlsson has some very interesting results on the locality of code changes. As Joakim discusses this relates closely to how you evaluate wether the code you are currently [...]
Post a Comment