Source code versioning systems are tools that help to facilitate the life of developers. Basically those are used to have a list of all of the changes in the source code and allow to navigate and recover old version of the project. Each of those changes to the source code is defined as a commit, and this may be considered as the nuclear piece of information in these systems.
And commits are nowadays considered as a “good” metric to have an initial idea of the total effort developed in a project. However, this is not as simple as it seems to be, and each versioning system and even each project with its particularities may distort this metric. So we all need to be a bit careful when raising this metric as “the most wonderful, marvelous and incredible metric in the world”.
So, in first place, what kind of information can we find in a commit? Typically commits provide information about the time when the change took place, files that were affected by that change, added, removed or modified lines, the author of the commit, and maybe extra information such as the reviewer, specific acknowledgements and others. The following example shows information that can be found in a specific commit (using the git log command):
Author: Daniel Izquierdo <email@example.com>
Date: Fri Mar 7 13:32:25 2014 +0100
Add turnover in ITS and SCR
diff –git a/vizGrimoireJS/alerts.py b/vizGrimoireJS/alerts.py
index ff5a703..12b1de6 100755
@@ -82,15 +82,29 @@ if __name__ == ‘__main__’:
However, the definition of commit is really specific of the versioning system. Just an example, a commit in CVS is a modification in one file. So N modified files, implies, N commits. But, on the other hand, Subversion or Git may have several “touched” files in the same commit. Are comparable projects at the level of commits using different versioning system? The answer is probably that they are not comparable simply counting commits. You need a bit more advanced way to count them.
Even more, distributed versioning systems as Git allow to have more advanced information. Do you really know when your developers are working? In any versioning system you have a field name as “date”. And this field indicates the time when that commit took place. Are you sure? Again we should take this with a pinch of salt when counting commits. As an example, subversion shows the time when the commit was done in the centralized server. But in Git it is possible to have the local time when a commit took place in the local computer of each developer. Thus, with Git (and other distributed ones) we know when developers are committing and even their time zone (let’s assume that this option is well configured), while with SVN we only know when each commit reaches the server.
A step ahead, we find other type of activities in daily basis, specifically branches, merges, reviews etc. Should we count a merge as a proper commit?. In some cases, an automated process takes place. A developer submit a change to the source code and tools such as Jenkins check if that works: compiling actions, compliance style, no conflict with the resultant merge and others. So, in most of the cases merges are typically actions that bring into the source code parallel development. But no actual actions take place in the source code. No files are ‘touched’ in a merge. Thus, they exist, but probably we should not measure them. The following figure shows this example for three projects of the OpenStack Foundation: Nova, Swift and Neutron. We can see how due to the development policy of the Foundation, merges are usual. Even to the point that they are close to 40% of the activity.
Regarding to branches, those are used to develop/fix/improve specific features or parts of the source code. They again exist, but in some cases they are not so important. Creating branches is cheap in systems such as Git (a bit heavier in others such as SVN), and there are communities using branches as a way to develop almost any minimum piece of source code. This is again the case of the OpenStack community where if you want to change the code, you will have to create a branch and start a revision process.
And what if we work with lines of code or files?. One of the advantages of using files or lines of code is that the metric is finer. However, in some specific cases, and in almost all of the projects that at some point we’ve studied, there is a big peak of movements of files (so lines as well). And this could be derived of a refactoring process of the code. Maybe one developer was ordering the project and moving files around. Is this real code? At some point this is. This developer needs to understand the code to make those modifications. But, if a top list of contributors is retrieved, this developer would be one of the top contributors (and maybe he was just justifying the code!). However, when using commits, this type of actions keep diluted. Even when the change is one of the biggest ones, this is a commit. Hence, what’s better? Our perception is that having commits is a fair (enough) metric to count activity and effort in the community as a first step.
Finally, it is worth mentioning the fact that new tools are in the software development horizon. Gerrit is one of them. A tool to control the review process. This type of tools complement a lot the information found in the versioning systems. A review typically consists of a patch sent, a review process, maybe some loops improving the patch, and finally this is accepted or abandoned. However, till a patch is accepted, this may have mutated, and previous changes are usually lost and not recovered by the versioning system. Even more with the abandoned activity. But, this is effort and time consuming actions.
Summarizing, the measurement of commits implies to understand their limitations. Filters should be applied to clean the dataset and specific actions are required to remove oddities and noise. The type of versioning system, the way developers work and even bots should be studied to fully understand how a community works and provide useful metrics where commits activity is one of them.