In a previous post (Commits: that metric), we were talking about all of the flavors we should take into account when measuring commits.
An example was provided and in some cases, and depending on the development policy of the project, commits ignoring merges represented around a 50% of the total activity that we can find.
CVSAnalY is one of the tools that is used as input in our dashboards. It is specialized in versioning systems, and parses the log provided by some of the most used in the open source world. It does this with the priceless help of Repository Handler, in charge of adding a transparency layer.
Its procedure is simple: CVSAnalY reads a log from SVN, CVS or Git and builds and feeds a relational database. For other distributed versioning systems, there are hooks to migrate from those, such as Mercurial or Bazaar to Git.
In order to illustrate this post, the publicly available database for the OpenStack project is used. This database is the basement of the dashboard that can be visualized at the Openstack Activity Dashboard page. Bitergia provides and daily updates this database. So, this analysis is done with dataset up to today.
Four tables are of interest for this analysis: scmlog, actions, branches and repositories.The scmlog table is the central table around the rest of the schema revolves around it. Each row of the scmlog table is a commit in the original repository. And the date, author and message left by the developer is available.
Then, table actions contains on each of the files of the repository. A commit may change several files. And each change in each file is an action. Files could be added, removed, copied, moved, modified, etc. All these actions are registered in this table.
Finally, the repositories and branches table. The table branches are linked to the actions table. Files are linked to a commit and a branch (through the actions table). And commits are linked to specific repositories. It is possible to aggregate several repositories in the same database schema.
Playing with the data.
Five examples are provided in this section. Each of them uses different filters. And those are intended to show how easy it is to filter the dataset once we get the CVSAnalY database.
If no filters are applied, counting commits is easy. We just need to calculate commits between two dates and that’s it. Let’s assume we are not interested in filtering by date. The total number of commits is 106,712.
SELECT count(*) FROM scmlog.
If specific filters are required for repositories, we need to filter by the field repository_id in the scmlog table.
Let’s assume we need to count commits for nova repository in the OpenStack project. With this filter the total number of commits is 28,288.
SELECT count(distinct(s.id)) FROM scmlog s, repositories r WHERE s.repository_id = r.id and r.name='nova.git';
Merges are a bit different. Typically a merge consists of changes to the source code with no “real” activity. And by “real” activity this means that no files are touched. If we go to the previous definition of the CVSAnalY schema, if we need to ignore merges, then we need to count commits that link with the actions table. Filtering merges, the total number of commits goes from 106,712 to 59,941.
SELECT count(distinct(s.id)) FROM scmlog s, actions a WHERE a.commit_id = s.id
And now using the nova repository (the total number of commits decreases from 28,288 to 16,519):
SELECT count(distinct(s.id)) FROM scmlog s, repositories r, actions a WHERE a.commit_id = s.id and s.repository_id = r.id and r.name='nova.git';
Bots are specific authors in the source code that submit changes to the versioning system. However, in some cases their activity is not worth counting it, so this needs to be ignored. How to do it? In the case of the OpenStack Foundation, we, at Bitergia, are providing extra tables to filter such activity, but in the case of a raw database schema from CVSAnalY, this can be done filtering specific authors. For this, we are using the table people. This table stores all of the identities found in the repository, including the bots.
Jenkins is the bot that merges all of the changes in the source code. And of course, if no filter is applied, this would be the most important developer of your community, so be careful with this type of automatic activity!
In the case of the OpenStack Foundation, this bot is found as “Jenkins” and as “OpenStack Jenkins” in the CVSAnalY schema.
If we filter them, the total number of commits goes from 106,712 to 75,323.
SELECT count(distinct(s.id)) FROM people p, scmlog s WHERE p.id = s.author_id and p.name <> 'Jenkins' and p.name <> 'OpenStack Jenkins';
Finally branches. As mentioned, this filter can be applied through the actions table. And this table can be used in a similar way as when we filtered the merges. The specific branch needs to be added. Given the idiosyncrasy of the OpenStack Foundation, branches are hardly found, given that they are merged or abandoned. But, they exist, and we can filter them. This is an example counting commits only from master.
SELECT count(distinct(s.id)) FROM branches b, actions a, scmlog s WHERE b.id=a.branch_id and a.commit_id=s.id and b.name="stable/grizzly";
A final example is needed to put in context all of these ways to calculate commits and filtering them. Let’s use again the case of the Nova project from the OpenStack Foundation. Four ways to calculate commits is analyzed in this example:
- All commits from the Nova project: 28,288 commits
- All commits from the Nova project removing merges: 16,519 commits
- All commits from the Nova project removing bots: 20,073 commits
- All commits from the Nova project removing bots and merges: 16,358 commits
CVSAnalY is a great tool to calculate activity. Among other ways, this post shows how to do it when calculating commits and specific filters can be applied to have more accurate data. However, it’s needed to understand the domain of analysis, and check peculiarities of each of the projects to analyze.