[Update: We have finally published the complete report about OpenStack Folsom. Although the details mentioned here are still relevant, the numbers are much more complete and accurate in that final report.]
Since several months ago, OpenStack is one of our pet projects. We already contributed some stats to their weekly newsletter back in April, and the project was also a matter of study for Analyzing Risks associated to FLOSS Communities, one of our LinuxTag 2012 talks. Now, here we come back to it, with a preview of a wider study we’re preparing. This one is on how companies are contributing to the maintenance and improvement of OpenStack, based on the analysis of its (many) git repositories [see the full preview of the study].
We extracted all information related to commits, and who performed them, from git metadata. Then, we used some heuristics and manual analysis to detect bots, and determine the companies for which committers are working. Based on that information, we have produced separated charts with the activity performed by specific companies.
For each company, we are producing charts showing the number of commits and active committers per month, which may give an idea of how active the company is in the project (you know, commits and number of active committers are just proxies for activity, so your mileage may vary, etc. etc). We also provide information on the number of different repositories (each OpenStack git repository roughly corresponds to a subproject) and files touched per month, which suggest how wide the contributions by the company are (some companies are very concentrated on specific parts of OpenStack, while others are spread all over the project).
Finally, two more metrics related to how committers (for each company) behave: the ratio of commits per developer, and the hourly pattern of contributions. The former give us an idea of the mean individual effort by developers (but not all commits are equal, you know), while the latter is a first shot at the hourly uses of developers for the company. Since git is tracking times according to the developer timezone, there you can see if they are working mostly on office hours, outside office hours, etc.
The charts of each company are absolute (that is, they show total number of commits, or active committers, or touched files, or whatever), and therefore have to be considered in the context of the total activity of the whole project. Therefore, for example, the apparent decrease in activity for many companies since October 2011 has to be put into the context of a lower number of commits for the whole project (see charts for aggregated data in the top of the main page for the study). This said, and keeping an eye on the scale for the Y axis, you can also observe how some companies are clearly decreasing their activity, while some others are taking the token, and have increased it during the last months.
For this study, we have analyzed a total of 33 git repositories, all shown in Github as associated with OpenStack on August 8th, the date of the data collection. We did our best to identify unique developers (e.g., by joining the activity performed by the same developer under different identities), to filter out bots (we were interested in activity by humans) and to assign developers to companies (by looking at the domains in the email addresses, and some other heuristics and even in some cases, manual inspection). But errors for sure remain, of course (in fact, if you suspect of errors because of your knowledge of the project, please let us know, we would love to track those).
We also removed developers for who we couldn’t identify a company, and those related to the domain openstack.org (which seem to be bots). With all of these, we managed to identify developers contributing well over 95% of the total number of commits to the OpenStack git repositories, which makes us feel confident that the data, having for sure some errors, is representative at least for the main actors in the project.
If you have any kind of feedback, and specially which other kind of data you could be interested in, or which data you find useful /unuseful, please, comment…
If you’re looking for transparency metrics on corporate contributions to an open source community, you may want to take a look at dash.eclipse.org. Maybe there’s something there that would be of use to you (as in inspiration, not implementation).
Thanks for the pointer Mike!. Pretty interesting indeed :).
The addition of lines information is pretty useful, and we are currently working on it. This is something that CVSAnalY at MetricsGrimoire is already doing, although this takes quite long for CVS or SVN repositories.
Thanks once again for your comments and hope to see you around!
Thanks a lot. Yes, a part of what we’re implementing has similar goals. The use for transparency (specially in large projects with many companies contributing) is an aspect which we didn’t work much up to now, thanks for the heads up.
Looks very promising guys. The main error I see is that Ansolabs is considered an independent company, but it was acquired by Rackspace and some developers may still use their @ansolabs.com email addresses now that they work for Nebula.
Thanks! The case of Ansolab, Rackspace & Nebula is indeed quite interesting. First of all, probably it makes sense to let them be separate “entities” with respect to the analysis, at least to the point of acquisition. But from that point on, we should reassign developers to the “new” company, even if they still use their old email address… That’s definitely something we have to work on.
Also I contribute from openstack.org address and I’m not a bot 🙂
Hi Thierry, thanks for your comments!
Most of the developers using the @openstack.org email address were updated to their real company. I mean, this was semi-automatic and later manually updated for all committers and authors of the community.
In your specific case, it seems that your affiliation in our database is still openstack.org, while this should be Canonical as far as I know (right?) (you can check in any case the dataset we provide for this purposes at http://bitergia.com/public/previews/2012_08_openstack/db/)
We’ll update results as soon as possible :).
Great stuff! It’s a rather impressive analysis.