Methodology used to analyze OpenStack repositories

Some days ago, we published a report on the making-off of OpenStack Folsom and Essex. Now, we are publishing some details about the methodology we followed, to clarify how we retrieved data from the OpenStack repositories, and how we analyzed it.
To bootstrap the study, we started with the information maintained by the project itself:

  • Source code management repositories (version control system, in this case supported by git). There is a list of repositories at the OpenStack GitHub webpage listing 35 repositories at the moment of this post.
  • Issue tracking system (ticketing, bug tracking system, in this case Launchpad). The Launchpad webpage for OpenStack lists 23 projects at the moment of this post.
List of git repositories
List of git repositories considered for the reports, as seen in the reports themselves.

We didn’t consider all of these repositories, but tried to focus on those that seem to represent the most prominent part of OpenStack:

  • Selection of git repositories (the whole list is available in the left column of the main webpage of the reports):
  1. “Core” projects: Nova, Swift, Glance, Keystone, Horizon, Quantum, Cinder
  2. “Libraries”: Openstack-common, Python-novaclient, Python-swiftclient, Python-glanceclient, Python-keystoneclient, Python-quantumclient, Python-cinderclient
  3. “Gating”: Tempest
  4. Documentation: Openstack-manuals
  • Selection of Lauchpad trackers: given that it is easy to retrieve all of the information from a Launchpad account, all the information about all the tickets in all of the trackers was retrieved. At some points of the report, we made later a difference between tickets in different trackers, such as distinguishing between core and non core projects.

In both cases, data sources were mined using the Metrics Grimoire toolset, and more specifically, CVSAnalY for the git repositories and Bicho for the Launchpad trackers.

CVSAnalY retrieves information from the source code management system and stores that in a MySQL database. The information obtained is organized as a collection of “commits” (or change sets). For this analysis, the following selections and filtering were done:

  • Selection of branches: All of the branches leading to master are included. Branches are usually created to work on parallel for specific releases of the source code or introduction of new features among others. In the case of OpenStack, specific branches are created for specific releases (eg: Diablo or Essex). This analysis is considering all of the activity found at those branches together with the master branch.
  • Dates of activity: The reports were focused on the analysis of Folsom and Essex releases. For Folsom, the period considered was 2012-04-05 (included) and 2012-09-27 (not included). For Essex it was limited between 2012-09-22 (included) and 2012-04-05 (not included). Any other previous or later commit was not taken into account. A distributed environment such as git provides local information from each developer (in other words, commits are registered with the local time of the developer that submitted the changes to the git), so we rely on the correctness of this information (in this case, with respect to correct time/date settings).
  • Analysis of bots: Several bots have been detected, among others: “Gerrit Code Review”, “Tarmac” and “OpenStack Jenkins”. By far Gerrit and Jenkins are the ones with the highest number of commits: around 3,000. All the commits performed by those bots were not considered for the study. Thus, no activity from bots should appear in the reports.
  • Selection of committers as basic piece of information: Git makes a difference between committers and authors. Authors are the real developers that modified a piece of source code (or added a new one), while committers are the developers with right to submit changes to the git repository. This study has used only committers, since we were interested in showing the activity by the people actually taking over the responsibility of making changes to the code.

Bicho retrieves information from the issue tracking system and stores it in a MySQL database. No filter was used for it, all tickets were considered. When selecting tickets for the different analysis, the following conditions were considered (always withing the same periods already mentioned for CVSAnalY):

  • The charts on evolution (open and closed tickets per week, for example) refer to tickets open, changed and closed during the corresponding week (even if they were open before the starting of the Essex or Folsom period).
  • The bar charts on totals for the whole Essex or Folsom periods refer to tickets open during the corresponding period, and tickets closed during that period, among those open during that same period.

The companies analysis is based only on the activity in the git repositories. It was aimed at providing traceable information about affiliations of the developers and typical activity of each of the companies. The following assumptions and decisions were made:

  • Coverage of the analysis: For all of the committers with more than 14 commits, we tried to identify the company for which they work. We succeeded for all of them but 10, amounting for 0.8% of the total number of commits. Several sources of information (from their email address to information about their name in several web services) were used for this identification. For the rest, a rough identification based on their email address was done: those with a generic email address (such as “gmail.com” or “hotmail.com”) were considered “unknown”; those domains that we could link to a company (such as “rackspace.com” or “redhat.com”) were assigned to those companies; the rest was also considered as “unknown”. The total population of committers for whom a company was identified amounted for more than 97% of the total number of commits.
  • Companies and other institutions in the study: Academic institutions were also considered. Institutions such as “ISI” (Information Science Institute) or the Melbourne University were therefore considered as “companies” to the effects of the study.
  • Movements among companies: A small number of contributors were detected as moving from one company to another. During the Folsom release cycle at least five developers moved to another company, but kept contributing to Openstack (in fact, two of them were top contributors). This has been taken into account in the study.
  • Developers and companies: All top contributors and some companies which were found to have a large number of developers with “generic” email addresses were revised specially, using the information provided by the openstack-gitdm project.

Maybe these details help to clarify what the data we presented in our reports actually mean, and also to understand the differences with other analysis of OpenStack development, such as the very interesting one produced  using Openstack-gitdm, which shows somewhat different, but comparable (and in several aspects more detailed) results. And remember, the databases we used are available, so you can always download them and perform your own data filtering, selection and analysis.

One thought on “Methodology used to analyze OpenStack repositories

Add yours

Leave a Reply

Up ↑

Discover more from The Software Development Analytics Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading