Complete (basic) analysis of MediaWiki

We at Bitergia are improving our reporting and visualizing tools. To demonstrate some of their new capabilities, we’ve done a complete activity analysis of the MediaWiki project, including their git repository, Bugzilla system and mailing lists. The analysis is complete in the sense that it includes all these sources, but it is basic in the sense that only bare automatic processing has been performed: no filtering of bots, no identification of special practices and data, no manual processing, etc.

Analysis of MediaWiki project
Analysis of MediaWiki project

The analysis, in addition to showing several aspects of the activity in the MediaWiki project, is also a test of some of the new capabilities of our vizGrimoireJS visualization suite, specialized in showing the data collected with MetricsGrimoire tools from software development repositories.

We have structured the report in four sections: a summary of activity, and separate analysis of the changes to the source code (git), of the movements in the ticketing system (Bugzilla), and of the activity in mailing lists.

Summary chart of Liferay evolution
Summary chart of Mediawiki evolution

In the summary, a chart with the evolution of the main parameters of the git, Bugzilla and mailing lists repositories shows a first view of the actvitiy of the project over time. For most of the data sources there is public information for almost 10 years, which allows to have a long-term perspective. We have also annotated the charts with the release numbers, which helps to relate activity to release times. The sliders at the bottom of the chart allow for the selection of specific periods in time, which can be seen with more detail.

The upper part of the chart shows activity per month in the git repository (formerly a svn repository, but its history was imported with the migration). The middle part refers to Bugzilla (opened and closed tickets and developers closing tickets per month). The bottom part refers to aggregated parameters for all mailing lists (messages sent and senders per month).

This summary page also shows some tables with top developers and some other charts that show the evolution of some parameters (some of them are already present in the summary chart, but are offered separate for convenience). All of those tables and charts can also be seen in context in their corresponding page.

One of the areas where this report shows new capabilities is the presentation of mailing lists. In this page, we have now a summary of the aggregation of all mailing lists (top of the page), and a selector (left column) that allows to specify which of the many (in this case 14) lists are to be shown in separate charts. This is very convenient since traffic in mailing lists vary a lot depending on their use.

Mailing lists
Mailing lists

Another novelty is the inclusion of automatically generated tables in several parts of the report, for showing top developers (by number of commits, by number of tickets closed, etc.). However, beware, these tables are generated automatically, directly from the data in the repositories. That means, for example, that bots are not filtered out.

Top committers for last year
Top committers

This is the reason why, for example in the table for top committers of the last year, gerrit is the first one, well above “real” committers. l10n-bot can also be found in it. Of course, a more useful analysis of commits would mean finding out who are real developers and which ones are bots performing operations on their behalf, and correctly attributing commits to them. This would also mean, for example in these cases, find out information about how code review is done, and about how localization information is updated, to improve not only the accounting of top developers, but also to properly calculate other parameters where these actions have to be taken into account (such as ticket management).

Therefore, take all these “top tables” with a grain of salt.

We have also included in the analysis of Bugzilla some charts on the evolution of quantiles .99 and .95 of closed tickets. Probably the .95 is the most interesting, since it shows how for some periods a large majority of the closed tickets are closed in relatively short periods of time, well below the maximum (remember that the maximum is always the number of days that passed since the ticket was open, that’s why quantiles are always below the line with a slope of -45 degrees).

Tickets closed: .95 quantil
Tickets closed: .95 quantil

The .50 quantil tells an interesting story as well. It shows how half the tickets open in almost any month are closed very quickly, usually in a matter of days. With the exception of the first months in the life of the project, and some peaks in early 2009, the project is closing at least half of the tickets in less than five days, and in many cases in less than 24 or 48 hours. Half the tickets are a lot of tickets, specially having into account that many of the tickets may be quite difficult to close, or even to reproduce (because, for example, they may correspond to an old release, no longer maintained).

However, we’re still looking for better ways to show parameters related to performance in closing tickets. First of all, these charts are showing just a part of the story: they only consider tickets closed, which means many still open tickets could mask a good time-to-close for a given month. Second, they still consider all tickets are equal, although a bug report is quite different from a code review request, just to mention two cases. And third, the charts are not that intuitive: interpreting them requires an explanation which, if missing, could lead to wrong conclusions.

0.50 quantil for closed tickets
Evolution of the 0.50 quantil for time to close for tickets closed per month

All in all, with this report we provide just yet another example of the capabilities of the tools we’re using (which are all of them free / open source software that you can use), but also of how with relatively little effort a first view of the activity of a complex project can be produced. This first view is, of course, a good starting point for our experts to dig much more about what is happening and how the project is reacting to specific conditions.

Leave a Reply

Up ↑

Discover more from The Software Development Analytics Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading