SortingHat: Wizardry on Software Project Members

SortingHat is an open source tool that simplifies the management of project member identities and their related information such as gender, country and organization enrollments. It is one of the key components of GrimoireLab, daily used in Bitergia to track and visualize project members information.

What is an identity?

In recent years, members of software projects and in particular open source ones have become more reliant on an increasing number of software development tools (e.g., Git, GitHub, IRC) to support, coordinate and promote their daily activities around software. In order to interact with these tools, every member is usually required to set up an identity.

An identity generally includes a combination of email address, full name or username. Examples of identities are the commit signatures (i.e., full names and email addresses) of committers and authors in Git repositories or the GitHub and Slack usernames. However, the identities used by the same member may differ across the tools used in the project. It also may happen that a given project member uses more than one identity for the same tool (typically in version control systems and mailing lists), sometimes successive in time, sometimes even contemporary. Moreover, an identity can be shared by more project members, like during pair programming (i.e., same email address for both members).

Meeting SortingHat

SortingHat has been created to help companies and projects to produce meaningful statistics about their communities, since it avoids that individual contributions end up being underestimated.

SortingHat maintains a relational database with identities and related information extracted from different software development tools via Perceval. An identity is a tuple composed of a name, email, username and the name of the source from where it was extracted. Tuples are converted to unique identifiers (i.e., uuid), which provide a quick mean to compare identities among each other. By default, SortingHat considers all identities as unique ones. Heuristics take care to automatically merge identities based on perfect matches on uuid, name, email or username. In case of a positive match, an identity is randomly selected as the unique one, and the other identities are linked to it.

Identities can be interactively manipulated via shell commands, which hide low-level implementation details to the user, thus decoupling the shell from the database technology in use. Then, each command is translated to one or more API calls in charge of dealing with the database specificity. Furthermore, identities can be loaded to SortingHat via batch files written in specific formats (gitdm, mailmap, Stackalytics’, etc.), thus speeding up identities imports for projects with large communities. Batch files are processed by parsers and inserted to the underlying database via API calls.

The overall view of SortingHat is summarized in the figure below. It is composed of three components: Database, Commands and API.

sortinghat.png

Database

The conceptual schema of the SortingHat database is shown below. Unique identities (i.e., Uidentities) are the first-class citizens. They have a profile, which summarizes the member data, and can be linked to more than one identity and organization, which are automatically extracted from the software development tools of your project. Note that organizations or identities can be easily excluded from SortingHat by registering their names/emails/usernames to a matching blacklist. The filter associated to the blacklist is executed every time an identity is inserted to the database or modified.

db-schema

Commands

SortingHat provides more than 20 commands to manipulate identities data which include manual and heuristic-based ones. The former allow to add, show, merge identities as well as list organizations, enroll and withdraw members from them. The latter are able to automatically perform operations like merging identities or affiliating them to organizations. Among them, an interesting command is the autogender, that adds gender information using the Genderize.io API. The full list of the commands is available on the SortingHat repository.

API

The shell commands are processed by the SortingHat API, which is based on a three-layer architecture that promotes modularization and decoupling. The first layer consists of basic methods that interact with the database and implement CRUD operations such as additions, deletions or searches. The second layer contains composed methods, which leverage on the basic methods. Finally, the top layer includes complex methods that have a one-to-one correspondence with the shell commands. They rely on composed methods.

SortingHat in action

SortingHat is being developed in Python 3 and tested mainly on GNU/Linux platforms, thus its installation is really straightforward with pip:

$ pip3 install perceval

However if needed you can install it from source code and even from a Docker image.

Once installed, SortingHat can be used as a stand-alone program via its shell commands, a Python library, or via HatStall.

Using SortingHat as a stand-alone program requires only some basic knowledge of GNU/Linux shell commands.

Finally, if you are not a big fan of shell commands and Python code, you can use HatStall, a Web application that interacts with SortingHat using web graphical interface. HatStall is fully open source, available as a Docker image, and can be easily plugged to GrimoireLab. The figure below shows the page of a member (Miguel Angel) of the CHAOSS Linux Foundation project, where Bitergia is involved to. The page contains his profile data, enrollments and identities plus widgets to modify them.

hatstall-profile-cut.png

The following video shows a clear example of how SortingHat works with identities, from command line and from HatStall. Let’s take a look at it:

However, you can also embed SortingHat to your scripts and applications easily. As an example, you can have a look at ELK (an ElasticSearch wrapper and part of GrimoireLab), which uses SortingHat to process identities included in Perceval data.
 

Try it, fork it and join the wizardry

SortingHat is a very valuable and mature tool to track and manage identities of the  members in your project. At Bitergia, we rely on it to handle members information coming from more than 30 different development tools, and produce useful visualizations and reports for our customers.

SortingHat is totally free, open source, and one of the GrimoireLab tools. Feel free to try it, fork it and submit issues or pull requests.

3 thoughts on “SortingHat: Wizardry on Software Project Members

Add yours

Leave a Reply

Up ↑

Discover more from The Software Development Analytics Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading