Graal: The Quest for Source Code Knowledge

Graal is an open source tool to conduct customizable, scalable and incremental analysis of source code I’ve been working on as a side project to the work I am doing in Bitergia with GrimoireLab. Let’s have a look it.

Extending GrimoireLab with source code analysis

The development on GrimoireLab never stops, after the recent advances in visualizations and metrics, I am happy to share with you Graal, which enables the platform to monitor the evolution of source code information, including qualitative analysis (LOC, cyclomatic complexity, etc.).

It combines and manipulates the output of existing source code analysis tools, and relies on Perceval, one of the key GrimoireLab‘s components.

Perceval simplifies the collection of project data by covering more than 30 well-known tools and platforms related to contributing to software development. It allows to retrieve collections of homogeneous items (i.e., categories) from multiple sources in an easy and consistent way through its backends. For instance, issues and pull requests are the categories extracted by the GitHub backend, e-mail messages are obtained via one of the supported backends (e.g., Hyperkitty, MBox archives, NNTP and Pipermail) and commits are collected by the Git backend.

However, Perceval Git backend does not perform source code analysis, as it just returns the commit metadata. Perceval offers the results in a flexible JSON format, easy to extend, and provides incremental support.


{"backend_name": "Git",
"backend_version": "0.10.2",
"category": "commit",
"data": {
  "Author": "Santiago Duenas ",
  "AuthorDate": "Tue Aug 18 18:08:27 2015 +0200",
  "Commit": "Santiago Duenas ",
  "CommitDate": "Tue Aug 18 18:08:27 2015 +0200",
  "commit": "dc78c254e464ff334892e0448a23e4cfbfc637a3",
  "files": [{"action": "A",
  "added": "10",
  "file": ".gitignore", ... }, ...],
  "message": "Initial import", ...
  "parents": [],
  "refs": []},
"origin": "...github.com/chaoss/grimoirelab-perceval, ...}

The listing above shows an excerpt of a JSON document produced by Perceval. As can be seen, each document contains metadata useful for filtering and debugging (e.g., backend version and category) as well as commit-related information, which include authored and committed dates, the names of the author and committer, the hash and message of the commit plus the references (i.e., tags or branches) where the commit appears. Furthermore, it contains also the hashes of the commit parents and the list of file paths affected by the commit with the corresponding actions.

Extending Percerval’s Git backend with Graal

Graal repository in Grimoirelab structure

Graal leverages on the incremental functionalities provided by Perceval and enhances the logic to handle Git repositories to process their source code. The overall view of Graal and its connection with Perceval is summarized in the figure above: the Git backend creates a local mirror of a Git repository (local or remote) and fetches its commits in chronological order. Several parameters are available to control the execution; for instance, from_date and to_date allow to select commits authored since and before a given date, branches allows to fetch commits only from specific branches, and latest_items returns only those commits which are new since the last fetch operation.

Graal extends the Git’s Perceval backend by enabling the creation of a working tree (and its pruning), that allows to perform checkout operations which are not possible on a Git mirror. Furthermore, it also includes additional parameters used to drive the analysis to filter in/out files and directories in the repository (in_paths and out_paths), set the entrypoint and define the details level of the analysis (useful when analyzing large software projects).

Following the philosophy of Perceval, the output of the Git backend execution is a list of JSON documents (one per commit). Therefore, Graal intercepts each document, replaces some metadata information (e.g., backend name, category) and enables the user to perform the following steps: (i) filter, (ii) analyze and (iii) post-process, which are described below.

Filter

The filtering is used to select or discard commits based on the information available in the JSON document and/or via the Graal parameters (e.g., the commits authored by a given user or targeting a given software component). For any selected commit, Graal executes a checkout on the working tree using the commit hash, thus setting the state of the working tree at that given revision.

Analyze

The analysis takes the document and the current working tree and enables the user to set up ad-hoc source code analysis by plugging existing tools through system calls or their Python interfaces, when possible. The results of the analysis are parsed and manipulated by the user and then automatically embedded in the JSON document.

It is worth noting that in this step the user can rely on some predefined functionalities of Graal to deal with the repository snapshot (e.g., listing files, creating archives).

Post-process

In the final step, the inflated JSON document can be optionally processed to alter (e.g., renaming, removing) its attributes, thus granting the user complete control over the output of Graal executions.

Graal Backends

Several Graal backends have been already developed. They leverage on some source code analysis tools, where executions are triggered via system calls or their Python interfaces.

In the current status, the backends mostly target Python code, however other backends can be easily developed to cover other programming languages.

The currently available backends are:

  • CoCom gathers data about code complexity (e.g., cyclomatic complexity, LOC) from projects written in popular programming languages such as: C/C++, Java,
    Scala, JavaScript, Ruby and Python. It leverages on Cloc and Lizard; the former is a Linux package used to count blank lines, comment lines and LOC, while the latter is a code complexity analyzer written in Python.
  • CoDep extracts package and class dependencies of a Python module and serialized them as JSON structures, composed of edges and nodes, thus easing the bridging with front-end technologies for graph visualizations. It combines PyReverse, a reverse engineering tool able to generate UML-like diagrams, plus NetworkX, a library to create, manipulate and study complex networks.
  • CoQua retrieves code quality insights, such as checks about line-code’s length, well-formed variable names, unused imported modules and code clones. It uses PyLint, a code, bug and quality checker for Python.
  • CoVuln scans the code to identify security vulnerabilities such as potential SQL and Shell injections, hard-coded passwords and weak cryptographic key size. It relies on Bandit, a tool designed to find common security issues in Python code.

Once installed, Graal backends can be used as a stand-alone program or Python library. Let’s showcase these two types of executions by fetching code complexity data using the CoCom backend.

The listing below shows an excerpt of a JSON document produced.


{"backend_name": "CoCom",
"backend_version": "0.2.1",
"category": "code_complexity",
"data": { ...,
  "AuthorDate": "Mon May 28 10:15:53 2018 +0200",
  "CommitDate": "Tue May 29 11:21:23 2018 +0200",
  "commit": "dc78c254e464ff334892e0448a23e4cfbfc637a3",
  "analysis": [{
    "avg_ccn": 2.42,
    "avg_loc": 6.27,
    "avg_tokens": 44.36,
    "blanks": 138,
    "ccn": 80,
    "comments": 153,
    "file_path": "perceval/backend.py",
    "loc": 341,
    "num_funs": 33,
    "tokens": 1867}, ...],
  "message": "Increase-coverage-pipermail..."},
"origin": "...github.com/chaoss/grimoirelab-perceval, ...}

As can be seen, the document contains dates, hash and message of the commit (while files, references and parents have been stripped out with the post-process step), metadata and the result of the analysis for the file perceval/backend.py. The latter has 341 LOC, 153 commented lines, 138 blank lines and accounts for 80 as total cyclomatic complexity. It contains 33 methods, which have on average 2.42 of cyclomatic complexity, 6.27 LOC and 44.36 tokens.

Installation and use of Graal

Graal is being developed and tested mainly on GNU/Linux platforms. Thus it is very likely it will work out of the box on any Linux-like (or Unix-like) platform, upon providing the right version of Python.

The listing below shows how to install and uninstall Graal on your system.


# To install, run:
$ git clone https://github.com/valeriocos/graal
$ python3 setup.py build
$ python3 setup.py install
# To uninstall, run:
$ pip3 uninstall graal

Currently, the only way of installing Graal consists of cloning the GitHub repository hosting the tool and using the setup script, while uninstalling the tool can be easily achieved by relying on the pip management system.

Stand-alone Program

Graal can be used as stand-alone program, which requires only some basic knowledge of GNU/Linux shell commands. The listing below shows how easy it is to fetch code complexity information from a Git repository.

As can be seen, the CoCom backend requires the URL where the repository is located (https://github.com/chaoss/grimoirelab-perceval) and the local path where to mirror the repository (/tmp/graal-cocom). Then, the JSON documents produced are redirected to the file graal-cocom.test.


$ graal cocom https://github.com/chaoss/grimoirelab-perceval
--git-path /tmp/graal-cocom > /graal-cocom.test
[2018-05-30 18:22:35,643] - Starting the quest for the Graal.
[2018-05-30 18:22:39,958] - Git worktree /tmp/... created!
[2018-05-30 18:22:39,959] - Fetching commits: ...
[2018-05-31 04:51:56,111] - Git worktree /tmp/... deleted!
[2018-05-31 04:51:56,112] - Fetch process completed: 1187 commits inspected
[2018-05-31 04:51:56,112] - Quest completed.

The remaining messages in the listing are prompted to the user during the execution. Interesting optional arguments are from-date, which is inherited from Perceval and allows to fetch commits from a given date, worktree-path which sets the path of the working tree, and details which enables fine-grained analysis by returning complexity information for methods/functions.

Python Library

Graal’s functionalities can be embedded in Python scripts. Again, the effort of using Graal is minimum. In this case the user only needs some knowledge of Python scripting.

The listing below shows how to use Graal in a script.


#! /usr/bin/env python3
from graal.backends.core.cocom import CoCom
# URL for the git repo to analyze
repo_uri = ’http://github.com/chaoss/grimoirelab-perceval’
# directory where to mirror the repo
repo_dir = ’/tmp/graal-cocom’
# Cocom object initialization
cc = CoCom(uri=repo_url, gitpath=repo_dir)
# fetch all commits
commits = [commit for commit in cc.fetch()]

The graal.backends.core.cocom module is imported at the beginning of the file, then the repo uri and repo dir variables are set to the URI of the Git repository and the local path where to mirror it. These variables are used to initialize a CoCom class object. In the last line of the script, the commits inflated with the result of the analysis are retrieved using the fetch method. The fetch method inherits its argument from Perceval, thus it optionally accept two Datetime objects to gather only those commits after and before a given date, a list of branches to focus on specific development activities, and a flag to collect the commits available after the last execution.

Explotation

kibana-graal

The JSON documents obtained by Graal and persisted to ElasticSearch can be visualized by means of Kibana dashboards or exploited using common libraries for data analytics like Pandas and R, thus enabling the proposed approach with support for data visualization and analysis. Furthermore, the Graal documents can be easily combined with the Perceval ones (storing both to the database), thus enabling cross-cutting analysis on software project data.

The figure above shows a dashboard composed by two charts and some widgets that use the code complexity data, such as LOC, comment lines, number of functions and cyclomatic complexity, obtained via the CoCom backend.

The top chart shows the evolution of LOC, comment lines, number of functions and cyclomatic complexity across the commit history of the Perceval repository. As can be seen, the evolution of LOC in Perceval witnessed a spike around the first year of the project due to the addition of new backends. A similar spike occurred also on comment lines and cyclomatic complexity, but with less extent. Conversely, the number of functions has increased smoothly over time. The widgets in the middle of the dashboard summarize the current values of code complexity in Perceval. Thus, the latest snapshot of Perceval (i.e., May 16 th , 2018) contains 22,650 Python LOC, 6,095 comment lines, 1,580 functions and 2,952 as total of cyclomatic complexity.

Finally, the bottom chart presents the evolution of the average of code complexity. As can be seen, except for an initial fast growth, LOC, comment lines, cyclomatic complexity and number of functions have grown slowly. Interesting enough is the shape of the LOC evolution: the high and low peaks represent the addition of new features (e.g., new backends or enhancements of existing ones) and refactorings, thus pointing out that the project has been constantly active. In the latest snapshot, every Python file has on average 310 LOC, 83 comment lines, 22 functions and 40 of cyclomatic complexity.

Graal future?

During the following weeks, this side project will be integrated in Bitergia’s set of repositories, and we’ll research on potential integration into Bitergia Analytics platform.

And last, but not least, Graal will be presented in 18th IEEE International Working Conference on Source Code Analysis and Manipulation on September.

Graal’s future will be mainly decided by the development community interested in integrating new source code analysis tools, extend current functionalities, or explore further integration with existing tools in CHAOSS or TODO Group. What do you think?

Feedback and comments are more than welcome!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: