'

X-COBOL: A Dataset of COBOL Repositories


Dataset Description

X-COBOL is a curated dataset containing structured metadata about the development cycle of 182 COBOL projects mined from GitHub. The dataset includes metadata on the commits, issues, pull requests, and releases of the mined repositories, along with the COBOL source files present in them. Additionally, we provide the metadata of COBOL files extracted. We expect the research community to utilize the dataset on COBOL projects to conduct empirical studies in code quality, COBOL software development practices, error analysis, security, and so on. Also, the dataset can aid in research studies and tools supporting the maintenance and migration of COBOL projects. Finally, the extracted COBOL source files can be used by researchers to perform static code analysis to develop tools that support the development of COBOL projects.

The Dataset is available at https://bit.ly/3IIa7Dy

Dataset Schema

elegant icons

The dataset contains eight CSV files, capturing different properties of the dataset, and a directory named COBOL_Files containing the extracted COBOL files. The overall schema demonstrating the type of data present in the eight CSV files is shown in the above figure.

  • repository_data.csv: The repository_data.csv provides the overall activity of the repositories using metrics such as commit frequency, merge frequency, commits, and others. repository_data.csv also contains repository metadata such as forks, stars, and size, along with metrics capturing the overall COBOL files content present in the repository such as total COBOL files, total blank lines, and total code lines.

    Fields = [id, owner, name, url, stars, forks, description, created_at, languages, size, commit frequency, committer frequency, integrator frequency, integration frequency, merge frequency, commits, releases, last_commit, open pull requests, closed pull requests, open issues, closed issues, number of cobol files, total comment lines, total code lines]



  • commits_data.csv: The commits_data.csv has the data on all the commits made to the selected repositories providing information such as commit date, commit message, and commit author.

    Fields = [repo id, Commit hash, Parents hash, Author_Name, Author_Email, Author_Date, Committer_Name, Committer_Email, Committer_Date, Commit Message]



  • release_data.csv: The release_data.csv contains data on all the releases made in the selected repositories.

    Fields = [repo_id, id, title, tag_name, target_commitish, is_draft, pre_release, created_at, author, published_at, url, upload_url, html_url, tar_url, zip_url]



  • issue_data.csv: The issue_data.csv contains metadata on all the issues created in the selected repositories.

    Fields = [repo_id, id, body, closed_at, closed_by, comments, comments_url, created_at, labels, labels_url, milestone, number, pull_request, repository, state, title, updated_at, url, user, locked, assignee, assignees]



  • issue_comments_data.csv: The issue_comments_data.csv contains metadata on all the trialing comments of issues created in the selected repositories.

    Fields = [repo_id, id, body, created_at, issue_url, updated_at, url, html_url, user]



  • pull_request_data.csv: The pull_request_data.csv contains information on all the pull requests made in the selected repositories.

    Fields = [repo_id, repo_name, id, additions, assignee, assignees, base, body, closed_at, comments, comments_url, created_at, commits, commits_url, deletions, head, merge_commit_sha, mergeable, mergeable_state, merged, merged_at, merged_by, html_url, labels, milestone, number, state, title, updated_at, url, user]



  • pull_request_comments_data.csv: The pull_request_comments_data.csv contains information on the comments made by users and collaborators on the pull requests.

    Fields = [repo id, repo name, id, body, created_at, commit_id, position, path, original_position, original_commit_id, in_reply_to_id, diff_hunk, pull_request_url, updated_at, url, html_url, user]



  • cobol_files_data.csv: The cobol_files_data.csv contains metadata on the COBOL files extracted from the selected repositories.

    Fields = [filename, repository path, dataset path, repo id, repo name, blank lines, comment lines, code lines, commits]



  • COBOL_Files: The COBOL_Files directory contains the COBOL source files extracted from the mined repositories. The COBOL files of a repository are stored in a directory named AuthorName_RepositoryName in the COBOL_Files directory.


Dataset Collection Methodology

elegant icons

The dataset was constructed using GitHub API and CLOC. The entire process was implemented using Python. Metadata of preliminary repositories with star count greater than one was extracted using GitHub API . Then, the repositories with description containing words such as 'course', 'resource', and others, were filtered out using Python. Repositories with no COBOL source files were removed. Finally, the preliminary repositories were manually validated to construct the resultant set of COBOL repositories. The metadata of selected repositories was collected using GitHub API and python. The COBOL source files present in the repositories were extracted using CLOC and Python.

Dataset Usage

  • The usability of standard constructs of COBOL can be analyzed to comprehend it's design. We believe that this study can be performed using the COBOL source files present in the X-COBOL dataset. Using the results, we can understand the usage patterns of different COBOL verbs, which can aid in optimizations in compilation and migration systems. Further, the inexperienced COBOL developers can benefit from this study by adopting the commonly used patterns of popular COBOL verbs.
  • Metadata of open source COBOL projects such as issue reports, commit messages, and pull requests present in X-COBOL can be analyzed to understand the causes and attributes of bugs present in the projects. Furthermore, techniques used to locate and rectify the identified bugs can be recognized. This analysis can benefit the COBOL developers in debugging and maintaining COBOL projects.
  • The X-COBOL dataset could be used to analyze the reliability of the open-source COBOL software in the current age, considering that COBOL is a legacy programming language. Also, following from the analysis of Bogart et al. (Using Productive Collaboration Bursts to Analyze Open Source Collaboration Effectiveness), it is essential to analyze the developer collaboration to measure productivity and evolution of software. Researchers can analyze these patterns in open-source COBOL software using the X-COBOL dataset. Further, this dataset can also measure the open-source community support, interest, and growth concerning a legacy programming language, COBOL.

Contributors

Mir Sameed Ali, Nikhil M, Sridhar Chimalakonda

Preprint