X-COBOL is a curated dataset containing structured metadata about the development cycle of 182 COBOL projects mined from GitHub. The dataset includes metadata on the commits, issues, pull requests, and releases of the mined repositories, along with the COBOL source files present in them. Additionally, we provide the metadata of COBOL files extracted. We expect the research community to utilize the dataset on COBOL projects to conduct empirical studies in code quality, COBOL software development practices, error analysis, security, and so on. Also, the dataset can aid in research studies and tools supporting the maintenance and migration of COBOL projects. Finally, the extracted COBOL source files can be used by researchers to perform static code analysis to develop tools that support the development of COBOL projects.
The Dataset is available at https://bit.ly/3IIa7Dy
The dataset contains eight CSV files, capturing different properties of the dataset, and a directory named COBOL_Files containing the extracted COBOL files. The overall schema demonstrating the type of data present in the eight CSV files is shown in the above figure.
Fields = [id, owner, name, url, stars, forks,
description, created_at, languages, size, commit
frequency, committer frequency, integrator
frequency, integration frequency, merge frequency,
commits, releases, last_commit, open pull requests,
closed pull requests, open issues, closed issues,
number of cobol files, total comment lines, total
code lines]
Fields = [repo id, Commit hash, Parents hash,
Author_Name, Author_Email, Author_Date,
Committer_Name, Committer_Email, Committer_Date,
Commit Message]
Fields = [repo_id, id, title, tag_name,
target_commitish, is_draft, pre_release, created_at,
author, published_at, url, upload_url, html_url,
tar_url, zip_url]
Fields = [repo_id, id, body, closed_at, closed_by,
comments, comments_url, created_at, labels,
labels_url, milestone, number, pull_request,
repository, state, title, updated_at, url, user,
locked, assignee, assignees]
Fields = [repo_id, id, body, created_at, issue_url,
updated_at, url, html_url, user]
Fields = [repo_id, repo_name, id, additions,
assignee, assignees, base, body, closed_at,
comments, comments_url, created_at, commits,
commits_url, deletions, head, merge_commit_sha,
mergeable, mergeable_state, merged, merged_at,
merged_by, html_url, labels, milestone, number,
state, title, updated_at, url, user]
Fields = [repo id, repo name, id, body, created_at,
commit_id, position, path, original_position,
original_commit_id, in_reply_to_id, diff_hunk,
pull_request_url, updated_at, url, html_url,
user]
Fields = [filename, repository path, dataset path,
repo id, repo name, blank lines, comment lines, code
lines, commits]
The dataset was constructed using GitHub API and CLOC. The entire process was implemented using Python. Metadata of preliminary repositories with star count greater than one was extracted using GitHub API . Then, the repositories with description containing words such as 'course', 'resource', and others, were filtered out using Python. Repositories with no COBOL source files were removed. Finally, the preliminary repositories were manually validated to construct the resultant set of COBOL repositories. The metadata of selected repositories was collected using GitHub API and python. The COBOL source files present in the repositories were extracted using CLOC and Python.