BotHunter

None, None; None, None; None, None; None, None; None, None

BotHunter

An Approach to Detect Software Bots in GitHub

Conference Paper (2022)

Author(s)

Ahmad Abdellatif (Concordia University)

Mairieli Wessel (TU Delft - Software Engineering)

Igor Steinmacher (Universidade Tecnológica Federal Do Paraná (UTFPR))

Marco Aurélio Gerosa (Northern Arizona University)

Emad Shihab (Concordia University)

Research Group

Software Engineering

Copyright

DOI related publication

https://doi.org/10.1145/3524842.3527959

Empirical Software Engineering Software Bots

To reference this document use:

https://resolver.tudelft.nl/uuid:a1c4fc01-5e10-46d9-b531-b718070ed63e

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Research Group

Software Engineering

Pages (from-to)

6-17

ISBN (electronic)

9781450393034

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Bots have become popular in software projects as they play critical roles, from running tests to fixing bugs/vulnerabilities. However, the large number of software bots adds extra effort to practitioners and researchers to distinguish human accounts from bot accounts to avoid bias in data-driven studies. Researchers developed several approaches to identify bots at specific activity levels (issue/pull request or commit), considering a single repository and disregarding features that showed to be effective in other domains. To address this gap, we propose using a machine learning-based approach to identify the bot accounts regardless of their activity level. We selected and extracted 19 features related to the account's profile information, activities, and comment similarity. Then, we evaluated the performance of five machine learning classifiers using a dataset that has more than 5,000 GitHub accounts. Our results show that the Random Forest classifier performs the best, with an F1-score of 92.4% and AUC of 98.7%. Furthermore, the account profile information (e.g., account login) contains the most relevant features to identify the account type. Finally, we compare the performance of our Random Forest classifier to the state-of-the-art approaches, and our results show that our model outperforms the state-of-the-art techniques in identifying the account type regardless of their activity level.

Files

3524842.3527959.pdf

(pdf | 0.751 Mb)

- Embargo expired in 01-06-2023

License info not available