Uncovering the Secrets of the Maven Repository

Analysis of Library Sizes in Maven Central

Bachelor Thesis (2023)
Author(s)

N.H.C. Tomassen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Mehdi Keshani – Mentor (TU Delft - Software Engineering)

S. Proksch – Graduation committee member (TU Delft - Software Engineering)

Soham Chakraborty – Coach (TU Delft - Programming Languages)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Niels Tomassen
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Niels Tomassen
Graduation Date
08-08-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This research explores the size variations of artifacts in Maven Central, a repository containing a large collection of Java artifacts. This analysis sheds light on the coding habits and dependency management ecosystems within Maven Central, emphasizing the importance of managing artifact sizes effectively. It also provides valuable insights to library maintainers and clients who want to download libraries. For example, we can determine the average amount of space required to download 100 libraries.
The analysis is done by selecting a single version for each artifact in Maven Central and extracting metadata from the corresponding files.
The results reveal that the average size of an artifact is 1447 KB, although this average is heavily influenced by a few exceptionally large artifacts. Approximately 86% of the artifacts have a size smaller than 400 KB, indicating that the majority of artifacts are relatively lightweight.
The large artifacts identified in the analysis are predominantly attributed to two categories. The first category contains extensive projects with a substantial number of files, while the second category includes machine learning or big data projects that include massive data files.

Files

Tomassen_Niels_thesis.pdf
(pdf | 0.581 Mb)
License info not available