Per-Node π‘˜-Hop Label Homophily Predicts GNN Accuracy in Multi-Label Node Classification

Bachelor Thesis (2026)
Author(s)

V. Guzun (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M. Khosla – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

E. Congeduti – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
19-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
3
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Multi-label node classification asks a graph neural network to assign each node a set of labels. For single-label classification the success of these networks is usually attributed to label homophily, the tendency of connected nodes to share labels; for the multi-label case Zhao et al. introduced a homophily metric and reported that it tracks performance across datasets. That metric measures label similarity only between directly connected nodes, even though graph neural networks aggre-gate information from wider neighbourhoods. It is therefore unknown whether label similarity at larger distances also predicts how accurately a network classifies a node, and at which distance that signal is strongest. We generalise the metric to a per-node label-similarity score at any chosen distance. We then correlate this score node-by-node with trained-model accuracy on three benchmark datasets, repeat the analysis across model depths to separate a data-driven explanation from an architectural one, and test the relationship causally with two synthetic graph generators that plant label structure at controlled distances. The predictive signal is consistently strongest at distance two, not at directly adjacent nodes, across two standard architectures and up to several thousand test nodes per dataset; the most predictive distance does not shift with model depth, so it is a property of the graphs rather than of the model; and controlled synthetic experiments confirm a causal effect of two-hop homophily on accuracy in graphs built to isolate it, with weaker and partially confounded support at three hops. Direct measurements further show that measuring similarity at an exact distance, rather than over a cumulative neighbourhood, is the right unit for identifying that scale. All code, data splits, and figures are released for reproducibility.

Files

RESEARCH_PAPER-1.pdf
(pdf | 1.21 Mb)
License info not available