AA
A. Andrei
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
1 records found
1
Many real-world tasks involve data that is naturally structured as a graph, such as proteins linked by interactions, papers linked by citations, or people connected in a social network. A common goal is to predict properties of each entity based on how it is connected to the rest. The models used for this, called graph neural networks, work by allowing each node to aggregate information from its neighbours, then from its neighbours’ neighbours, and so on.
However, looking too far can blur distant nodes together. A common remedy is to train the model on a random subset of the links and discard the rest, in the hope of preventing it from over-relying on any single part of the graph. This idea was proposed and tested only on the simpler problem in which each entity carries exactly one label. Many real-world problems are not like this. A single protein may participate in multiple biological processes simultaneously, so an effective predictor must assign several labels at once. Whether dropping links still helps in this more realistic multi-label setting has not been studied.
This work addresses that question using synthetic graphs with precisely controlled structure, ranging from strongly clustered to nearly random, as well as three real biological and bibliographic datasets spanning the same range. In almost every case, dropping links harms performance rather than improving it, and the damage increases with the fraction of removed links, with a single weak exception on the most strongly multi-label, lowest-homophily real graph.
The cause is not a flaw in the technique but a property of the multi-label task itself. When multiple labels must be predicted from the same node representation, each label receives only a fraction of the learning signal it would obtain in a single-label setting. Discarding links reduces this already limited signal even further. ...
However, looking too far can blur distant nodes together. A common remedy is to train the model on a random subset of the links and discard the rest, in the hope of preventing it from over-relying on any single part of the graph. This idea was proposed and tested only on the simpler problem in which each entity carries exactly one label. Many real-world problems are not like this. A single protein may participate in multiple biological processes simultaneously, so an effective predictor must assign several labels at once. Whether dropping links still helps in this more realistic multi-label setting has not been studied.
This work addresses that question using synthetic graphs with precisely controlled structure, ranging from strongly clustered to nearly random, as well as three real biological and bibliographic datasets spanning the same range. In almost every case, dropping links harms performance rather than improving it, and the damage increases with the fraction of removed links, with a single weak exception on the most strongly multi-label, lowest-homophily real graph.
The cause is not a flaw in the technique but a property of the multi-label task itself. When multiple labels must be predicted from the same node representation, each label receives only a fraction of the learning signal it would obtain in a single-label setting. Discarding links reduces this already limited signal even further. ...
Many real-world tasks involve data that is naturally structured as a graph, such as proteins linked by interactions, papers linked by citations, or people connected in a social network. A common goal is to predict properties of each entity based on how it is connected to the rest. The models used for this, called graph neural networks, work by allowing each node to aggregate information from its neighbours, then from its neighbours’ neighbours, and so on.
However, looking too far can blur distant nodes together. A common remedy is to train the model on a random subset of the links and discard the rest, in the hope of preventing it from over-relying on any single part of the graph. This idea was proposed and tested only on the simpler problem in which each entity carries exactly one label. Many real-world problems are not like this. A single protein may participate in multiple biological processes simultaneously, so an effective predictor must assign several labels at once. Whether dropping links still helps in this more realistic multi-label setting has not been studied.
This work addresses that question using synthetic graphs with precisely controlled structure, ranging from strongly clustered to nearly random, as well as three real biological and bibliographic datasets spanning the same range. In almost every case, dropping links harms performance rather than improving it, and the damage increases with the fraction of removed links, with a single weak exception on the most strongly multi-label, lowest-homophily real graph.
The cause is not a flaw in the technique but a property of the multi-label task itself. When multiple labels must be predicted from the same node representation, each label receives only a fraction of the learning signal it would obtain in a single-label setting. Discarding links reduces this already limited signal even further.
However, looking too far can blur distant nodes together. A common remedy is to train the model on a random subset of the links and discard the rest, in the hope of preventing it from over-relying on any single part of the graph. This idea was proposed and tested only on the simpler problem in which each entity carries exactly one label. Many real-world problems are not like this. A single protein may participate in multiple biological processes simultaneously, so an effective predictor must assign several labels at once. Whether dropping links still helps in this more realistic multi-label setting has not been studied.
This work addresses that question using synthetic graphs with precisely controlled structure, ranging from strongly clustered to nearly random, as well as three real biological and bibliographic datasets spanning the same range. In almost every case, dropping links harms performance rather than improving it, and the damage increases with the fraction of removed links, with a single weak exception on the most strongly multi-label, lowest-homophily real graph.
The cause is not a flaw in the technique but a property of the multi-label task itself. When multiple labels must be predicted from the same node representation, each label receives only a fraction of the learning signal it would obtain in a single-label setting. Discarding links reduces this already limited signal even further.