X. Zhan
Please Note
15 records found
1
Multiple network embedding algorithms have been proposed to perform the prediction of missing or future links in complex networks. However, we lack the understanding of how network topology affects their performance, or which algorithms are more likely to perform better given the topological properties of the network. In this paper, we investigate how the clustering coefficient of a network, i.e., the probability that the neighbours of a node are also connected, affects network embedding algorithms’ performance in link prediction, in terms of the AUC (area under the ROC curve). We evaluate classic embedding algorithms, i.e., Matrix Factorisation, Laplacian Eigenmaps and node2vec, in both synthetic networks and (rewired) real-world networks with variable clustering coefficient. Specifically, a rewiring algorithm is applied to each real-world network to change the clustering coefficient while keeping key network properties. We find that a higher clustering coefficient tends to lead to a higher AUC in link prediction, except for Matrix Factorisation, which is not sensitive to the change of clustering coefficient. To understand such influence of the clustering coefficient, we (1) explore the relation between the link rating (probability that a node pair is the missing link) derived from the aforementioned algorithms and the number of common neighbours of the node pair, and (2) evaluate these embedding algorithms’ ability to reconstruct the original training (sub)network. All the network embedding algorithms that we tested tend to assign higher likelihood of connection to node pairs that share an intermediate or high number of common neighbours, independently of the clustering coefficient of the training network. Then, the predicted networks will have more triangles, thus a higher clustering coefficient. As the clustering coefficient increases, all the algorithms but Matrix Factorisation could also better reconstruct the training network. These two observations may partially explain why increasing the clustering coefficient improves the prediction performance.
Identifying important nodes in networks is essential to analysing their structure and understanding their dynamical processes. In addition, myriad real systems are time-varying and can be represented as temporal networks. Motivated by classic gravity in physics, we propose a temporal gravity model to identify important nodes in temporal networks. In gravity, the attraction between two objects depends on their masses and distance. For the temporal network, we treat basic node properties (e.g., static and temporal properties) as the mass and temporal characteristics (i.e., fastest arrival distance and temporal shortest distance) as the distance. Experimental results on 10 real datasets show that the temporal gravity model outperforms baseline methods in quantifying the structural influence of nodes. When using the temporal shortest distance as the distance between two nodes, the proposed model is more robust and more accurately determines the node spreading influence than baseline methods. Furthermore, when using the temporal information to quantify the mass of each node, we found that a novel robust metric can be used to accurately determine the node influence regarding both network structure and information spreading.
The study of citation networks is of interest to the scientific community. However, the underlying mechanism driving individual citation behavior remains imperfectly understood, despite the recent proliferation of quantitative research methods. Traditional network models normally use graph theory to consider articles as nodes and citations as pairwise relationships between them. In this paper, we propose an alternative evolutionary model based on hypergraph theory in which one hyperedge can have an arbitrary number of nodes, combined with an aging effect to reflect the temporal dynamics of scientific citation behavior. Both theoretical approximate solution and simulation analysis of the model are developed and validated using two benchmark datasets from different disciplines, i.e. publications of the American Physical Society (APS) and the Digital Bibliography & Library Project (DBLP). Further analysis indicates that the attraction of early publications will decay exponentially. Moreover, the experimental results show that the aging effect indeed has a significant influence on the description of collective citation patterns. Shedding light on the complex dynamics driving these mechanisms facilitates the understanding of the laws governing scientific evolution and the quantitative evaluation of scientific outputs.
Link prediction can be used to extract missing information, identify spurious interactions as well as forecast network evolution. Network embedding is a methodology to assign coordinates to nodes in a low-dimensional vector space. By embedding nodes into vectors, the link prediction problem can be converted into a similarity comparison task. Nodes with similar embedding vectors are more likely to be connected. Classic network embedding algorithms are random-walk-based. They sample trajectory paths via random walks and generate node pairs from the trajectory paths. The node pair set is further used as the input for a Skip-Gram model, a representative language model that embeds nodes (which are regarded as words) into vectors. In the present study, we propose to replace random walk processes by a spreading process, namely the susceptible-infected (SI) model, to sample paths. Specifically, we propose two susceptible-infected-spreading-based algorithms, i.e., Susceptible-Infected Network Embedding (SINE) on static networks and Temporal Susceptible-Infected Network Embedding (TSINE) on temporal networks. The performance of our algorithms is evaluated by the missing link prediction task in comparison with state-of-the-art static and temporal network embedding algorithms. Results show that SINE and TSINE outperform the baselines across all six empirical datasets. We further find that the performance of SINE is mostly better than TSINE, suggesting that temporal information does not necessarily improve the embedding for missing link prediction. Moreover, we study the effect of the sampling size, quantified as the total length of the trajectory paths, on the performance of the embedding algorithms. The better performance of SINE and TSINE requires a smaller sampling size in comparison with the baseline algorithms. Hence, SI-spreading-based embedding tends to be more applicable to large-scale networks.
In this paper, we explore how to effectively suppress the diffusion of (mis)information via blocking/removing the temporal contacts between selected node pairs. Information diffusion can be modelled as, e.g., an SI (Susceptible-Infected) spreading process, on a temporal social network: an infected (information possessing) node spreads the information to a susceptible node whenever a contact happens between the two nodes. Specifically, the link (node pair) blocking intervention is introduced for a given period and for a given number of links, limited by the intervention cost. We address the question: which links should be blocked in order to minimize the average prevalence over time? We propose a class of link properties (centrality metrics) based on the information diffusion backbone [19], which characterizes the contacts that actually appear in diffusion trajectories. Centrality metrics of the integrated static network have also been considered. For each centrality metric, links with the highest values are blocked for the given period. Empirical results on eight temporal network datasets show that the diffusion backbone based centrality methods outperform the other metrics whereas the betweenness of the static network, performs reasonably well especially when the prevalence grows slowly over time.
Network embedding aims at learning node representation by preserving the network topology. Previous embedding methods do not scale for large real-world networks which usually contain millions of nodes. They generally adopt a one-size-fits-all strategy to collect information, resulting in a large amount of redundancy. In this paper, we propose DiaRW, a scalable network embedding method based on a degree-biased random walk with variable length to sample context information for learning. Our walk strategy can well adapt to the scale-free feature of real-world networks and extract information from them with much less redundancy. In addition, our method can greatly reduce the size of context information, which is efficient for large-scale network embedding. Empirical experiments on node classification and link prediction prove not only the effectiveness but also the efficiency of DiaRW on a variety of real-world networks. Our algorithm is able to learn the network representations with millions of nodes and edges in hours on a single machine, which is tenfold faster than previous methods.
Many systems are dynamic and time-varying in the real world. Discovering the vital nodes in temporal networks is more challenging than that in static networks. In this study, we proposed a temporal information gathering (TIG) process for temporal networks. The TIG-process, as a node's importance metric, can be used to do the node ranking. As a framework, the TIG-process can be applied to explore the impact of temporal information on the significance of the nodes. The key point of the TIG-process is that nodes' importance relies on the importance of its neighborhood. There are four variables: temporal information gathering depth n, temporal distance matrix D, initial information c, and weighting function f. We observed that the TIG-process can degenerate to classic metrics by a proper combination of these four variables. Furthermore, the fastest arrival distance based TIG-process (fad-tig) is performed optimally in quantifying nodes' efficiency and nodes' spreading influence. Moreover, for the fad-tig process, we can find an optimal gathering depth n that makes the TIG-process perform optimally when n is small.
The rapid development of World Wide Web accelerates information spreading in various ways. Thanks to the emergence of multiple social platforms, some events which are not much attractive in the past can become social hot spots nowadays. In this paper, we study the information diffusion process of “IP MAN3 box office fraud”, which is widely diffused in the largest Chinese microblogging system, namely Sina Weibo, in March 2016. Based on the temporal metric we have proposed, we succeed in finding out the sources of the information, and constructing the panorama of the diffusion process. In addition, a portion of nodes that promote the diffusion are identified by using the node importance algorithms. Finally, the users with abnormal behaviors in the process of event development are identified.
Research on the interplay between the dynamics on the network and the dynamics of the network has attracted much attention in recent years. In this work, we propose an information-driven adaptive model, where disease and disease information can evolve simultaneously. For the information-driven adaptive process, susceptible (infected) individuals who have abilities to recognize the disease would break the links of their infected (susceptible) neighbors to prevent the epidemic from further spreading. Simulation results and numerical analyses based on the pairwise approach indicate that the information-driven adaptive process can not only slow down the speed of epidemic spreading, but can also diminish the epidemic prevalence at the final state significantly. In addition, the disease spreading and information diffusion pattern on the lattice as well as on a real-world network give visual representations about how the disease is trapped into an isolated field with the information-driven adaptive process. Furthermore, we perform the local bifurcation analysis on four types of dynamical regions, including healthy, a continuous dynamic behavior, bistable and endemic, to understand the evolution of the observed dynamical behaviors. This work may shed some lights on understanding how information affects human activities on responding to epidemic spreading.
The interaction between disease and disease information on complex networks has facilitated an interdisciplinary research area. When a disease begins to spread in the population, the corresponding information would also be transmitted among individuals, which in turn influence the spreading pattern of the disease. In this paper, firstly, we analyze the propagation of two representative diseases (H7N9 and Dengue fever) in the real-world population and their corresponding information on Internet, suggesting the high correlation of the two-type dynamical processes. Secondly, inspired by empirical analyses, we propose a nonlinear model to further interpret the coupling effect based on the SIS (Susceptible-Infected-Susceptible) model. Both simulation results and theoretical analysis show that a high prevalence of epidemic will lead to a slow information decay, consequently resulting in a high infected level, which shall in turn prevent the epidemic spreading. Finally, further theoretical analysis demonstrates that a multi-outbreak phenomenon emerges via the effect of coupling dynamics, which finds good agreement with empirical results. This work may shed light on the in-depth understanding of the interplay between the dynamics of epidemic spreading and information diffusion.
Purpose - Information carriers (including mass media and We-Media) play important roles in information diffusion on social networks. The purpose of this paper is to investigate changes in the dissemination of information combing with data analysis. Design/methodology/approach - This work analyzed nearly 200 years of coverage of different information carriers during different periods of human society, from the period of only mouth-to-mouth communication to the period of modern society. Information diffusion models are built to illustrate how the information dynamic changes with time and combined box office data of several movies to predict the process of information diffusion. In addition, a metric is defined to identify which information would become news in the future. Findings - Results show that with the development of information carriers, information spreads faster and wider nowadays. The correctness of the metric proposed has been validated. Research limitations/implications - The structure of social networks influences the dissemination of information. There are an enormous number of factors that influence the formation of hotspots. Practical implications - The results and conclusion of this work will benefit by predicting the evolution of information carriers. The metric proposed will aid in searching hot news in the future. Originality/value - This work may shed some light on a better understanding of information diffusion, spreading not only on social networks but also on the carriers used for the information spreading.
The ongoing rapid expansion of the Word Wide Web (WWW) greatly increases the information of effective transmission from heterogeneous individuals to various systems. Extensive research for information diffusion is introduced by a broad range of communities including social and computer scientists, physicists, and interdisciplinary researchers. Despite substantial theoretical and empirical studies, unification and comparison of different theories and approaches are lacking, which impedes further advances. In this article, we review recent developments in information diffusion and discuss the major challenges. We compare and evaluate available models and algorithms to respectively investigate their physical roles and optimization designs. Potential impacts and future directions are discussed. We emphasize that information diffusion has great scientific depth and combines diverse research fields which makes it interesting for physicists as well as interdisciplinary researchers.