Content Propagation in Online Social Networks

More Info
expand_more

Abstract

This thesis presents methods and techniques to analyze content propagation within online social networks (OSNs) using a graph theoretical approach. Important factors and different techniques to analyze and describe content propagation, starting from the smallest entity in a network, representing a user-account, up to complete friendship graphs and traces of content are described. All individuals and their attributes are stating the basic elements for statistical analysis of user behavior and individuals interests. When trying to identify the opinion of the population of a country for example, a random sample or data from everyone within the population is needed, a task which is not trivial because of different activity patterns and the fact that individuals may either do not provide information about themselves or obscure their data by supplying bogus information. This thesis shows that obtaining a random sample of the population of the Netherlands is possible in terms of certain parameters like the location, family and first names of users. Such a sample is likely not to be “random” in terms of the age of inhabitants and the usage of gathered data in order to predict the outcome of elections may be questioned. The representation of an individual's view onto an OSN is called an ego-centric network. It contains all friends and relations between friends of an ego within a sub-graph. Within such graphs, the influence between friends can be estimated improving the usability of recommendation systems which also raises concerns about the privacy of users. This thesis describes possibilities to reconstruct private information of a user if only a few friends of the individual share their data publicly because most friendships are created between persons having similar interests. Therefore the current way of dealing with privacy concerns, by enabling users to protect their data, is not sufficient. The structure of ego-centric networks also unveils the ability of egos to spread and control the spread of information as a person completely embedded in a group has less control over disseminating content than a person connecting multiple groups. A snapshot of a whole network of an OSN includes all user-accounts (nodes) and friendships (links) at a certain point in time. But as OSNs may contain millions of nodes the process of obtaining data by crawling is likely to be skewed depending on the used method and duration. Therefore a new way of traversing the graph called “Mutual Friend Crawling” is proposed in which certain network metrics converge faster to the final value by also detecting communities of users while traversing the graph. When analyzing the diffusion process of content in multiple OSNs, only a limited fraction of the neighbors of a user (i.e. friends) are ”useful” in terms of spreading content to their peers. Commonly used network metrics which reflect the centrality of a node are shown to have no correlation with the ability to repeatedly succeed in passing messages to a high number of users. The reason lies in the fact that the whole network of friends contains inactive or abandoned user accounts and a critical dependency to the time a message was sent exist. This denotes that friends of a user that forward a message have to be available or online at the time they are “needed” in order to forward content. On the other hand, influential groups might exist which act together in order to spread content with the help of each other. These groups might organize themselves via external communication channels, shown by the example of a famous group, the “Digg Patriots”, where members of the group cannot be found through purely topological measures. A similar time dependency exists in terms of the evolution of OSNs, because users can only forward information or befriend others when they are online. The interactivity durations of these actions are shown to be log-normal like distributed rather than exponential or power-law as assumed in multiple previous publications. The argumentation for such an assumption is based on the fact that power-law and exponential distributions would indicate most interactivity durations to be very short whereas individuals always need some time to complete tasks. However, it is shown that the time-scale of observations is crucial, because log-normal and power-law distributions with a small exponent might look the same in a log-log plot if the chosen bin-size is too large. Another process involved in the structural evolution of a friendship network is given by markets that sell friendship relations in OSNs. These markets are accounting for quite a high number of friendship relations whereas their usage has usually a negative connotation. But in terms of content propagation they might be beneficial because, for example politicians, “buying” followers are able to reach users which would otherwise not connect to them. The term viral spreading is often used in combination with content propagation within the network of an OSN. Therefore certain parameters of epidemiology are compared to ”viral spreading” in Twitter. It was found that most messages had a low basic reproductive ratio