Follow the “Mastodon”: Structure and Evolution of a Decentralized Online Social Network
| Key | Value |
|---|---|
| Author | Zignani et. al |
| Year | 2018 |
| Link |
- Provides dataset of 400k users of 1.7 instances + their social graph
- Analyses network properties and compares with Twitter
- Simple analysis of connections across instances
# Extracted Notes
a) the information about the status of the instances is available at registration time and is easily accessible on the home page 2; b) most of the Mastodon servers do not ban requests to the “follower” and “following” pages of the instance members; and c) the decentralized architecture of Mastodon allows us to reach a higher page rate, while also respecting the politeness factor, since the request load is distributed among the instances [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
However, these algorithms have significant effects on the structure of the underlying social graph. For instance, the “Who To Follow” algorithm abruptly changed the following graph of Twitter [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
So being a recommendation-free social network makes Mastodon a data source for understanding the growth mechanisms that drive online social networks, without the influence of external factors. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
meta-data associated to each instance and enrich them by providing the geographical position (country) and topics, [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
We analyze the network structure of Mastodon and compare its main properties against Twitter, i.e. its most widespread centralized counterpart. We show that in Mastodon bidirectional relationships are more likely, i.e. the links are less “weak” than in Twitter, and that, as in Twitter, there are hub-users who attract many other users. However the presence of spambots is marginal, unlike Twitter, where social bots have been an issue. Finally, we find that an instance-based organization highly impacts on the tightly clustered structure of the social network [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
three major instances are well interconnected, but weakly overlapped. So, despite the decentralized and fragmented architecture, Mastodon users keep connected to the core of the network and are able to search for friendships in other instances, [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
All these studies have focused on centralized online social platforms; to the best of our knowledge this is the first dataset on a decentralized social network which includes the network structure and its evolution, and the meta-data about its elements. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 543]
“follow” relationships and its growth in terms of new connections and users, [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
usage statistics and meta-data (geographical location and allowed topics) about the servers [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
analyzes the overall structure of the Mastodon social network, focusing on its diversity w.r.t. other commercial microblogging platforms such as Twitter. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
instance-like paradigm influences the connections among the users. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
The released dataset paves the way to a number of research activities, which range from classic social network analysis to the modeling of social network dynamics and platform adoption in the early stage of the service. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
original goals and changed their business model as they faced problems like data monetization and data privacy. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
tilization of user data for advertisement purposes is at the core of the public debate and impacts consumer trust in these online services. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
Mastodon since it is the newest and fastest-growing microblogging platform. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
users have a more detailed control of the visibility of their posts; in fact, each post can alternatively be public and visible on local (instance) and global (federated) timelines, public but not visible on the timelines, or totally private. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
data collection campaign on Mastodon may overcome the following issues researches are facing when dealing with well established and company-based online social networks. First, data confidentiality policies of the major social network platforms severely limited the access to information exploitable for the purpose of reconstructing the structure and evolution of the social network. To protect this kind of data the major platforms continuously act on the API 1 and on the spider policies (robots.txt) removing any entry-point. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 541]
Information tracking the temporal dynamics of these groups is essential in understanding how communities grow and overlap; and Mastodon provides this kind of data: we can track how connections evolve and we can gather the communities, eventually along with the topics they are focused on or the behaviors they prohibit. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
The initial snapshot of the network contains more than 400K users and 5.5 millions links among the members of the platforms located in about 1,700 instance around the world [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 542]
With respect to the above data sources, the Mastodon dataset captures the evolution of a different social platform paradigm, the decentralized one; it has a daygranularity; it is enriched by different kinds of meta data which allow a more detailed analysis of the mechanisms driving it microscopical evolution; and its evolution is not biased by any friend recommendation algorithm. Finally, the dataset is publicly available 3. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 543]
The basic aim is to return control of the content distribution channels to the people by avoiding the insertion of sponsored users or posts in the feeds. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 543]
both to manage the communications (black links) among the servers - instances - comprising the federation and to offer a client-to-server interface which enables interactions (blue links) among the users having their accounts on the instances. In the server-to-server layer the instances are connected as nodes in a network, and each of them administrates its own rules, account privileges, and whether or not to share messages coming to and from other instances. Unlike a centralized social media, anyone can run a server of Mastodon. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 543]
The local timeline shows messages from users hosted on a singular instance, while the federated timeline aggregates the messages across all participating Mastodon instances. The non-commercial purposes of the social network only allow a chronological ordering in both timelines, thus avoiding any ranking mechanism based on advertisement or other recommendation algorithms. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 543]
In fact, each message has a privacy option, and users can choose whether the post is public or private. Public messages can be displayed on the federated timeline, and private messages are only shared on the timelines of the user’s followers. Messages can also be marked as unlisted from timelines or direct between users. Finally, users can also mark their accounts as completely private, so their posts never appear on any timeline. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
To this aim, each instance declares in its description the topics their users should be interested in and the maximum number of users the instance can handle. So prior to registration, a user is encouraged to choose the instance more suited to her/is own tastes. If the instance has no room for her/him, the user may choose another similar instance or run a new server supporting specific contents or applying different moderation policies. In fact, the community-orientation strongly impacts the moderation procedures since each instance can limit specific contents. For instance, the flagship instance Mastodon.social bans contents that are illegal in Germany or France, including Nazi symbolism and Holocaust denial. In the mind of Mastodon’s founder, small and close communities would defeat unwanted behavior more effectively than a centralized solution based on an operation team screening harming contents. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
Finally, the organization in decentralized communities is harder for government to censor, since the migration of the community on a server placed in a safer country is an easy task for the instance administrator. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
The information about the status of the instances is already available when a user has to register; also, it is easily accessible through a specific web resource 5. This way, we are able to collect a set of meta-data related to the setting of the instances. The meta-data are useful in studies on the formation of groups and communities (Backstrom et al. 2006; Kairam, Wang, and Leskovec 2012) and in the validation of online social network models where the structure of the social network is strictly related to the interests of its members (Coletto et al. 2016; Zhang et al. 2017). [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
The spider policies of most of the Mastodon instances are soft and allow requests towards the resources which return the ’follower’ and ’following’ relationships for the instance members. This way we can build the Mastodon social network. The network structure, combined with the instances’ meta-data, paves the way to a number of research activities, ranging from classical social network analysis to the overlapping community formation and detection (Xie, Kelley, and Szymanski 2013) since people have connections to several instances expressing diverse [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
interests and tastes. Moreover, the fact that each community corresponds to a physical server raises new questions with regards to the understanding of socio-technological systems since there is an explicit interplay between the physical network architecture and the overlaid social network (Schneider et al. 2009). [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
a) a direct access to the data from the system (Jiang et al. 2013; Zhao et al. 2012); b) a passive measurement, i.e. a tracking of the communication between users and the platform through click-stream or monitoring applications installed on the users’ devices (Schneider et al. 2009; Benevenuto et al. 2009); and c) an active measurement by actively querying the OSN platform through the API provided by the system or by parsing web pages. In this study we adopt the third methodology since the first and second options impose strong limitations in terms of number of users willing to install an application and release their private data, in terms of data representativeness, and in terms of number of instance administrators who should be contacted and willing to share the logs of their servers and users’ data. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 544]
We gather the list of all the instances by querying the resource instances.social/api/1.0/instances/list, after receiving an access token which prevent an improper usage of the API. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 545]
Figure 2 shows the trend of the number of instances, users, connections and posts (statuses) along a six-month period, namely from July 19, 2017 to January 23, 2018 and highlights the actual status of the platform. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 545]
In fact, pawoo.net and mstdn.jp users are very active (25K and 50K post-per-day on average), while the third most used instance (mastodon.social) is much less active (1K post-per-day). [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 545]
We also enriched the instance meta-data by adding the geographical location of the instances at a country-granularity. To this aim, we exploited the geo-lookup service provided by freegeoip.net for assigning to each server the country it is in. The service relies on a database of IP addresses associated to cities and countries along with other information we overlooked. Thus, we introduce in the released meta-data the field Country which contains the ISO 3166-1 alpha-3 6code of the country hosting the server. By the geo-lookup service we found the geographical position of 93% (1588) of the instances. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 545]
We also note that in many countries there are no instances, e.g. there are none in any of the African countries and throughout most of Central and South America and Russia. Thus, so far the Mastodon platform has not reached a worldwide diffusion, despite its decentralized nature and the ease of setting up new instances. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 545]
In fact, a first check of the “Full description” and “Topic” fields has shown that most instance administrators do not apply the good practice of describing their instance and indicating the topics. Practically, we found that only 18% of instances have a description and 8% also have a not-empty topic list. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 545]
instances are not focused on something particular, they are generalist; b) programming and technology, and in general science, are the most common among the specific topics; and c) there are also instances dedicated to the arts, creativity and gaming. Despite the above results, the information on topics should be combined with a text analysis on the published statuses in order to fill the missing data. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 546]
Nevertheless, the development of the tools to gather this kind of data requires some choices which depend on the features of the platform: 1. how to access the information on the connections among the users; 2. which users the graph visit should start from; 3. which connections to follow and which policy to implement during the graph visit. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 546]
To overcome these limitations, we developed a web spider targeted to the web pages of the platform. From each profile page we extract the URLs which return both the followers and the followees. Then, by scraping the web pages linked to the above URLs we gather the in-going and outgoing relationships of a user. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 546]
We also highlight that the information in following/follower web pages are also available to visitors who are not logged on. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 546]
Once we have identified how to access the data, we have to define the seed set, i.e. the set of users the crawler starts to visit. To build a seed set as large as possible we exploit both the global and the local timelines, since they report all the statuses with public visibility (see the previous section) in chronological order. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 546]
From each list we extract the users who posted at least one status and put them into the seed set. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 546]
Finally, in our crawler we implement a breadth-first search (BFS) strategy which traverses both out-going and in-going links; where the latter are traversed in the opposite direction, i.e. from destination to source. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
In the crawler we also add a filter which discard links towards profiles hosted in other social platform supporting the open protocols ActivityPub or OStatus. Indeed, these protocols allow users to interact with users on other platforms, forming what is named “fediverse”. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
After the end of the crawling process we obtained a network made up of 479,425 nodes and 5,649,762 directed links. Based on the number of users we get from the instance meta-data at the time of the crawl ending, our network covers 46% of users in Mastodon. Specifically, in Table 1 we observe that coverage of the crawl is greater than 50% in most instances and that the three biggest instances are covered on average to an extent of 52% [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
Every day we run a monitoring tool which extracts the new followers and followees of each user in the crawled network and, whereas it detects users not yet in the network, it runs the network [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
crawler again starting from these new profiles8. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
Moreover, the tool is less request-demanding since it retrieves the last connections, due to a chronological ordering in the followee and follower web pages. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
In 9 days we have gathered more than 370K new links, with a peak during the week-end (2021 January, 2018). Finally, in the release network we add to each link a timestamp, indicating the day on which we retrieve it. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
to which extent does the decentralized and instance-based nature of Mastodon influence its overall structure? More specifically, how does it differ from the most famous microblogging service, Twitter? [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
In addition, we consider also the mutual network - which is built by reciprocated edges (Mastodon reciprocity is 0.35), whose extremes are users following one another - in order to enable comparison with Twitter features as reported in literature (Myers et al. 2014). [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
the in-degree and out-degree distributions are very similar to the extent of exhibiting the same median value of 16; this is opposite to Twitter, where the median value for the out-degree distribution is higher than for the in-degree. This highlights the fact that while the typical Twitter user follows more people than he/she has followed, in Mastodon users have a more balanced behavior. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
In fact, more than 95% of users have a difference in the interval (−250, 250). Thus, in both platforms we can find a small population of celebrity users and the presence of social bots which in Twitter was estimated to be around 15% (Varol et al. 2017); by contrast, Mastodon profiles with spambot traits are more marginal, less than 5%. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 547]
The clustering coefficient in social networks measures the fraction of users whose friends are friends among one another. As in the Twitter analysis, we focus on the local clustering coefficient (cc) of nodes in the Mastodon mutual network. In Figure 4e we show the average local clustering coefficient as a function of the mutual degree. As in most social networks, the local clustering coefficient decreases while the degree increases. In the comparison of this metric with two of most widespread online social networks, we find that its lies in the middle between Facebook and Twitter. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 548]
First, mstdn.jp, the second largest instance, has a higher average clustering coefficient (0.35) compared to the other instances (pawoo.net 0.26, mastodon.social - 0.13 and mastodon.xyz - 0.08), and it is also higher than the clustering coefficient of the entire network. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 548]
Assortativity measures preference for a node to be linked to others that are similar (or dissimilar) w.r.t. a specific property. Typically, we distinguish between nominal and numerical attributes, since the metrics adopted are different. In the first case, we compute the modularity of the network w.r.t. a given category, while in the second case we use the Pearson coefficient to compute how correlated a numerical property of nodes connected by a link is. Here we focus on the degree assortativity and on the nominal assortativity measured on the instance hosting the nodes. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]
Generally, we find that Mastodon shows degree assortativities that are not consistent with the other online social networks. This is also confirmed by a disassortative trait (-0.13) in the mutual network10. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]
Finally, we investigate how the decentralized architecture of Mastodon impact the relationships among users sited in different instances. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]
result in well-separated and scarcely interconnected groups. To this aim we analyze the directed network of the instances, as shown in Figure 5, where we draw a weighted link from the instance i1 to instance i2 if there is at least one link from i1’s member to i2’s member and the weight is proportional to the number of links connecting i1 and i2’s members11. By visually inspecting the structure of the network we observe that the three major instances are well interconnected, and that there are many instances surrounding them. So, despite the decentralized and fragmented architecture, Mastodon users keep connected to the core of the network and are able to search for friendships in other instances, even if a friendship suggestion mechanism is still lacking. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]
It is a platform that aims to overturn the centralized and invasive model of the most popular social networks by proposing a decentralized approach, free of sponsored contents and recommendation systems. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]
The results of an initial network analysis reported in our paper already underline its particular features. The absence of interference by the platform will also allow us to understand the intrinsic mechanisms of growth and evolution of a social network when not mediated. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]
tion and evolution models based on topics or interests. [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 550]
Lastly, the wealth of the dataset - which includes not only the structure of the network but also temporal annotation and meta-data about topics and geographical information - makes it suitable for a plethora of different studies from the validation of triadic closure models to a deeper understanding of network forma11The percentage of directed link whose extremes are in different instances is 62%. 54 [@PVUX7JDF#Zignani_Gaito_Rossi_2018, p. 549]