Tight Sampling in Unbounded Networks
Introduces snowball-type sampling scheme to prioritize cohesive communities around given seed nodes.
# Introduction
# Paragraph 1
- Social networks valuable source
- Obtaining full-sized networks is impossible
- limited access
- volume
- Sampling is a fundamental challenge
# Paragaph 2
- Sampling schemes
- Attribute-based sampling (e.g. demographics or tweet content)
- Connectivity-based sampling (e.g. seed-set expansion)
- Connectivity-based sampling quickly covers distant parts of graph on small-world networks
- Unwanted if one wants to sample from cohesive groups (as in local clustering with seeds)
# Paragaph 3
- Proposal of sampling scheme that prioritizes sampling within cohesive subgroups
- Similar to seeded community detection
- Determine locally dense subgraph containing a set of seed nodes
- Difference: graph is only partially known
- Common clustering objectives (e.g. conductance, modularity) are difficult to optimize
- rely on the whole graph (page rank, random walk)
# Paragraph 4
- Generalization of maximum adjacencey search
- prominently used to find minimum graph cuts
- Replace maximum-adjacency criterion with a weighted interaction-based priority
- Iteratively include profiles that show maximum priority with profiles inside
- Priority function is iteratively calibrated with empirical data
- Repeated until stopping criterion is met
# Paragraph 5
- Contributions:
- Snowball-sampling mechanism
- Behavior of sampling scheme vs. other connectivity-based on synthetic data
- Generalize the adjacency criterion
- Provide Twitter Interaction dataset with local clusters
# Related Work
# Node-based sampling methods
- Simple: Random vertex sampling
- samples nodes with equal probability
- constructs induced subgraph
- good to estimate i.e. degree distribution but not connectivity
- Sophisticated: Page-Rank weighted sampling
# Traversal-based sampling methods
- Fix seed nodes and expand on them
- Expand by
- Traditonal Graph traversal:
- BFS: biased towards high-degree nodes
- DFS
- Snowball sampling
- based on BFS
- peripheral nodes have many missing neighbors
- Random Walks
- modeled by markov chains
- biased towards high-degree nodes, which many try to solve:
- Metropolis-Hastings
- visit low-degree nodes
- Liu et al.
- hybrid jump mechanism
- avoids repetititve sampling within small connected components
- Forest Fire
- hyrbird combination of RW/SS sampling
- “burns down” fraction of outgoing links
- Maiya et al.
- uses concepts from expander graphs
- greedily construct with maximal expansion
- preserves community structure
- Zhang et al.
- based on edge connectivity
- requires large parts of the graph surrounding the seeded nodes
- Traditonal Graph traversal:
# Methodology
# Terminologies
- Seed users
- Given set of influencers
- Civil society category (DISMISS dataset)
- Denoted by $V_s$
- Interactions
- interactions X consists of like, retweet, reply, quote
- interaction $I_t(i, j) = x \in X$
- Insiders
- same as frontier
- Denotes as $V_i \supseteq V_s$
- Outsiders
- set of nodes that interact with insiders
- Denoted as $V_o$ disjoint $V_i$
# Priority scheme
# Data
- DISMISS dataset
- 11.580 invidiuals from India
- use civil society category
- tweets from July 2022
- 50,379 tweets sampled
- used only 90% bottom tweets to remove outlier
- 379.514 interactors
- 1 to 534 tweets per author
# Sampling strategy
- 4 sampling strategies:
- uniform (U) or weighted(W) outsider selection
- direct (D) or staged(S - choose insider then outsider) outsider selection
- Combinations: “DU”, “DW”, “SU”, “SW”
# Synthetic dataset
- For assessment
- Stochastic Block Model: 5000 nodes with 3 communities (1.5,, 1.5k, 2k), 58,799 edges
- Sampling
- no interactions therefore based on maximum adjacency search
- MAS: select outsider with highest adjecency to insider set
- RI_MAS (analogous SW): select ousider with highest adjencency to random insider
- RO (similar DU): select random ousider
- RI_RO (similar SU): select random outsider incident to random insider
# Experimentation
- Louvain community detection algorithm