Abstract
Real social network datasets with community structures are critical for evaluating various algorithms in Online Social Networks (OSNs). However, obtaining such community data from OSNs has recently become increasingly challenging due to privacy issues and government regulations. In this paper, we thus make our first attempt to address two important factors, i.e., user willingness and existence of community structure, to obtain more complete OSN data. We formulate a new research problem, namely Community-aware Data Acquisition with Maximum Willingness in Online Social Networks (CrawlSN), to identify a group of users from an OSN, such that the group is a socially tight community and the users’ willingness to contribute data is maximized. We prove that CrawlSN is NP-hard and inapproximable within any factor unless, and propose an effective algorithm, named Community-aware Group Identification with Maximum Willingness (CIW) with various processing strategies. We conduct an evaluation study with 1093 volunteers to validate our problem formulation and demonstrate that CrawlSN outperforms the other alternatives. We also perform extensive experiments on 7 real datasets and show that the proposed CIW outperforms the other baselines in both solution quality and efficiency.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Many other factors, such as relationship types of users are also important. Here, we discuss the two fundamental factors to crawl the community data for further analysis and discuss the other important factors in the future work.
We show undirected edges here for the clarity of presentation. Directed relations can be easily incorporated in our problem formulation.
We have implemented a light crawler using python 3.6, which is able to obtain the publicly accessible user data in OSNs.
We have built a simple machine learning model with SVM that predicts users’ willingness with their publicly accessible information on Facebook.
We can also consider directed influences in our problem formulation with a slight modification of the algorithm.
If \(\tau _{v}=0\), we define the value of the second term of Eq. 1 as 0. That is, if \(\tau _{v}=0\), \(\frac{\sum _{u \in N_S(v)} \delta _{u,\emptyset } \cdot w_{u,v}}{\tau _v}=0\).
In some extreme cases, for a user who is very unwilling to provide her data (i.e., with a small individual willingness), the influenced willingness may raise the value of Eq. 1 up to 1. To tackle this issue, an additional parameter \(\beta \in [0,1]\) can be added to the second term (i.e., influenced willingness) of Eq. 1 as follows. By setting a smaller \(\beta \), i.e., close to 0, the user’s individual willingness becomes more important in the computation of the average willingness.
The source codes are available online http://www.cs.nthu.edu.tw/~chihya/CIW_download/.
References
Aksu H, Canim M, Chang Y, Korpeoglu I, Ulusoy O (2014) Distributed \(k\) -core view materialization and maintenance for large dynamic graphs. IEEE Trans Knowl Data Eng 26(10):2439–2452
Alvarez-Hamelin J, Dall’Asta L, Barrat A, Vespignani A (2005) K-core decomposition of internet graphs: hierarchies, self-similarity and measurement biases. Networks and Heterogeneous Media 3, Dec
Aridhi S, Brugnara M, Montresor A, Velegrakis Y (2016) Distributed k-core decomposition and maintenance in large dynamic graphs. In: Proceedings of the 10th ACM international conference on distributed and event-based systems, pp 161–168
Balasundaram B, Butenko S, Hicks IV (2011) Clique relaxations in social network analysis: the maximum k-plex problem. Oper Res 59(1):133–142
Blenn N, Doerr C, Van Kester B, Van Mieghem P (2012) Crawling and detecting community structure in online social networks using local information. In Bestak R, Kencl L, Li LE, Widmer J, Yin H (eds) Networking 2012, pp 56–67
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008
Bond RM, Fariss CJ, Jones JJ, Kramer AD, Marlow C, Settle JE, Fowler JH (2012) A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295
Candogan O (2019) Persuasion in networks: public signals and k-cores. In Proceedings of the 2019 ACM conference on economics and computation, EC ’19, pp 133–134. Association for Computing Machinery
Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197
Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 199–208
Chen S, Fan J, Li G, Feng J, Tan K-L, Tang J (2015) Online topic-aware influence maximization. Proc VLDB Endow 8(6):666–677
Cheng J, Ke Y, Fu AW-C, Yu JX, Zhu L (2010) Finding maximal cliques in massive networks by h*-graph. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 447–458
Cui W, Xiao Y, Wang H, Wang W (2014) Local search of communities in large graphs. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14, pp 991–1002
Deutsch M, Gerard HB (1955) A study of normative and informational social influences upon individual judgment. J Abnormal Soc Psychol 51(3):629
Fang Y, Cheng R, Luo S, Hu J (2016) Effective community search for large attributed graphs. Proceedings of the VLDB Endowment 9(12):1233–1244
Giatsidis C, Thilikos DM, Vazirgiannis M (2011) Evaluating cooperation in communities with the k-core structure. In: 2011 international conference on advances in social networks analysis and mining, pp 87–93
Gjoka M, Kurant M, Butts CT, Markopoulou A (2011) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892
Gomez-Rodriguez M, Leskovec J, Krause A (2012) Inferring networks of diffusion and influence. ACM Trans Knowl Discov from Data 5(4)
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, Cambridge
Goyal A, Bonchi F, Lakshmanan LV (2010) Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on web search and data mining, WSDM ’10, pp 241–250
Hsu B, Shen C, Yan X (2019a) Network intervention for mental disorders with minimum small dense subgroups. IEEE Trans Knowl Data Eng. 1–1
Hsu B-Y, Tu C-L, Chang M-Y, Shen C-Y (2019b) On crawling community-aware online social network data. In: Proceedings of the 30th ACM conference on hypertext and social media, pp 265–266
Huang X, Cheng H, Qin L, Tian W, Yu JX (2014) Querying k-truss community in large and dynamic graphs. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 1311–1322
Huang X, Lakshmanan LV, Yu JX, Cheng H (2015) Approximate closest community search in networks. Proc VLDB Endow 9(4):276–287
Hung H-J, Lee W-C, Yang D-N, Shen C-Y, Lei Z, Chow S-M (2020) Efficient algorithms towards network intervention. In: Proceedings of the web conference 2020
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp 137–146
Kubat M (2015) An introduction to machine learning, 1st edn. Springer, Berlin
Laishram R, Wendt J, Soundarajan S (2019) Crawling the community structure of multiplex networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 168–175
Leskovec J, Mcauley JJ (2012) Learning to discover social circles in ego networks. In: Advances in neural information processing systems, pp 539–547
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Li G, Chen S, Feng J, Tan K-l, Li W-s (2014) Efficient location-aware influence maximization. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 87–98
Li R-H, Qin L, Yu JX, Mao R (2015) Influential community search in large networks. Proc VLDB Endow 8(5):509–520
Li Y, Zhang D, Tan K-L (2015) Real-time targeted influence maximization for online advertisements. Proc VLDB Endow 8(10):1070–1081
Li J, Wang X, Deng K, Yang X, Sellis T, Yu JX (2017) Most influential community search over large social networks. In: 2017 IEEE 33rd international conference on data engineering, pp 871–882
Lu W, Bonchi F, Goyal A, Lakshmanan LV (2013) The bang for the buck: fair competitive viral marketing from the host perspective. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 928–936
Mokken RJ (1979) Cliques, clubs and clans. Quality & Quantity 13(2):161–173
Mucha PJ, Richardson T, Macon K, Porter MA, Onnela J-P (2010) Community structure in time-dependent, multiscale, and multiplex networks. Science 328(5980):876–878
Reproducibility materials. http://www.cs.nthu.edu.tw/~chihya/CIW_download/, 2020
Seidman SB (1983) Network structure and minimum degree. Soc Netw 5(3):269–287
Shen C-Y, Yang D-N, Huang L-H, Lee W-C, Chen M-S (2016) Socio-spatial group queries for impromptu activity planning. IEEE Trans Knowl Data Eng 28(1):196–210
Shen C-Y, Huang L-H, Yang D-N, Shuai H-H, Lee W-C, Chen M-S (2017) On finding socially tenuous groups for online social networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 415–424
Shen C-Y, Fotsing CPK, Yang D-N, Chen Y-S, Lee W-C (2018) On organizing online soirees with live multi-streaming. In: AAAI conference on artificial intelligence
Shin K, Eliassi-Rad T, Faloutsos C (2016) Corescope: Graph mining using k-core analysis—patterns, anomalies and algorithms. In: 2016 IEEE 16th international conference on data mining, pp 469–478
Shuai H-H, Yang D-N, Yu PS, Chen M-S (2013) Willingness optimization for social group activity. Proc VLDB Endow 7(4):253–264
Song C, Hsu W, Lee ML (2017) Temporal influence blocking: Minimizing the effect of misinformation in social networks. In: 2017 IEEE 33rd international conference on data engineering, pp 847–858
Wang K, Cao X, Lin X, Zhang W, Qin L (2018) Efficient computing of radius-bounded k-cores. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 233–244
Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213
Yang D-N, Shen C-Y, Lee W-C, Chen M-S (2012) On socio-spatial group query for location-based social networks. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12, pp 949–957
Yang D-N, Hung H-J, Lee W-C, Chen W (2013) Maximizing acceptance probability for active friending in online social networks. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 713–721
Yang Y, Mao X, Pei J, He X (2016) Continuous influence maximization: What discounts should we offer to social network users? In: Proceedings of the 2016 international conference on management of data, pp 727–741
Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: 2010 12th international Asia-Pacific web conference, pp 236–242
Zhang Y, Parthasarathy S (2012) Extracting analyzing and visualizing triangle k-core motifs within networks. In: 2012 IEEE 28th international conference on data engineering, pp 1049–1060
Zhang F, Zhang W, Zhang Y, Qin L, Lin X (2017) Olak: an efficient algorithm to prevent unraveling in social networks. Proc VLDB Endow 10(6):649–660
Zhang F, Zhang Y, Qin L, Zhang W, Lin X (2017) When engagement meets similarity: efficient (k, r)-core computation on social networks. Proc VLDB Endow 10(10):998–1009
Zhu Q, Hu H, Xu C, Xu J, Lee W-C (2017) Geo-social group queries with minimum acquaintance constraints. VLDB J 26(5):709–727
Acknowledgements
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 109-2636-E-007-019 and MOST 108-2218-E-468-002.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Hsu, BY., Tu, CL., Chang, MY. et al. CrawlSN: community-aware data acquisition with maximum willingness in online social networks. Data Min Knowl Disc 34, 1589–1620 (2020). https://doi.org/10.1007/s10618-020-00709-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-020-00709-5