Abstract
With the advent of social networks and micro-blogging systems, the way of communicating with other people and spreading information has changed substantially. Persons with different backgrounds, age and education exchange information and opinions, spanning various domains and topics, and have now the possibility to directly interact with popular users and authoritative information sources usually unreachable before the advent of these environments. As a result, the mechanism of information propagation changed deeply, the study of which is indispensable for the sake of understanding the evolution of information networks. To cope up with this intention, in this paper, we propose a novel model which enables to delve into the spread of information over a social network along with the change in the user relationships with respect to the domain of discussion. For this, considering Twitter as a case study, we aim at analyzing the multiple paths the information follows over the network with the goal of understanding the dynamics of the information contagion with respect to the change of the topic of discussion. We then provide a method for estimating the influence among users by evaluating the nature of the relationship among them with respect to the topic of discussion they share. Using a vast sample of the Twitter network, we then present various experiments that illustrate our proposal and show the efficacy of the proposed approach in modeling this information spread.

Similar content being viewed by others
Notes
A divulgative article expressing the same high-level idea can be found at http://www.nytimes.com/external/readwriteweb/2010/03/19/19readwriteweb-the-million-follower-fallacy-audience-size-d-3203.html.
Please notice that, as pointed out in literature, the inverse document frequency factor cannot be positively applied in our work. In fact, it diminishes the weight of terms that occur very frequently in the corpus, while increasing the weight of terms that occur rarely. In our case, we believe that this is not a suitable weighting scheme. Terms that appear in most of the documents in the corpus are likely to be highly relevant for the domain. Please also notice that, in order to exclude common function words (such as conjunctions and articles), we have removed stop-words with common techniques, and we have only considered nouns in our computation.
An interesting observation is possible: the highest ranked \(n\)-grams are mostly uni-grams and simply reflect the distribution of the letters of the alphabet in the language of the document. In other words, the most frequent \(n\)-grams are most of the time correlated to the language. Thus, considering that the most frequent \(n\)-grams for the considered topic profiles resulted to be very similar due to this fact (while they start differing consistently in the lowest part of ranked \(n\)-grams list), we excluded from our analysis the uni-grams.
The sampling rate of the used Twitter account is 10 % over an average of 200 millions per day. More information are available at http://apiwiki.Twitter.com.
References
Adar E, Adamic LA (2005) Tracking information epidemics in blogspace. In: IEEE/WIC/ACM international conference on web intelligence, WI’05. IEEE Computer Society, pp 207–214. doi:10.1109/WI.2005.151
Aral S, Muchnik L, Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proc Natl Acad Sci 106(51):21544–21549
Bakshy E, Karrer B, Adamic LA (2009) Social influence and the diffusion of user-created content. In: Proceedings of the 10th ACM conference on electronic commerce, EC’09. ACM, pp 325–334
Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Phys A: Stat Mech Appl 311(3–4):590–614
Castillo C, Mendoza M, Poblete B (2011) Information credibility on twitter. In: Proceedings of the 20th international conference on World wide web, WWW’11, pp 675–684. ACM, New York, NY, USA. doi:10.1145/1963405.1963500
Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: MDMKDD’10, pp 4:1–4:10. ACM, New York, NY, USA
Cataldi M, Di Caro L, Schifanella C (2014) Personalized emerging topic detection based on a term aging model. ACM Trans Intell Syst Technol 5(1):27. doi:10.1145/2542182.2542189
Cataldi M, Mittal N, Aufaure MA (2013) Estimating domain-based user influence in social networks. In: Proceedings of the 28th symposium on applied computing, SAC 2013. ACM, New York, NY, USA
Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, pp 161–175
Cha M, Benevenuto F, Ahn YY, Gummadi KP (2012) Delayed information cascades in flickr: measurement, analysis, and modeling. Comput Netw 56(3):1066–1076. doi:10.1016/j.comnet.2011.10.020
Cha M, Benevenuto F, Haddadi H, Gummadi PK (2012) The world of connections and information flow in twitter. IEEE Trans Syst Man Cybern Part A 42(4):991–998
Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring User Influence in Twitter: the million follower fallacy. In: Proceedings of the 4th international AAAI conference on weblogs and social media (ICWSM), The AAAI Press, Menlo Park, California, pp 10–17
Chubin DE (1976) The conceptualization of scientific specialties. Sociol Q 17(4):448–476
Crane D (1969) Social structure in a group of scientists: a test of the “invisible college” hypothesis. Am Sociol Rev 3:335–352
de Beaver D, Rosen R (1979) Studies in scientific collaboration. Scientometrics 1(2):133–149
Di Caro L, Cataldi M, Schifanella C (2012) The d-index: discovering dependences among scientific collaborators from their bibliographic data records. Int J Scientometr. pp 1–25. doi:10.1007/s11192-012-0762-1
Erceg V, Greenstein LJ, Tjandra SY, Parkoff SR, Gupta A, Kulic B, Julius AA, Bianchi R (2006) An empirically based path loss model for wireless channels in suburban environments. IEEE J Sel A Commun 17(7):1205–1211. doi:10.1109/49.778178
Favenza A, Cataldi M, Sapino ML, Messina A (2008) Topic development based refinement of audio-segmented television news. In: NLDB’08, Springer, Berlin, Heidelberg, pp 226–232
Friedman N (2000) Being bayesian about network structure. In: Machine learning, pp 201–210
Getoor L, Friedman N, Koller D, Taskar B (2002) Learning probabilistic models of link structure. J Mach Learn Res 3:679–707
Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark Lett 12(3):211–223
Goyal A, Bonchi F, Lakshmanan LV (2010) Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on Web search and data mining, WSDM’10ACM, New York, NY, USA, pp 241–250
Granovetter M (1978) Threshold models of collective behavior. Am J Sociol 83(6):1420–1443. doi:10.1086/226707
Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: WWW’04, pp 491–501. ACM
Gruhl D, Liben-Nowell D, Guha R, Tomkins A (2004) Information diffusion through blogspace. SIGKDD Explor Newsl 6(2):43–52. doi:10.1145/1046456.1046462
Hou H, Kretschmer H, Liu Z (2008) The structure of scientific collaboration networks in scientometrics. Scientometrics 75(2):189–202
Joachims T (1997) A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In: Proceedings of the fourteenth international conference on machine learning, ICML’97 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 143–151
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, Springer, London, UK, pp 137–142
Katz JS, Martin BR (1997) What is research collaboration? Res Policy 26:1–18
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, ACM, New York, NY, USA, pp 137–146. doi:10.1145/956750.956769
Khanafiah D, Situngkir H (2004) Social balance theory: revisiting Heider’s balance theory for many agents. Technical report
Kumar R, Novak J, Raghavan P, Tomkins A (2004) Structure and evolution of blogspace. Commun ACM 47(12):35–39. doi:10.1145/1035134.1035162
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW’10, pp 591–600. ACM
Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web 1(1). doi:10.1145/1232722.1232727
Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: KDD ’06, pp 631–636. ACM. doi:10.1145/1150402.1150479
Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks. In: CIKM ’03, pp 556–559. ACM
Melin G, Persson O (1996) Studying research collaboration using co-authorships. Scientometrics 36: 363–377
Moon S, You J, Kwak H, Kim D, Jeong H (2010) Understanding topological mesoscale features in community mining. In: 2010 second international conference on communication systems and networks (COMSNETS), IEEE Press, Piscataway, NJ, USA, pp 1–10
Newman MEJ (2001) Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E 64(1): 016131
Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: Bringing order to the web. In: WWW’98, pp 161–172
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Rocchio J (1971) Relevance feedback in information retrieval, pp 313–323
Romero DM, Galuba W, Asur S, Huberman BA (2011) Influence and passivity in social media. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—Volume Part III, ECML PKDD’11. Springer, Berlin, Heidelberg, pp 18–33. http://dl.acm.org/citation.cfm?id=2034161.2034164
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523
Schifanella C, Caro LD, Cataldi M, Aufaure MA (2012) The d-index: a web environment for analyzing dependences among scientific collaborators. In: KDD, pp 1520–1523. ACM
Shapin S (1981) Laboratory life. The social construction of scientific facts. Med Hist 25(3):341–342
Suen CY (1979) n-gram Statistics for natural language understanding and text processing. IEEE Trans Pattern Anal Mach Intell 1(2):164–172. doi:10.1109/TPAMI.1979.4766902
Sun E, Rosenn I, Marlow C, Lento TM (2009) Gesundheit! modeling contagion through facebook news feed. In: Proceedings of International AAAI conference on weblogs and social media, 1–8
Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, pp. 261–270. ACM, New York, NY, USA. doi:10.1145/1718487.1718520
Wu S, Hofman JM, Mason WA, Watts DJ (2011) Who says what to whom on twitter. In: Proceedings of the 20th international conference on world wide web, WWW ’11, pp 705–714. ACM, New York, NY, USA. doi:10.1145/1963405.1963504
Yang Y (1999) An evaluation of statistical approaches to text categorization. J Inf Retr 1:67–88
Zhao Q, Mitra P, Chen B (2007) Temporal and information flow based event detection from social text streams. In: Proceedings of the 22nd national conference on artificial intelligence, vol 2., AAAI’07AAAI Press, Menlo Park, California, pp 1501–1506
Zipf G (1949) Human behaviour and the principle of least-effort. Addison-Wesley, Cambridge
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cataldi, M., Aufaure, MA. The 10 million follower fallacy: audience size does not prove domain-influence on Twitter. Knowl Inf Syst 44, 559–580 (2015). https://doi.org/10.1007/s10115-014-0773-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0773-8