Social Media Mining With R Sample Chapter
Social Media Mining With R Sample Chapter
Richard Heimann leads the Data Science Team at Data Tactics Corporation and is an EMC Certified Data Scientist specializing in spatial statistics, data mining, Big Data, and pattern discovery and recognition. Since 2005, Data Tactics has been a premier Big Data and analytics service provider based in Washington D.C., serving customers globally.
Richard is an adjunct faculty member at the University of Maryland, Baltimore County, where he teaches spatial analysis and statistical reasoning. Additionally, he is an instructor at George Mason University, teaching human terrain analysis, and is also a selection committee member for the 2014-2015 AAAS Big Data and Analytics Fellowship Program. In addition to co-authoring Social Media Mining in R, Richard has also recently reviewed Making Big Data Work for Your Business for Packt Publishing, and also writes frequently on related topics for the Big Data Republic ( ). He has recently assisted DARPA, DHS, the US Army, and the Pentagon with analytical support. I'd like to thank my mother who has been supportive and still makes every effort to understand and contribute to my thinking.
Chapter 3, Mining Twitter with R, explains that an obvious prerequisite to gleaning insight from social media data is obtaining the data itself. Rather than presuming that readers have social media data at their disposal, this chapter demonstrates how to obtain and process such data. It specifically lays out a technical foundation for collecting Twitter data in order to perform social data mining and provides some foundational knowledge and intuition about visualization. Chapter 4, Potentials and Pitfalls of Social Media Data, highlights that measurement and inference can be challenging when dealing with socially generated data, including social media data. This chapter makes readers aware of common measurement and inference mistakes and demonstrates how these failures can be avoided in applied research settings. Chapter 5, Social Media Mining Fundamentals, aims to develop theory and intuition over the models presented in the final chapter. These theoretical insights are provided prior to the step-by-step model building instructions so that researchers can be aware of the assumptions that underpin each model, and thus apply them appropriately. Chapter 6, Social Media Mining Case Studies, helps to bring everything together in an accessible and tangible concluding chapter. This chapter demonstrates canonical lexiconbased, and supervised sentiment analysis techniques as well as laying out and executing a novel unsupervised sentiment analysis model. Each class of model is worked through in detail, including code, instructions, and best practices. This chapter rests heavily on the theoretical and social science information provided earlier in the book, but can be accessed right away by readers who already have the requisite understanding. Appendix, Conclusions and Next Steps, wraps everything up with our final thoughts, the scope of the data mining field, and recommendations for further reading.
Traditional social science not only focuses on important questions, but also seeks to uncover relationships that are interesting and unexpected. The world around us is full of complex social behavior; though identifying mundane facts is sometimes helpful in the name of basic research, it does little to help us understand social behavior. We take to heart the mandate to nd interesting relationships as we mine social media dataa particularly complex and rich source. At heart, however, social science is not a focus on the important or the interesting. It is science, which means that it is a set of methods and practices designed to generate and verify facts. The logic of science, regardless of whether it proceeds quantitatively or in a qualitative fashion, is fundamentally about knowledge discovery and accumulation. This logic helps mitigate several shortcomings in reasoning that frequently hinder our ability to make correct inferences. Some examples include illusory correlations (perceiving correlations that do not exist), selective observation (inadvertently cherry-picking data), illogical reasoning, and over or under generalizing (assuming that facts discovered in one domain apply to others as well). Generally, the scientic process helps avoid the discovery of false truths often arrived at through deduction, speculation, justication, and groupthink.
[ 54 ]
Chapter 5
These data sources, despite their shortcomings in terms of coverage, are held in high regard due to their focused and authoritative nature. These sources are used due to the fear that bad input data will yield low-quality inferences, or "garbage in, garbage out," as the saying goes. But, what is garbage data? Should we consider social media data garbage because of its unfocused nature? Also, under what circumstances might we be willing to use social media data? Our view is that focused, purpose-collected data is the best option when it is available. This statement may come as a surprise in a book on social media mining, but the linchpin of the statement is the phrase when it is available. For the vast majority of emerging questions related to business, politics, and social life, purpose-collected social science data-sets simply do not exist. As such, we take the pragmatic position that social data, due to its broad coverage and large volume, makes a nice fallback to targeted data. Social data is bad in the sense that much of it will be inapplicable to any particular question; however, limited applicability is certainly better than utter absence of data. The reality is that we live in an imperfect world, which will consequently yield imperfect data. Our job, as data analyst is to work with data in responsible ways. This book does not cover how to handle poor, dirty, missing, or incorrect data in a comprehensive manner. However, we do wish to promote the use of social media data and its utility in cases where traditional social science data-sets do and do not exist and where there are low and high barriers to targeted collection. Traditional social science modeling techniques tend to require data-sets in which observations are independent of one another. However, data gleaned from social media outlets, such as Twitter, is almost certainly not independent. That is, data is not randomly sampled from a larger population and thus each observation is likely to be related to observations that are nearby in some sense. For example, tweets about a large public event arise around the same time and from the same area. Also, many may express similar views. This nonindependence has implications for how you handle tweets given their degree of centrality, shared geography, and repetition through retweets. Although we do not often study sentiment polarity explicitly in terms of networks, doing so may prove useful for future researchers. We anticipate research in that direction will produce better measures and predictions, localized lexicons, and other advantages.
[ 55 ]
Understanding sentiments
Social media mining can and should have broad interpretation. It is not the intent of the authors to conne social media mining to sentiments or opinions, but rather we suggest that a sentiment or opinion is a useful tool for many research pursuits. Until recently, sentiment was understood as a ubiquitous and constant part of the human experience, with variations in sentiments changing only slightly up or down. Klaus Scherer (2000) developed a working denition as follows: "Emotions (sentiments) are episodes of coordinated changes in several components in response to external and internal events of major signicance to the organism." It is our intent to understand, measure, and interrelate these changes in a sentiment. Scherer's typology of emotions is a useful grounding point for the understanding of sentiments, and as a jumping-off point for a discussion of the difculty in measuring sentiment-laden text.
Generally, when we try to measure a sentiment, we talk about Scherer's emotions; though, in some situations, we might try to capture longer-term phenomena such as moods.
[ 56 ]
Chapter 5
Anchoring the neo-social science approach using Twitter data versus other types of social media data is important as well because not all data is equal. Twitter data differs from data derived from sites such as Yelp and Google Reviews due to the simple fact that Twitter does not have ratings or explicit targets. If we want to know the sentiment of a given source or topic, whether it is the iPhone 5S or something more sensitive such as social policy, we have to discover that signal in a corpus of other signals. However, Yelp and Google Reviews (just two examples of many) have explicitly accounted for the source or topic by design and have ratings designed to measure sentiments. A tweet is what Twitter users send to each other and to the Twittersphere. A tweet is sometimes a sentence and other times not, but it is restricted to 140 characters or approximately 11 words. Twitter therefore provides sentence-level sentiment analysis as opposed to reviews on Yelp or Google Review, which usually constitute the entire documents.
[ 57 ]
To gather data, we generally collect content that contains a manually specied (set of) keyword(s). This is called the target. For example, the target for presidential approval would use the topic keyword obama. We may wish to add context to analyses done on particular keywords by adding additional opposing or specifying keywords. For example, in addition to obama, we could add romney to provide a counterpoint if we were studying the 2012 presidential election campaigns. Depending on the purpose of our analysis, we could jointly search for, say, obama and economy to target more specic subjects. Topic models represent a second, more sophisticated, and potentially more thorough way of capturing bits of text that are relevant to a particular analysis. These models take very large sets of documents as their inputs and group them probabilistically into estimated topics. That is, each document is proclaimed to be a mixture of one or more topics that are themselves estimated from the data. This allows users to nd texts that are related to a topic, though they may not explicitly use a particular keyword. The details of this class of statistical models are outside the scope of this text; however, in Appendix, Conclusions and Next Steps, we point readers to references on the theory and estimation of this exciting new class of tools. Social data mining is the detection of attitudes, and the easiest way to understand it is through the following structure: sentiment = {data source, source, target, sentiment, polarity}} The parameters are explained in detail as follows: Data source: This relates to understanding the source of the data; that is, is the source a sentence or an entire document? Twitter or a blog? Source or holder: This is the one that expresses a sentiment or an opinion, Target or aspect: The target or aspect is what or to whom the sentiment is directed toward. Type of sentiment: This is the type(s) of emotion(s) expressed, that is, like, love, hate, value, desire, and so on. Polarity: These are juxtapositional sentiments on a dimension, that is, positive or negative.
[ 58 ]
Chapter 5
The following examples highlight these components and also some of the challenges involved in sentiment analysis. We have parts of two reviews: one about Steven Spielberg and another about John Carpenter. In both examples, the data source is the Internet Movie Database (IMDB) that considers itself the world's most popular and authoritative source for movie, TV, and celebrity content. The holder is the one who wrote the review, and the targets are Steven Spielberg and John Carpenter respectively. However, the target is complicated by mentions of various movies over time. Also, complicating matters is the variety of sentiment types and polarities. Steven Spielberg's second epic lm on World War II is an unquestioned masterpiece. Spielberg, ever the student on lm, has managed to resurrect the war genre by producing one of its grittiest and most powerful entries. He also managed to cast this era's greatest answer to Jimmy Stewart, Tom Hanks, who delivers a performance that is nothing short of an astonishing miracle for about 160 out of its 170 minutes; Saving Private Ryan is awless, literally! There was a time when John Carpenter was a great horror director. Of course, his best lm was 1978's masterpiece Halloween; however, he also made The Fog in the 1980s and 1987's underrated Prince of Darkness. Even, Heck made a good lm, In the mouth of madness, in 1995. However, something terribly wrong happened to him in 1992 with the terrible comedy Memoirs of an Invisible Man.
[ 59 ]
A lexicon-based sentiment often entails merely counting the opinion words from a subset of data from a particular source. This approach certainly has errors, as does perhaps all natural language processing; however, in aggregate, the lexicon-based approach has proven to be fairly robust, even when only used on subsets. Additionally, there are many possible ways to aggregate the sentiment scores of each word, but most commonly, they are simply summed up to form an overall score for a document. Despite lexicon-based sentiment classication being considered here as basic, it is still difcult. This is primarily due to the fact that when counting words with positive or negative valences, one must decide which words to count as each. Different dictionaries of positive and negative words can generate different sentiment scores for the same sentences. Some words with perceived sentiment are more neutral, while others have perceived neutrality, but are in fact more extreme. This challenge arises, in part, due to varied usages of words within and across contexts. Preassembled lexicons are incredible resources and are applicable for a wide variety of problemswe use several in Chapter 6, Social Media Mining Case Studies. Despite subtle differences, they are all good starting points, but they are just that, starting points and not end points. Rather than utilizing a preassembled lexicon indiscriminately, researchers should often develop lexicons that are sensitive to the domain they are analyzing. For instance, a lexicon that is useful for economic atmospherics (where moderate and stable are positive) may prove useless for examining political leanings. Preassembled domain-specic lexicons exist as well and two popular economic lexicons will be used later. There are many approaches to extending both generic preassembled lexicons as well as domain-specic preassembled lexicons, and we will describe two rather intuitive ones, dictionary-based lexicons, and corpus-based lexicons in addition to preassembled lexicons. Both dictionary-based and corpus-based approaches augment preassembled lexicons in one of the following two ways: Using a dictionary (that is, synonyms and antonyms) to add keywords external to our corpus to enhance our preassembled lexicon(s) Using the corpus directly to add words already internal to our corpus that are keywords but are not accounted for by preassembled lexicons
Merging preassembled lexicons, dictionary-based lexicons, and corpus-based lexicons offers the best chance to successfully estimate sentiment.
[ 60 ]
Chapter 5
The two approaches (dictionary and corpus) produce empirically constructed lexicons that seek to calibrate the underlying sentiment by adding to the preassembled lexicons. The intuition is words appearing in the complete collection of lexicons (preassembled, dictionary, and corpus) within our set of documents returned from a narrow topic (the search set) are more likely to objectively describe sentiment information. In the lexicon approach, it is sufcient to simply count the frequency of words from our lexicons with the set of documents returned from our topic of interest and sum the results over time, by space or by product. The next chapter outlines this process in detail and sums the results' set over time where the target is the US economy.
So, where did the naive part of the name come from, and why is this method useful in spite of its self-assumed simplicity? The goal of any classier is to determine which class or type a new observation belongs to based on its characteristics and previous examples from both types that we have seen before (that is, from existing data). Some types of classiers can account for the fact that the characteristics we use for this prediction may be correlated. That is, if we are trying to predict whether an e-mail is spam or not by looking at what words and phrases the e-mail contains, the words easy and money are likely to co-occur (and are likely to be indicative of spam messages). The Naive Bayes classier does not try to account for correlations between characteristics. It just uses each characteristic separately to try to determine each new observation's class membership. The naive assumption that all of the characteristics of an observation are unrelated is always wrong. In predicting whether or not to extend a loan to an individual, a bank may look at their credit score, whether or not they own a home, their income, and their current debt level. Obviously, all of these things are likely to be correlated. However, ignoring the correlations between predictive characteristics allows us to do two things that would otherwise be problematic. First, it allows us to include a huge number of characteristics, which becomes important. This is because in text analysis, individual words often have predictive characteristics, and documents often contain thousands of unique words. Other models have a difcult time accommodating this number of predictors. Secondly, the Naive Bayes classier is fast, thus allowing us to use large training sets to train a model and to generate results quickly.
Unsupervised social media mining Item Response Theory for text scaling
The techniques set out earlier for scaling or classifying sentiments in texts are fairly robust; that is, they tend to work well under a wide variety of conditions such as heterogeneous text lengths and topic breadths. However, each of these methods requires substantial analyst input, such as labeling training data or creating a lexicon. Item Response Theory (IRT) is a theory, but will be used in this text to refer to a class of statistical models that rely on that theory, providing a way to scale texts according to sentiment in the absence of labeled training data. That is, IRT models are unsupervised learning models.
[ 62 ]
Chapter 5
IRT models were developed by psychologists for scoring complex tests and were then picked up by political scientists who employ them for scaling legislators. We will briey explain the legislative context as that will help readers build intuition over how models work when applied to scaling texts. Consider a set of V voters, such as US Senators, who, over the course of a year, vote on B bills. For simplicity, assume each voter can only vote yes or no. We could then put all of the data into a matrix, where each row represents a voter and each column a bill. Each cell then represents a particular voter's decision on a particular bill, that is, yes (1) or no (0). Now, we need to make two related assumptions. The rst is that all or most of these voters can be described as lying along a single underlying continuum. The second more trivial assumption is that this position inuences their votes, at least on some bills. With these assumptions in place, we can estimate a statistical model that describes the probability of each cell in our data matrix being a one or a zero. The model is a function of each bill's difculty of being voted for (that is, how controversial it is), each voter's position on the underlying scale, and how strongly each bill is affected by voters' locations on the scale. Technically, we estimate a logistic regression as follows: pr(yvb=1) = logit(b1b*xv - b0b) Here, x is the scaled position of each voter (v), b0 is the difculty of voting yes for each bill (b), and b1 is the degree to which each voter's position affects their proclivity to vote in favor of each bill (b). Positive values of b1 mean that voters to the right are more likely to vote in favor of a bill, and negative values of b1 mean that senators to the left are more likely to vote for a bill. As you will see, we apply the previous assumption to the analogous case of text scaling. To do so, we create a matrix with rows representing authors or documents (instead of voters) and columns representing words or phrases, (instead of bills). Each cell represents whether or not a particular author used a particular word or phrase. We modify the previous assumptions: authors lie along a sentiment continuum, and their placement affects their pattern of word use. The rst part of this assumption is limiting. We can only apply the method to sets of documents that are sufciently narrow to be usefully described by a single underlying continuum, and that continuum must essentially be the sentiment we are trying to measure. The results of this analysis are a continuous scaled measure of author (or document) location (x) as well as estimates of the weights for each word or phrase (b1). This scaled measure of location (x) represents the author's sentiment towards the topic under study if the previous assumptions are met.
[ 63 ]
The IRT-based method described here has mixed properties. It requires no training data, little subject matter expertise to employ, is language agnostic (that is, could function on any language), and generates a quantitative (instead of merely a binary) measure of a sentiment. However, this model can only be applied to documents that are all about the same topic, can only estimate a single underlying dimension, can be slow to estimate, and is not guaranteed to converge.
Summary
In this chapter, we learned key concepts related to sentiment analysis. Sentiment was dened and difculties related to its mining were covered. We then walked through the theoretical underpinnings of three different models for sentiment analysis. We intentionally separated the details of implementation from theoretical concerns in hopes of giving readers an appreciation for the methods, including their strengths and weaknesses. The next chapter delves into the details of implementing the classes of models described earlier.
[ 64 ]
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet book retailers.
www.PacktPub.com