Archive

Tag Archives: big data

The study of social media has great promise, but we always need to understand its limitations. This sounds rather basic, but it is often not reflexively thought about. Though social media is not as shiny as it was several years ago, the zeitgeist still persists and it often clouds our ability to frame what it is exactly that we are doing with all the social data we have access to.[1] Specifically, if we use Twitter data, it is not enough to just leave research at the level of frequency counts (top hashtags, top retweets, most engaged with comments, etc.). David De Roure [2] warns that this type analysis of social media misses the social aspects of web technologies. Ultimately, social media spaces are sociotechnical systems and the social that is (re)produced – like face-to-face communication – is highly nuanced. I think that it is fundamentally important for researchers of social media data across the disciplines to think critically beyond the literal results of brute force machine learning. Rather, this is an opportunity for us to ask large and important social questions. My point is epistemological in that I think it is important for our results to contribute to our understanding of these social questions. This is not to say that quantitative methods such as natural language processing, n-grams (and other co-occurrence methods), and various descriptive statistics are not important to the study of social media. But, rather, they are often the starting or mid point of a research project. In my work, Big Data analytical models provide a great way to get a birds-eye view of social media data. However, they cannot answer social questions as such. However, these methods are valuable to, for example, grounded theory approaches, which can help produce valuable research questions or social insights. Additionally, the mixing of methods this encourages is exciting as it provides opportunities for us to innovate new research methods rather than trying to fit traditional research methods (though doing this is valuable of course too).

[1] Ramesh Jain in his talk at the NUS Web Science & Big Data Analytics workshop puts this as data being everywhere and that we have access to billions of data streams.

[2] In his talk at the NUS Web Science & Big Data Analytics workshop (December 8th, 2014)

A recent application of Big Data which has become understandably controversial is the Facebook experiment, where Facebook data scientists manipulated the feed content of selected users to include only positive or negative feed content. I have previously written about this.

The Guardian’s exposé on the U.S State Department’s PRISM project—which collects data from large technology companies— clearly highlighted the footprint users leave behind when utilizing the Internet. While this particular scenario represents a more extreme and some would argue unethical application of Big Data Technologies, the Facebook experiment reminded many of us why we spoke out about data privacy and PRISM. While many Internet users are aware of the trace data created via online interactions, the power and potential of this information when collected, aggregated, and analyzed is enormous and often easy to forget. The Facebook experiment speaks to the capability for nongovernmental entities such as corporations to easily access information that was previously not available nor analyzable. This type of information, paired with the right technology, can lend a unique glance into a person’s life and ultimately lead to more advanced insights directed towards a person’s interests, hobbies, activities, work, and more. This can be a welcome development in some contexts (e.g. those who opt into health behavior change interventions to quit smoking or lose weight).

However, most of the time, online footprint data (derived from platforms such as Twitter and Facebook) are used to facilitate personalized and targeted advertising (Silberstein, et al. 2011) at best and hyper-surveillance at worst. Some do not have a problem with this use of personal data (as a trade-off for ‘free’ services such as Facebook). Others, see the Facebook experiment as yet one more reason to either minimize their use on the dominant social networking site or quit altogether.

References:

Silberstein, A., Machanavajjhala, A. and Ramakrishnan, R. 2011 ‘Feed following: the big data challenge in social applications’ Databases and Social Networks: ACM.

While I was writing my book about Twitter (Twitter: Social Communication in the Twitter Age), I took an interest in tracking the US Republican primary as it was being constructed within Twitter. Last year, I started collecting all geo-located tweets  (tweets with location information turned on) for the 50 most populous urban American cities (according to U.S. Census statistics ). Because of the geographical richness of this data set, I thought it would be a perfect source to use to study twitter activity surrounding the US Republican primary. Working with  Alexander Gross and Stephanie Bond, I designed and developed a tool to visualize this specific geographically-anchored landscape.

The 2012 US presidential election provided another opportunity to leverage this data. Twitter has been extremely active in terms of election-related discourse. Our Election 2012 Twitter Visualization Tool uses emergent big data research methodologies to visualize the election. The visualization tool has been optimized for the Safari browser (and is known to have some issues in other browsers).

The goal of our research is to explore urban American responses to the 2012 presidential candidates on Twitter. In order to create a representative sample of tweets from urban centers in the United States, we collected tweets from Twitter by location. We took the 50 most populous American cities according to the U.S. Census and instructed Twitter to send us tweets that were within 7-12km of the locations of these cities.

Our software collects these geo-located tweets and uses the data to chart the relative buzz surrounding candidates in the 2012 presidential election. The tool charts the relative popularity of each primary candidate as measured by the number of tweets which we have collected over the last 24 hours and identified with a particular candidate. For a tweet to be counted as referring to a particular candidate, the tweet must contain the candidate’s first and last name separated by a space e.g. “Mitt Romney” or the candidate’s official campaign twitter account name or the account name eg @mittromney. A single mention as reported by the chart’s dynamic legend is equivalent to one tweet which contains one of the candidate names. Tweets which contain more than one candidate name will be counted as mentions for both candidates. These stringent rules prevent unecessary possible over counting of tweets for a candidate. Though the frequency of the tweet count in our visualization is low because of this, the data collected is very robust. Specifically, all tweets visualized do refer to Obama or Romney.

Please visit the tool’s webpage at my lab, the Social Network Innovation Lab, for more detailed information.