Expanding Data Science to Consider Social Science Questions – A Student Perspective

Mitchell Jones

Data Scientist at IBM

Published Jun 29, 2020

Data Science and Social Science: what do they have in common, besides the word science? One might not think they have much in common, but over the course of this data analytics project, I came to realize that Data Science is much more flexible and capable that I thought. It can be leveraged in a way that can begin to answer meaningful Social Science questions by analyzing unstructured data such as tweets from Twitter users.

In this post, I’ll talk about the development process behind Visual Risk IQ's latest visualizations around hate speech and share some insight into how Data Science principles can be applied to answer broader questions than one might expect.

Project Purpose

Initially, our project’s goal was to use Data Science techniques to understand and explore questions that were being asked in reaction to President Trump’s March 18 press conference where he kept calling COVID-19 the “Chinese Virus” - even when questioned by reporters about the violence that some Asian American’s were experiencing as a result of that language. Through discussion with Professor Laura Huang of Harvard Business School, our firm set out to answer questions that she was asking about how notable voices on Twitter spoke out about anti-Asian hate and violence.

Our initial research question became “How many tweets have included hateful language like #Wuhanvirus or #Chinesevirus and how has that changed over time?” We chose Twitter because of its popularity for social and political discourse. We also aimed to see if language usage trends were correlated to people in power using similar language.

Learning by Trial and Error

One technique that we aimed to use to explore a volume of tweets on a topic over time was a social media platform analysis tool called SocioViz. SocioViz was generous in providing a free premium membership for the project, but it was ultimately apparent that our research questions were too broad to use it for measuring all tweets about a topic. It was helpful to see the relationships between users, words, and emojis within relevant tweets, but given limits of 50,000 tweets on our account, we chose to use other tools for data acquisition and for analysis. With unlimited resources, it might be possible to track all of the tweets surrounding a topic, but we decided to revise the topic and scope due to resource constraints within this tool and our original approach. See examples of SocioViz in action via screen prints below.

Revising our Research Question

As a result of those limitations, we decided to examine how various Social Media influencers voices carried when speaking out on hate-filled incidents such as shootings at the Pittsburgh Tree of Life Synagogue or Emanuel African Methodist Episcopal Church in Charleston SC. Sadly, we have revised the data to include more recent incidents including the killing of Ahmaud Arbery in Brunswick Georgia.

We feel this data is so important and timely that we have not updated yet to include all of what is being said about the George Floyd killing in Minneapolis. That conversation is clearly ongoing, and we believe our research points toward what can and is being said and done to stop racism.

Our new research aimed to examine the reach (i.e. number of “Likes” plus “Re-tweets”) of top social media influencers (as measured by number of Twitter followers) from different racial identify groups. We focused specifically on their Tweets regarding incidents related to racial injustice, hate speech and/or violence in recent years that affected a variety of identity groups. See “Incident Summary” below that include the date range and number of tweets from the influencers in our sample.

Additionally, we were curious to see if their reach varied based on whether an influencer was speaking about an injustice related to their own identity group or of other groups. Our new research question became “When Influencers tweet about injustices, do their voices carry farther when they are speaking about their own identify group or on the behalf of others?”

Our samples were comprised of a random selection of the most influential users on Twitter from each demographic group. Certainly had we selected more or different Twitter users or different event date and date ranges, we might have gotten different conclusions.

Market Research

Many free services use Twitter’s API, which doesn’t go back in time to the dates that we were interested in. Additionally, each had similar limitations regarding the volume of ALL tweets that we might had been looking to analyze. This is where most of my work with this project was focused – creating our data source. Fortunately, I found a convenient Python package called “GetOldTweets,” but still ran into some hurdles.

Major Hurdles

Multiple factors in the “GetOldTweets” package posed problems – by design, the package is written to pull a small number of tweets from one user in one set of dates. This would work if we were engaging in a small-scale project, but it posed challenges since we were looking to get a sample size that covered dozens of influencers, over multiple date ranges each.

Additionally, the Twitter server limits how many requests you can make at one time. Since the official API doesn’t let you go back in time more than 30 days to get historical data, GetOldTweets scrapes the HTML returned from flipping through pages of searches on Twitter. The problems posed were traditional Data Science problems with web scraping, but it the flexibility of Data Science became apparent in how it can help provide a data source needed to answer Social Science questions.

How Data Science Can Answer Non-Traditional Questions

Undoubtedly, the techniques that were used through the course of this project were DataScience oriented. To me, analysis of interactions and tweets on twitter felt like something out of my expertise as a Data Scientist. However, I came to realize that even though I was using what are considered Data Science techniques, they could be broadly applied to answer a social science question.

Data science techniques for scraping can actually sit at the intersection of Data Science and Social Science. You can use related techniques to gather large bodies of tweets and use natural language processing (NLP) techniques to perform sentiment analysis, and data visualization to explore patterns in larger conversations. Even though these topics have a more Social Science focus, they require input data, which speaks to the broad application of Data Science as support for analytics in other sciences.

At the center of any kind of analysis lies data. Even though Data Science is so commonly associated with numbers and modeling, it’s clear that Data Science can step into the shoes of Social Science to answer questions that are not considered “traditional.”

Curious about the final product? You can find our visualization here, as well as a companion article by Dr. Laura Huang.

My Background

My name is Mitchell Jones. I’ve been working as Data Analyst Intern for Visual Risk IQ for more than 2 years. I’m currently an undergraduate Business Analytics student at the University of North Carolina at Charlotte, with an interest in Data Science and Machine Learning.