Big Data Analysis for Social Scientists

This page describes a short course and associated R code being developed by Robert Ackland (Australian National University) and Timothy Graham (University of Queensland). The course is running as part of the ACSPRI Winter Program at the University of Queensland, Brisbane, 29 June - 3 July 2015. Book here!

This course introduces students to the collection and analysis of socially-generated 'big data' using the R statistical software and Gephi network visualisation software.

Big data involves data on: (1) people (social web) e.g. online social networks (e.g. Facebook), microblogs (e.g. Twitter); (2) information (WWW) e.g. web pages, clickstreams; (3) things (sensor web) e.g. phones, temperature sensors, and (4) places (geospatial web) e.g. geology, land use maps.

The focus of this course is on data from the social web and the WWW. Students will learn how to: (1) collect data from web pages, Twitter and Facebook; (2) construct, analyse and visualise networks of people and organisations (social networks) and terms (semantic networks); (3) extract and analyse text data; (4) conduct temporal analysis, (5) filter and sample from large datasets; (6) identify and engage with advanced techniques for dealing with very large datasets.

We will focus on three sources of network and text data: Twitter, Facebook and the WWW. We will look at:

Who are the actors, and what actor attributes are available for them?
How can we find connections between actors, and how can we use social network analysis to understand the social scientific meaning of such connections?
What text can be attributed to these actors, and what does analysis of this text tell us about the actors and society as whole?
How can we study behaviour over time, identifying significant events or trends?
What are some of the key methodological issues for using social media and WWW data to study individual and collective behaviour, and how can we address such limitations?

The course will also provide an opportunity for students to learn about examples of 'best practice' social science big data research, and thus see how these data and techniques are already being used in social science.

Topics covered in the course

The following is an indicative list of topics covered in the course. The topics are listed as being either 'core' or 'advanced'. Core topics will be covered in detail, while the coverage of the advanced topics will depend on student interest. Prior to the course running we will ascertain student interest in particular topics and focus the course accordingly.

1. Data collection (core)

Collection of data from Twitter, Facebook and WWW. We will also provide datasets (e.g. Twitter #auspol dataset) that will be used in the course.

2. Creation of networks (core)

Creation of different types of networks, including:

unimodal networks (one type of actor): (1) social networks (where network nodes are Twitter users, FB users, organisational websites), (2) semantic networks (where network nodes are terms extracted from tweets, FB posts, web pages)
bimodal networks (two types of actors) – for example, Twitter users and hashtags extracted from tweets, FB users and posts they have commented on or liked.

3. Network analysis (core)

Introduction to social network analysis, covering main network-level and node-level metrics, and clustering (‘community detection’).

4. Visualisation techniques (core)

How to create high-quality network visualisations, publish interactive networks on the web, and generate ‘word clouds’ and dendrograms from network data.

5. Text analysis (core)

Supervised machine learning (e.g. support vector machines), unsupervised machine learning (topic modelling), sentiment analysis, hierarchical clustering, and descriptive analytics.

6. Temporal analysis (advanced)

Analysing networks and text over time, identifying significant changes in behaviour of individual nodes, clusters or entire networks.

7. Filtering and sampling (advanced)

Techniques for targeting data collection and analysis on a particular set of actors. Reducing the scale of datasets via sampling.

8. Scaling up to very large datasets (advanced)

The data used in the course will be able to be analysed on a desktop/laptop computer. What if your dataset is too large for your computer, or do you have a project in mind that involves massive amounts of data?

Tools used in the course

R and Gephi will be used. We will be using a selection of existing R libraries, but will aim to make use of 'wrapper functions' or possibly a custom R library so as to reduce the amount of coding that students need to undertake during the week (hence maximising the amount of content we can cover). All R source code will be available to students and will be released as open source code. The following is an indicative list of the R libraries used in the course:

igraph (network analysis and visualisation)
twitteR (for collecting Twitter data)
tm (text mining)
RTextTools (machine learning package for automatic text classification)
RCurl (collecting WWW data)
XML (reading and creating XML documents)
R.utils (programming utilities)
wordcloud (text word clouds)
ape and dendextend (dendograms, hierarchical clustering)?
FactoMineR and homals (multiple correspondence analysis)
plyr and stringr (text sentiment analysis)

This course is expected to take place in a computer lab. Course participants are welcome to bring a laptop with R installed if they prefer.

Prerequisites:

Participants are advised to have taken at least one of the following ACSPRI courses or have had some equivalent exposure to social network analysis and quantitative text analysis:

Social Media Analysis
Introduction to Social Network Research and Network Analysis
Advanced Network Analysis for Social Research

Participants must also have some experience with the R programming language. You do not need to be an R expert but must have some familiarity with how to program in R (or other similar languages), for example via the following ACSPRI courses:

Learning R: Open Source (Free) Stats Package
Using R for Practical Research and Data Visualisation
Data Analysis, Graphics and Visualisation Using R

If the course is run with participants using their own laptops, participants will need to have installed: (1) R statistical software plus a set of libraries which will be specified before the course starts, (2) Gephi. Both R and Gephi run with Mac/Windows/Linux.