Workshop in Biostatistics

DATE: December 1, 2016
TIME: 1:30 - 2:50 pm
LOCATION: Medical School Office Building, Rm x303
TITLE: Mining Big Data to Extract Patterns and Predict Real-Life Outcomes

Michal Kosinski
Assistant Professor of Organizational Behavior, Stanford Graduate School of Business


This hands-on tutorial aims to introduce the participants to essential tools that can be used to obtain insights and build predictive models using large data sets. Recent user proliferation in the digital environment has led to the emergence of large samples containing a wealth of traces of human behaviors, communication, and social interactions. Such samples offer the opportunity to greatly improve our understanding of individuals, groups, and societies, but their analysis presents unique methodological challenges. In this tutorial, we discuss potential sources of such data and explain how to efficiently store them. Then, we introduce 2 methods that are often employed to extract patterns and reduce the dimensionality of large data sets:

singular value decomposition and latent Dirichlet allocation. Finally, we demonstrate how to use dimensions or clusters extracted from data to build predictive models in a cross-validated way. We will provide a sample data set, allowing the participants to practice the methods discussed here.


1. Basic knowledge of R (If you have not previously used R, we recommend that you start by reading the official introduction to this powerful language for statistical programming at  (

2. Bring your laptop

3. Install  R ( and R Studio (

4. Download this file: