Topics in Quantitative Sociology

Fall 2020 ENSAE

Big Data, Prediction & Explanation

The contemporary data deluge offers rich opportunities for sociologists to deploy old and new tools to study patterns of social life. We focus in particular on whether the most recent developments in AI and Large Language Models, since the notable advent of ChatGPT, help improve prediction of life outcomes.

Case-studies for reading and commentary

Note: All four papers are shorter in length and are based on the same data and participate in the same prediction challenge (see the two Salganik pieces under the background readings section above; you can also listen to a dedicated podcast). For your commentary, choose one of the four papers.
In this article, the authors discuss and analyze their approach to the Fragile Families Challenge. The data consisted of more than 12,000 features (covariates) about the children and their parents, schools, and overall environments from birth to age 9. The authors’ modular and collaborative approach parallelized prediction tasks and relied primarily on existing data science techniques, including (1) data preprocessing: elimination of low variance features, imputation of missing data, and construction of composite features; (2) feature selection through univariate mutual information and extraction of nonzero least absolute shrinkage and selection operator coefficients; (3) three machine learning models: random forest, elastic net, and gradient-boosted trees; and finally (4) prediction aggregation according to performance. The top-performing submissions produced winning out-of-sample predictions for three outcomes: grade point average, grit, and layoff. However, predictions were at most 20 percent better than a baseline that predicted the mean value of the training data for each outcome.
Sociological research typically involves exploring theoretical relationships, but the emergence of “big data” enables alternative approaches. This work shows the promise of data-driven machine-learning techniques involving feature engineering and predictive model optimization to address a sociological data challenge. The author’s group develops improved generalizable models to identify at-risk families. Principal-components analysis and decision tree modeling are used to predict six main dependent variables in the Fragile Families Challenge, successfully modeling one binary variable but no continuous dependent variables in the diagnostic data set. This indicates that some binary dependent variables are more predictable using a reduced set of uncorrelated independent variables, and continuous dependent variables demand more complexity.
Survey data sets are often wider than they are long. This high ratio of variables to observations raises concerns about overfitting during prediction, making informed variable selection important. Recent applications in computer science have sought to incorporate human knowledge into machine-learning methods to address these problems. The authors implement such a “human-in-the-loop” approach in the Fragile Families Challenge. The authors use surveys to elicit knowledge from experts and laypeople about the importance of different variables to different outcomes. This strategy offers the option to subset the data before prediction or to incorporate human knowledge as scores in prediction models, or both together. The authors find that human intervention is not obviously helpful. Human-informed subsetting reduces predictive performance, and considered alone, approaches incorporating scores perform marginally worse than approaches that do not. However, incorporating human knowledge may still improve predictive performance, and future research should consider new ways of doing so.
The Fragile Families Challenge provided an opportunity to empirically assess the applicability of black-box machine learning models to sociological questions and the extent to which interpretable explanations can be extracted from these models. In this article the author uses neural network models to predict high school grade point average and examines how variations of basic network parameters affect predictive performance. Using a recently proposed technique, the author identifies the most important predictive variables used by the best performing model, finding that they relate to parenting and the child’s cognitive and behavioral development, consistent with prior work. The author concludes by discussing the implications of these findings for the relationship between prediction and explanation in sociological analyses.

Case-studies for written reviews