Ivaylo D. Petev - Patterns & Big Data

Topics in Quantitative Sociology

Fall 2020 ENSAE

Big Data, Prediction & Explanation

The contemporary data deluge offers rich opportunities for sociologists to deploy old and new tools to study patterns of social life. We focus in particular on whether the most recent developments in AI and Large Language Models, since the notable advent of ChatGPT, help improve prediction of life outcomes.

Background readings

Bail, 2023, arXiv, "Can Generative AI Improve Social Science"
McFarland & al., 2015, AmSoc, “Sociology in the era of big data: The ascent of forensic social science”
Molina & Garip, 2019, ARS, "Machine learning for sociology"
Salganik & al., 2019, Socius, "Introduction to the special collection on the Fragile Families Challenge"
Salganik & al., 2020, PNAS, "Measuring the predictability of life outcomes with a scientific mass collaboration"
Ziems & al., 2023, arXiv, "Can large language models transform computational social science"

Optional readings

Bail, 2014, TS, "The cultural environment: Measuring culture with big data"
Boelaert & Ollion, 2018, RFS, "The great regression. Machine learning, econometrics, and the future of quantitative social sciences"
Blei & al, 2003, JMLR, "Latent Dirichlet Allocation"
Colbaugh & al, 2012, arxiv, "Leveraging sociological models for predictive analysis"
Evans & Aceves, 2016, ASR, "Machine translation: Mining text for social theory"
Garip, 2020, PNAS, "What failure to predict life outcomes can teach us"
Grimmer & Stewart, 2013, PA, "Text as data: The promise and pitfalls of automatic content analysis methods for political texts"
Kitchin, 2014, BDS, "Big data, new epistemologies and paradigm shifts"
McFarland & al, 2013, Poet, "Differentiating language usage through topic models"
Mohr & Bogdanov, 2013, Poet, "Introduction - Topic models. What they are and why they matter"
Varian, 2014, JEP, "Big data: New tricks for econometrics"

Case-studies for reading and commentary

Note: All four papers are shorter in length and are based on the same data and participate in the same prediction challenge (see the two Salganik pieces under the background readings section above; you can also listen to a dedicated podcast). For your commentary, choose one of the four papers.

Rigodon & al., 2019, Socius, "Winning Models for Grade Point Average, Grit, and Layoff in the Fragile Families Challenge"

In this article, the authors discuss and analyze their approach to the Fragile Families Challenge. The data consisted of more than 12,000 features (covariates) about the children and their parents, schools, and overall environments from birth to age 9. The authors’ modular and collaborative approach parallelized prediction tasks and relied primarily on existing data science techniques, including (1) data preprocessing: elimination of low variance features, imputation of missing data, and construction of composite features; (2) feature selection through univariate mutual information and extraction of nonzero least absolute shrinkage and selection operator coefficients; (3) three machine learning models: random forest, elastic net, and gradient-boosted trees; and finally (4) prediction aggregation according to performance. The top-performing submissions produced winning out-of-sample predictions for three outcomes: grade point average, grit, and layoff. However, predictions were at most 20 percent better than a baseline that predicted the mean value of the training data for each outcome.

Crompton, 2019, Socius, "A Data-Driven Approach to the Fragile Families Challenge Prediction through Principal-Components Analysis and Random"

Sociological research typically involves exploring theoretical relationships, but the emergence of “big data” enables alternative approaches. This work shows the promise of data-driven machine-learning techniques involving feature engineering and predictive model optimization to address a sociological data challenge. The author’s group develops improved generalizable models to identify at-risk families. Principal-components analysis and decision tree modeling are used to predict six main dependent variables in the Fragile Families Challenge, successfully modeling one binary variable but no continuous dependent variables in the diagnostic data set. This indicates that some binary dependent variables are more predictable using a reduced set of uncorrelated independent variables, and continuous dependent variables demand more complexity.

Filipova & al., 2019, Socius, "Humans in the Loop Incorporating Expert and Crowd-Sourced Knowledge for Predictions Using Survey Data"

Survey data sets are often wider than they are long. This high ratio of variables to observations raises concerns about overfitting during prediction, making informed variable selection important. Recent applications in computer science have sought to incorporate human knowledge into machine-learning methods to address these problems. The authors implement such a “human-in-the-loop” approach in the Fragile Families Challenge. The authors use surveys to elicit knowledge from experts and laypeople about the importance of different variables to different outcomes. This strategy offers the option to subset the data before prediction or to incorporate human knowledge as scores in prediction models, or both together. The authors find that human intervention is not obviously helpful. Human-informed subsetting reduces predictive performance, and considered alone, approaches incorporating scores perform marginally worse than approaches that do not. However, incorporating human knowledge may still improve predictive performance, and future research should consider new ways of doing so.

Davidson, 2019, Socius, "Black-Box Models and Sociological Explanations. Predicting High School Grade Point Average Using Neural Networks"

Commentaries: .

The Fragile Families Challenge provided an opportunity to empirically assess the applicability of black-box machine learning models to sociological questions and the extent to which interpretable explanations can be extracted from these models. In this article the author uses neural network models to predict high school grade point average and examines how variations of basic network parameters affect predictive performance. Using a recently proposed technique, the author identifies the most important predictive variables used by the best performing model, finding that they relate to parenting and the child’s cognitive and behavioral development, consistent with prior work. The author concludes by discussing the implications of these findings for the relationship between prediction and explanation in sociological analyses.

Case-studies for written reviews

Algee-Hewitt, Mark, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. 2016. Canon/Archive: Large-Scale Dynamics in the Literary Field. Stanford, California: Literary Lab.
Anderson, Gordon, Maria Grazia Pittau, and Roberto Zelli. 2016. “Assessing the Convergence and Mobility of Nations without Artificially Specified Class Boundaries.” Journal of Economic Growth 21(3):283–304. doi: 10.1007/s10887-016-9128-5.
Argyle, Lisa P., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis 31(3):337–51. doi: 10.1017/pan.2023.2.
Bail, Christopher A. 2012. “The Fringe Effect: Civil Society Organizations and the Evolution of Media Discourse about Islam since the September 11th Attacks.” American Sociological Review 77(6):855–79. doi: 10.1177/0003122412465743.
Bail, Christopher A., Brian Guay, Emily Maloney, Aidan Combs, D. Sunshine Hillygus, Friedolin Merhout, Deen Freelon, and Alexander Volfovsky. 2020. “Assessing the Russian Internet Research Agency’s Impact on the Political Attitudes and Behaviors of American Twitter Users in Late 2017.” Proceedings of the National Academy of Sciences 117(1):243–50. doi: 10.1073/pnas.1906420116.
Bolukbasi T., Saligrama V., Chang K.-W., Zou J., and Kalai A. 2016. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” in 30th Annual Conference on Neural Information Processing Systems.
Cardon, Dominique, Guilhem Fouetillou, and Camille Roth. 2021. “Two Paths of Glory — Structural Positions and Trajectories of Websites within Their Topical Territory.” Proceedings of the International AAAI Conference on Web and Social Media 5(1):58–65. doi: 10.1609/icwsm.v5i1.14101.
Centola, Damon. 2011. “An Experimental Study of Homophily in the Adoption of Health Behavior.” Science 334(6060):1269–72. doi: 10.1126/science.1207055.
Chu, Johan S. G., and James A. Evans. 2021. “Slowed Canonical Progress in Large Fields of Science.” Proceedings of the National Academy of Sciences 118(41):1–5.
DiMaggio, Paul, Manish Nag, and David Blei. 2013. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics 41(6):570–606. doi: 10.1016/j.poetic.2013.08.004.
Evans, James A. 2010. “Industry Induces Academic Science to Know Less About.” American Journal of Sociology 116(2):389–452.
Gross, Neil, and Marcus Mann. 2017. “Is There a ‘Ferguson Effect?’ Google Searches, Concern about Police Violence, and Crime in U.S. Cities, 2014–2016.” Socius: Sociological Research for a Dynamic World 3:237802311770312. doi: 10.1177/2378023117703122.
Hofstra, Bas, Rense Corten, Frank Van Tubergen, and Nicole B. Ellison. 2017. “Sources of Segregation in Social Networks: A Novel Approach Using Facebook.” American Sociological Review 82(3):625–56. doi: 10.1177/0003122417705656.
Hunzaker, M. B. Fallin, and Lauren Valentino. 2019. “Mapping Cultural Schemas: From Theory to Method.” American Sociological Review 84(5):950–81. doi: 10.1177/0003122419875638.
Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 84(5):905–49. doi: 10.1177/0003122419877135.
Latour, Bruno, Pablo Jensen, Tommaso Venturini, Sébastian Grauwin, and Dominique Boullier. 2012. “‘The Whole Is Always Smaller than Its Parts’ – a Digital Test of G Abriel T Ardes’ Monads.” The British Journal of Sociology 63(4):590–615. doi: 10.1111/j.1468-4446.2012.01428.x.
Levy, Brian L., Nolan E. Phillips, and Robert J. Sampson. 2020. “Triple Disadvantage: Neighborhood Networks of Everyday Urban Mobility and Violence in U.S. Cities.” American Sociological Review 85(6):925–56. doi: 10.1177/0003122420972323.
Liu, David M., and Matthew J. Salganik. 2019. “Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge.” Socius: Sociological Research for a Dynamic World 5:1–21.
Mazieres, Antoine, Telmo Menezes, and Camille Roth. 2021. “Computational Appraisal of Gender Representativeness in Popular Movies.”
McKay, Stephen. 2019. “When 4 ≈ 10,000: The Power of Social Science Knowledge in Predictive Performance.” Socius: Sociological Research for a Dynamic World 5:237802311881177. doi: 10.1177/2378023118811774.
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. 2011a. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331(6014):176–82. doi: 10.1126/science.1199644.
Miller, Ian Matthew. 2013. “Rebellion, Crime and Violence in Qing China, 1722–1911: A Topic Modeling Approach.” Poetics 41(6):626–49. doi: 10.1016/j.poetic.2013.06.005.
Nowak, Adam, and Patrick Smith. 2017. “Textual Analysis in Real Estate: TEXTUAL ANALYSIS.” Journal of Applied Econometrics 32(4):896–918. doi: 10.1002/jae.2550.
Roth, Camille, Antoine Mazières, and Telmo Menezes. 2020. “Tubes and Bubbles Topological Confinement of YouTube Recommendations” edited by T. P. Peixoto. PLOS ONE 15(4):e0231703. doi: 10.1371/journal.pone.0231703.
Salganik, Matthew J., Peter Sheridan Dodds, and Duncan J. Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” Science 311(5762):854–56. doi: 10.1126/science.1121066.
Salganik, Matthew J., and Duncan J. Watts. 2009. “Web‐Based Experiments for the Study of Collective Social Dynamics in Cultural Markets.” Topics in Cognitive Science 1(3):439–68. doi: 10.1111/j.1756-8765.2009.01030.x.
Spiro, Emma S., Zack W. Almquist, and Carter T. Butts. 2016. “The Persistence of Division: Geography, Institutions, and Online Friendship Ties.” Socius: Sociological Research for a Dynamic World 2:237802311663434. doi: 10.1177/2378023116634340.
Stanescu, Diana, Erik Wang, and Soichiro Yamauchi. 2019. “Using LASSO to Assist Imputation and Predict Child Well-Being.” Socius: Sociological Research for a Dynamic World 5:237802311881462. doi: 10.1177/2378023118814623.

Google Sites

Report abuse