Big Data and the Well Being of Women and Girls

Transcript

1 Big Data and the Well-Being of Women and Girls Applications on the Social Scientific Frontier April 2017

2 This report was written by Bapu Vaitla (overall); Claudio Bosco, Victor Alegana, Tom Bird, Carla Pezzulo, Graeme Hornby, Alessandro Sorichetta, Jessica Steele, Cori Ruktanonchai, Nick Ruktanonchai, Erik Wetter, Linus Bengtsson, and Andrew J Tatem (Section II); Riccardo Di Clemente, Miguel Luengo-Oroz, and Marta C. González (Section III); United Nations Global Pulse and University of Leiden, lead authors René Nielsen, Thomas Baar, and Felicia Vacarelu (Section IV, part 1); Munmun de Choudhury, Sanket Sharma, Tomaz Logar, Wouter Eekhout, and René Nielsen (Section IV, part 2). We thank Anoush Tatevossian and Robert Kirkpatrick for their guidance and access to key datasets. Editing and design by Julia Van Horn. We are grateful to Rebecca Furst-Nichols, Mayra Buvinic, Emily Courey Pryor, and Alba Bautista for helpful comments and guidance. This work was initiated by Data2X, a collaborative technical and advocacy platform dedicated to improving the quality, availability, and use of gender data in order to make a practical difference in the lives of women and girls worldwide. Data2X works with UN agencies, governments, civil society, academics, and the private sector to close gender data gaps, promote expanded and unbiased gender data collection, and use gender data to improve policies, strategies, and decision-making. Hosted at the United Nations Foundation, Data2X receives funding from the William and Flora Hewlett Foundation and the Bill & Melinda Gates Foundation. Cover photo © Elizabeth Whelan

3 Contents 1 Executive Summary 4 Introduction I 6 Geospatial Data II 6 High-resolution Mapping of Sex-Disaggregated Indicators 13 III Digital Exhaust 13 Analyzing Economic Activity with Credit Card and Cell Phone Information 20 Internet Activity IV 20 Sex-Disaggregation of Social Media Posts 23 Social Media Expression as Signaling Mental Health States 29 Conclusion: Reimagining the Revolution V 31 Notes 31 References 32 Photo Credits

4 exe ummary Cutive s Conventional forms of data—household surveys, national economic accounts, institu- tional records, and so on—struggle to capture detailed information on the lives of women and girls. The many forms of big data, from geospatial information to digital transaction logs to records of internet activity, can help close the global gender data gap. This report profiles several big data projects that quantify the economic, social, and health status of women and girls. The first project, described in Section II (“Geospatial Data”), uses satellite imagery to greatly improve the spatial resolution of existing data on girls’ stunting, women’s literacy, and access to modern contracep- tion in Bangladesh, Haiti, Kenya, Nigeria, and Tanzania. This project develops modeling techniques that use publicly available high-resolution geospatial data to infer similarly high-resolution patterns of social and health phenomena across entire countries. The approach takes advantage of the fact that many types of social and health data are correlated with geospa- tial phenomena. These relationships can predict social and health outcomes in areas where surveys have not been performed but correlated geospatial data is available. This Figure A. Differences in stunting between girls and boys, Nigeria, 2013. Red areas project generated a series of highly detailed are where girls’ stunting is higher than boys’ stunting, green where girls’ stunting is lower than boys’ stunting. maps that clearly illustrate landscapes of gender inequality (see Figure A). The second project, profiled in Section III (“Digital Exhaust”), utilizes anonymized credit card and cell phone data to describe patterns of women’s expenditure and mobility in a major Latin American metropolis. The credit card data includes 10 weeks of transactions from 150,000 users, with associated age, sex, and location information; for a subset of these credit card users, cell phone data is also available. The two types of information together create portraits of economic lifestyles—patterns of behavior that illustrate the needs and priorities 1 Big Data and the Well-Being of Women and Girls

5 Credit Grocery stores, supermarkets Eating places and restaurants Bridge and road fees, tolls Computer network/information services Miscellaneous food stores of women (see Figure Service stations B). Over a longer Insurance sales, underwriting and premiums timeframe, such data Department stores could also reveal signals Telecommunication services about how women are Manual cash disbursements coping with a wide Taxicabs and limousines range of environmental Cable, satellite, and radio services and economic shocks Fast food restaurants and stressors. Drug stores and pharmacies Direct marketing The third and fourth Computer software stores projects, profiled in Motion picture theaters Section IV (“Internet Women’s ready to wear stores Activity”) concentrate Wholesale clubs on the expression of Miscellaneous general merchandise stores ideas and emotions 0.00 0.02 0.04 0.06 0.08 0.10 0.12 on the social media Frequency platform Twitter. The Figure B. Frequency of women’s transactions in different expenditure categories, as assessed by credit third project develops card data. and prototypes a tool for automatically identifying the sex of Twitter users, and then uses this method to quantify the concerns of women on a wide range of global development issues. The algorithm created in this project automates the process of looking up user’s names and pictures from Twitter profiles. Using open source software, the tool analyzes users’ names from a built-in database that contains sex information. If name alone is insufficient to infer sex, the tool analyses profile photos using face recognition software. The tool was tested on more than 50 million Twitter accounts across the world to understand the differing priorities of women and men on topics related to sustainable development (see Figure C). The final project locates signals of depression in a large database of publicly available tweets from women and girls in India, South Africa, the United Kingdom, and the United States. The project uses machine learning techniques to identify genuine self-disclosures of mental illness from nearly 1.5 million social media posts and half a million Twitter users. The method accurately identifies mental illness in 96% of cases. The project also compares modes of linguistic expression and topical content across female and male users. Overall, the 2 Executive Summary

6 Nepal Compare how likely men and women are to tweet about each of the 16 topics. m 11% A good education f 11% findings reveal significant differences in how different 3% m Access to clean water and sanitation sexes express mental health concerns on Twitter. The 3% f 8% m work suggests two major applications for monitor- Action taken on climate change 6% f ing and treatment. At the individual level, signals of 3% m Affordable and nutritious food mental illness could provoke response, either from the 4% f user’s community or through automated means from 9% m An honest and responsive government 7% f the social media platform itself (for example, offering 2% m counseling resources). At the population level, mental Better healthcare 3% f health trends can be monitored in near real-time, which 6% m Better job opportunities may be especially useful following recessions, natural 5% f 5% m disasters, and other shocks. Better transport and roads 4% f 9% m Equality between men and women This report illustrates the potential of big data in 13% f 8% m filling the global gender data gap. The rise of big data, Freedom from discrimination and persecution 10% f however, does not mean that traditional sources of data 4% m Phone and internet access will become less important. On the contrary, the suc- 4% f cessful implementation of big data approaches requires 8% m Political freedoms 6% f investment in proven methods of social scientific 12% m research, especially for validation and bias correction of Protecting forests, rivers, and oceans 10% f big datasets. More broadly, the invisibility of women 6% m Protection against crime and violence 9% f and girls in national and international data systems is 4% m a political, not solely a technical, problem. In the best Reliable energy at home 3% f case, the current “data revolution” will be reimagined 3% m Support for people who can’t work as a step towards better “data governance”: a process 3% f through which novel types of information catalyze the Figure C. Trending topics among Twitter users in Nepal, May creation of new partnerships to advocate for scientific, 2012-July 2015, disaggregated by sex. policy, and political reforms that include women and girls in all spheres of social and economic life. 3 Big Data and the Well-Being of Women and Girls

7 uCtion introD I The term “big data” encompasses diverse types of information, from satellite imagery to cell phone records to internet activity. These forms of data differ in many ways, but all have digital origins, record observations at high frequency, and are massive in size. Such charac- teristics are invaluable in studying human well-being as it changes over time. Traditional data systems struggle to quantify trajectories of physical and mental health among a population, especially during and following economic recessions, natural disasters, and other unpredictable shocks. The problem is exacerbated—and present even during periods of relative economic stability—with respect to women and girls, who often work in the informal sector or at home, suffer social constraints on their mobility, and are margin- alized in both private and public decision-making. Household surveys, national economic accounts, institutional records, and so on often do not successfully capture the lives of women and girls, especially at the kind of frequency needed to assess changes in economic and health status. This report profiles groundbreaking approaches to using various kinds of big data to fill the global gender data gap. For each of three major big data categories—geospatial data, digital exhaust, and records of internet activity—we present exemplary research initiatives a,b conducted over the past two years: Data”), at the Flowminder Foundation researchers II (“Geospatial Section • In and WorldPop project use satellite imagery to improve the spatial resolution of existing data on women and girls obtained from Demographic and Health Surveys (DHS) in Bangladesh, Haiti, Kenya, Nigeria, and Tanzania; at the Massachusetts researchers Institute • In of Section III, (“Digital Exhaust”), Technology (MIT), working with a colleague at United Nations Global Pulse (UNGP), utilize credit card and cell phone data to discern patterns of women’s a More detailed reports on each of these projects are expenditure and mobility in a major Latin American metropolis; available at http://data2x. IV, (“Internet Activity”), projects on Section at two concentrating we look • In org/resources the expression of ideas and emotions on the social media platform Twitter. In b Note that the first-person the first, researchers at UNGP and the University of Leiden create an algorithm plural “we” is used through- out this report to refer in for automatically identifying the sex of Twitter users, and then use this method different sections to different to quantify the concerns of women across a wide range of global development groups of researchers. The relevant researchers for each issues. In the second project, researchers at Georgia Tech University, supported section are listed on the by colleagues at the University of Leiden and UNGP, locate signals of depression inside front cover. 4 I. Introduction

8 in a large database of publicly available tweets from women and girls in India, South Africa, the United Kingdom, and the United States. In all cases, the projects yielded important new insights into the lives of women and girls. The sections that follow describe each in detail. Elizabeth Whelan 5 Big Data and the Well-Being of Women and Girls

9 Geospatial Data II High-resolution Mapping of Sex-Disaggregated Indicators The big data conversation usually centers on novel forms of data, ignoring a valuable source of information that has been available in the public domain for decades: geospatial data. In recent years, the amount of freely accessible geospatial data, especially satellite imagery, has greatly expanded, spurred by increased investment from government agencies and private businesses. This data is increasingly fine-grained in both time and space: satellite imagery, for example, is now able to record rapid changes in both biophysical phenomena (for example, vegetation, soil cover, and water flows) and human infrastructure (for example, settlements, roads, and light intensity). Equally high-resolution data on social and health indicators is critically needed, but still lacking. Human well-being varies considerably within countries, and development indica- tors assessed at national scales conceal these inequalities. Importantly, the status of women and girls in economically marginalized or geographically isolated communities is often unknown. Although four out of every five countries in the world regularly produce sex-dis- aggregated statistics at national or provincial scale, this data is not spatially refined enough to support local policymaking or program targeting. To address this problem, we developed modeling techniques that use high-resolution geo- spatial data to infer similarly high-resolution patterns of social and health phenomena. This approach takes advantage of the fact that many types of social and health data— for example, child stunting, literacy, and access to modern contraception, the indicators we focus on in this case study—are correlated with geospatial phenomena that can be mapped in great detail across entire countries using satellite imagery. These relationships are then used to predict social and health outcomes in areas where surveys have not been 1 performed. The result is maps that provide entire landscapes of information on indicators of interest. The workflow is illustrated in Figure 1, and the methods more fully explicated in the following pages. We focus especially on outcomes related to girls and women, that is, girls’ stunting, women’s literacy, and contraceptive access; results for boys’ stunting and men’s literacy are presented in the accompanying technical report by Bosco et al. (2016). Methods The DHS program has been a leader in collecting and disseminating survey data on key development indicators in low- and middle-income countries. Large-sample household 6 II. Geospatial Data

10 Figure 1. Workflow, geospatial modeling of well-being outcomes. DHS D ata DHS surveys provide information on well- being outcomes (stunting, literacy, modern contraceptive use) in distinct locations reDictive P oDeling M being ell W igH -r eS ata D - H The geospatial models The correlation of Using the geospatial best able to predict well-being outcomes models, well-being is well-being in the with geospatial predicted across the survey locations are variables is assessed whole landscape retained D atial eoSP g ata Whole landscape maps of a broad set of geospatial variables (e.g. accessibility) are generated data collection of this type, however, is costly, and so surveys are normally designed to be representative at the national or the largest subnational administrative level (typically called states or provinces). These areas often contain millions of people, and statistical assessments c at such scales obscure substantial lower-level heterogeneity in social, economic, and health Children under age five whose height is considerably status. However, recent DHS surveys—and, increasingly, other household surveys— (two standard deviations) provide GPS coordinates for observations or clusters of observations, which enables us below the median of the World Health Organization’s to utilize our geospatial modeling approach to improve the spatial resolution of DHS reference population are 2 indicators. considered stunted. People ages 15-49 who attended at least secondary school or In this study, we focus on three countries in Sub-Saharan Africa (Kenya, Nigeria, and could read part of a sentence during the DHS interview Tanzania), one country in South Asia (Bangladesh), and one country from the Western are defined as literate. The 3 Hemisphere (Haiti); all have a low or medium human development index. We use DHS current use of any modern method of contraception data from the last several years on child stunting, literacy, and the use of modern contracep- is asked of all women ages 15-49, but in Bangladesh tion (hereafter collectively referred to as “well-being outcomes”), the first two of which are only of ever- married women. c,4 disaggregated by sex; only girls’ and women’s results are presented in this report. d Selecting the optimal subset of geospatial variables We chose geospatial variables, summarized in Table 1, by combing existing publicly is critical for maximizing the ultimate predictive available libraries for those variables that had previously shown correlation with the out- accuracy of a model: too few d,5 comes. We then analyzed the relationship of these variables with stunting, literacy, and informative variables and the model will not explain contraceptive access at each recorded survey location. The final step used these observed much; too many and the relationships to infer, using high-resolution landscape maps of each geospatial variable, resulting model may explain the observed data extremely outcomes in all non-survey locations. A continuous landscape of girls’ stunting, women’s well but perform badly when applied to new datasets. literacy, and access to contraception was thus generated for each country. 7 Big Data and the Well-Being of Women and Girls

11 Results We first present results of the geospatial variable selection exercise, summarizing overall model performance and then listing the most strongly correlated set of geospatial variables for each indicator in each country. For selected indicators, we show maps comparing DHS survey results with the landscape of values generated by geospatial variables. Table 1. Geospatial variables used in this study. See Bosco et al. (2016) for extended descriptions and sources. Geospatial Variate Description Accessibility Likely travel times between two points, a function of distarce and infrastructure Weather station-based interpolation of moisture availability Aridity evapotranspiration Births WoldPop-derived number of live births Rainfed crop suitability given crop/technology mix Crop suitability Distance to conflicts Nigeria only, between years 2010-13 calculated from Open Street Map datasets Distance to health facility Calculated from Open Street Map datasets Distance to roads Caulculated from Open Street Map datasets Distance to schools Economic productivity Gross domestic product, calculated with economic data and geospatial correlates of economic activity Elevation Elevation above sea level Ethnicity Estimated distribution of ethnic groups Land biophysical properties estimated by reflectance Land surface Modeled spatial distribution of livestock Livestock density Light intensity, denoting population density and electrification Nightlights Population density Density inferred from settlement and land use patterns WorldPop-derived number of pregnancies Pregnancies Protected areas Geospatial conservation on databases Global climate layers Temperature/rainfall Estimated distance to settlements, country-specific datasets Urban/rural settlements Vegetation/land cover Plant cover estimated by surface reflectance First, we note that model performance varied greatly across indicators and countries (Figure 2). Models for girls’ stunting, for example, were inadequate for all countries except Nigeria. Geospatial variables were generally informative in building models for women’s literacy. For modern contraceptive use, models performed strongly in Tanzania and Nigeria. The results suggest that geospatial modeling requires careful investigation of a broad set of variables—even broader than the set explored here—and some outcomes in some countries 8 II. Geospatial Data

12 Figure 2. Explanatory power of geospatial models, by country and indicator. Country/indicator models with no information shown (stunting in Tanzania and Haiti, literacy in Haiti, contraception in Kenya and Bangladesh) were not modeled, due to lack of sufficient survey indicator data. Boys’ stunting and male literacy is not shown. 0.8 0.7 0.6 0.5 0.4 may not be correlated well to any set of 0.3 0.2 geospatial variables. In the present work, 0.1 literacy appears to have strong geospatial Proportion of Explained Variance 0 correlates almost universally, while the per- Women’s literacy Girls’ stunting Modern contraceptive use formance of girls’ stunting and contracep- Haiti Kenya Nigeria Tanzania Bangladesh tive use models depends on context. Figure 3. Geospatial correlates of girls’ and women’s well-being outcomes in the We also see that the optimum subset of six best-performing models. Shaded box indicates that the variable was included in the final model. geospatial variables also differs by country Modern Girls’ and indicator, as shown in Figure 3 for the Female literacy contraceptive use stunting six best performing models. Accessibility, a Tanzania Tanzania Nigeria Nigeria Kenya Nigeria general indicator of transport infrastructure Accessibility quality, is the only geospatial variable that Aridity, evapotranspiration Births is statistically significant in all six models; Crop suitability elevation and land surface are usually Distance to conflicts Distance to health facility important, and aridity/evapotranspiration, Distance to roads distance to roads, temperature/rainfall, Distance to schools Distance to waterways and the distance to urban and rural set- Economic productivity tlements are significant in most cases. The Elevation Ethnicity key message, however, is that the set of Land surface Livestock density optimum geospatial variables depends on Nightlights context; even the same indicator will have Population density Pregnancies Protected areas Temperature/rainfall Urban/rural settlements Vegetation/land cover 9 Big Data and the Well-Being of Women and Girls

13 Figure 4. DHS survey data for girls’ stunting in Nigeria in 2013 (top panel) and geospatially predicted landscape of girls’ stunting in the same year (bottom panel). different geospatial correlates in different locations. We now turn to the core results, presented in a series of maps. Figure 4 makes clear the overall value of the approach. The top panel shows stunting rates for girls in the original DHS survey locations from the Nigeria 2013 dataset. The data appears as a scatter of points distributed unevenly throughout the country; between the survey locations are large areas of space in which stunting prevalence is not known. The bottom panel then shows the girls’ stunting landscape in 2013 as generated by the best-performing geospatial model (which includes those variables shaded in the “Girls’ stunting/ Nigeria” column of Figure 3). We see a con- tinuous gradient over the entire expanse of the country; not only broad geographic patterns but also differences within sub-re- gions of Nigeria become more evident. 10 II. Geospatial Data

14 Geospatial modeling also unveils inequalities between girls and boys across the landscape. Figure 5 shows differences in 2013 stunting rates across sexes in Nigeria; positive values (colored in orange/red) indicate areas where girls have higher rates of stunting, and negative values (colored in green) where boys have higher rates (separate results for absolute levels of boys’ stunting are available in Bosco et al. 2016). Notably, the areas with higher absolute levels of girls’ stunting, as shown in the right panel of Figure 4, are not necessarily the areas of greatest inequality; the northeast, central, and southern urban regions of Nigeria appear to exhibit the largest disadvantage for girls. Overall, the map provides a fine- Figure 5. Differences in stunting rate between girls and boys, Nigeria, 2013. grained picture of inequality across the entire landscape. Figure 6 and Figure 7 show similar results for women’s literacy in Kenya and modern con- traceptive use in Tanzania. Once again, we see that geospatial modeling can transform a limited number of survey data points dis- tributed unevenly across the country into a continuous landscape of information. This approach does face challenges. Some of the geospatial models we attempted were unable to accurately predict well-be- ing outcomes, as shown earlier by Figure 2. It is possible that a wider set of geospatial covariates would have improved modeling performance. For many variables and locations an exploratory approach is necessary, as little theoretical guidance linking geo- spatial phenomena with well-being outcomes is available. In addition, the exact nature of the relationships between geospatial phenomena and well-being outcomes—linearity vs. non-linearity, for example—is also unclear. Overall, however, geospatial modeling of women’s and girls’ social and health status shows great promise. Some of the maps produced in the present study, especially the maps of all three indicators in Nigeria and the map of women’s literacy in Kenya, have sufficiently 11 Big Data and the Well-Being of Women and Girls

15 Figure 6. Women’s literacy in Kenya. Figure 7. Modern contraceptive use in Tanzania. low uncertainty to be utilized by policymakers seeking to target interventions at local ad- ministrative levels. We have in this section presented results on three indicators from five countries, but the modeling architecture can be extended to other indicators and countries for which DHS has information, as well as to other household surveys containing georef- erenced data. The recent expansion of publicly available high-resolution satellite imagery offers a rich bounty of data for exploring geospatial correlations in the many countries where traditional data systems are insufficient to capture the status of women and girls. 12 II. Geospatial Data

16 DiG ital e xhaust III Analyzing Economic Activity with Credit Card and Cell Phone Information Digital technologies are ubiquitous, and their use leaves traces—records of the goods and 6 services we consume, the places we go, and the people with whom we interact. If informa- tion on the sex of the technology user is available, these types of “data exhaust” can offer insight in near-real time about the lives of women and girls. The following pages describe a project that uses credit card and cell phone data to analyze patterns of economic activity among tens of thousands of women living in one of the most e populated cities of Latin America. We use credit card records (CCRs) to examine the ex- 7 penditure priorities and patterns of mobility of different sexes, income levels, and ages. Call detail records (CDRs), meanwhile, store information about the time, duration, and location of mobile phone calls, as well as the anonymized IDs of the people receiving calls. Past research has used CDRs to analyze social interactions, the laws of human mobility, and 8 the economic welfare of users. For this project, we obtained over 10 weeks of anonymized individual credit card transac- tions from 150,000 users, with associated age, sex, and location information. The CCRs include data on the broad types of goods and services purchased, expenditure amounts, and the chronological sequence of transactions. For 10% of these credit card users, call detail record (CDR) data is also available. We used CCRs and CDRs together to describe economic lifestyles—patterns of behavior that illustrate the needs and priorities of individ- uals. A detailed analysis is done specifically for women in the sample. Methods Detecting differences in economic lifestyles is complicated by the fact that only a few categories of purchases dominate spending: most people, regardless of sex, wealth level, or age, spend most of their income on food, transport, and communication, as the “Results” e section below discusses. This project thus delves deeper, looking not only at transaction For contractual and privacy reasons, we cannot type but also the order in which individuals made purchases, as well as their patterns of make details of the dataset geographical mobility. Certain sequences of transactions may be repeated in an individual’s publicly available. Upon request, the authors can purchase history; for example, in panel A of Figure 8, sequence W1 captures grocery store provide anonymized data expenditures (the shopping cart icon) followed by department store purchases (the gift box used for a subset of the analyses described below; the icon), while sequence W2 represents restaurant expenditures (the plate and silverware icon) code to replicate methods is followed by fuel purchases (the gas pump icon). Both W1 and W2 are repeated twice in this available upon request. 13 Big Data and the Well-Being of Women and Girls

17 DiG ital e xhaust f Figure 8. Examples of repeated sequences within a transaction history. Such patterns— short transaction history. more than ten thousand of which were detected in the CCR dataset—are the basis for inferring economic lifestyles. As each user’s sequences are analyzed, and data on mobility from geocoded transactions added, similarities begin to emerge; people cluster together into distinct economic lifestyles. Mobile phone data further enriches our understanding of patterns of economic and social behavior, helping to delineate economic lifestyles even more distinctly. In this project, we focus on three types of information about individuals obtained from CDRs: mobility 9 diversity, social diversity, and the radius of gyration (Figure 9). Mobility diversity—how evenly an individual splits travel time across the various locations he/she visits—can be constructed using location information from CDRs, gathered by the towers through which cell phone signals pass. Social diversity quantifies how evenly an individual splits airtime across all people in his/her calling network. Finally, the radius of gyration defines the physical area where the user is most likely to be found. Results We find that expenditures on food are the most important transactions for women, with over a quarter of transactions in grocery stores/supermarkets, eating places/restaurants, and miscellaneous food stores (Figure 10). Figure 9. Features of individual behavior obtained A closer look shows expenditure patterns from a combination of call detail records and across sexes, ages, and income levels (Figure credit card records. 11); notice that expenditure across sexes shows strong differences in some catego- ries, while differences across income levels are minor. Women have more transactions than men with respect to grocery stores/ f Sequences can be as short supermarkets, insurance-related expenses, as two transactions or much longer. Longer sequences and department stores, while the opposite will occur less frequently in is true for restaurants and transport-related the transaction history. 14 xhaust iii. Digital e

18 Figure 10. Frequency of women’s transactions in each expenditure category. For example, 12% of all transactions in the CCR dataset were from grocery stores and supermarkets. Credit Grocery stores, supermarkets Eating places and restaurants Bridge and road fees, tolls Computer network/information services expenses. In general, women report less Miscellaneous food stores total expenditure per capita than men, in- Service stations dicating that they either have less access to Insurance sales, underwriting and premiums economic resources in general and/or use Department stores Telecommunication services credit cards less frequently. Such patterns Manual cash disbursements are likely to vary based on the nature of the Taxicabs and limousines economy and the prevailing economic cir- Cable, satellite, and radio services cumstances; this analysis will be especially Fast food restaurants relevant in the wake of economic and en- Drug stores and pharmacies Direct marketing vironmental shocks, when little real-time, Computer software stores sex-disaggregated data is available. Motion picture theaters Women’s ready to wear stores Using the combination of credit card Wholesale clubs Miscellaneous general merchandise stores and cell phone data, we identified seven 0.00 0.02 0.04 0.06 0.08 0.10 0.12 economic lifestyle clusters among women Frequency in the dataset (Figure 12). One of the Figure 11. Frequency of transactions by income, age, and gender in selected categories. Income Age Gender Income Age High income Gender 20-34 Female Medium income 35-49 Male Low income 50-64 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Frequency Frequency Frequency 15 Big Data and the Well-Being of Women and Girls

19 clusters, however, does not exhibit a strong pattern of sequences, and is best left unlabeled (cluster 5 in the figure). The transaction sequences within each of the other clusters is dominated by a single type of expenditure, and we use this type to help label the clusters as commuters, homemakers, youth tech-users, and diners (of which there are two types, as discussed further below). Commuters ’ primary transaction is toll fees, and their mobility metrics suggest that they is grocery stores; they travel long distances frequently. The core transaction of homemakers are less mobile, have less social network diversity, and spend less using credit cards. Women are overrepresented in this group, suggesting that women in this urban area perform tra- have taxis as their primary expenditure and live close to the ditional domestic roles. Youth city center. Tech-users are of similar age as youths, but computer and information services are their most important transaction and they have greater diversity in their social contacts Figure 12. Economic lifestyles of women in the dataset. Arrows indicate frequent transaction sequences; color bar indicates how common that sequence is among the people in the category. For example, commuters are likely to pay tolls followed by expenditure on more tolls, restaurants, groceries, fuel, computer and information services, and telecommunication services. 16 xhaust iii. Digital e

20 diners group has high and mobility networks, as well as higher spending overall. The first mobility diversity, high expenditures, and restaurants as the primary transaction; the second diners group has lower mobility diversity and lower expenditure, and miscellaneous food stores are the core expenditure. The identification of economic lifestyle clusters is a vital input for policy formulation. Subgroups within a population have distinct social and economic needs. For example, commuters may be hit hardest by fuel price increases, and the creation of inexpensive and efficient public transport systems may be an important investment in urban areas, especially those where low-income residential areas and job opportunities are not in proximity. The analysis above shows also that groups like homemakers and youth have distinct patterns of expenditure from other segments of the population. Information of this kind facilitates Figure 13. Characteristics of women users in each group (shaded polygons) vs. the entire sample of men (red line). Commuters Homemakers Youth Tech-users Unlabeled Diners 1 Diners 2 17 Big Data and the Well-Being of Women and Girls

21 an analysis of the relative costs and benefits of policies to improve (say) access to afford- able food, information services, or transport. The segregation of the diners illustrates that subgroups do not access food in the same way, and food policies—nutrition subsidies for at-risk groups, programs to incentivize low-cost grocery stores in “food desert” areas, and so on—must be tailored to the needs specific to each economic lifestyle. We also found that women’s economic lifestyle clusters differed in important ways from men’s. The diagrams in Figure 13 show median scores for social diversity, mobility diversity, age, distance from the city center, and total expenditure. The scores for women in each economic lifestyle group are represented by the shaded polygons; the average scores for all men in the sample are represented by the red line. There are clear differences between women in each group and men in the overall sample. Men have much greater mobility diversity than women commuters, and tend to live much closer to the city center, indicating that men may have better access to economic op- portunities. Women homemakers tend to be much less social and have reduced mobility compared to men, again with implications on market and non-market activity. Female youth, tech-users (especially), and diners also have much less social diversity than men; in this urban area, men appear to have greater numbers of social connections, while women have smaller networks. Young women also have a much smaller radius of movement, again pointing to a constrained economic and social world. Female tech-users are the only lifestyle group for whom expenditure is significantly greater than that of men in general, indicating that tech may be a sector in which women are finding more remunerative job opportunities. Two distinct types of female diners are present: one group with relatively high mobility diversity and expenditure but low radius of gyration (e.g., many different restaurants in a localized area; Diners 1), and another with low mobility diversity and expenditure but higher radius of gyration (e.g., people needing to travel relatively long distances to find cheaper food sources; Diners 2). The current project does have some limitations—most notably, a bias towards inclusion of relatively better-off individuals with access to credit cards and cell phones, although penetration of the latter into even isolated rural areas of the developing world is increas- ing. We did examine whether CCR users were representative of the general population, especially given that less than a quarter of the population in the research area use credit cards. The monthly expenditure of the CCR users was high relative to wages in most 18 xhaust iii. Digital e

22 of the neighborhoods included in the project, confirming that (within neighborhoods) the sample is biased towards wealthier individuals—although, evaluating the sample as a whole, users from all income levels are well-represented. Despite the potential for bias, the project does demonstrate that a combination of credit card and cell phone data can provide detailed insights into women’s economic behavior. This research could also serve as a basis for further work applying the proposed method- ologies to other developing countries where mobile money, in addition to credit cards, is Elizabeth Whelan commonly used. The results above are drawn from a ten-week period in which economic conditions were relatively stable. Over a longer timeframe, our approach could also reveal signals about how women are coping with a wide range of shocks and stressors: environ- mental disasters, recessions, macroeconomic policy shifts, and so on. For example, reduced mobility among a low-expenditure group could signal that poorer economic classes are unable to afford the transport costs necessary for commuting and accessing markets and government services. Such early warning information would be valuable in designing and managing effective social protection systems. 19 Big Data and the Well-Being of Women and Girls

23 internet aC tivity IV Sex-Disaggregation of Social Media Posts Social media can help monitor public perceptions and measure global development prior- ities and impact. It can also provide insights into the differences and inequalities between people of different income, sex, age, race, ethnicity, migratory status, disability status, geo- graphic location, and other characteristics. Sex disaggregation, in particular, can play an important role in providing information about the disparities between women and men. However, data from open social media channels such as Twitter may not indicate a person’s sex. In this project, United Nations Global Pulse (UNGP) collaborated with the University of Leiden to develop and test a tool to infer the sex of Twitter users. The tool automates the process of looking up public information from Twitter profiles, especially the user’s name and profile picture. Using open source software, the tool analyzes users’ names from a built-in database of predefined names, built from sources such as official statistics that contain sex information. For cases in which name alone may not be enough to discern sex, the tool analyzes profile photos using face recognition software. The tool was applied to more than 50 million Twitter accounts from around the world to understand the different concerns and priorities of women and men on topics related to sustainable development. Methods A key objective in the development of this tool was to ensure that the approach could be applicable at a global scale and across different languages. The tool disaggregates social media posts based on several automated classification methods in a “waterfall” approach— starting with the classifiers with the highest overall success rate and, when results are unknown or indecisive, moving on to another classifier. The name classifier performed best in this study: a user’s name was compared to a pre-exist- ing “name dictionary” showing whether the name was more likely to indicate a woman or a man. These results could be further improved by using country-specific name dictionaries, g although this would require a more complex process of first determining the home country See the technical report pertaining to the subsequent of a specific user. Since exact location of the user is often omitted in tweets, the approach section (“Social Media would adopt a separate script for classifying location, after which a country-specific dictio- Expression as Signaling Mental Health States”) for g nary could be used. If a country-specific dictionary were absent or results were indecisive, another approach to country inference. 20 IV. Internet Activity

24 Figure 14. Global Pulse post-2015 tweets dashboard, conversations about equality between men and women. A good education Access to clean water and sanitation Action taken on climate change Affordable and nutritious food An honest and responsive government Better healthcare Better job opportunities Better transport roads Equality between men and women Freedom from discrimination Phone and internet access Political freedoms Protecting forests, rivers, and oceans Protection against crime and violence Reliable energy at home Support for people who can’t work disaggregation would take place by relating a user’s name either to language-specific dictio- h naries or an aggregated set of several dictionaries. In addition to name classification, image recognition of a user’s profile picture demon- strated good results for sex disaggregation. For this project, the script classified a user’s profile picture with the free-to-use tool Face++. However, if multiple persons are identified in the same photo, the results can be inconclusive. For the purposes of this prototype tool, the algorithm chose the face for which sex is most reliably identifiable. h The script for sex disaggregation of social media accounts is open-source and readily avail- The name classification process and script used i able. For illustrative purposes, a demo version of the tool itself has been made available in this tool built upon the code of “Gender online for inferring the sex of a person based on their Twitter user name, first name, or an Computer” developed by image URL. TU Eindhoven. The code of Gender Computer can be accessed here: https://github. com/tue-mdse/genderCom- Results puter. For the overall script, we updated several of the dictionaries with new names and included additional To test the accuracy of the waterfall method—first deploying the name classifier, followed country specific dictionaries. by image recognition—a public website was created. The website allows users to manually i GitHub repository: https:// determine whether a certain Twitter account is male or female. The accuracy of the hybrid github.com/LU-C4i/ classification approach (the waterfall combination of name and image recognition) was gender_classifier 21 Big Data and the Well-Being of Women and Girls

25 Figure 15. Trending topics of Twitter users in Nepal, May 2012-July 2015, disaggregated by sex. Nepal Compare how likely men and women are to tweet about each of the 16 topics. m 11% A good education compared with crowdsourced verification mechanism, the results f 11% 3% m of which were assumed to be correct. The automated classifica- Access to clean water and sanitation 3% f 8% m tion approach accurately determined sex in 74% of cases, a rea- Action taken on climate change 6% f j sonable result for a tool in its initial stages of development. As 3% m Affordable and nutritious food 4% f described above, future work using more context-specific dictio- 9% m An honest and responsive government 7% f naries should improve accuracy. 2% m Better healthcare 3% f 6% m Better job opportunities The tool could be used for any study of tweets and other types 5% f 5% m Better transport and roads of social media expression wherein the name and/or profile 4% f 9% m picture of users are available. For example, Global Pulse used the Equality between men and women 13% f 8% m sex-disaggregation tool to improve an existing real-time online Freedom from discrimination and persecution 10% f dashboard showing the volume of tweets around priority topics 4% m Phone and internet access 4% f related to sustainable development, including gender equality 8% m Political freedoms 6% f (Figure 14). By filtering through 500 million daily tweets from 12% m Protecting forests, rivers, and oceans 10% f over 50 million accounts for 25,000 keywords relevant to global 6% m Protection against crime and violence development topics, this interactive dashboard showed which 9% f 4% m countries tweeted most about which topics between May 2012 Reliable energy at home 3% f k 3% m and July 2015. Support for people who can’t work 3% f To further refine the dashboard, the gender classification script was run over the entire dataset. Once disaggregated by sex, the dashboard revealed new insights, highlighting the different concerns and priorities of women and men. For example, in Nepal, the sex-dis- j With respect to privacy, aggregated data showed that women tweeted most on “equality between men and women” the methodology uses (Figure 15). In comparison, men discussed most about “protecting forests, rivers and publicly available data from Twitter profiles. Moreover, oceans.” In the second quarter of 2015, prompted by the earthquake that hit Nepal on only publicly revealed gender th April 25 , discussions were dominated by “support for people who cannot work”—a topic markers such as the name and profile pictures of users rarely mentioned previously—and “an honest and responsive government.” The above were applied to building the topics were widely mentioned by both men and women. tool. Users for whom the name and profile picture were insufficient to allow the This project faced several obstacles. Because of the anonymity standards of Twitter, identi- classifier to detect sex were categorized as unknown. fying authentic user data is not always possible. Context-specific name databases and other k The project was initially tools—for example, profile and linguistic style choices—can help improve prediction of developed by Global Pulse in sex; improvement on the current accuracy rate of 74% is almost certainly possible. Overall, collaboration with the UN Millennium Campaign and however, the approach developed here advances sex-disaggregated analysis of social media, DataSift: http://post2015. and by doing so provides a window into large databases of ideas and opinions. unglobalpulse.net/ 22 IV. Internet Activity

26 Social Media Expression as tates signaling m ental h ealth s Several unique social and psychological characteristics are implicated in the mental health 10 challenges experienced by women and girls. In addition, poverty, inequality, and cultural 11 expectations may heighten the risk of mental illness among women and girls. Most of the publicly available data on mental health burden, however, comes from massive and infre- quent exercises that rarely include sex-disaggregated information, especially in the develop- 12 13 ing world. Methods of mental health assessment are also inconsistent across countries. Overlooking sex-based differences can have drastic consequences, including misdiagnosis, 14 inappropriate treatment, and constrained help-seeking. More high-frequency data on mental illness and better understanding of the ways in which women and girls express their 15 mental health concerns is thus needed. Research in recent years has proposed that social media data can help understand patterns 16 of mental health in complement to more traditional assessments. Here we present a gen- der-based, cross-cultural quantitative examination of mental health content shared on the social media platform Twitter. Using a dataset of half a million Twitter users and nearly 1.5 million posts from four countries, India, South Africa, the United Kingdom, and the United States, we employed machine learning techniques to identify genuine self-disclo- sures of mental illness from public social media posts. Comparison of these posts with content shared on online mental health support communities, as well as consultation with mental health professionals, suggests that the method accurately identifies mental illness in nearly all cases. We also compare modes of linguistic expression and topical content across female and male users. Overall, the findings reveal significant differences in how different sexes and cultures express mental health concerns on Twitter, and suggest that unobtrusively gathered social media data can serve as an important source of mental health information. Methods The various steps of the methodology are described in the paragraphs that follow. First, we filtered English-language Twitter posts from March 2015 to create a sample of mental illness disclosures (“MID users”) containing any of the key phrases listed in Table 2. These phrases, denoting current experience with mental illness, were collated through reference 17 to prior work as well as consultation with a practicing psychiatrist. A control data sample of posts over the same period (“CTL users”) was also created; none of these posts contained 23 Big Data and the Well-Being of Women and Girls

27 Table 2. Key phrases to filter for signals of mental health concerns. I want to end my life I want to suicide I want to die I thought of suicide I am depressed I [*] diagnosed [*] depression I attempted suicide I [have/had] depression Killing myself I [*] thinking of suicide I [*] diagnosed [*] mental illness I tried to suicide Ending my life I [have/had] mental illness the key phrases in Table 2. Sex and country information were then inferred for each post l using an automated method. The key phrases in Table 2, however, may not indicate genuine disclosure of mental illness; for example, “when I have to wake up at 6am, I feel like killing myself ” does not indicate suicidal intent. To eliminate such misleading posts, a machine learning method was used to compare the language of each Twitter user with the language of posts made by people who self-identify as suffering from mental illness on the Reddit sub-communities r/depression, 18 r/mentalhealth, and r/SuicideWatch. A similar process was used to validate the control dataset, but evaluated dissimilarity to the Reddit sample instead. A final qualitative valida- tion exercise was also carried out: a licensed psychiatrist and two researchers experienced in mental health/social media research evaluated a subsample of 100 mental health disclo- sures, each from a different user. Overall, the machine learning approach in this project was 96% accurate in identifying genuine disclosures of mental health concerns. Deeper analysis of the social media content followed. We developed linguistic measures (how users express themselves) and a topic model (what users are talking about) to quantify l Because Twitter does the differences between how female and male users disclosed their mental health concerns. not allow individuals to self-report their sex and Linguistic measures were divided into three categories—affective attributes, cognitive at- location information is often tributes, and linguistic style—and subtypes within these categories, drawn from previous inaccurate, sex and country 19 inference is necessary. For psycholinguistic work (Table 3). sex inference, we matched the self-reported name string in Twitter profile names with With respect to affective attributes, the project considered positive affect (PA), negative name databases from gov- affect (NA), and four other more specific measures of emotional expression: anger, anxiety, ernment and other sources. For country inference, we sadness, and swearing. Cognitive measures were divided into cognition and perception, corrected location names 20 which together evaluate cognitive complexity and emotional stability. Finally, four using standard techniques and matched locations to measures of linguistic style were considered: lexical density, temporal measures, social/ various large geographic databases. personal concerns, and interpersonal awareness/focus. These measures of linguistic style 24 IV. Internet Activity

28 indicate one’s underlying psychological processes (lexical density), personality (temporal references), social support and connectivity (social/personal concerns), and awareness of one’s surroundings and environment (interpersonal focus). Prior work suggests that these 21 cues are valuable in understanding mental health, including in social media expression. Table 3. Types of linguistic measures used. Description/example Attribute Category Subtype Expressions denoting positive moods (e.g. joy, energy, alertness) Positive affect Affective Expressions denoting negative moods (e.g. sadness, fear, lethargy) Negative affect Anger Expressions of anger Expressions of anxiety Anxiety Expressions of sadness Sadness Use of swear words, denoting frustration, intensity of reaction Swearing Expression that reflects thought, possibly independent of external stimuli Cognition Cognitive Perception Expression that reflects sensory input (e.g. information received by seeing, hearing, feeling) Nouns, adjectives, adverbs, and verbs as a proportion of all words Lexical density Linguistic style Use of past, present, and future tenses Temporal measures Social/personal concerns Words pertaining to social engagement or self-engagement (e.g. words about family, friends, social work, health, etc.) st st nd rd person plural, 2 person singular, 1 person pronouns person, and 3 Interpersonal awareness/focus Use of 1 Results As noted above, the machine learning approach accurately identified genuine mental health disclosures in nearly all social media posts we examined. We also observed consid- erable differences in the linguistic content and topical focus of Twitter posts of female and m male users, as well as across cultures. First, when affective and cognitive attributes are aggregated into single categories, we see that females generally show higher scores in all linguistic measures (Figure 16). This suggests a generally higher level of psycholinguistic expressiveness on social media by women and girls, a promising result for the objective of using such platforms to identify trends in mental health. Second, we see that the differences m We focus on sex dif- are even more pronounced in the MID (mental illness disclosure) user sample than in the ferences here in this CTL (control) sample. summary. Please refer to De Choudhury et al. (2016) for a discussion of cultural differences. 25 Big Data and the Well-Being of Women and Girls

29 Figure 16. Differences in linguistic measures between female and male users, disaggregated by mental illness disclosure (MID) and control sample (CTL). Positive values indicate higher scores for female users. 9 CTL users MID users 8 7 We can further break down the attribute 6 categories. Female users in the MID sample 5 show 15.4% higher sadness and 10.7% 4 higher anxiety; prior literature indicates 3 that expression of these emotions is asso- 2 ciated with depression, mental instability, female and male users (%) Absolute difference between and feelings of helplessness, loneliness, and 1 restlessness. However, female users also tend 0 to use 7.1% more positive affect in their focus content, perhaps to demonstrate a positive Affective concerns attributes attributes Temporal Cognitive references outlook publicly despite the mental health Interpersonal and awareness Lexical density Social/personal challenges they are facing. Male users, on the other hand, express 2.6% more negative affect overall, including 5.3% higher anger and 9.5% more expressions with swearing. Females express fewer cognitive attributes on social media than do males. Lower usage of words that denote certainty, for example, may demonstrate heightened emotional instability. These differences in cognitive expres- sion are not pronounced in the control sample, however, suggesting that experience #depression has invaded my peace and of mental illness, not intrinsic differences “ #anxiety has exhausted my thoughts. Pain between the sexes, is responsible for the isn’t always physical observed gap. – female user We turn now to social/personal concerns why am I even here... No one needs or wants “ and interpersonal focus, both subtypes me... I’m useless – female user within the linguistic style category. Male MID users display an 8.1% lower sense of achievement than women and girls, 22 a known signal of reduced self-esteem. Female MID users, meanwhile, express 6.0% greater concern about their health and 2.7% greater concern about their body, which may indicate a greater self-awareness about their health or, alternatively, more fixation with social perceptions about their appearance. Another interesting finding is that male MID users exhibit lower use of words having to Over the past 2 years I have been hit with “ do with social concerns, friends, or family. Their female peers may be physical and mental pain. The pain is real. It using such language more frequently in their Twitter posts to explicitly is still there. – female user seek help from their social networks. The interpersonal focus metrics 26 IV. Internet Activity

30 also reflect these patterns. Male MID users use first person pronouns 10.2% more than female MID users, but 3.0% less second person and Hard to really feel sick with this support “ 3.4% less third person pronouns, indicating that males tend to be less group. #family interactive. Once again, these differences are much less pronounced in – female user the CTL sample. Mental illness appears to amplify differences between I miss having someone, a friend to talk to female and male expression on social media. “ all night – female user Our analysis of topics—what users were talking about—confirmed the patterns observed in the linguistic measures. We found that two topics were more likely to appear in male MID posts than in female MID posts. The first topic related to negative thoughts and hopelessness, and the second to detachment from the social realm Sometimes I wonder if anyone still looks “ and a hesitation to seek help. Female MID out for me. I am a mess that nobody wants to clean up. I’m a wreck users expressed a positive outlook in coping – male user with mental health challenges, as well as a desire for disclosure and help-seeking, to a much greater degree than their male peers. If I were going to kill myself, I wouldn’t tell Women were also much more likely to share “ anyone. If I’m already invisible, why see me personal experiences around mental illness to favor your own self-righteousness? and engage in self-assessments. – male user This work provides some of the first detailed insights on patterns of mental health among girls and women using public social media posts. We found that female users expressed higher sadness and anxiety, but lower anger and you’re afraid to tell people how you feel negative affect than male counterparts. These observations align with “ 23 because you fear rejection, so you bury it prior work in social psychology. Female mental health disclosers in deep inside yourself where it only destroys our dataset also expressed greater social and familial concerns than you more did males. The literature indicates that women tend to rely more on – female user the social network of family and community, whereas men exhibit 24 a relative orientation towards public stoicism. The topic analysis I used to hurt myself because it was the confirmed this pattern. Although much work remains to link these “ only pain I could control. differences to specific mental health conditions and severity of illness, – female user this data suggests that such research would indeed be fruitful. 27 Big Data and the Well-Being of Women and Girls

31 The present analysis has important limita- tions. The phrases used to filter for mental health concerns are not an exhaustive list of possible signals of depression, anxiety, adam Cohn or other states. In addition, our sample is not representative of the general population; it captures Twitter users, which are likely to be more affluent, more technologically skilled, and more willing to express themselves publicly about mental health issues than the population at large. Inferring overall mental health disorder prevalence rates from social media will clearly require validation surveys that precisely quantify bias. Overall, however, we conclude that machine learning methods can filter through immense amounts of data available to identify signals of illness with a high level of accuracy. This suggests two major applications for monitoring and treatment. At the individual level, signals of mental illness could provoke response, either from the user’s community or through automated means from the social media platform itself (for example, offering counseling resources). At the population level, given adjustments for biases in Internet use and other factors, mental health trends can be monitored in near real-time, which may be especially useful following acute events of social stress such as recessions, political crises, and natural disasters. Social media monitoring will not replace more formal approaches to mental health surveillance, but it can complement these other tools. 28 IV. Internet Activity

32 Con Clusion: r eima Ginin G V the r evolution Big data is a valuable resource in the fine-grained measurement of women’s and girls’ well-being. Flowminder’s geospatial modeling work, based on satellite imagery, provides a high-resolution look at social and health outcomes as they vary over space; the same method could be used to create data systems that capture variation in welfare over time. Expenditure patterns inferred from credit card and cell phone expenditure records, as in the MIT-led work, provide a detailed look at economic activity across different social groups and over time. The social media-based projects achieve the same objective of high-spatial resolution, high-frequency measurement, but with a focus on emotions, thoughts, and ideas. Overall, this report illustrates the potential of big data in filling the global gender data gap. In closing, however, we note that the rise of big data does not signify that traditional sources of data will become less important. On the contrary, the successful implementation of big data requires investment in proven methods of social scientific research, not least for the validation and bias correction of big datasets. For example, Flowminder’s work requires DHS or other types of survey data as a starting point, as well as field biophys- ical data to calibrate the interpretation of satellite imagery. Inferring women’s economic behavior from cell phone and credit card records demands ground-truthing work to assess how strongly, within a given culture and economy, these records reflect overall social and consumer behavior. Twitter users are a biased sample of society at large, and determining the magnitude and direction of that bias through surveys is critical if this information is to be useful in assessing population-level patterns. More broadly, big data is not a panacea for all the challenges of development planning and research. The invisibility of women and girls in international and national data discourse is a political, not solely a technical, problem. New methods can indeed illuminate previously ignored aspects of the lives of women and girls, but it can also create a sense that technical advancements alone will compel investments in gender-sensitive data systems by national statistical agencies, civil society organizations, and international donors. They will not. In the worst-case scenario, they will have the opposite effect: the data deluge may shift policy focus towards the groups and regions for which the most information is available, not the people and places in greatest need. Even big data illuminates only small parts of the entire field of human experience. 29 Big Data and the Well-Being of Women and Girls

33 Ginin eima Clusion: r Con G the r evolution In the best case, however, the current “data revolution” will be reimagined as a step towards building “data governance”: a process through which novel types of data bring about not instant, perfect knowledge about global development processes, but rather catalyze the creation of new partnerships for the sharing and interpretation of information. In the best case, projects that use big datasets will be informed by thoughtful hypotheses advanced by women and girls themselves, take a pragmatic but tireless approach to data policy reform in a decision-making world still dominated by men, and work in concert with advocates for the inclusion of women and girls in all spheres of social and economic life. These are the kinds of projects we have profiled in this report, and they hold great promise for the future of social science and policymaking. Elizabeth Whelan 30 V. Conclusion: Reimagining the Revolution

34 notes 1 See Alegana et al. 2015 and Sedda et al. 2015 for similar past work ICF International 2012 2 3 HDRO (UNDP) 2015 KNBS 2010; NBS 2011; NPC 2014; NIPORT 2013; Cayemittes et al. 2013 4 5 Alegana et al. 2015; Gething et al. 2015 6 Lazer et al. 2009 7 Yoshimura et al. 2009; Krumme et al. 2013; Giles 2012 8 Toole et al. 2015; Gonzalez, Hidalgo, and Barabasi 2008; Jiang et al. 2016; Song et al. 2010; Blumenstock, Cadamuro, and On 2015; Lenormand et al. 2015; Louail et al. 2014; Çolak et al. 2016 9 Eagle, Macy, and Claxton 2010; Gonzalez, Hidalgo, and Barabasi 2008; Pappalardo et al. 2015. 10 Cauce et al. 2002; Taylor and Brown 1988 11 Wang et al. 2000 12 WHO 2001 13 Spector 2002 Taylor and Brown 1998; Spector 2002 14 15 Ormel et al. 1994; Patel et al. 1999 Coppersmith, Dredze, and Harman 2014; Coppersmith, Harman, and Dredze 2014; Culotta 2014; De 16 Choudhury, Counts, and Horvitz 2013; De Choudhury et al. 2014; De Choudhury et al. 2013; Eichstaedt et al. 2015; Homan et al. 2014 17 Coppersmith, Dredze, and Harman 2014; Coppersmith, Harman, and Dredze 2014 18 Zhu, Ghahramani, and Lafferty 2003; De Choudhury et al. 2016 19 Pennebaker, Francis, and Booth 2001; Chung and Pennebaker 2007; De Choudhury and De 2014 20 Gross and Muñoz 1995 21 Ramirez-Esparza 2008 22 Chancellor 2016 23 Lieberman and Goldstein 2006 24 Guillemin, Bombardier, and Beaton 1993 Photo C Redits Cover, pages 5, 19, and 30: © Elizabeth Whelan. All rights reserved. ElizabethWhelanPhotography.com Page 28: Adam Cohn, “Indian Woman with Smartphone.” Some rights reserved, CC BY-NC-ND 2.0 (Creative Commons Attibution-NonCommercial-NoDerivs 2.0 Generic) license. Cropped. AdamCohn.com 31 Big Data and the Well-Being of Women and Girls

35 referen Ces Alegana, Victor A., Peter M. Atkinson, Carla Pezzulo, Alessandro Sorichetta, D. Weiss, T. Bird, E. Erbach- Schoenberg, and Andrew J. Tatem. 2015. “Fine resolution mapping of population age-structures for health and development applications.” Journal of the Royal Society Interface 12, no. 105: 20150073. Blumenstock, Joshua, Gabriel Cadamuro, and Robert On. 2015. “Predicting poverty and wealth from mobile phone metadata.” Science 350, no. 6264: 1073-1076. Cauce, Ana Mari, Melanie Domenech-Rodriguez, Matthew Paradise, Bryan N Cochran, Jennifer Munyi Shea, Debra Srebnik, and Nazli Baydar. 2002. “Cultural and contextual influences in mental health help seeking: a focus on ethnic minority youth.” Journal of Consulting and Clinical Psychology 70, no. 1: 44. Cayemittes, Michel, Michelle Fatuma Busangu, Jean de Dieu Bizimana, Bernard Barrère, Blaise Sévère, Viviane Cayemittes et Emmanuel Charles. 2013. Enquête Mortalité, Morbidité et Utilisation des Services, Haïti, 2012. Calverton, Maryland, USA: MSPP, IHE et ICF International. Chancellor, Stevie, Zhiyuan (Jerry) Lin, Erica Goodman, Stephanie Zerwas, and Munmun De Choudhury. 2016. “Quantifying and predicting mental illness severity in online pro-eating disorder communities.” In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 1171-1184. ACM. Chung, Cindy, and James W Pennebaker. 2007. “The psychological functions of function words.” Social Communication: 343-359. Çolak, Serdar, Antonio Lima, and Marta C. González. 2016. “Understanding congested travel in urban areas.” Nature Communications 7: 10793. doi:10.1038/ncomms10793. Coppersmith, Glen, Craig Harman, and Mark Dredze. 2014. “Measuring post traumatic stress disorder in Twitter.” In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 579-582. Coppersmith, Glen, Mark Dredze, and Craig Harman. 2014. “Quantifying mental health signals in Twitter.” In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: from Linguistic Signal to Clinical Reality, pp. 51-60. Baltimore, Maryland: Association of Computational Linguistics. Culotta, Aron. 2014. “Estimating county health statistics with Twitter.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1335-1344. ACM. De Choudhury, Munmun, and Sushovan De. 2014. “Mental health discourse on Reddit: Self-disclosure, social support, and anonymity.” In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM). De Choudhury, Munmun, Emre Kiciman, Mark Dredze, Glen Coppersmith, and Mrinal Kumar. 2016. “Discovering shifts to suicidal ideation from mental health content in social media.” In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 2098-2110. ACM. 32 References

36 De Choudhury, Munmun, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. “Predicting depression via social media.” In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM). De Choudhury, Munmun, Scott Counts, and Eric Horvitz. 2013. “Social media as a measurement tool of depres- sion in populations.” In Proceedings of the 5th Annual ACM Web Science Conference, pp. 47-56. ACM. De Choudhury, Munmun, Scott Counts, Eric Horvitz, and Aaron Hoff. 2014. “Characterizing and predict- ing postpartum depression from Facebook data.” In Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM. Eagle, Nathan, Michael Macy, and Rob Claxton. 2010. “Network diversity and economic development.” Science 328, no. 5981: 1029-1031. Eichstaedt, Johannes C., Hansen Andrew Schwartz, Margaret L Kern, Gregory Park, Darwin R Labarthe, Raina M Merchant, Sneha Jha, Megha Agrawal, Lukasz A Dziurzynski, Maarten Sap, Christopher Weeg, Emily E. Larson, Lyle H. Ungar, Martin E.P. Seligman. 2015. “Psychological language on Twitter predicts county-level heart disease mortality.” Psychological Science 26, no. 2: 159-169. doi:10.1177/0956797614557867. Gething, Peter, Andy Tatem, Tom Bird, and Clara R. Burgert‐Brucker. 2015. “Creating Spatial Interpolation Surfaces with DHS Data.” DHS Spatial Analysis Reports No. 11. Rockville, Maryland, USA: ICF International. Giles, Jim. 2012. “Making the links.” Nature 488, no. 7412: 448-450. Gonzalez, Marta C., Cesar A Hidalgo, and Albert-Laszlo Barabasi. 2008. “Understanding individual human mobility patterns.” Nature 453, no. 7196: 779-782. Gross, James J., and Ricardo F Muñoz. 1995. “Emotion regulation and mental health.” Clinical Psychology: Science and Practice 2, no. 2: 151-164. Guillemin, Francis, Claire Bombardier, and Dorcas Beaton. 1993. “Cross-cultural adaptation of health related quality of life measures: literature review and proposed guidelines.” Journal of Clinical Epidemiology 46, no. 12: 1417-1432. Homan, Christopher M., Naiji Lu, Xin Tu, Megan C Lytle, and Vincent Silenzio. 2014. “Social structure and depression in TrevorSpace.” In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 615-625. ACM. Human Development Report Office (HDRO), United Nations Development Program (UNDP). 2015. Human Development Report 2015: Work for Human Development. New York: United Nations Development Program. ICF International. 2012. Demographic and Health Survey Sampling and Household Listing Manual. MEASURE DHS, Calverton, Maryland, U.S.A.: ICF International. 33 Big Data and the Well-Being of Women and Girls

37 Jiang, Shan, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shounak Athavale, and Marta C. González. 2016. “The TimeGeo modeling framework for urban motility without travel surveys.” Proceedings of the National Academy of Sciences: 201524261. doi:0.1073/pnas.1524261113. Kenya National Bureau of Statistics (KNBS) and ICF Macro. 2010. Kenya Demographic and Health Survey 2008‐09. Calverton, Maryland: KNBS and ICF Macro. Krumme, Coco, Alejandro Llorente, Manuel Cebrian, and Esteban Moro. 2013. “The predictability of consumer visitation patterns.” Scientific Reports 3: 1645. Lazer, David, Alex Sandy Pentland, Lada Adamic, Sinan Aral, Albert Laszlo Barabasi, Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Guttman, Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall Van Alstyne. 2009. “Life in the network: the coming age of computational social science.” Science 323, no. 5915: 721-723. Lenormand, Maxime, Thomas Louail, Oliva G. Cantú-Ros, Miguel Picornell, Ricardo Herranz, Juan Murillo Arias, Marc Barthelemy, Maxi San Miguel, and José J. Ramasco. 2015. “Influence of sociodemographic characteristics on human mobility.” Scientific Reports 5: 10075. doi:10.1038/srep10075. Lieberman, Morton A., and Benjamin A Goldstein. 2006. “Not all negative emotions are equal: The role of emotional expression in online support groups for women with breast cancer.” Psycho-Oncology 15, no. 2: 160-168. Louail, Thomas, Maxime Lenormand, Oliva G. Cantu Ros, Miguel Picornell, Ricardo Herranz, Enrique Frias- Martinez, José J. Ramasco, and Marc Barthelemy. 2014. “From mobile phone data to the spatial structure of cities.” Scientific Reports 4: 5276. doi:10.1038/srep05276. National Bureau of Statistics (NBS) [Tanzania] and ICF Macro. 2011. Tanzania Demographic and Health Survey 2010. Dar es Salaam, Tanzania: NBS and ICF Macro. National Institute of Population Research and Training (NIPORT), Mitra and Associates, and ICF International. 2013. Bangladesh Demographic and Health Survey 2011. Dhaka, Bangladesh and Calverton, Maryland, USA: NIPORT, Mitra and Associates, and ICF International. National Population Commission (NPC) [Nigeria] and ICF International. 2014. Nigeria Demographic and Health Survey 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF International. Ormel, Johan, Michael VonKor_, T Bedirhan Ustun, Stefano Pini, Ailsa Korten, and Tineke Oldehinkel. 1994. “Common mental disorders and disability across cultures: results from the WHO collaborative study on psychological problems in general health care.” Journal of the American Medical Association 272, no. 22: 1741-1748. Pappalardo, Luca, Dino Pedreschi, Zbigniew Smoreda, and Fosca Giannotti. 2015. “Using big data to study the link between human mobility and socio-economic development.” In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), pp. 871-878. IEEE Computer Society. doi:10.1109/ BigData.2015.7363835 34 References

38 Pennebaker, James W., Martha E Francis, and Roger J Booth. 2001. “Linguistic inquiry and word count: LIWC 2001.” Mahway: Lawrence Erlbaum Associates, 71:2001. Rachel E Spector. 2002. “Cultural diversity in health and illness.” Journal of Transcultural Nursing 13, no. 3: 197-199. Ramirez-Esparza, Nairan, Cindy K Chung, Ewa Kacewicz, and James W Pennebaker. 2008. “The psychology of word use in depression forums in English and in Spanish: Texting two text analytic approaches.” In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM). Sedda, Luigi, Andrew J. Tatem, David W. Morley, Peter M. Atkinson, Nicola A. Wardrop, Carla Pezzulo, Alessandro Sorichetta, Joanna Kuleszo, and David J. Rogers. 2015. “Poverty, health and satellite-derived vegetation indices: their inter-spatial relationship in West Africa.” International Health 7, no. 2: 99-106. Song, Chaoming, Zehui Qu, Nicholas Blumm, and Albert-László Barabási. 2010. “Limits of predictability in human mobility.” Science 327, no. 5968s: 1018-1021. Taylor, Shelley E., and Jonathon D Brown. 1988. “Illusion and well-being: a social psychological perspective on mental health.” Psychological Bulletin 103, no. 2: 193. Toole, Jameson L., Carlos Herrera-Yaqüe, Christian M. Schneider, and Marta C. González. 2015. “Coupling human mobility and social ties.” Journal of the Royal Society Interface 12, no. 105: 20141128. doi:10.1098/ rsif.2014.1128. Vikram Patel, Ricardo Araya, Mauricio de Lima, Ana Ludermir, and Charles Todd. 1999. “Women, poverty and common mental disorders in four restructuring societies.” Social Science & Medicine 49, no. 11: 1461-1471. Wang, Xiangdong, Lan Gao, Naotaka Shinfuku, Huabiao Zhang, Chengzhi Zhao, and Yucun Shen. 2000. “Longitudinal study of earthquake-related PTSD in a randomly selected community sample in North China.” American Journal of Psychiatry 157, no. 8: 1260-1266. doi:10.1176/appi.ajp.157.8.1260. World Health Organization. 2001. The World Health Report 2001: Mental health: new understanding, new hope. World Health Organization. Yoshimura, Yuji, Stanislav Sobolevsky, Juan N. Bautista Hobin, Carlo Ratti, and Josep Blat. 2016. “Urban associ- ation rules: uncovering linked trips for shopping behavior.” Environment and Planning B: Planning and Design: 0265813516676487. Zhu, Xiaojin, Zoubin Ghahramani, and John Lafferty. 2003. “Semi-supervised learning using Gaussian fields and harmonic functions.” In Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), pp. 912-919. 35 Big Data and the Well-Being of Women and Girls

39

40 @Data2X www.data2x.org

Related documents

LaneEtAlPrivacyBigDataAndThePublicGood

LaneEtAlPrivacyBigDataAndThePublicGood

This is a prel version of the book Privacy, Big Data , and the Public Good : iminary er, and for Lane, Victoria Stodden, Stefan Bend nt, ed. Julia Helen Niss enbaum Frameworks Engageme Unive rsity Pre...

More info »
MGI big data full report

MGI big data full report

McKinsey Global Institute June 2011 Big data: The next frontier for innovation, competition, and productivity

More info »
the minneapolis plan to end too big to fail final

the minneapolis plan to end too big to fail final

THE MINNEAPOLIS PLAN TO END TOO BIG TO FAIL DECEMBER 2017

More info »
Big Bear Valley CWPP

Big Bear Valley CWPP

B b ear v alley ig C ommunity w ildfire P rotection p lan Final Plan “A Systems approach” June, 2006 Prepared by David A. Yegge M.P.A., B.B.A., A. S. Fuels Technician City of Big Bear Lake Fire Depart...

More info »
The Best of Charlie Munger 1994 2011

The Best of Charlie Munger 1994 2011

The Best of Charlie Munger: 1994-2011 A collection of speeches, essays, and Wesco annual meeting notes

More info »
seasons rules big game 2019 2020

seasons rules big game 2019 2020

Idaho Big Game 2019 & 2020 Seasons & Rules Edition 2019 1st Controlled Hunt Application Periods Deer, Elk, Pronghorn & Fall Black Bear: May 1 - June 5 Spring Black Bear: January 15 - February 15 Deer,...

More info »
WEF GlobalInformationTechnology Report 2014

WEF GlobalInformationTechnology Report 2014

Insight Report The Global Information Technology Report 2014 Rewards and Risks of Big Data Beñat Bilbao-Osorio, Soumitra Dutta, and Bruno Lanvin, Editors

More info »
Big12StyleGuide

Big12StyleGuide

BIG 1 2 CONFERENCE IDENTITY STANDARDS CONFERENCE STYLE GUIDE

More info »
cover.wps

cover.wps

englishbanana.com’s big grammar book by Matt Purland 101 worksheets for English lessons featuring Essential English worksheets Entry Level

More info »
cover.wps

cover.wps

englishbanana.com’s resource book big by Matt Purland 101 worksheets for English lessons y 100% photocopiable! y Includes full answers and notes for use Intermediate / Level 1

More info »
Christl Networks  K .indd

Christl Networks K .indd

Wolfie Christl, Sarah Spiekermann Networks of Control

More info »
625137

625137

2018-19 Nebraska All-Sports Record Book - Nebraska Communications Office -

More info »
Big Data for Resilience Storybook

Big Data for Resilience Storybook

D A T G A I for B RESILIENCE STORYBOOK Experiences integrating Big Data into resilience programming

More info »
PowerPoint Presentation

PowerPoint Presentation

Talend Big Data Sandbox Big Data Insights Cookbook

More info »
BigDataAndTheRoleOfTheActuary

BigDataAndTheRoleOfTheActuary

BIG DATA AND THE ROLE OF THE ACTUARY American Academy of Actuaries JUNE 2018 Big Data Task Force ACTUARY.ORG

More info »
AndersBehringBreivikManifesto

AndersBehringBreivikManifesto

2011 , London – By Andrew Berwick

More info »
Big Guide for Small Systems

Big Guide for Small Systems

The Big Guide for Small Systems: A Resource for Board Members RURAL COMMUNITY A SSIST ANCE P ARTNERSHIP an equal opportunity pr ovider and employer

More info »