Secondary Data Analysis
Secondary Data Analysis
Data
Analysis
12110-00_FM_rev3.qxd 6/23/10 1:51 PM Page ii
Secondary
Data
Analysis
An Introduction for Psychologists
Edited by
Kali H. Trzesniewski, M. Brent Donnellan,
and Richard E. Lucas
Copyright © 2011 by the American Psychological Association. All rights reserved. Except as
permitted under the United States Copyright Act of 1976, no part of this publication may
be reproduced or distributed in any form or by any means, including, but not limited to, the
process of scanning and digitization, or stored in a database or retrieval system, without the
prior written permission of the publisher.
Published by To order
American Psychological Association APA Order Department
750 First Street, NE P.O. Box 92984
Washington, DC 20002 Washington, DC 20090-2984
www.apa.org Tel: (800) 374-2721; Direct: (202) 336-5510
Fax: (202) 336-5502; TDD/TTY: (202) 336-6123
Online: www.apa.org/pubs/books/
E-mail: [email protected]
In the U.K., Europe, Africa, and the Middle East, copies may be ordered from
American Psychological Association
3 Henrietta Street
Covent Garden, London
WC2E 8LU England
The opinions and statements published are the responsibility of the authors, and such opin-
ions and statements do not necessarily represent the policies of the American Psychological
Association.
BF39.S39 2011
150.72'1—dc22
2010010687
CONTENTS
Contributors ................................................................................................ ix
Introduction ................................................................................................. 3
M. Brent Donnellan, Kali H. Trzesniewski, and Richard E. Lucas
v
12110-00_FM_rev3.qxd 6/23/10 1:51 PM Page vi
vi CONTENTS
12110-00_FM_rev3.qxd 6/23/10 1:51 PM Page vii
CONTENTS vii
12110-00_FM_rev3.qxd 6/23/10 1:51 PM Page viii
12110-00_FM_rev3.qxd 6/23/10 1:51 PM Page ix
CONTRIBUTORS
ix
12110-00_FM_rev3.qxd 6/23/10 1:51 PM Page x
x CONTRIBUTORS
Secondary
Data
Analysis
12110-01_Intro_rev1.qxd 6/23/10 2:00 PM Page 2
12110-01_Intro_rev1.qxd 6/23/10 2:00 PM Page 3
INTRODUCTION
M. BRENT DONNELLAN, KALI H. TRZESNIEWSKI,
AND RICHARD E. LUCAS
3
12110-01_Intro_rev1.qxd 6/23/10 2:00 PM Page 4
term secondary data analysis largely to refer to the existing national studies like
Add Health, the BHPS, and MTF. However, we also use the term more
loosely to refer to archives of existing data that are available to researchers
who were not involved in the original data acquisition (see, e.g., Chapter 13,
this volume). Specifically, we hope this volume provides readers with an
introduction to the research possibilities that can be realized through the
analysis of existing data and provides psychologists with a set of accessible
methodological primers that can help them begin their own secondary data
analyses. This book is designed to serve as a springboard for the further devel-
opment of methodological skills and is not presented as an exhaustive com-
pendium that covers all of the issues (Bulmer, Sturgis, & Allum, 2009). To
facilitate further learning, the authors present a list of recommended readings
at the end of each chapter.
school seniors in the United States). Many of these resources are free or avail-
able at a very low cost to qualified researchers. Thus, learning how to analyze
secondary data sets can provide individual researchers with the raw material
to make important contributions to the scientific literature using data sets with
impressive levels of external validity. This advantage is the primary reason we
advocate secondary data analysis.
A second advantage is that the analysis of secondary data sets encourages
an “open source” approach to science. Researchers using the same resource
should be able to replicate findings using similar analyses. This fact encourages
careful reporting of analysis and a reasoned justification for all analytic deci-
sions (e.g., explaining why certain covariates are included in a given model).
Moreover, researchers who might wish to test alternative explanations can use
the same resources to evaluate competing models. Thus, we believe that the
analysis of secondary data sets encourages transparency, which in turn helps
facilitate scientific progress. It should certainly help researchers develop good
habits such as fastidious record keeping and careful attention to detail.
Despite these notable advantages, there are also several disadvantages
associated with secondary data analysis. The first, and primary, disadvantage
is the flip side of the major advantage—the data have already been collected.
The user of an existing resource may not have all of the information about
data collection procedures or important details about problems that occurred
during data collection. And more important, there is no guarantee that any
existing data set will be useful for addressing the particular research question
of primary interest to a given researcher. We want to stress the perspective
that high-quality research should proceed first and foremost from an impor-
tant (and, ideally, interesting) question that is informed by theoretical con-
cerns or glaring gaps in knowledge. Good research is motivated by great
questions, and this holds for all forms of research. The purpose of analyzing
data is to refine the scientific understanding of the world and to develop the-
ories by testing empirical hypotheses.
The second disadvantage is that there is a significant investment of time
and energy required to learn any particular existing data set. These start-up
costs become a central concern because of the risk that someone else will be
pursuing answers to the same research questions. There is a real possibility
that a new researcher might be “scooped” by someone using the same data set
that the researcher has chosen to analyze.
A third disadvantage is one that is probably of most concern to psychol-
ogists. That is, the measures in these existing resources are often abbreviated
because the projects themselves were designed to serve multiple purposes and
to support a multidisciplinary team. Many of these data sets have impressive
breadth (i.e., many constructs are measured) but with an associated cost in
terms of the depth of measurement (i.e., constructs may be measured by only
INTRODUCTION 5
12110-01_Intro_rev1.qxd 6/23/10 2:00 PM Page 6
a few survey items). For example, at least two of us are very interested in
global self-esteem, and it is exceedingly rare to find existing studies that
have administered all 10 items of the widely used Rosenberg (1965) Self-
Esteem Scale. Rather, it is quite common to find studies that administered
a subset of the Rosenberg items (see, e.g., Trzesniewski & Donnellan, 2010;
Trzesniewski, Donnellan, & Robins, 2003) or studies that administered alter-
native measures of self-esteem that are unique to that project (e.g., the Add
Health data set; see Russell, Crockett, Shen, & Lee, 2008). This fact can cre-
ate concern among psychologists who are used to evaluating studies based on
the entire set of items associated with conventional measures.
Measurement concerns are therefore a major issue in secondary data
analysis, and these issues frequently require some amount of defending in the
peer-review process. For example, short forms tend to have lower levels of
internal consistency than parent scales, given how the alpha coefficient is cal-
culated (see Chapter 4). On the one hand, this reduction does not always
mean that the predictive validity of a given scale will be dramatically impaired.
On the other hand, short forms may have such reduced content that they end
up assessing a much narrower construct than the original scale. Thus, basic
issues of reliability and validity are of paramount concern when conducting
secondary data analyses. Researchers who do not have a good grounding in
psychometrics might be frustrated because they are not able to carefully eval-
uate trade-offs in reliability and validity (for a good summary of psychometric
issues, see Clark & Watson, 1995). However, for those researchers with even
basic quantitative training (or for those willing to acquire this training), the
analysis of existing data is a useful skill to have in their tool kit.
A fourth disadvantage has to with the potential for “fishing” exercises (e.g.,
searching for significant correlations instead of testing hypotheses) that can be
undertaken with existing resources. Different researchers will find this concern
more or less compelling, given their views on statistical significance testing. We
list this as a disadvantage because colleagues, tenure-review committees, and
reviewers might be among those researchers who are extremely concerned about
the potential for data exploitation. Our view is that all findings need to be repli-
cated and that the actual risks associated with fishing are often overstated (see
Rosenthal, 1994, p. 130). That is, there seems to be little reason to dismiss find-
ings from secondary analyses simply because of concerns of increased Type I
errors. The data have an important story to tell regardless of how many articles
have been published from a project. The fact that resources are often publicly
available also helps to counterbalance fishing concerns because others are able
to verify the robustness of the results across different ways of operationalizing the
variables and different ways of formulating statistical models.
The fifth disadvantage is a practical one that can have real conse-
quences for developing academic careers. There is a real possibility that orig-
inal data collection might be more highly regarded in some fields and some
departments than others (see also McCall & Appelbaum, 1991). This can be
a significant disadvantage for new professors looking to build a strong case for
tenure or promotion. Local norms vary widely across departments, and the
best advice is to be aware of the local culture. Our view is that academic
departments should value scholarly contributions regardless of whether the
underlying data were newly collected or whether they were extracted from a
publicly available archive. We hope that this perspective will become more
widespread as more and more psychologists engage in secondary data analy-
sis. On a related point, most existing data sets are correlational in nature, and
this places considerable limits on the causal inferences that can be drawn
from any analyses. This fact may affect the extent to which secondary data
analysis is valued within subdisciplines within psychology.
On balance, however, we believe that the advantages of secondary data
analysis often outweigh the disadvantages. Accordingly, we are convinced that
this is a valuable approach to research for psychologists to learn. This book is
intended to make this argument by showcasing important research findings
that have emerged from secondary data analyses and to provide an introduc-
tion to the key methodological issues. We endorse the perspective of strategic
replication, or the idea that scientific advances occur when researchers submit
hypotheses to a number of different tests using diverse forms of data. To the
extent that findings replicate across different forms of data, researchers can
have increased confidence in their verisimilitude. Likewise, failures to repli-
cate findings using large and diverse samples often afford the opportunity to
learn about the boundary conditions for particular propositions. In short, we
believe that the analysis of secondary data sets has a definite place in psycho-
logical research. It is one important way of testing empirical hypotheses and
thus can contribute to the development and refinement of theories.
The book is divided into two major sections. The first section is a practi-
cal guide to the analysis of secondary data. These chapters cover basic method-
ological issues related to getting started (Chapters 1 and 2), measurement issues
(Chapter 3), sample weighting (Chapter 4), and handling missing data
(Chapter 5). The final chapter in this section (Chapter 6) discusses innovative
methodological techniques that are typically not associated with secondary data
but that are becoming increasingly available. Although these chapters cover
methodological issues that are particularly relevant for secondary data analysis,
we believe that many of these chapters will be of general methodological
interest. Indeed, the discussions of measurement and missing data analysis
INTRODUCTION 7
12110-01_Intro_rev1.qxd 6/23/10 2:00 PM Page 8
offer insights that transcend secondary data analysis and apply to many areas of
psychological research. We should also caution readers that our coverage
of some topics is sparse, especially in terms of survey response artifacts and
sampling schemes. Accordingly, we refer readers to good introductions to
survey methodology by Dillman, Smyth, and Christian (2009) and de Leeuw,
Hox, and Dillman (2008).
The second section contains illustrations of secondary data analyses
from leading researchers who have used these analyses to make substantive
contributions to topical areas in psychology. This list includes behavioral
genetics (Chapter 8), clinical psychology (Chapter 9), life-span developmen-
tal psychology (Chapters 7 and 10), cross-cultural psychology (Chapter 11),
political psychology (Chapter 12), and intellectual development (Chapter 13).
These chapters describe key substantive findings and therefore provide con-
crete examples of the kinds of contributions that secondary data analysis can
make to psychological science. All of these chapters conclude by identifying
key data sets in respective content areas and offer recommended readings.
These example data sets will provide readers of similar interests with useful
starting points for their own work.
Readers interested in nuts and bolts issues may want to read all of the
chapters in the first section and then selectively read chapters in the second
section that are most relevant to their substantive areas. However, those who
want an introduction to the promise of secondary analysis techniques may
want to begin by reading chapters in the second section to gain exposure to
the broad range of questions that can be answered using secondary data. These
chapters offer vivid testaments to our proposition that secondary data analy-
sis can make crucial contributions to psychological science. Moreover, the sub-
stantive findings described in these chapters are quite interesting, regardless of
the techniques that were used. After gaining insight into the possibilities that
can be achieved with secondary analysis, readers can approach the chapters in
the first section for basic instruction about how to approach secondary data
analysis. In short, we hope that the chapters in this book provide strong moti-
vation and the necessary practical guidance that will enable them to use this
important class of techniques in their own work.
Brooks-Gunn, J., Berlin, L. J., Leventhal, T., & Fuligini, A. S. (2000). Depending on
the kindness of strangers: Current national data initiatives and developmental
research. Child Development, 71, 257–268.
The authors provide an overview of several data sets that are useful for develop-
mental research and reflect on the how these resources can be used to make sub-
stantive contributions in both scientific and public policy contexts.
De Leeuw, E. D., Hox, J. J., & Dillman, D. A. (Eds.). (2008). International handbook
of survey methodology. New York, NY: Taylor & Francis.
This edited volume provides an introduction to the myriad issues associated with
survey research.
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail, and mixed-
mode surveys: A tailored design method (3rd ed.). Hoboken, NJ: Wiley.
This classic textbook can help researchers design their own surveys and better
understand research based on surveys. Chapters on sampling and the psychol-
ogy of asking questions are particularly helpful.
Elder, G. H., J. R., Pavalko, E. K., & Clipp, E. C. (1993). Working with secondary data:
Studying lives. Quantitative applications in the social sciences (Sage University Paper
Series No. 07-088). Beverly Hill, CA: Sage.
This short monograph provides clear guidance regarding the steps necessary to
take advantage of existing archives with particular attention to conceptual
issues related to measurement. The lead author is well known for his work as an
architect of life course theory.
Hofferth, S. L., (2005). Secondary data analysis in family research. Journal of Marriage
and the Family, 67, 891–907.
This piece provides a clear introduction to the major issues in secondary data
analysis that have wide applicability beyond family research.
Kiecolt, K. J., & Nathan, L. E. (1985). Secondary analysis of survey data. Quantitative
applications in the social sciences (Sage University Paper Series No. 07-053).
Beverly Hill, CA: Sage.
Although it is over 20 years old, this book provides important guidance about
making use of existing resources.
McCall R, B., & Appelbaum, M. I. (1991). Some issues of conducting secondary
analyses. Developmental Psychology, 27, 911–917.
The authors provide a straightforward discussion of many basic methodological
issues related to secondary data analysis.
The following web links are useful starting points for finding secondary
data sets:
䡲 University of Michigan’s Interuniversity Consortium for Political
and Social Research (ICPSR): http://www.icpsr.umich.edu/.
The ICPSR’s Data Use Tutorial is particularly helpful: http://
www.icpsr.umich.edu/ICPSR/help/newuser.html.
䡲 Howard W. Odum Institute for Research in Social Science at the
University of North Carolina at Chapel Hill: http://www.irss.unc.
edu/odum/jsp/home.jsp
INTRODUCTION 9
12110-01_Intro_rev1.qxd 6/23/10 2:00 PM Page 10
REFERENCES
I
METHODOLOGIES FOR
SECONDARY DATA USE
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 12
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 13
1
GETTING STARTED: WORKING WITH
SECONDARY DATA
AMY M. PIENTA, JOANNE MCFARLAND O’ROURKE,
AND MELISSA M. FRANKS
Secondary data are those data that have been made available for use by
people other than the original investigators. These data are typically preserved
and disseminated by an organization that has a stated mission to preserve the
data for the long term or in perpetuity. Most data that have been archived are
quantitative (e.g., survey data), but increasingly, qualitative data (e.g., inter-
view transcripts, open-ended question responses) and other nonquantitative
data (e.g., audio, video) are being archived for secondary use. This chapter
focuses mainly on the use of existing data that have a quantitative component,
including data from studies that involved mixed data collection methods.
Secondary data analyses have been important for a number of years to
scholarship in various disciplines, such as sociology, economics, and political
science. Additionally, use of secondary data has been expanding in other disci-
plines that have not traditionally used such data, including psychology, family
science, and various health science disciplines. Nonetheless, the value of sec-
ondary data remains an important topic of debate in the social sciences. Such
debate initially was spurred by a series of National Research Council reports and
more recently by the 2003 publication of the National Institutes of Health’s
(NIH; 2003) Statement on Sharing Research Data. This document from NIH and
a similar one from the National Science Foundation Directorate for Social,
13
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 14
GETTING STARTED 15
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 16
Murray’s holdings are extensive and include the Block and Block Longitudinal
Study. These data can also be used at no charge.
A second type of data archive has more narrowly focused collections
around a particular substantive theme. Data in these thematic archives may
be unique, or they may overlap with other archives, thus making the data more
broadly available than they might otherwise be. For instance, the Association
of Religion Data Archives is a specialty archive focusing on American and
international religion data. The Cultural Policy and Arts National Archive at
Princeton University is another example of a specialty data archive. It has data
on the arts and cultural policy available for research and statistical analysis,
including data about artists, arts and cultural organizations, audiences, and
funding for the arts and culture. And finally, WebUse at the University of
Maryland provides data to researchers interested in how technology generally,
and the Internet specifically, affects human behavior. The web addresses for
these data archives are provided in the section Recommended Data Sets.
Another type of archive is designed solely to support the scientific notion
of replication. Journal-based systems of sharing data have become popular in
economics and other fields as a way of encouraging replication of results
(Anderson et al., 2008; Glenditsch et al., 2003). The American Economic
Review, for example, requires authors to make data underlying their published
work available on its website prior to publication in the journal. These col-
lections can sometimes be shorter lived than the formal archives, particularly
if the sustainability of their archival model relies on a single funding source.
Some examples of still less formal approaches to data sharing include authors
who acknowledge they will make their data available on request or who dis-
tribute information or data through a website. Researchers often keep these
sites current with information about findings from the study, and publication
lists, in addition to data files and metadata.
Many researchers, especially those who collect their own data, may find
using secondary data to be daunting. A researcher who collects his or her own
data understands the nuances of the data collection methodology, the docu-
mentation and its completeness, and the data and data structure itself. Gaining
familiarity with an existing data source can be challenging but can allow one
to use a broader set of variables and for a larger, more representative sample
than one would have access to in their own data collection. This section pro-
vides a summary of various components of secondary data collections and
exploration tools that the secondary analyst should consider when evaluating
an existing data collection.
Study-Level Metadata
Documentation
The online catalogs of the major archives also point to other important
elements of an existing data collection that can be useful when evaluating
the use of secondary data. The documentation provides more detailed infor-
mation than the study description and can be used to further determine
whether a data collection suits one’s research questions. Detailed informa-
tion about the design of the data collection and the data are found in the
documentation.
Codebooks are among the most important pieces of documentation for
secondary data. Thus, the codebook is the next step for exploring an existing
data collection. Codebooks typically provide a summary of the methodology
underlying the data collection and important information about the prove-
nance (ownership) of the data collection. In addition, the codebook provides
a listing of all the variables in a data set(s) and should include the exact word-
ing of the question, universe information (i.e., who was actually asked the
question), unweighted frequency distributions, missing data codes, imputation
and editing information, details about constructed variables, and labels that
describe response codes. The codebook also typically includes information on
administrative variables used to linking data files within a collection and how
to properly apply survey weights.
Data collection instruments are often distributed with secondary data as
well. Instruments can help one ascertain question flow and stems, skip patterns,
and other crucial details for fully understanding the data collection. Review of
the instruments and seeing the survey instrument as a whole helps provide the
context in which a particular question was asked. This is difficult for automated
GETTING STARTED 17
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 18
files are created after each collection point. Like the other examples, these data
files must be merged for comprehensive data analysis.
The complex nature of many data sets has spurred some data collectors to
develop data extraction tools that seamlessly allow users to access all under-
lying data files and create analytic subsets of data without needing to have the
skills required to merge the data files in SAS, SPSS, or Stata. For example, the
website for the Panel Study of Income Dynamics (http://psidonline.isr.umich.
edu) offers such a data extraction.
Online data analysis enables users to explore data sets using certain sta-
tistical procedures without downloading data files and without being familiar
with any statistical packages. As an example, ICPSR currently offers online
analysis for selected data collections using the data analysis tool, Survey
Documentation Analysis (SDA), a software product developed and maintained
by the University of California, Berkeley (see http://sda.berkeley.edu). Data at
other archives use a variety of online analysis packages with similar capabili-
ties. SDA allows users to select variables for analysis, perform statistical
analyses, view the data graphically, recode and compute variables, and create
customized subsets of variables and/or cases for download. Thus, online analy-
sis allows a potential data user to (a) understand the types of variables available
in a data collection, (b) explore the basic descriptive characteristics of the sam-
ple, and (c) easily examine bivariate and multivariate relationships within the
data collection.
GETTING STARTED 19
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 20
GETTING STARTED 21
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 22
CONCLUSIONS
GETTING STARTED 23
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 24
are often not, nor are they necessarily intended to be, very thoroughly ana-
lyzed prior to their public release. The data are subsequently used for research,
policy, and teaching purposes in the years after their release, and even decades
later for comparative analysis.
Using secondary data has been common among sociologist, demogra-
phers, and political scientists. Psychologists have begun more recently to turn
to secondary data in their work. The increasing availability of high-quality, lon-
gitudinal information on a wide range of topics of interests to psychologists has
facilitated increasing amounts of scholarship built on these rich data sources.
And, although there are significant barriers to consider when evaluating a sec-
ondary data source, the potential benefits, such as access to long-term represen-
tative samples at little to no economic cost, make secondary data an attractive
option. Finding, evaluating, and accessing secondary data has become relatively
easy, given the availability of data and metadata that is searchable on the web-
sites of the various U.S.-based archives and through search engines such as
Google. In exchange for loss of control about the types of questions asked of
study participants, researchers using secondary data may gain the ability to
understand human behavior over time (longitudinal information) and diver-
sity across population subgroups (e.g., socioeconomic status, age, region,
race/ethnicity). Moreover, in an increasingly multidisciplinary arena, psychol-
ogists have come to play a central role in the collection of large-scale social sci-
ence data and have influenced the types of information that is being collected.
Thus, secondary data represent an increasingly important resource for future
scholarship among psychologists and in the social sciences more generally.
REFERENCES
Anderson, R. G., Greene, W. H., McCullough, B. D., & Vinod, H. D. (2008). The
role of data/code archives in the future of economic research. Journal of Economic
Methodology, 15, 19–119.
Bailar, J. C., III. (2003, October). The role of data access in scientific replication. Paper pre-
sented at the Access to Research Data: Risks and Opportunities: Committee on
National Statistics, National Academy of Sciences conference, Washington, DC.
Fienberg, S. E. (1994). Sharing statistical data in the biomedical and health sciences:
Ethical, institutional, legal, and professional dimensions. Annual Review of Public
Health, 15, 1–18. doi:10.1146/annurev.pu.15.050194.000245
Freese, J. (2007). Replication standards for quantitative social science: Why not soci-
ology? Sociological Methods & Research, 36, 153–172. doi:10.1177/004912410
7306659
Glenditsch, N. P., Metelits, C., & Strand, H. (2003). Posting your data: Will you be
scooped or will you be famous? International Studies Perspectives, 4, 89–95.
King, G. (2006). Publication, publication. Political Science & Politics, 39, 119–125.
King, G., Herrnson, P. S., Meier, K. J., Peterson, M. J., Stone, W. J., Sniderman, P. M.
et al. 1995. Verification/replication. Political Science & Politics, 28, 443–499.
Kuhn, T. (1970). The structure of scientific revolutions. Chicago, IL: University of
Chicago Press.
Louis, K. S., Jones, L. M., & Campbell, E. G. (2002). Sharing in science. American
Scientist, 90, 304–307.
National Institutes of Health. (2003, February 26). Final statement on sharing research
data. Retrieved from http://grants.nih.gov/grants/policy/data_sharing/
National Science Foundation Directorate for Social, Behavioral, and Economic
Sciences. (n.d.) Data archiving policy. Retrieved from http://www.nsf.gov/sbe/ses/
common
O’Rourke, J. M. (2003). Disclosure analysis at ICPSR. ICPSR Bulletin, 24, 3–9.
O’Rourke, J. M., Roehrig, S., Heeringa, S. G., Reed, B. G., Birdsall, W. C.,
Overcashier, M., & Zidar, K. (2006). Solving problems of disclosure risk while
retaining key analytic uses of publicly released microdata. Journal of Empirical
Research on Human Research Ethics, 1, 63–84.
Sobal, J. (1981). Teaching with secondary data. Teaching Sociology, 8, 149–170.
doi:10.2307/1316942
GETTING STARTED 25
12110-02_PT1-Ch01_rev1.qxd 6/23/10 1:56 PM Page 26
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 27
2
MANAGING AND USING
SECONDARY DATA SETS
WITH MULTIDISCIPLINARY
RESEARCH TEAMS
J. DOUGLAS WILLMS
The use of secondary data can be rather daunting for the beginning
researcher. Quite often, new investigators have used only small “textbook”
data sets during their graduate work and have not encountered very large data
sets that have a multilevel structure, variables with differing amounts of missing
data, and a complex weighting scheme. Usually, secondary data sets arrive as
a text file with one or more sets of syntax files for reading the data and creating
a system data file that can be used with particular software such as SAS, SPSS,
or Stata. Even achieving this first step can be frustrating for the beginning
researcher.
During the past 15 years, I have worked with researchers at the Canadian
Research Institute for Social Policy at the University of New Brunswick
(UNB-CRISP) to develop strategies for managing and using large-scale,
complex data sets with multidisciplinary teams. This work has included
the analysis of data from several national and international studies such
as the National Longitudinal Survey for Children and Youth (NLSCY;
Statistics Canada, 2005), the Programme for International Student Assessment
(PISA; Organisation for Economic Cooperation and Development, 2001), the
Progress in International Reading Literacy Study (Mullis, Martin, Gonzalez,
27
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 28
& Kennedy, 2003), and Tell Them From Me (Willms, & Flanagan, 2007).
Each of these studies has its own peculiarities, but the studies share many of the
same features, such as design weights, missing data, and a multilevel structure.
Moreover, the analytic work involves the management and analysis of
secondary data that could not reasonably be done by one person; they require
work by teams of analysts, with a consistent approach to data management.
The aim of this chapter is to describe some of the management techniques
that may be useful to the beginning researcher when preparing an unfamiliar
data set for analysis. I also discuss some of the common pitfalls that we typically
encounter in working with secondary data. Throughout the chapter, I use the
NLSCY as an example. The NLSCY is a nationally representative longitudinal
study of Canadian children and youth that was launched by the Canadian
government in 1994 with a sample of more than 22,000 children in more than
13,000 families. The design included surveys administered to parents, teachers,
and school principals; direct assessments of the children after age 4; and a
separate self-report questionnaire for youth from age 10 onward. Children
and their families have been followed longitudinally with data collected every
2 years. The U.S. National Longitudinal Survey of Youth (NLSY) and
Australia’s Longitudinal Survey of Australian Youth are comparable surveys.
These studies provide extraordinary coverage of the characteristics, life
experiences, and the healthy development of children as they grow from
infancy to early adulthood.
Human Resources and Skills Development Canada has funded two
multidisciplinary team projects to further the research on the NLSCY. The
first of these, called “Vulnerable Children,” led to the publication of an edited
volume (Willms, 2002) and the development of a pan-Canadian network
called the “New Investigators Network.” The network published nearly
100 articles over a 4-year period. Recent examples of research based on the
longitudinal data from the NLSCY include works published by Arim, Shapka,
Dahinten, and Willms (2007); Dahinten, Shapka, and Willms (2007); and
Dupéré et al. (2007). The second team project, “Successful Transitions,” has
aimed to exploit the longitudinal structure of the NLSCY data.
I believe there are three elements critical to the success of a multi-
disciplinary team. One is that the members need to be chosen for both their
knowledge and expertise in a substantive area and their predilection for
quantitative research. Also, I prefer to have a balance of researchers who are
at varying stages in their careers, as this affords the opportunity for senior
researchers to mentor junior colleagues. The second element is training. In
both projects, we held several 3-day training sessions on statistical approaches
relevant to the project. These included, for example, training in the handling
of missing data, the use of design weights, item response theory (IRT), and
hierarchical linear modeling (HLM). Although team members varied in their
28 J. DOUGLAS WILLMS
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 29
The most difficult task in working with secondary data is taking the raw
data set provided and building a data set that can be used for analysis. During
the early stages of my career, I developed very long syntax files that could read
the secondary data, do the necessary coding and scaling of variables, and save
a data set with 10 or 15 variables that I would then use for analysis. This
worked reasonably well for small projects, but the syntax files were unwieldy
for large-scale longitudinal studies like the NLSCY. Also, these syntax files
were never adequately documented. Quite often, a paper would come back
from review, 6 months to 1 year later, requesting additional analyses. Because
the syntax files and the resulting variables were not adequately documented,
it would usually take me several hours to reconstruct what I had done. Also,
without adequate documentation, the knowledge transfer to new researchers
was difficult at best, and often unsuccessful. Therefore, when the Successful
Transitions team embarked on its work, the analysis group set in place an
approach that included the development of a master usable data set in three
incremental steps. The approach also involved the development of an accom-
panying “measures document,” which included documentation that enables
a multidisciplinary team to use the data across several projects.
The first step in the process of moving from raw text data to a usable
data set is to create a data set that has all of the secondary data in the base
program that the analyst intends to use, such as SPSS, SAS, or Stata. Although
some data sets come in one or more of these formats, more often the analyst
must perform this step. If so, the secondary data set usually comes with a
syntax file for reading the raw, text-format data. When using this syntax file,
I sometimes make small changes to the syntax. For example, I like to avoid
the use of string variables as much as possible, especially for variables that will
be used as identification (ID) variables because some programs will not read
string variables for the ID. At the end of the syntax file, I like to sort the data
on the basis of the structure of the data set. For example, with the NLSCY
data, we sort cases by Cycle, Household-ID, and Child-ID. Although we may
need to sort the data in other ways for particular projects, this structure is used
for managing the tall skinny files. This database is referred to as the base data set.
The data set is prepared without recoding of variables, such that it faithfully
represents the raw data.
The next step is building tall skinny data files for each of the variables
to be used in analysis. A tall skinny file is simply a data file that includes a small
set of focal variables and one or more relevant ID variables. The syntax file
to create a tall skinny file simply reads in the data from the base data set; does
some manipulation on the data for a particular variable; sets the variable name,
variable label, and missing values codes as desired; and saves a tall skinny
file—a data set with the ID variables plus the new variable or set of variables.
Building the tall skinny files is the real work of the analyst: In our case,
the first step—reading the secondary data into SPSS—required about 2 hr,
whereas the step of building the tall skinny files took us 6 months. The use
of tall skinny files is crucial to our work as it enables teams of researchers to
work on separate variables; it allows us to identify errors when they occur; and
it allows us to add new data to our system when they are collected in succes-
sive cycles. (We use the term cycle to refer to data collected on successive
occasions; some studies such as the NLSY use the term rounds; others use
the term waves.)
30 J. DOUGLAS WILLMS
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 31
As a simple example, consider the syntax for creating the tall skinny file
for the child’s sex. Our syntax file reads in the base data; recodes the data for
the variable “sex” to a new variable called “female,” which is coded “0” for males
and “1” for females, with missing data codes set to “9”; assigns value labels;
and saves an SPSS file that includes the three ID variables and the new
variable “FEMALE.” Although this may seem trivial, the data for sex are
coded “1” for males and “2” for females in some cycles of the NLSCY and
“M” and “F” in other cycles. Also, we prefer the “0”–“1” coding rather than
“1”–“2,” as the variable can then be used directly as a dummy variable in
regression analyses. We call the variable “FEMALE” rather than “sex” or
“gender” so we can easily recall which children were coded “1” and which
were coded “0.”
It seems that there is no consistent naming convention for variables in
secondary data sets. We like to include a variable name that is meaningful, such
as “mumedyrs” for mothers’ education coded in years. We include a suffix, which
currently ranges from “A” to “F,” to denote the cycle (i.e., “A” for 1994/1995,
“B” for 1996/1997, etc.). Also, when data are imputed for a variable we add a
suffix “_i” to denote that missing data were imputed for that variable. Thus, the
variable “Cmumedyrs_i” is a variable with information on mother’s education
for children studied in the third data collection cycle (1999/2000).
For a slightly more complicated example of tall skinny file construction,
there are many variables that are either nominal or ordinal but come as a
single variable in the base data set. In this case, our syntax for the tall skinny
file creates a small set of dichotomous variables with each variable name
denoting the category. For example, for mother’s or father’s level of education,
one might have separate variables denoting “did not finish secondary school,”
“secondary school,” “some college or university,” “completed college or trade
program,” and “completed university.” We prefer to include the full set of
dummy variables, rather than leaving out one dummy variable to be used as
the base category in a regression analysis, as this allows the analyst to decide
later which variable to use as the base category.
There is also a tall skinny file for childhood overweight and obesity. This
variable is slightly more complicated. In this case, we read in the base data set
and then merge the data from the tall skinny files for age and sex. We then
examine the data for each child’s weight and height, and do the necessary
checking to ensure that there are not any extreme outliers. Usually, extreme
outliers stem from coding errors, but in this case there are some children with
data coded in the imperial system (inches and pounds) rather than the metric
system (meters and kilograms). After making necessary repairs, we then
construct a measure of body mass index (BMI). This is followed with a long
syntax that codes levels of BMI into markers of overweight and obesity
according to each child’s age and sex. The cutpoints for BMI that determine
32 J. DOUGLAS WILLMS
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 33
create a dummy variable denoting whether the data were missing and a
new variable that has the missing data replaced with an imputed value.
Thus for family income, we have four variables: “INCOME,” “INCOME94,”
“INCOME94_m,” and “INCOME94_i,” for the base variable, the constant
dollar variable, the missing data indicator, and the variable with missing data
imputed, respectively. Techniques for imputing missing data are described in
Chapter 5.
A device we have found invaluable in our work is the cohort and cell
diagram, which is shown in Figure 2.1. The NLSCY comprises 11 cohorts of
children sampled in 2-year age groups.
We created a syntax file that assigns the cohort number (1–11) and the
cell number (1–48) for the data for each child at each cycle. As each research
paper in the Successful Transitions project uses a different subsample from
the data set, the researcher can quickly select the subsample needed and
communicate to others what data were used in the analysis.
The final step is to merge the several tall skinny files into a data set ready
for analysis. In most cases, this is easily done, as the tall skinny files include
the relevant matching variables and are sorted by these variables.
However, this last step can be a little more complicated for longitudinal
data sets. For the NLSCY, the construction of tall skinny files is done separately
for each cycle, in our case yielding six tall skinny files. The tall skinny files are
then stacked (e.g., using “add files” in SPSS) and sorted by the child ID to
create a very tall skinny file. The new, stacked variable then does not require
the prefix denoting cycle. We prefer stacking the skinny files rather than
matching them laterally, as many of the programs we use to analyze longitudinal
data require this format. Also, this approach reduces the number of variables,
as there are no longer six separate variables for each measure.
Keeping good notes is the sine qua non of the trade. One can keep infor-
mal notes in the syntax files, and the time required for this is seldom wasted.
Those who do advanced programming in complex languages learn this early
in their careers.
We also built a more formal document called the measures document,
which describes each variable in the tall skinny files. The codebooks provided
with secondary data sets never include all of the relevant information that
analysts require. I use the word never rather than seldom because analysts need
more information on variables and data structure than reasonably can be
12–13
13 20 28 36 44
14–15
21 29 37 45
16–17
30 38 46
18–19
39 47
20–21
48
Figure 2.1. Longitudinal cohorts for the NLSCY, Cycles 1–6. The first six of these
cohorts include children born between 1983 and 1994 who were followed longitudinally
from 1994 to 1995; these are shown with thicker arrows. At each cycle, a new cohort
of children aged birth to age 1 was sampled, and these children were followed
longitudinally for two additional cycles; these are shown with thinner arrows. From
The NLSCY Measures Document, p. 3, by J. D. Willms, L. Tramonte, and N. Chin,
2008, Fredericton, Canada: Canadian Research Institute for Social Policy. Copyright
2008 by Canadian Research Institute for Social Policy. Reprinted with permission.
34 J. DOUGLAS WILLMS
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 35
and changes to old ones. Although building a measures document may seem
time consuming and somewhat pedantic, it saves time in the long run.
The secondary data set for the NLSCY comprises a set of complex data
sets that include data collected from children and youth, and their parents,
teachers, and school principals. The survey began in 1994/1995, with a
nationally representative sample of Canadian youth ages birth to 11. The
majority of these children are part of a longitudinal sample that is being
followed with biennial data collection. In addition, at each cycle, a new
sample of children ages birth to 24 months are sampled, and in recent cycles
these children are followed longitudinally to age 5. Consequently, the codebooks
for the various NLSCY data sets are massive, with several thousands of
pages requiring about 2 m of shelf space.
The NLSCY is not only large and complex—there are other challenges
that make its use difficult. For example, the sample design changed as the
study evolved such that not all children are followed longitudinally; many of
the measures were altered from cycle to cycle, so that without some careful
scaling work, they cannot be used longitudinally; and new survey content was
added as children got older, whereas old content was either discarded or
scheduled for collection at every other cycle, instead of every cycle.
Our measures document for the NLSCY (Willms, Tramonte, & Chin,
2008) was developed to help deal with these issues. In the short term, it was
designed for describing a set of variables that could be used in a consistent
fashion across the 15 papers for Successful Transitions. It has allowed researchers
across the country to quickly begin their own studies as they can build on what
has been done by the UNB-CRISP team. In the longer term, the measures
document is an ongoing project that describes a set of measures used by analysts
interested in the NLSCY. It has also proven useful also as a teaching tool.
The NLSCY measures document begins with a brief description of the
data set and the variables to be used in Successful Transitions. This is followed
with individual summaries for each variable that include (a) a description of
the variable, (b) the age range covered by the variable, (c) the cycles in which
the variable was measured, (d) the variable type (e.g., dichotomous, ordinal,
continuous), (e) the unweighted number of children with data on the variable,
(f) basic descriptive statistics (e.g., frequencies, mean, standard deviation) for
the variables across cycles, and (g) notes regarding its development.
getting too excited, assume it was a mistake. If you cannot find a mistake, try to
do some auxiliary analyses that approach the analysis from different directions.
The most common errors I encounter are described in the following
listing.
䡲 Miscoded data. Quite often, questionnaire data are reverse coded.
I have seen data sets in which males are coded “1” and females
“2” in one cycle, and then females, “1,” and males, “2,” in the
next cycle. The best safeguard is careful, simple preliminary
analyses as suggested earlier.
䡲 Mismatched data. Strange results occur when data are mismatched
across cycles or across levels, such as the pupil and school levels.
This can happen if one forgets to include the matching key when
combining data.
䡲 Missing data codes. If the values established for missing data have
not been set correctly, one will typically get unusual results. The
convention seems to be to use “9” for missing data for one-digit
variables, “99” for two-digit variables, and so on. Many data sets
include several missing data codes such as “7” for “not answered,”
“8” for “not applicable,” and “9” for “missing.” Of course, a
careful check of the syntax files is the first step. However, one
can also use bigger values, such as “999” instead of “9,” for missing
data, and then if there is a problem, it is easily detected. My
advice is to never use “0” for the missing data value.
䡲 Filter questions. In many studies, respondents are asked a question,
and depending on their answer, they are asked to continue to
the next question or skip ahead to a different set of question.
This creates a special type of missing data—“not applicable”—
and one needs to be careful in how these data are handled. Quite
often, one can estimate a percentage or mean for the variable,
but the statistic refers to the percentage or mean only for those
who were filtered to that question.
䡲 Age of respondent. The single most important variable in longi-
tudinal surveys is the age of the respondent. However, one needs
to distinguish between age of respondent when the survey was
conducted; age of the respondent when the sample was selected;
and age of the respondent when some event or intervention
occurred, such as when a child started school. Most often,
researchers will want the age of respondent when the survey was
conducted, and this usually requires one to compute it using the
age of the respondent when the sample was selected and the
date when the respondent was actually tested or surveyed.
36 J. DOUGLAS WILLMS
12110-03_Ch02_rev2.qxd 6/23/10 1:58 PM Page 37
CONCLUDING REMARKS
We do not view the approach we used for creating tall skinny files as
the best way or only way to establish a usable data set. For example, with the
NLSCY, we coded our measure of adolescent depression using a particular
variant of IRT. Another researcher may prefer to use a different approach to
scaling. The important point is that another researcher can develop his or her
own strategy and examine how the results compare with ours without having
to start from the beginning.
Our approach to data analysis also provides a logical structure for man-
aging files. We like to keep the raw secondary data sets in a separate direc-
tory. Our principle is to never alter these, as sometimes we have to go back
and reconstruct what was done from first principles. We also have separate
directories for each cycle. Each of these directories include the base SPSS file
drawn from the secondary raw data with the first step; a subdirectory for the
syntax files; and a subdirectory for the tall skinny files constructed for that
cycle. Finally, we have a directory for the file that includes the syntax for stack-
ing the tall skinny files across cycles and the resulting full database.
The use of secondary data for the novice researcher can be exceedingly
frustrating, but it can also be rewarding. Many data sets require the use of
complex design weights, a strategy for handling missing data, and an analytical
approach that takes account of a multilevel structure. Longitudinal survey
data can be even more trying, as quite often the data structure, the measures,
and even the coding framework changes from cycle to cycle. This chapter
recommends writing syntax files that create tall skinny files that can be used
to form a usable data set. This has several advantages. First, it is very efficient
in that all researchers in the team do not have start from the beginning and
do all of the basic groundwork. Second, it enables fairly novice researchers to
take advantage of sophisticated work, such as IRT scaling, done by seasoned
researchers. Third, it facilitates work in multidisciplinary teams. The approach
enables one to move from secondary data to usable data, and from frustrating to
rewarding, in small steps with safeguards that help avoid the common pitfalls.
REFERENCES
Arim, R. G., Shapka, J. D., Dahinten, V. S., & Willms, J. D. (2007). Patterns and
correlates of pubertal development in Canadian youth. Canadian Journal of
Public Health, 98, 91–96.
Cole, T. J., Bellizzi, M. C., Flegal, K. M., & Dietz, W. H. (2000). Establishing a standard
definition for child overweight and obesity worldwide: International survey.
British Medical Journal, 320, 1240–1243. doi:10.1136/bmj.320.7244.1240
Dahinten, S., Shapka, J. D., & Willms, J. D. (2007). Adolescent children of adolescent
mothers: The impact of family functioning on trajectories of development.
Journal of Youth and Adolescence, 36, 195–212. doi:10.1007/s10964-006-9140-8
Dupéré, V., Lacourse, E., Willms, J. D., Vitaro, F., & Tremblay, R. E. (2007). Affiliation
to youth gangs during adolescence: The interaction between childhood psycho-
pathic tendencies and neighborhood disadvantage. Journal of Abnormal Child
Psychology, 35, 1035–1045. doi:10.1007/s10802-007-9153-0
Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., & Kennedy, A. M. (2003). PIRLS 2001
international report: IEA’s study of reading literacy achievement in primary schools.
Chesnut Hill, MA: Boston College.
Organisation for Economic Cooperation and Development. (2001). Knowledge and skills
for life: First results from the OECD programme for international student assessment
(PISA) 2000. Paris, France: Author.
Statistics Canada. (2005). National Longitudinal Survey of Children and Youth microdata
user guide—Cycle 5. Ottawa, Canada: Author.
Willms, J. D. (Ed.). (2002). Vulnerable children: Findings from Canada’s National
Longitudinal Survey of Children and Youth. Edmonton, Canada: University of
Alberta Press.
Willms, J. D., & Flanagan, P. (2007). Canadian students “Tell Them From Me.”
Education Canada, 3, 46–50.
Willms, J. D., Tramonte, L., & Chin, N. (2008). The NLSCY measures document.
Fredericton, Canada: Canadian Research Institute for Social Policy.
38 J. DOUGLAS WILLMS
12110-04_Ch03_rev1.qxd 6/23/10 1:59 PM Page 39
3
ON CREATING AND USING
SHORT FORMS OF SCALES
IN SECONDARY RESEARCH
KEITH F. WIDAMAN, TODD D. LITTLE, KRISTOPHER J. PREACHER,
AND GITA M. SAWALANI
39
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 40
FUNDAMENTAL ISSUES IN
PSYCHOLOGICAL MEASUREMENT: RELIABILITY
40 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 41
Internal Consistency
Many different kinds of reliability coefficient have been developed, such
as split half, internal consistency, parallel forms, and test–retest reliability
(see McDonald, 1999). The most commonly used reliability coefficients are
internal consistency and test–retest coefficients. Internal consistency indexes
estimate reliability on the basis of associations among scale items. Coefficient
alpha (Cronbach, 1951) is the most often reported internal consistency index.
However, many researchers are unaware that alpha is based on the assumptions
that a single factor underlies the scale and that all items are equally good
indicators of the latent variable being assessed (Schmitt, 1996). That is, if an
item factor analysis were performed, all loadings of the p items on the single
factor underlying the scale would be equal (i.e., tau equivalent). If the factor
loadings are not equal for all items, coefficient omega is a more appropriate
estimator of reliability (see below), and coefficient omega is always greater
than or equal to coefficient alpha for a scale that is unidimensional (i.e., that
is a one-factor scale; see McDonald, 1970, 1999).
Coefficient alpha can be calculated in many ways, but perhaps the
easiest way is
where p is the number of items, sX2 is the variance of total scores on scale X, and
∑ s2j refers to the summation of item variances for the p items (j = 1, . . . , p).
In the left section of Table 3.1, descriptive statistics are presented for a
6-item short form of the 10-item Rosenberg Self-Esteem Scale (Rosenberg,
1965; note that Items 5 and 6 are reversed scored), which are based on a
sample of 6,753 participants in the 2005 Monitoring the Future survey.
These data are freely available to researchers (more details are available at
http://monitoringthefuture.org/).
The correlations among items, shown below the main diagonal, are
moderate to large, ranging between .256 and .696, with a mean correlation of
.468. Item variances are shown on the diagonal, and covariances among items
are above the diagonal. To use Equation 1 to calculate coefficient alpha, we
TABLE 3.1
6/23/10
Six Rosenberg Self-Esteem Scale Items From the 2005 Monitoring The Future Study: Descriptive Statistics
and One-Factor and Two-Factor Solutions
One-factor
WIDAMAN ET AL.
2:00 PM
Item 1 2 3 4 5r 6r λ1 θ 2j λ1 λ2 θ 2j
1 1.094 .660 .495 .783 .470 .553 .853 .367 .810b .058a .377
Page 42
2 .613 1.061 .569 .661 .440 .456 .776 .458 .825b −.042a .420
3 .516 .602 .843 .525 .297 .300 .604 .478 .708b −.121a .432
4 .696 .596 .532 1.158 .526 .597 .886 .372 .806b .105a .393
5r .381 .362 .274 .415 1.390 .878 .605 1.024 .094b .750a .734
6r .415 .347 .256 .435 .584 1.626 .670 1.177 −.094b 1.149a .427
M 3.978 4.072 4.171 4.045 3.944 3.753
SD 1.046 1.030 .918 1.076 1.179 1.275
Note. N = 6,753. In the Item Descriptive Statistics section, values are correlations among items (below diagonal), item variances (on diagonal in bold print), and covariances
among items (above diagonal, in bold print), with the mean and standard deviation for each item. Items 5 and 6 were reversed scored (resulting in Items 5r and 6r, respectively)
prior to calculating item statistics. In the “One-factor” and “Two-factor” sections, tabled values are estimates from factor analyzing covariances among items. Symbols λ1 and λ2
refer to Factors 1 and 2, respectively, and θ 2j , to unique factor variance. Factor variances were fixed at unity to identify and scale the estimates. Factors correlated .608 in two-
factor solution.
aParameters constrained to sum to zero. bParameters constrained to sum to zero.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 43
need two quantities: (a) the sum of all elements of the item variance–covariance
matrix, which is equal to sX2 , is 23.592, and (b) the sum of the item variances on
the diagonal, which is 7.172. Given these values and the presence of p = 6 items,
⎛ 6 ⎞ ⎛ 23.592 − 7.172 ⎞
coefficient alpha is estimated as α = ⎜
⎝ 6 − 1⎟⎠ ⎜⎝ 23.592 ⎟⎠ = .835 for
this six-item short form.
An interesting alternative to coefficient alpha is coefficient omega
(McDonald, 1970, 1999). Coefficient omega is more appropriate for most
research applications because it is unrealistic to assume that all items on a
measure are equally good at tapping true score variance. In our experience,
equal factor loadings for all scale items rarely, if ever, occur, so coefficient
omega will, in general, be preferable to coefficient alpha. Assuming that a
linear model underlies responses on each item on scale X, the linear model
for item xji takes the form xji = τj + λ jFi + εji, where xji is the score of person i on
item j, τj is the intercept (i.e., mean) of item j, λ j is the raw score (or covariance
metric) common factor loading for item j, Fi is the score on the common factor
for person i, and εji is the score of person i on the unique factor for item j. On
the basis of factor analysis of the item variance–covariance matrix, coefficient
omega can be estimated as
(∑ λ )
2
ω=
j
= 1−
∑ θ2j , (2)
(∑ λ ) + ∑ θ
2
j
2
j
sX2
where all summations are from 1 to p (the p items), θ2j is the estimated unique
variance of item j (i.e., the variance of εji), and other symbols are as defined
previously. In the first expression in Equation 2, one must first sum the p factor
loadings and square this sum; the square of summed loadings estimates the
variance of the scale. The denominator is the sum of the preceding value and
the sum of the unique variances of the items. The ratio of these two values gives
the proportion of variance in the scale that is reliable variance (i.e., coefficient
omega). When a single factor underlies the scale and all factor loadings (λj s)
are identical, coefficients omega and alpha are identical.
The second expression in Equation 2 is simply 1.0 minus the ratio of the
sum of unique factor variances, ∑ θ2j , over total scale variance, s 2X. If a scale is
truly a single-factor instrument (i.e., is unidimensional), both expressions for
coefficient omega in Equation 2 will provide identical results. But recent work
by Zinbarg and Revelle and associates (e.g., Zinbarg, Revelle, & Yovel, 2007;
Zinbarg, Yovel, Revelle, & McDonald, 2006) showed that the second expression
is a more appropriate estimator of coefficient omega as developed by McDonald
(1970) if a scale is “lumpy” (cf. Cronbach, 1951) and thus consists of two or
more highly correlated group factors that reflect excess overlap in item content
or stylistic variance that contributes to the multidimensionality.
To estimate coefficient omega, we used maximum likelihood estimation
in Mplus (Muthén & Muthén, 1998–2007) to obtain a one-factor solution
for the self-esteem items. To identify this model, we fixed the factor variance
to 1.0 and estimated all remaining parameters. As seen in Table 3.1, the
positively worded items had higher factor loadings and much lower unique
variances than did the negatively worded items. Factor loadings varied
considerably (range = .604–.886) and thus were not tau equivalent, imply-
ing that coefficient alpha is inappropriate for this data set. The one-factor
solution had marginal levels of fit to the data, with a comparative fit index
(CFI) of .881 and standardized root-mean-square residual (SRMR) of .069,
so computation of coefficient omega may be suspect. For illustration, we used
the first expression for coefficient omega shown in Equation 2 to compute
a reliability estimate (the sum of factor loadings is 4.394, the square of this
sum is 19.307, and the sum of unique variances is 3.876). This estimate of
coefficient omega was .833, which is marginally lower than coefficient alpha
reported above. Using the second expression in Equation 2, coefficient
omega was .836, marginally higher than coefficient alpha. That the first
of these two estimates of coefficient omega is lower than coefficient alpha
is inconsistent with the claim that omega is greater than or equal to alpha
(McDonald, 1999), an inequality that holds only if a scale is unidimensional.
Thus, the first estimate of coefficient omega suggests that the one-factor
solution for this data set is inappropriate.
To investigate this, we fit a freely rotatable, exploratory two-factor model
to the data, again using maximum likelihood estimation and the Mplus program.
To identify this model, we fixed factor variances to 1.0, allowed the factors to
correlate, and constrained hyperplanar loadings on each factor to sum to zero.
That is, we constrained the loadings of Items 5r and 6r on the first factor to
sum to zero, and we constrained the loadings of Items 1 through 4 on the second
factor to sum to zero. The results of this analysis are shown in the last three
data columns of Table 3.1. The two-factor solution had quite acceptable levels
of fit, with CFI of .980 and SRMR of .018. As seen in Table 3.1, the four
positively worded items loaded highly on the first factor, the two negatively
worded items loaded highly on the second factor, and the two factors were
relatively highly correlated (.608). Notably, the unique factor variances
for the two negatively worded items were greatly reduced relative to the
one-factor model, and the sum of unique variances was now 2.783. Using the
second expression in Equation 2, omega was 882, considerably higher than
coefficient alpha.
Coefficient alpha is often touted as a lower bound estimator of scale
reliability, and this sounds at first to be desirable, as one generally would not
44 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 45
Test–Retest Reliability
The second most commonly reported index of reliability is test–retest
reliability. Here, one administers the same scale at two points and calculates
the Pearson product–moment correlation between the two administrations.
In secondary data sets that contain longitudinal measurements, correlations
between scores on a given scale across measurement occasions are test–retest
correlations, and even single-item measures can be evaluated using test–retest
reliability.
The strength of test–retest correlations depends on the time lag between
assessments, with longer lags tending to yield lower correlations. Therefore,
such estimates are not optimal as indices of measurement precision; instead,
they reflect stability over time. Additionally, internal consistency and test–retest
reliabilities may diverge as a function of the type of construct. A scale assessing
a trait construct would ideally have high internal consistency reliability and
should also exhibit high test–retest reliability. In contrast, if a scale assesses a
state or mood construct that varies across time, the scale would ideally have
high internal consistency reliability, whereas its test–retest reliability should
be quite low, even near zero. If the test–retest reliability of a state or mood
scale is quite high, the contention that the scale assessed a state construct is
open to question.
are a subset of items in the original scale, a short form will usually have lower
reliability than the original, longer form. This reduced reliability is a crucial
consideration: If a short form is too short, its reliability may be so com-
promised that it has unacceptable levels of measurement precision—and
its use in research becomes a risky endeavor. Providing bounds of reliability
that are acceptable is fraught with problems. In general, short-form scale
reliabilities of .80 or above are generally quite acceptable, values between
.70 and .80 are adequate for research purposes, and reliabilities between .60
and .70 are at the low end of general use (cf. Nunnally & Bernstein, 1994,
pp. 264–265). However, occasionally a scale with a reliability of .45 to .50
has surprisingly high correlations with outside variables, so the proof of its
use is in the empirical relations it has with other variables (i.e., its validity,
as discussed later).
The second scale characteristic—the magnitude of the MIC—reflects
the amount of variance that is shared among items. Here, the higher the MIC,
the smaller the number of items needed to achieve an acceptable level of
reliability. Conversely, scales with a low MIC will require more items to
achieve a comparable level of reliability. Clark and Watson (1995) argued
that the MIC for scales generally should fall somewhere between .15 and .50.
When measuring broad constructs like extroversion, lower MICs (in the
range from .15 to .35) are expected, as one should attempt to cast a broad net
and assess many, somewhat disparate, aspects of a content domain. When
measuring a narrow construct like test anxiety, higher MICs (ranging from
.30 to .50) should occur because narrow domains of content have items of
greater similarity.
The MIC is directly related to the standardized factor loadings one
would obtain if item correlations were factor analyzed. If correlations among
items were fairly homogeneous and the MIC were .16, then standardized
factor loadings for items would be around .40, and MICs of .25, .36, and .50
would translate into factor loadings of about .50, .60, and .70, respectively.
Thus, the guideline by Clark and Watson (1995) that the MIC should fall
between .15 and .50 means that standardized item factor loadings should vary
between about .40 and .70, a useful benchmark when evaluating factor analyses
of items comprising a short form from a secondary data set.
With regard to the third scale characteristic, differences in item variance
can affect both reliability and the nature of the construct assessed by the scale.
Other things being equal, items with larger variance contribute proportionally
more variance to the scale than do items with smaller variance. When this
occurs, individual differences on items with larger variance contribute dis-
proportionately to individual differences on the total scale, which is shifted in
the direction of these items and away from items with relatively small variance.
When items with small variance assess crucial aspects of a construct, failure
46 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 47
TABLE 3.2
Predicted Reliabilities for Short Forms as a Function of Properties
of Long Form
Length of short form as function of long form
No. of items MIC rxx No. of items rxx No. of items rxx No. of items rxx
40 .50 .98 30 .97 20 .95 10 .91
.40 .96 .95 .93 .87
.30 .94 .93 .90 .81
.20 .91 .88 .83 .71
30 .50 .97 23 .96 15 .94 8 .89
.40 .95 .94 .91 .84
.30 .93 .91 .87 .77
.20 .88 .85 .79 .67
20 .50 .95 15 .94 10 .91 5 .83
.40 .93 .91 .87 .77
.30 .90 .87 .81 .68
.20 .83 .79 .71 .56
10 .50 .91 8 .89 5 .83 3 .75
.40 .87 .84 .77 .67
.30 .81 .77 .68 .56
.20 .71 .67 .56 .43
Note. MIC = mean interitem correlation. Tabled values are theoretical estimates derived from MIC and
number of items using the Spearman–Brown prophesy formula. A Microsoft Excel spreadsheet with other
combinations of conditions is available at http://www.Quant.KU.edu/resources/published.html.
pairs of columns, one pair of columns for a scale 75% as long as the original
form (i.e., discarding one fourth of the items), a second pair for a scale 50%
as long (i.e., discarding half of the items), and a final pair for a scale 25% as
long (i.e., discarding three fourths of the items). Thus, a 40-item scale with
an MIC of .3 would have a reliability of .94, and a 20-item scale with MIC of .2
would have a reliability of .83.
Values in Table 3.2 appear to present a rather positive picture, with
relatively high levels of reliability in many parts of the table. But one should
remember that most original (or long-form) scales used in psychology have
between 10 and 20 items per dimension. Suppose a researcher wanted to keep
reliability above .80 and knew that the reliability of a 20-item scale was .90.
This level of reliability would arise from an MIC of around .3, and the
researcher would expect a short form containing half of the items from this
scale (10 items) to have a reliability of .81. If this level of reliability were deemed
too low, then keeping more items would be advisable. Or, if the original form
of a 10-item scale had a reliability of .87 (i.e., MIC of .4), deleting more than
about three items from the scale would lead to reliability below .80, which
may be unacceptable.
48 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 49
FUNDAMENTAL ISSUES IN
PSYCHOLOGICAL MEASUREMENT: VALIDITY
by the pattern of results obtained across all studies using the scale. This pattern
should satisfy several criteria: (a) the scale should correlate highly with other,
well established measures of the same construct; (b) the scale should correlate
much lower with measures of quite different constructs; and (c) scale scores
should vary as a function of relevant contexts or conditions. To reiterate, the
correlation of a scale with measures of other constructs need not always
involve strong positive correlations, but rather can involve a well reasoned
set of hurdles that include zero, small, medium, and large correlations in both
positive and negative directions—all of which should be consistent with the-
ory underlying the measure and its expected levels of correlation with other
measures. In their description of a hypothetical measure of social intelligence,
Westen and Rosenthal (2003) argued that it should correlate moderately
positively with verbal IQ (r = .50), at a low positive level with Extraversion
(r = .10), and moderately negatively with hostile attribution bias (r = −.40).
Even using the elegant Westen and Rosenthal (2003) approach, construct
validity of a scale cannot be captured with a single set of correlations but is
summarized by examining all evidence that has accrued using the scale. A
scale may have good construct validity for certain inferences but much poorer
validity for others. Thus, construct validity is not an all-or-nothing affair but
requires an understanding and summary of available research on the scale and
considerations of the uses to which the scale will be put.
Much psychological research suffers an unfortunate confirmation bias
(Widaman, 2008), whereby researchers hypothesize that relatively high,
significant correlations will hold between certain measures and pay less atten-
tion to variables with which the focal measure should correlate at low levels.
When correlations are evaluated, correlations hypothesized to differ significantly
from zero are treated as theoretically important even if they are relatively
small in magnitude, and correlations hypothesized to be negligible are treated
as essentially equal to zero, even if they barely miss being deemed statistically
significant. To combat this unfortunate bias, researchers should develop
hypotheses regarding both the convergent and discriminant validity of their
measures, as Westen and Rosenthal (2003) argued. Convergent validity refers
to the degree to which a set of measures converges on the construct of interest.
Convergent validity is supported if measures of the same purported construct
exhibit high intercorrelations. Discriminant validity describes the degree of
meaningful separation, or lack of substantial correlation, between indicators
of putatively distinct constructs. Discriminant validity is supported if relations
between different constructs are approximately zero in magnitude or at least
much smaller than convergent correlations for the measures. The classic article
by Campbell and Fiske (1959) should be consulted regarding convergent and
discriminant validation, as should more recent work on structural equation
modeling (SEM) of such data (e.g., Eid & Diener, 2006; Widaman, 1985).
50 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 51
When creating a short form of a scale, the ultimate issue should be the
validity of the short form, rather than its reliability (John & Soto, 2007).
That said, reliability should not be disregarded; indeed, because reliability is
a prerequisite for validity, reliability should be the first psychometric index to
be evaluated. However, any short form will have fewer, often many fewer, items
than the original scale, so reliability of a short form is likely to be appreciably
lower than that for the full scale.
If a researcher were interested in creating a short form that has as high
a level of reliability as possible, he or she might select the subset of items that
have the highest MIC, because higher reliability arises from higher MIC values.
But if selection of only items with high MIC leads to a biased selection of
items (i.e., failure to preserve the breadth of the domain across the items in the
short form), then the validity of the short form may be severely compromised,
even as the reliability of the short form is maximized. Stated differently, the
optimal set of indicators for a short form measure of a given construct may not
be the indicators that have the highest internal consistency from among the
possible items of the full scale; in fact, maximizing internal consistency can lead
to suboptimal outcomes. Indeed, Loevinger (1954) described the attenuation
paradox in which increasing reliability leads to increasing validity up to a
point, beyond which point further increases in homogeneity reliability decreases
validity. Or, selecting items that correlate most highly can lead to selection of
items with extreme levels of item content overlap, leading to bloated specific
factors that represent pairs of redundant items (Cattell & Tsujioka, 1964).
Researchers must take care when developing short forms from longer original
measures because common approaches for developing short forms are poten-
tially problematic.
The most common methods for constructing short form measures are
(a) selecting a subset of items with the highest MIC (described earlier), to
maximize reliability of the short form; (b) selecting items with the highest
loadings on the common factor underlying the items, to obtain items most
closely aligned with the factor; (c) selecting items with the highest correlation
with the total scale score (preferably the highest correlation with a composite
of the remaining items on the scale); (d) selecting items with the highest
face validity, or items that are the most obvious indicators of the construct;
or (e) selecting items randomly from the original scale. Each of the preced-
ing methods has flaws, and most methods have several. Methods (a), (b), and
(c) use empirical methods, basing decisions on patterns of results from a
particular set of data. Because the subset of items that appears to be optimal
might vary across different sets of empirical data, basing item selection on a
single data set is problematic and capitalizes on chance results in a single
sample. Further, Methods (a) through (d) may result in a narrowing of item
content, restricting improperly the breadth of the item content in the full
scale. Method (d) is based on subjective judgments by researchers, and care
must be taken lest the predilections of one researcher bias the item selection
in idiosyncratic ways. Finally, Method (e) appears to be an unbiased approach
to item selection, but researchers usually want to select the best items for a short
52 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 53
form, not a random sample of items. Only if all items of the larger scale were
equally good would a random selection of a subset of items be a reasonable
approach. In practice, items are rarely equally good, so this approach probably
would not lead to an optimal short form.
An underused approach that has potential merit is to identify a subset
of items that maintains the factorial integrity of the construct. By factorial
integrity, we mean that the construct maintains its levels of association with
a select set of criteria or other constructs and that the estimated mean and
variance of the construct are minimally changed. This focus on factorial
integrity of the construct is a focus on validity—ensuring that the construct
embodied in the short form maintains the same position in the nomological
network of relations among constructs (cf. Cronbach & Meehl, 1955) as did
the full, longer scale. SEM can be used iteratively to identify the subset of items
that maintains factorial integrity of the short form. Here, one fits a model
using the items from the full scale to represent the construct and includes a
carefully chosen set of additional criteria. In a second analysis, all aspects of
the model in the first analysis are identical except that one selects a subset of
items to represent the focal construct. The mean, variance, and associations
of the construct based on the full scale are compared with the mean, variance,
and associations of the construct based on the selected subset of items. This
model would be iteratively fit until an optimal subset of items is identified that
maintains the factorial integrity of the construct.
When using secondary data, using the preceding methods to construct
short form measures of constructs may not be possible. Secondary data are
what they are—existing data that can be used for new purposes. As a result,
if short forms were used when the data were collected, then existing short
forms have already been constructed, and the user must live with those existing
short forms. However, many secondary data sets have large selections of items
that were never assigned to a priori scales. Instead, the questions probe various
domains of content, and individual questions may have been used in prior
research to answer particular questions. Nothing should stop the enterprising
researcher from using these items to create scales to represent constructs of
interest, but care must be taken when doing so. Of course, creating new scales
from collections of items in an existing data set will involve new scales, not
short forms of established scales. Still, the resulting new scales will likely
consist of a fairly small number of items, so all principles and concerns related
to analysis and evaluation of short forms still apply.
Additional steps can be pursued to explore the use of short forms
constructed through the preceding steps. One step would be to perform a factor
analysis to determine whether factors aligned with newly constructed short
forms can be confirmed in the secondary data. Researchers should ensure that
common factor techniques are used, because the use of principal-components
54 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 55
rXY
rXYc = , (3)
rXX rYY
the existing data set has similar constructs or criteria, the short form should
show patterns of association with these variables that are sufficiently similar
to encourage further consideration. Admittedly, differences between studies
can result in changes in correlations among variables, but similar patterns of
correlations among constructs should tend to hold across studies.
Our fifth recommendation, which pertains specifically to longitudinal
studies, is that researchers take care to ensure that variables are on the same
metric across times of measurement. In longitudinal data sets, items in a short
form may change from one occasion to the next. In such situations, researchers
must ensure that measurements are on the same underlying scale if growth or
change is the object of study. Simple approaches—such as the computation
of average item scores or of proportion scores—rest on problematic assumptions
that the items from the different forms function in precisely the same fashion
in assessing the construct. Thus, these simple approaches are too simpleminded
and problematic for current scientific work.
Linking the metric of latent variables across time in the presence of
changes in the sets of items on a short form can be accomplished using either
SEM or item response theory (IRT) approaches. Many different scenarios for the
migration of items off of or onto short forms can be envisioned. For example,
under one scenario, Items 1 through 12 (a short form of a 30-item instrument)
might be used to assess a construct for three times of measurement during early
adolescence; as participants move into later adolescence and are assessed three
additional times, the first six items are dropped and Items 13 through 18 are
substituted for them. Thus, 12 items are used at each measurement occasion, and
one subset of six items (Items 7–12) is used at all measurement occasions. Under
a second scenario, Items 1 through 12 are included at the first three times of
measurement, Items 1 through 12 are supplemented with Items 13 through 24
at the fourth time of measurement, and all remaining times of measurement
involve only Items 13 through 24. Under this scenario, no core set of items is
administered across all measurement occasions, but all items that appear at any
time of measurement are used at the fourth time of measurement. Clearly, many
additional scenarios could be posed as likely to occur in existing data.
Under the first scenario, an SEM approach might use Items 7 through 12
to define two parcels, the same items would be assigned to these two parcels
at each time of measurement, and these two parcels would appear at all times
of measurement. Items 1 through 6 could be summed to form a third parcel
for measurement occasions 1 through 3, and Items 13 through 18 could be
summed for a third parcel at the last three measurement occasions. If the
factor loadings, intercepts, and unique variances for the two common parcels
were constrained to invariance across all times of measurement, the resulting
latent variables would be on a comparable scale across all times of measurement.
An appropriate IRT approach would have a similar rationale, requiring the
56 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 57
presence of all 18 items in one analysis, and resulting theta scores (which
are estimates of participant level on the construct) would be on a comparable
scale across time.
The second scenario is, in some ways, simpler than the first, with all
items that are used at any time of measurement appearing at a single occasion
of measurement (the fourth time of measurement). The key analyses would
be the linking of scores across the two forms—Items 1 through 12 and Items 13
through 24—in analyses using data from the fourth measurement occasion.
Then, whether using SEM or IRT approaches, invoking invariance of param-
eter estimates from the fourth occasion of measurement on corresponding
estimates at other occasions of measurement would lead to latent-variable
scores on the same metric. Details of these methods are beyond the scope of
the present chapter. The Embretson and Reise (2000) text offers a very
good introduction to IRT procedures in general, and recent work (e.g., Cho,
Boeninger, Masyn, Conger, & Widaman, 2010; Curran et al., 2008) provides
relevant details and comparisons between SEM and IRT approaches.
CONCLUSIONS
to answer questions of current theoretical interest, and new data with long
forms of scales would take years or decades to gather anew. Thus, rather than
ruing the absence of long form instruments, we recommend that researchers
concentrate instead on the most appropriate and state-of-the-art ways to
analyze the existing data, warts and all.
Our strongest recommendations range from tried-and-true to innovative
approaches to data analysis that can and should be used with short-form
instruments. The tried-and-true methods include estimation of reliability in
the sample of data at hand and use of the correction for attenuation when
estimating relations among variables. Similarly, SEM methods, particularly
multiple-indicator SEMs, accomplish a great deal in terms of correcting for
poorer measurement properties of short forms (Little et al., 1999), and these
methods are not generally novel any more. However, the ways in which SEM
or IRT can be used to ensure that the scale of a latent variable remain the same
across measurement occasions in the face of changes in the composition of a
short form are innovative. Current research is being done to illustrate how
this can and should be done and to establish optimal procedures for meeting
these analytic goals.
We have considered likely outcomes when short forms are used, offered
basic ideas about how to gauge the psychometric properties of short form data,
provided some guidelines about constructing new scales from older collections
of items in existing data, and recommended analytic strategies for evaluating
short form data and including them in models. Valuable secondary data are out
there, often containing short form instruments but waiting to be used as the
unique basis for answering interesting, crucial, state-of-the-science questions. We
encourage researchers to exploit such resources, using analytic approaches that
are appropriate for the data and provide optimal tests of their conjectures.
58 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 59
Reise, S. P., Waller, N. G., & Comrey, A. L. (2000). Factor analysis and scale revision.
Psychological Assessment, 12, 287–297. doi:10.1037/1040-3590.12.3.287
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8,
350–353. doi:10.1037/1040-3590.8.4.350
REFERENCES
60 WIDAMAN ET AL.
12110-04_Ch03_rev1.qxd 6/23/10 2:00 PM Page 61
Rammstedt, B., & John, O. P. (2007). Measuring personality in one minute or less:
A 10-item short version of the Big Five Inventory in English and German.
Journal of Research in Personality, 41, 203–212. doi:10.1016/j.jrp.2006.02.001
Reise, S. P., Waller, N. G., & Comrey, A. L. (2000). Factor analysis and scale revision.
Psychological Assessment, 12, 287–297. doi:10.1037/1040-3590.12.3.287
Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton
University Press.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8,
350–353. doi:10.1037/1040-3590.8.4.350
Spearman, C. (1910). Correlation calculated with faulty data. The British Journal of
Psychology, 3, 271–295.
Westen, D., & Rosenthal, R. (2003). Quantifying construct validity: Two simple
measures. Journal of Personality and Social Psychology, 84, 608–618. doi:10.1037/
0022-3514.84.3.608
Widaman, K. F. (1985). Hierarchically nested covariance structure models for
multitrait–multimethod data. Applied Psychological Measurement, 9, 1–26.
doi:10.1177/014662168500900101
Widaman, K. F. (1993). Common factor analysis versus principal component analysis:
Differential bias in representing model parameters? Multivariate Behavioral
Research, 28, 263–311. doi:10.1207/s15327906mbr2803_1
Widaman, K. F. (2007). Common factors versus components: Principals and principles,
errors and misconceptions. In R. Cudeck & R. C. MacCallum (Eds.), Factor
analysis at 100: Historical developments and future directions (pp. 177–203). Mahwah,
NJ: Erlbaum.
Widaman, K. F. (2008). Integrative perspectives on cognitive aging: Measurement and
modeling with mixtures of psychological and biological variables. In S. M. Hofer
& D. F. Alwin (Eds.), The handbook of cognitive aging: Interdisciplinary perspectives
(pp. 50–68). Thousand Oaks, CA: Sage.
Zinbarg, R. E., Revelle, W., & Yovel, I. (2007). Estimating ωh for structures containing
two group factors: Perils and prospects. Applied Psychological Measurement, 31,
135–157. doi:10.1177/0146621606291558
Zinbarg, R. E., Yovel, I., Revelle, W., & McDonald, R. P. (2006). Estimating the
generalizability to a latent variable common to all of a scale’s indicators: A
comparison of estimators of ωh. Applied Psychological Measurement, 30, 121–144.
doi:10.1177/0146621605278814
4
ANALYZING SURVEY DATA
WITH COMPLEX SAMPLING DESIGNS
PATRICK E. SHROUT AND JAIME L. NAPIER
63
12110-05_Ch04_rev1.qxd 6/23/10 1:52 PM Page 64
1Statisticians do not like this informal term, as the sample does not ever represent the population
precisely. However, methodologists often use this term to state that some effort was taken to design
a sample that can be used to produce unbiased estimates of population values.
1999; Skinner, Holt, & Smith, 1989; StataCorp, 2007; for a more technical
reference, see Sarndal, Swensson, & Wretman, 1992).
A NUMERICAL EXAMPLE
2A simple random sample would require that all persons were enumerated and that selection was obtained
by using some random rule to select 35 persons directly from the sampling frame.
3Bias from a statistical perspective is the difference in the expected value of a statistic from the true
population value.
TABLE 4.1
Data for Amnesty Survey Item for 36 Fictional Respondents
Subject ID Household ID Household size Survey response Survey response
1 14 1 3 3
2 33 1 2 2
3 131 1 3 3
4 221 1 3 3
5 249 1 2 2
6 405 1 3 3
7 453 1 3 3
8 474 1 2 2
9 487 1 3 3
10 489 1 4 4
11 97 2 4 4
12 108 2 5 5
13 134 2 5 5
14 161 2 5 5
15 247 2 5 5
16 287 2 3 3
17 291 2 2 2
18 343 2 3 3
19 369 2 4 4
20 396 2 3 3
21 286 3 3 3
22 325 3 3 3
23 337 3 6 6
24 348 3 5 5
25 356 3 3 3
26 375 3 4 4
27 383 3 4 4
28 407 3 2 2
29 418 3 3 3
30 67 4 5 5
31 157 4 5 5
32 169 4 5 5
33 268 4 6 6
34 340 4 7 7
35 417 4 3 3
36 27 5 5 5
Note. ID = Identification number.
the biased sample mean of 3.78 in Table 4.1.4 In the next section, we discuss
how sampling weights can be used to eliminate the bias, but these weights
make inference more complicated. This is one reason that special software is
needed for analysis of complex samples.
4The sample mean in this small example is not statistically different from the known population mean,
but our simulation study allows us to calculate the expected value of the biased estimator. It is 3.80, which
is quite close to the sample mean in this case, and it is important to note, smaller than the population mean.
TABLE 4.2
Data for Amnesty Survey Item for 26 Additional Fictional Respondents
From Households With Two or More People
Subject ID Household ID Household size Survey response
37 97 2 3
38 108 2 3
39 134 2 4
40 161 2 5
41 247 2 4
42 287 2 6
43 291 2 4
44 343 2 4
45 369 2 4
46 396 2 4
47 286 3 4
48 325 3 4
49 337 3 5
50 348 3 4
51 356 3 5
52 375 3 5
53 383 3 3
54 407 3 5
55 418 3 2
56 67 5 5
57 157 4 3
58 169 4 5
59 268 4 4
60 340 4 4
61 417 4 3
62 27 5 5
Note. ID = Identification number.
TABLE 4.3
Means of Simulated Data in Tables 4.1 and 4.2, Ignoring Sample Design
and Considering Sample Design
Estimator M SE
Unweighted mean of 36 persons 3.78 0.2150
Unweighted mean of 62 persons in 36 clusters 3.92 0.1460
Weighted mean of 36 persons 4.12 0.2155
Weighted mean of 62 persons in 36 clusters 4.05 0.1471
small. The clustered data usually provides less information than independent
data, and therefore the standard error estimate needs to be adjusted with
special software.
Table 4.3 shows a comparison of the population mean, and four estimates
of the mean. The first is a simple mean and standard error estimate of data in
Table 4.1, with no adjustment for sample selection probability. The second
is a simple mean and standard error estimate of data in Tables 4.1 and 4.2,
with no adjustment for either selection or clustering. The third estimate is an
estimate of the Table 4.1 mean, taking into account sampling weights, and
the last is an estimate of data in Tables 4.1 and 4.2, with adjustments for both
weights and clustering. If these were real data, the different methods used
would lead political commentators to give different spins on the outcomes of
the survey. The biased mean, which is less than the neutral point of 4.0, suggests
that on average the population is modestly against amnesty, whereas the
unbiased mean suggests that this population is on average undecided on the
issue of amnesty.
To set the stage for understanding alternate approaches for the analysis
of data from complex survey designs, we briefly consider five technical sample
survey issues. These issues are (a) a possible correction for finite populations,
(b) random versus systematic samples, (c) stratified samples, (d) sample weights,
and (e) cluster samples. For a full discussion of these topics, see Levy and
Lemeshow (1999).
5If N is the number of persons in the population and n is the number of persons in the sample, then
the finite population correction for the standard error of the mean is SQRT[(N − n)/N] for a simple
random sample. Similar adjustments are made for estimates from more complex sampling designs.
outcome of interest. See Levy and Lemeshow (1999, pp. 81–120) for some
issues that can arise with systematic sampling. These are generally small issues,
and we do not distinguish systematic samples from other random samples.
Stratified Samples
Sample Weights
∑ i =1 wi Yi .
n
YW = (1)
∑ i =1 w i
J
Cluster Samples
6Correct in this case means that important variables related to the outcome are included and that the
functional forms of the relationships (e.g., a linear or nonlinear relation) are properly specified.
made contact with the household. This principle can be generalized to pro-
duce a cluster sample design. The survey organization can enumerate primary
sampling units (PSUs), such as census tracts, zip codes, or counties, and obtain
random samples of these geographic units. Next, the households within the
sampled units are enumerated and a sample obtained of these nested units.
Within the households, the sampling plan might call for obtaining multiple
participants from among those eligible. The primary benefit of a cluster
sample design is cost saving in the fieldwork. The disadvantage is that the
additional surveys that are obtained do not add as much information to the
parameter estimates as independent observations.
The loss of information of additional observations within clusters is
sometimes called the design effect (Levy & Lemeshow, 1999, p. 302). The design
effect is a function of both the proportion of observations that are clustered
together and the empirical similarity of observations within the cluster relative
to observations across clusters. Depending on how similar respondents are
within clusters, the effective sample size might vary from the total sample n
(a design effect of 1.0) to the (much smaller) number of PSUs (a design effect
of n/[# PSU’s]). What we mean by similarity of observations depends on the out-
come variable being analyzed. A cluster sample might be very efficient for one
outcome but not very efficient for another. Design effects are largest for outcomes
that are influenced by neighborhood and family factors. Because cluster sample
designs produce sequences of observations that cannot be assumed to be inde-
pendent, it is especially important to use survey sample software that takes
the dependency of the data into account when computing standard errors of
estimates.
SOFTWARE APPROACHES
standard errors. Readers are encouraged to consult Levy and Lemeshow (1999),
who provide both Stata and SUDAAN syntax examples for a wide variety of
analyses of complex sample data. On our website (http://www.psych.nyu.edu/
couples/SurveySyntax.rtf) we provide illustrations of SPSS, SAS, and Stata
syntax for the estimation of descriptive statistics and a simple regression
analysis of a secondary data set.
7http://www.electionstudies.org/studypages/2000prepost/2000prepost.htm
with certainty (e.g., New York City, Los Angeles, Chicago, Dallas/Fort Worth).
Of the 20 next largest PSUs, 10 were selected (e.g., Houston, Seattle/Tacoma,
Cleveland, Denver) using a stratified design that randomly selected one unit
from pairs of cities that were formed to be similar geographically. In addition
to these 18 locations, 26 PSUs were sampled from the 80 remaining units with
the probability of selection proportionate to size (on the basis of 1990 census
information). For the second stage of selection, PSUs were divided into “area
segments” on the basis of 1990 census information, and six to 12 segments
were selected within PSUs with a selection probability proportionate to size.
A total of 279 such segments were selected. For the third stage of selection,
2,269 housing units were enumerated within the segments and housing units
were selected with equal probability. Of the 1,639 occupied selected housing
units, 1,564 contained eligible persons. To be eligible, persons needed to have
been both a U.S. citizen and 18 years of age on or before November 7, 2000.
For the final stage of selection, one resident per housing unit was selected at
random using a procedure described by Kish (1949). The response rate for the
FTF survey was calculated to be .64.
Analysis Considerations
ANES 2000 provides a composite sampling weight for each respondent.
It incorporates sampling probability, nonresponse adjustments based on Census
region, and poststratification adjustments based on age group and education
level. The weight is scaled so that its sum equals the sample size, 1,807. This
scaling makes it difficult to make finite population corrections, but that is not
a concern in the tradition of the way psychologists provide evidence. Moreover,
the size of the U.S. population is so large relative to the sample size that these
adjustments would have little impact. The scaling of the weights to the obtained
sample size allows the weight to be used in standard statistical programs to get
estimates of the standard errors of estimates that are in the ballpark of the
correct estimates. In the codebook of the ANES 2000, the survey researchers
report that the standard errors from standard programs are likely to be too small
by a factor of 1.098 on average.8
To obtain the best measures of estimation precision, one needs to
explicitly take the sampling design into account in the analysis. Although the
ANES 2000 sample is defined as a composite of two different designs, one for
the FTF mode and one for the TEL mode, the public access data set provides
combined survey design variables that incorporate the needed information
for an appropriate analysis. A variable called sampling error stratum code (SESC)
is provided that links each person to a sample design PSU and a sample area
segment. To ensure confidentiality of the responses, the specific identities of
sample area segments within PSUs are not provided in the public access version
of the data. In combination with interview mode (FTF vs. TEL), combined
sampling weight and the SESC variable is all that is needed to use Taylor
series methods to take into account the complex sampling design in the
analysis.9 Insofar as the responses within SESC groupings are more homo-
geneous than responses across the groupings, the standard errors of parameter
estimates need to be larger than those based on observations that are completely
independent.
In the analyses that are presented in the next section, we specified the
interview mode as a major stratification variable, the SESC groups as the key
clustering variable, and the composite weight as the adjustment for differing
selection probabilities and non-response. We treated the sample as if it were
sampled with replacement, which implies that the population is substantially
larger than the sample. This latter treatment removes any adjustments for
finite populations.
In 2000, the ANES added items to the survey that assessed the respondents’
cognitive style, namely the “need for cognition” (Cacioppo & Petty, 1982).
8The survey researchers examined a variety of outcomes and calculated an average square root of the design
effect to be 1.098 for the combined sample, 1.076 for the FTF sample, and 1.049 for the TEL sample.
Multiplying the standard errors from the usual statistical packages by these values will give a better
approximation to the correct standard error.
9The information that is needed for this analysis is provided in the public access codebook, but it is
embedded in several pages of technical description. Analysts need to patiently review all the technical
material provided on the public access files to find the necessary information.
Descriptive Statistics
In Table 4.4, we show six different estimates of means and standard errors
for the need for cognition and political orientation variables. The first two use
standard (nonsurvey) software procedures, whereas the last four use complex
survey software (see http://www.psych.nyu.edu/couples/SurveySyntax.rtf for
SPSS, Stata, and SAS syntax).10 As shown in the first two rows of the table, the
application of sample weights in the standard software has a small effect on
the mean for need for cognition but scarcely any effect for political orientation.
The estimates of the standard errors, which ignore the sample clusters, are not
much affected the application of the weights. This is not surprising because
the weights have been scaled to sum to the sample size. When complex
sample software is used with weights, we obtain the same estimates of means as
provided by the standard software (when weights were used), but the estimates
of standard errors are slightly larger, even without taking strata and cluster
information into account. When clustering is considered, the standard error
estimates for both variables increase further. This implies that individuals
10
For these results, as well as the regression results presented next, the three software systems produced
exactly the same estimates and standard errors for the different conditions.
TABLE 4.4
Descriptive Statistics for Need for Cognition and Political Orientation
Need for cognition Political orientation
within each cluster were more likely to respond in a similar way than individ-
uals across clusters. Thus, not considering the nonindependence of observations
would lead to a biased estimate of the standard errors. The ratio of the stan-
dard error in the last row of Table 4.4 to that of row 3 provides an estimate of
the square root of the design effect. For need for cognition, the estimate is
0.047/0.033 = 1.17, and for political orientation it is 0.046/0.043 = 1.07.
One is larger and one is smaller than the average values provided in the
ANES codebook and illustrate the fact that design effects vary with the
outcomes.
Regression Analysis
Next, we examined the relation of political orientation to need for
cognition while adjusting for age, sex, marital status, and region. Ages ranged
from 18 to 97, with a mean of 47.21. Sex was dummy coded so that “0” = males
and “1” = females. Marital status was entered as dummy codes with those who
are widowed, divorced or separated, single, and partnered compared with
those who are married. Region was entered as dummy codes with individuals
living in the Northeast, North Central, and Western United States compared
with those living in the South.
We estimated three regression models: a model without weights, one with
sample weights only, and a model that accounted for clusters and strata using
survey software (see http://www.psych.nyu.edu/couples/SurveySyntax.rtf for
the syntax to run this latter model in SPSS, SAS, and Stata). As shown in
Table 4.5, these three models could lead to different conclusions. For instance,
in the unweighted model, one might conclude that people who are widowed
do not differ significantly from married people on political orientation. When
the sample weight was taken into account (either with or without using
complex samples survey software) as shown in the second and third columns
TABLE 4.5
Predicting Political Orientation With Age, Marital Status, Region, and Need for Cognition
With complex samples software
1:52 PM
Independent variable b SE p b SE p b SE p b SE p
Page 78
Need for cognition −.057 .030 .060 −.073 .030 .014 −.073 .031 .018 −.073 .032 .023
Age (decades) .055 .029 .058 .051 .028 .069 .051 .030 .092 .051 .030 .098
Female −.244 .083 .003 −.238 .081 .004 −.238 .087 .006 −.238 .083 .005
Widowed −.255 .161 .113 −.309 .175 .078 −.309 .167 .065 −.309 .150 .041
Divorced −.471 .115 .000 −.452 .125 .000 −.452 .122 .000 −.452 .116 .000
Single −.584 .114 .000 −.653 .113 .000 −.653 .121 .000 −.653 .111 .000
Partnered −.963 .262 .000 −.932 .247 .000 −.932 .272 .001 −.932 .281 .001
Northeast −.320 .118 .007 −.331 .112 .003 −.331 .124 .007 −.331 .122 .007
North Central −.107 .106 .312 −.220 .104 .035 −.220 .107 .040 −.220 .114 .056
West −.294 .109 .007 −.352 .109 .001 −.352 .117 .003 −.352 .116 .003
12110-05_Ch04_rev1.qxd 6/23/10 1:52 PM Page 79
of Table 4.5, we found that the estimate for being widowed on political ori-
entation increased and that the relationship between being widowed and
political orientation is marginally significant. When the strata and clusters
are considered, the standard errors are further refined, as shown in the third
set of columns of Table 4.5. In this model, there is evidence that widows
are significantly more liberal than married persons. In this case, the standard
error decreased, which means that the design effect was less than one. This
is unusual in cluster samples, and it suggests that the widow effect on polit-
ical orientation may be somewhat more pronounced across clusters than
within clusters.
In addition, the unweighted model shows that need for cognition is only
marginally related to political orientation. When the sample weights are applied,
the estimate increases and the association between political orientation and
need for cognition are significant at the .05 level. In this case, when the strata
and clusters are considered, the standard error increases slightly, but the effect
remains statistically significant.
CONCLUSION
REFERENCES
Bizer, G. Y., Kronsnick, J. A., Petty, R. E., Rucker, D. D., & Wheeler, S. C. (2000).
Need for cognition and need to evaluate in the 1998 National Election Survey
pilot study. ANES Pilot Study Report, No. nes008997. Retrieved from ftp://ftp.
electionstudies.org/ftp/nes/bibliography/documents/nes008997.pdf
Brogan, D. J. (1998). Pitfalls of using standard statistical software packages for sample
survey data. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics
(Vol. 5; pp. 4167–4174). New York, NY: Wiley.
Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and
Social Psychology, 42, 116–131. doi:10.1037/0022-3514.42.1.116
Cochran, W. G. (1977). Sampling techniques. New York, NY: Wiley.
Duan, N., Alegria, M., Canino, G., McGuire, T. G., & Takeuchi, D. (2007). Survey
conditioning in self-reported mental health service use: Randomized compari-
son of alternative instrument formats. Health Services Research, 42, 890–907.
doi:10.1111/j.1475-6773.2006.00618.x
DuMouchel, W. H., & Duncan, G. J. (1983). Using sample survey weights in multi-
ple regression analyses of stratified samples. Journal of the American Statistical
Association, 78, 535–543. doi:10.2307/2288115
Jost, J. T., Napier, J. L., Thorisdottir, H., Gosling, S. D., Palfai, T. P., & Ostafin, B.
(2007). Are needs to manage uncertainty and threat associated with political
conservatism or ideological extremity? Personality and Social Psychology Bulletin, 33,
989–1007. doi:10.1177/0146167207301028
Kish, L. (1949). Procedure for objective respondent selection within the household.
Journal of the American Statistical Association, 44, 380–387. doi:10.2307/2280236
Kruglanski, A. W., Pierro, A., Mannetti, L., & De Grada, E. (2006). Groups as
epistemic providers: Need for closure and the unfolding of group-centrism.
Psychological Review, 113, 84–100. doi:10.1037/0033-295X.113.1.84
Lee, E. S., & Forthofer, R. N. (2006). Analyzing complex survey data. Thousand Oaks,
CA: Sage.
Levy, P. S., & Lemeshow, S. (1999). Sampling of populations: Methods and applications.
New York, NY: Wiley.
Meng, X.-L., Alegria, M., Chen, C., & Liu, J. (2004). A nonlinear hierarchical model
for estimating prevalence rates with small samples. In T. Zheng (Chair), ASA
Proceedings of the Joint Statistical Meetings (pp. 110–120). Alexandria, VA: American
Statistical Association.
National Election Studies. (n.d.). The 2000 National Election Study [data set]. Ann
Arbor, MI: University of Michigan, Center for Political Studies [producer and
distributor]. Retrieved from http://electionstudies.org/studypages/download/
datacenter_all.htm
Research Triangle Institute. (1989). SUDAAN: Professional software for survey data
analysis. Research Triangle Park, NC: RTI.
Sarndal, C., Swensson, B., & Wretman, J. (1992). Model-assisted survey sampling.
New York, NY: Springer-Verlag.
Schuman, H., & Bobo, L. (1988). Survey-based experiments on White racial attitudes
toward residential integration. American Journal of Sociology, 94, 273–299.
doi:10.1086/228992
Skinner, C. J., Holt, D., & Smith (1989). Analysis of complex samples. New York, NY:
Wiley.
Sniderman, P. M., & Piazza, T. L. (1993). The scar of race. Cambridge, MA: Belknap
Press.
StataCorp. (2007). Stata statistical software: Release 10. College Station, TX:
StataCorp, LP.
Weststat. (2007). WesVar 4.3 user’s guide. Rockville, MD: Author.
5
MISSING DATA IN SECONDARY
DATA ANALYSIS
PATRICK E. MCKNIGHT AND KATHERINE M. MCKNIGHT
83
12110-06_Ch05_rev1.qxd 6/23/10 1:55 PM Page 84
1The inferential process we refer to here pertains primarily to frequentist statistical procedures. Bayesians
typically do not hold to this dictum because Bayesian procedures allow for updating from previous studies
or from within studies.
2A discussion of statistical power goes beyond the scope of this chapter. For a clear definition and
TABLE 5.1
A Generic Data Set to Illustrate Missing Data Mechanisms
ID IV DV MCAR.DV MAR.DV MNAR.DV
from the other variables would allow one to predict that the lowest values
for the DV are missing. Therefore as researchers, we cannot know that the
available data for the DV are actually biased, with the lowest end of the
distribution missing. If we have only three variables in our data set—ID, DV,
and MNAR.DV—then we would be unable to predict scores from the missing
variables. What makes this missing data situation difficult is that the mechanism
is not available to the data analyst, and the mechanism cannot be ignored
statistically because of the aforementioned bias in the available data.
These three terms—MCAR, MAR, and MNAR—form the basis of the
missing data statistical language. The three mechanisms play an important
role in determining how to diagnose and treat missing data. Additional factors
play important roles, and we address those when discussing diagnostics.
3Software packages differ in their default handling of impossible transformations. Some packages report
an error, whereas others complete the computation without any hint that the transformation was illogical
for some cases. It is always best practice to scan the raw values and transformed values graphically before
proceeding to subsequent data analyses.
ing data. When only a relatively small number of cells (say 1 cell for every 1,000)
are missing, then the analyst may need to do nothing. If, however, many cases
have missing values, something ought to be done with those missing values
or the researchers risks the loss of a substantial portion of the sample (and
therefore decrease statistical power and bias results).
The amount of missing data fails to communicate the full extent of the
missing data problem because it does not address the scientific relevance of
the missing values. A supplemental diagnosis focuses on the level of missing
data. Level refers to measurement and generally falls under the five categories
of item (individual questions), scale (combinations of items), construct
(all relevant measures of a construct), person (an individual respondent), or
group (naturally or artificial collections of persons). These levels provide
the scientific relevance of missing data from a sample and often indicate the
extent and influence missing data might have on results. Missing data at
the item level tend to be far less damaging than missing data at the group
level. In a way, level communicates the severity of the problem.
Another method of diagnosing severity is the pattern of missing data.
Missing data may be patterned in any manner, but some patterns present fewer
problems than others. For illustrative purposes (see Figure 5.1), imagine the
data were recoded as “1” or “0,” indicating present or missing, respectively,
and the raw data matrix were plotted as a rectangle with the rows representing
the cases (N = 300) and the columns representing the variables (N = 30).
Evident from the figures is that information is contained in the patterns.
The most disorganized pattern shows that no general cause is likely for the
missing values, whereas the most organized pattern suggests that a systematic
or causal process to account for the missing data is likely. Patterns of missing
data potentially are indicative of the mechanism but are not direct tests of
mechanisms.
The final diagnostic pertains directly to the three missing data mecha-
nisms (MCAR, MAR, and MNAR) discussed previously. Because patterns
do not directly identify which mechanism produced missing values, one ought
to investigate the underlying structure of the missing data. As mentioned
previously, the mechanism determines what information from the data avail-
able and what needs to be done to remedy the situation. However, no direct
test for all three mechanisms exists. The only test proposed to date is for
ruling out MCAR. Little (1988) developed a chi-square test for assessing the
extent to which available data could account for missing values. If the test is
nonsignificant, then the researcher failed to rule out MCAR. It is important
to note that the test does not indicate whether the missing data mechanism
is MCAR; rather, the test rules out, or fails to rule out, the possibility that
the mechanism is MCAR. Therefore, if MCAR can be ruled out, that leaves
two other mechanisms: MAR and MNAR. Differentiating between these two
Figure 5.1. Patterns of missing data. The black cells represent the missing values,
and the white cells represent the observed values. This figure shows four possible
patterns plotted for a relatively large data set arranged from the most disorganized
pattern (A) to the most organized pattern (D). The four patterns illustrated are a
completely random pattern (i.e., randomly missing by case and variable) (A), a random
pattern by variable but fixed by case (B), a random pattern by case but fixed by
variable (C), and a monotonic pattern (fixed by case and variable) (D).
tools helps the researcher to understand better the extent, severity, and diffi-
culty of the missing data problem.
Deletion Methods
The easiest and most obvious method for treating missing data is by
deleting cases, variables, or cells with missing values. That is, the data con-
tained in the case, variable, or cell are deleted from the analysis. Deletion
methods are the easiest to implement because they require little from the
analyst. Moreover, these methods are the easiest to understand. Data that are
missing are essentially ignored.
Listwise deletion (complete case), pairwise deletion (available case),
and available item are three procedures that fall under the deletion category.
Listwise deletion involves deleting all cases with any missing data—in some
cases, even if the case is missing data for a variable that is not used in the
statistical modeling. Listwise deletion will always result in the same sample
size, regardless of the analysis, because it leads to a reduced data set with
complete data for all observations. In contrast to listwise deletion, pairwise
deletion omits cases on the basis of the bivariate, or paired (e.g., correlational),
analysis. Computations in which one or both of the variables are missing for
a given case use only those cases for which data for both variables are present.
Pairwise deletion, thus, results in different sample sizes for each bivariate
analysis.
Both listwise and pairwise deletion pertain to cases. The third and final
procedure—available item—is the only deletion procedure that pertains
directly to variables. Data reduction procedures such as summing or averaging
variables require complete data on all the variables (items) for the calculation.
Depending on the statistical software, when variables used to calculate a
total score have missing values, the summary score will either be missing or
computed, but it may not be valid because it is composed of fewer variables
(only those with observed data, or the “available items”). In the latter case,
the score is based on differing numbers of variables for cases with missing data,
which is also problematic.
Deletion procedures are the easiest to use, and they tend to be the default
procedure in many statistical packages. These methods require little (if any)
planning or effort on the part of the researcher. Additionally, these methods
often result in the least disturbance to the data integrity (i.e., they do not
manipulate the raw data). Despite these advantages, in many situations the use
of deletion procedures ought to be considered with great caution. Listwise
deletion results in the complete elimination of cases because of the presence of
missing data, even if that missing data is irrelevant to the analysis of choice.
If a sufficiently large number of cases contain missing values, they will
be omitted from analyses, thus reducing statistical power. Furthermore, some
deletion procedures may create new problems that are unrelated to missing
data. Pairwise deletion, in particular, creates unbalanced correlation or
covariance matrices that often cannot be used for more complicated models
(e.g., factor analysis or structural equation models) because the matrices fail
to provide mathematically sound structures.4 Although deletion procedures
are efficient, they require careful application in all situations (i.e., amount,
pattern, level, and mechanism of missing data). Listwise deletion is an efficient
and useful method only when data are MCAR and when a negligible amount
is missing. Pairwise deletion is an efficient method for bivariate analyses when
the data are MCAR and the results of the analyses will undergo no further
procedures (e.g., correlations are the end result). Finally, available item
4The sound structures we refer to are balanced, invertible matrices that contain no singularities or linear
dependencies. These topics fall outside the scope of the chapter, and interested readers are encouraged
to consult more technical books on the mathematics underlying covariance modeling procedures.
Weighting Methods
Adjustment Methods
One problem with missing data is that missing values tend to produce
estimates that do not reflect accurate population parameters, particularly
when missing data are MAR or MNAR. A way to counter the inaccuracy is
to adjust the estimates so they are more accurate or, perhaps, not as misleading.
Adjustments tend to be based on a model or distribution, such as the normal
distribution, and thus allow for flexibility in application. The normal distribu-
tion model used most frequently is maximum likelihood. Several variants on
maximum likelihood, such as restricted maximum likelihood, full-information
maximum likelihood, marginal maximum likelihood, and expectation and max-
imization, have been proposed. Each variant makes slight modifications to
the basic underlying normal distribution method that adjusts parameter esti-
mates to fit the expected distribution better and thus provides potentially bet-
ter population estimates. The actual mechanics of these procedures are
complex and fall outside the scope of the current chapter. The important
point is that these methods do not replace missing values, but instead they
are part of the parameter estimation procedure, and their purpose is to adjust
parameter estimates to reflect better the expected population values, given
the hypothesized distribution of scores (e.g., multivariate normal). Adjustment
methods are regarded as both useful and efficient when missing data are MCAR
or MAR (e.g., Allison, 2001; Schafer & Graham, 2002). These methods can
be used to treat missing data across many different data analytic models; unlike
deletion methods, however, they require some statistical sophistication to
implement, troubleshoot, and interpret.
Most adjustment methods rely on the assumption that an underlying
distribution is inherently normal (i.e., multivariate normal for more complex
multivariate analyses). In many instances, these normal distribution procedures
suit the data and, therefore, perform as expected. Data that are nonnormal or
cannot be transformed to be normal are not well-suited for these procedures.
Additionally, missing data must be ignorable; the mechanism must be either
MAR or MCAR and in sufficiently small quantities to avoid over- or under-
adjustment. Another limitation to adjustment procedures is whether adequate
adjustment has been made to counter the missing data. Without this informa-
tion, researchers cannot be aware of the impact of missing data on statistical
results. We discuss this general limitation further when we address MI.
Imputation Methods
Single Imputation
As the name suggests, single imputation replaces each missing value once
with a particular value. The imputed value may be a constant (e.g., zero),
a between-subjects value (e.g., group-level mean), a within-subject value
(e.g., mean of other items completed by the respondent or the last observed
value carried forward), or a random value from either the current data (i.e., hot
deck) or other similar data (i.e., cold deck). Single imputation methods impute
a single value into each missing data cell and, therefore, result in a complete
data set. The aim of all imputation strategies is to create a complete data
set prior to statistical analysis. Selecting which value to impute presents a
dilemma to the analyst. Some values may come from the data (hot deck or
last observation carried forward), but those values may have inherent biases.
Other values come from the data indirectly (e.g., mean imputation), by random
process (e.g., hot deck), from other data (e.g., cold deck), or by principle
(e.g., zero imputation). Choosing the source from the myriad of choices can
be as perplexing as fully appreciating the implications and limitations of each.
Multiple Imputation
The primary disadvantage of most missing data treatment methods
is that there is no process to determine the extent to which missing values
impact study results. MI is the exception. The MI approach to missing data
provides an estimate of the impact missing data had statistical results. This
facility alone makes MI the preferred treatment method among expert data
analysts. The following briefly summarizes the steps to conduct MI.
The MI process begins by selecting an appropriate imputation method,
given the type of data for which values are missing—continuous normal,
discrete, or mixed. Next, the analyst imputes values to replace missing ones
using the selected method. In contrast to the single imputation methods, the
imputation process runs several times (usually three to five), and each run
results in a complete data set. Next, the analyst conducts the desired statistical
procedure on each of the complete data sets. Thus, if five fully observed data
sets are generated, the statistical model (e.g., multiple regression) is carried
out five times, once for each data set, resulting in five sets of statistical results.
These results then serve as data to be aggregated and analyzed, for instance,
to assess measures of central tendency and variance. The results that serve as
data for analysis may be parameter estimates (e.g., regression coefficients),
p values, statistical indices (e.g., t test values), or any other outcome the
analyst deems appropriate. Results of these analyses provide evidence regarding
the extent to which missing data influenced the results from the multiple runs
of the desired statistical procedure. If parameter estimates varied widely from
each of five multiple regressions carried out on the five imputed data sets, then
the missing data had a considerable impact on the statistical results and may
call into question their accuracy.
The above description of MI is brief, because of space constraints, and
cannot provide sufficient detail for carrying out MI. We describe these
procedures for the applied researchers in greater detail elsewhere (McKnight
et al., 2007), and a large statistical literature describes the use and advantages
of MI for handling missing data. Rubin, Schafer, and colleagues (e.g., Rubin,
1987; Schafer, 1997; Schafer & Graham, 2002) have spent considerable effort
in describing MI and making the general procedure more accessible to most
data analysts. The implementation of MI requires some skill, and it is not yet
the standard procedure in even the most popular statistical software. Currently,
many data analysts conduct MI procedures and generate multiple, fully
observed data sets using separate software from that they use for analyzing data
using the desired statistical models. However, because of rapid technology
improvements, this situation is likely to change, and MI soon will become a
standard procedure available in most statistical packages.
Currently, experts regard MI as the premier method for handling
missing data in the more difficult situations in which the simplest methods
(e.g., deletion, single imputation) are not appropriate (Schafer & Graham,
2002). MI is a preferred method because it borrows the advantages of the
random single imputation procedures but supplements those procedures with
an iterative process. The combination results in a method that allows the
researcher to understand better the nature and influence of missing data on
results. Difficulties lie in implementing MI because of the technical exper-
tise required to troubleshoot when problems arise, as they often do when
using these methods.
Reporting is the final step in the treatment and handling of missing data.
Science relies on the ability to communicate results effectively so that other
researchers may review, evaluate, replicate, and extend the findings. Missing
data present special circumstances that require additional reporting. Many
times, resistance to the problem of missing data occurs at the level of reporting.
Reviewers, editors, and readers often do not understand or appreciate the
relevance or the importance of reporting missing data; in many cases, neither
do the investigators themselves. A researcher’s job is to document all com-
ponents of a study that might have an impact on the results and conclusions.
We provide some guidelines about reporting missing data to assist researchers
in determining the most relevant details. As a brief checklist, we suggest
researchers gauge the extent to which another researcher could replicate each
of the points listed in Exhibit 5.1 using the existing secondary data. Researchers
who can check off each of these points can be assured that the reporting of
methods used to prevent, diagnose, and treat missing data were detailed
sufficiently. The majority of these points ought to be addressed in the Method
and Results section of a research article; however, they are relevant when
discussing the limitations of the study as well. It is important to bear in mind
EXHIBIT 5.1
A Brief Checklist for Reporting Results With Missing Data
Description
• Clearly specify the source of your data, including Internet location and manuscript
references if applicable.
• Identify variables selected for your analyses and the reasons they were selected.
• Describe variable transformation and computation steps.
• Provide details about how you prevented missing values in the transformation
and computation of new variables.
• Briefly describe the nature and extent of missing data; that is, missing data
diagnostics.
• Provide the necessary detail for handling the missing data; that is, how diagnostics
informed choice of treatment.
• Describe any results of the treatment methods.
• Explain the impact of missing data on the results.
• Describe the limitations of the results based upon the missing data.
that the list we provide is not exhaustive. A researcher may deem other points
to be relevant to include, given the nature and the purpose of the study. The
best reporting covers all relevant details as succinctly as possible.
CONCLUSION
REFERENCES
6
INNOVATIVE METHODS WITHIN
THE CONTEXT OF SECONDARY DATA:
EXAMPLES FROM HOUSEHOLD
PANEL SURVEYS
THOMAS SIEDLER, JÜRGEN SCHUPP, AND GERT G. WAGNER
103
12110-07_Ch06_rev1.qxd 6/23/10 1:48 PM Page 104
The PSID is the oldest household panel study in the world. The first
wave started in 1968 as a representative sample of private households in
the United States, with about 5,000 households. Since then, the PSID has
attempted to follow all individuals from these original households regardless
of whether they have continued residing in the same household or with
the same persons. Similarly, children born to original household members
become part of the survey and are interviewed on their own when they have
reached adulthood. The PSID collects data on economic and social behavior;
demography; health; and the neighborhoods, housing, and labor markets in
which respondents live. Most of the data might not be of interest to psy-
chologists. However, the PSID also contains a child development supplement
with a focus on children and caregivers that might be of special interest to
psychologists collecting information on health, education, cognitive and
behavioral development, and time use. (For further information on the PSID,
see http://psidonline.isr.umich.edu/.)
1984 with two subsamples: Sample A, the main sample, covers the population
of private households in the former West Germany; Sample B, the guest
worker sample, is an oversample of immigrants from South European countries
(foreign-born residents and their children, mainly recruited abroad during the
economic booms of the 1960s and 1970s, with a Turkish, Spanish, Italian,
Greek, or [ex-]Yugoslavian head of household). The initial sample size in 1984
comprised around 6,000 households. Since 1990, the GSOEP has included a
sample of households in the former East Germany, and in the following years
more samples were included into the survey (e.g., in 1998, a refreshment sam-
ple; in 2000, an innovation sample; and in 2002, a high-income sample). As a
result, the sample size has increased considerably, and in 2008—with the study’s
25th wave—the GSOEP consists of about 11,000 households with more than
20,000 adult respondents. In each of these samples, original sample respondents
are followed, and they (and coresident adults) were interviewed at approximately
1-year intervals. Children in original sample households are also interviewed in
their own right when they reach adulthood (the year they turn 17), and former
partners of original sample members are also followed over time.
Whereas the PSID is very much centered on income dynamics, the
GSOEP supports research not only in economics but also in sociology and
psychology. For psychologists, one of the most important variables in the
GSOEP is the one on life satisfaction, which has been part of the GSOEP
since the very beginning (for a recent application, see Easterlin, 2008). All
the other panels discussed later in this chapter contain similar subjective data
that are of interest to psychological research. (For further information on the
GSOEP, see http://www.diw.de/english/sop/index.html.)
needs of social scientists and biomedical science, (d) linkage with administrative
records and geocodes, (e) a collection of biomarkers, and (f) an innovation
panel for methodological research. (For further information about the UKHLS
and the BHPS, see http://www.iser.essex.ac.uk.)
1Note that Internet access was not a prerequisite for participation in the study and all households without
panel itself can be used to provide data on later-life outcomes. Children and
even grandchildren of original household members become panel respondents
in their own right at around the age of 16 or 17 and are then followed over
time. By design, these are children and grandchildren who have lived with
a parent who was (and may well remain) a panel member too. Thus, with
household panel surveys, researchers can not only match partners to each other
(Ermisch et al., 2006) but also match parents to children, as well as grandparents
to grandchildren, and analyze various aspects of intergenerational mobility
(Solon, 1992).
Tracking Rules
Household panel studies share certain rules for tracking respondents
over time. First, to maintain the ongoing cross-sectional representativeness
of the (nonimmigrant) population, household surveys define the adults and
children in the households of the first-wave representative sample as the
original sample members (OSMs). Note that the children living at home in
wave one are not necessarily all the children the parents have, because some
may have left home previously or died. Similarly, the adults present in wave
one may not include both birth parents of a given child because of, for
example, divorce or the death of a parent before the survey. In subsequent
waves, interviews are attempted with all adult members of all households
containing either an OSM or an individual born to an OSM, regardless of
whether that individual was a member of the original sample or whether the
individual lives in the same household or residence as at the previous interview.
This rule allows researchers to analyze the influence of shared environment
over a certain period (e.g., during cohabitation, marriage, or both) and later-life
outcomes of individuals who separated or divorced (Schimmak & Lucas, 2007),
and underlies the design of virtually all household panels. However, differences
exist with respect to treatment of new panel members who later move out of
the OSM household. In most household panel surveys, including the BHPS
and the PSID, these people are not interviewed again (unless they retain an
important relationship with a sample member, such as parent). By contrast,
the GSOEP has, since Wave 7, followed and interviewed all panel members,
regardless of their relationship to the OSM (Spiess et al., 2008).
Attrition
The other dimension of nonresponse that is of particular importance for
household panel surveys is selective sample dropout (i.e., attrition). Attrition
is a problem that potentially increases in severity the longer the panel lasts
and, hence, is a feature that conflicts with the distinct advantages of longer
panels. Attrition reduces sample size and also introduces potential non-
representativeness if respondents drop out in a nonrandom way. The latter
case occurs when individuals with particular characteristics are systematically
more likely to drop out of the panel than others. For example, if respondents
with lower levels of education are more likely to drop out of the panel, estimates
of the degree of returns to education may be biased. Nonrandom panel attrition
and nonrepresentativeness are issues that are often discussed but not always
addressed in empirical research.
INNOVATIVE TOPICS
Trust and trustworthiness are key components of social capital, and there
is a growing literature on how best to measure trust (Ermisch et al., 2009;
Glaeser et al., 2000; Sapienza et al., 2007). Both the BHPS and the GSOEP
collect information about respondents’ levels of trust and fairness. In 2003,
several trust and fairness questions were incorporated into the GSOEP
questionnaire. Overall, six different questions on trust, generosity, and fairness
were included. The BHPS also repeatedly collected attitudinal trust measures
in the years 1998, 2000, and 2003, asking respondents the general trust question
“In general, would you say that most people can be trusted, or that you can’t
be too careful these days?”
One criticism of attitudinal survey questions like the ones above concerns
the lack of behavioral underpinnings and the absence of meaningful survey
questions that get at respondents’ trustworthiness (Ermisch et al., 2009;
Glaeser et al., 2000). Combining attitudinal survey questions that inquire
into respondents’ trust with behavioral experiments that include monetary
In 2005, both the BHPS and the GSOEP incorporated new questions
to elicit respondents’ personality traits through the Big Five Inventory (BFI).
The BFI is a psychological inventory used to measure personality on the basis
of the assumption that differences in personality can be summarized through
five personality traits: Neuroticism, Extraversion, Openness to Experience,
Agreeableness, and Conscientiousness (John & Srivastava, 1999). The study
by Gosling et al. (2003) indicated that the Big Five personality traits can be
reliably measured with a small number of items. Both the BHPS and GSOEP
use a 15-item version, with three items per personality trait. For further infor-
mation about data collection and internal validity of the short BFI version
in GSOEP (BFI-S), see Gerlitz and Schupp (2005) and Dehne and Schupp
(2007). The incorporation of the BFI into the GSOEP and the BHPS will
likely be of great value to psychologists and will allow researchers to study
relationships between personality traits and various behavioral outcomes
(for a first application, see Rammstedt, 2007). Using the Big Five personality
factors from the GSOEP, Winkelmann and Winkelmann (2008) reported
that certain personality clusters are more dominant in some occupations than
others and that a positive relationship exists between personal and occupa-
tional profiles and life satisfaction. The study by Rammstedt and Schupp (2008)
aimed to investigate personality congruence between spouses and to examine
which dimensions show a high degree of congruence. It also investigated
the extent to which the congruence between spouses is moderated by the
marriage duration. Results reveal that among the Big Five dimensions, there
are strong differences in spouses’ congruence. Although for Extraversion, con-
gruence is close to zero, correlations averaging at .30 are found for Agreeableness,
Conscientiousness, and Openness.
answers based on the World Values Survey or the General Social Survey (GSS)
to the standard trust question “Generally speaking, would you say that most
people can be trusted, or that you can’t be too careful in dealing with people?”
Versions of this attitudinal trust question are widely used in the literature to
measure social capital or generalized trust, that is, respondents’ expectation
about the trustworthiness of other people in the population (Alesina &
La Ferrara, 2002). A second approach is to measure trust and trustworthiness
through behavioral experiments with monetary rewards (for a review of this
literature, see Camerer, 2003).
Both approaches have their advantages and disadvantages. Glaeser
et al. (2000) questioned the validity of the GSS trust measure to capture social
capital and argued that attitudinal survey questions are “vague, abstract, and
hard to interpret” (p. 812). The authors found no empirical evidence that the
GSS attitudinal trust question predicts trusting behavior in the experiment,
which used the standard trust game first introduced by Berg et al. (1995) based
on a sample of Harvard undergraduate economics students. However, Glaeser
et al. found a positive significant correlation between survey measures of trust
and revealed trustworthiness (i.e., sender’s behavior in the experiment). In
the original two-person one-shot trust game, Player 1 (truster) is allocated
$10, which she or he can keep or invest. If the truster decides to invest
(e.g., transfers a positive amount to Player 2), the amount invested is doubled
by the experimenter and transferred to Player 2 (trustee). The trustee can
then decide how much of the amount received to return to the truster and
how much to keep. The amount transferred by the first player measures trust,
and the amount transferred back measures trustworthiness. Ermisch et al. (2009)
pointed out that answers to attitudinal questions might be too generic and
relatively uninformative about the reference group or the stakes respondents
have in mind. Another potential limitation when using survey questions is
measurement error and the issue of whether respondents’ answers to survey
questions are behaviorally relevant (Fehr et al., 2002).
Laboratory experiments have the advantage that researchers can con-
trol the environment under which individuals make their financial decisions
and allow causal inferences by exogenously varying one parameter while
keeping all others unchanged. However, a major limitation of most experiments
is that they are administered to students, who usually self-select themselves into
the study and are therefore not representative of the entire adult population.
In fact, because of self-selection, experimental studies with student subjects
might not even be representative of the entire student population. In addition,
most laboratory experiments are conducted on very homogenous samples
(typically students studying the same subject at the same university), and
often information on potentially important socioeconomic background
characteristics is missing or lacks sufficient variation. Another shortcoming
risk part of it,’ ‘I like to gamble’” (p. 753). They found that revealed trust in
the experiment is more likely if people are older, if they are homeowners,
if their financial situation is “comfortable,” or if they are divorced or separated.
Trustworthiness is lower if a person perceives his or her financial situation as
difficult or as “just getting by” compared with those who perceive their own
financial situation as “comfortable.”
Taken together, these studies demonstrate that there might exist enormous
academic benefits from combining experimental studies with representative
surveys. First, experiments based on representative samples help to assess
potential biases of studies based on student subjects who self-select themselves
into the sample. This advances our knowledge of whether and to what extent
experimental studies from student samples can be generalized. Second, research
measuring both revealed preferences and stated preferences allow researchers
to validate their measures. For example, Fehr et al. (2002), Ermisch et al. (2009),
and Naef and Schupp (2009) reported that answers to attitudinal questions
on trust toward strangers do predict real trusting behavior in the experiment.
Cognitive Tests
In 2006, the GSOEP included cognitive tests in the survey for the first time.
We briefly describe three tests here. The aim of the first test (word fluency test)
is to measure fluency and declarative knowledge, whereas the second (symbol
correspondence test) is aimed at measuring individuals’ speed of perception.
Both tests last 90 s and are conducted using computer-assisted personal inter-
viewing techniques (CAPI). The rationale for including these two tests in the
GSOEP is the perception among psychologists that intellectual skills can
be described by two main components. The first component constitutes the
cognitive functioning of the brain, and the second describes the pragmatic
part of the intellect (Lindenberger, 2002).
In the first test, participants had to name as many animals as possible.
Time was measured automatically by the computer, and interviewers entered
the number of animals named by respondents into the laptop. The task of the
interviewer was to exclude animals that were named more than once and any
words that were not clearly identifiable as animal names. If respondents could
not name an animal after a considerable time, interviewers were allowed to
terminate the test.
The task of the second test—the “symbol correspondence test”—is to
assign as many symbols to digits as possible. This test is a revised version of
the Symbol Digit Modalities Test developed by Smith (1973). A number of
modifications were introduced to ensure that the test could be successfully
conducted without requiring special training for interviewers and to minimize
sources of error when using the CAPI method. In contrast to the “animal
naming” test, participants take this test alone on a laptop. They are informed
that the aim of the test is to assign as many symbols to digits as possible.
During the test, a correspondence table that shows the mapping between
symbols and digits is permanently visible. The first row displays nine different
symbols (e.g., +, ), and the second row displays digits from 1 to 9, so that
each symbol can be mapped to one particular number. Participants are told
that the aim of the test is to map as many symbols to numbers as possible. Prior
to the test, they are shown an example on the screen with the correct answers.
Because this test involves only numbers and geometric figures, it is rela-
tively culture free and results should be independent of language skills. For
further information about reliability and validity of these cognitive tests,
see Lang et al. (2007).
The first studies that investigated relationships between these cognitive
ability measures and labor market outcomes were Anger and Heineck (in press)
and Heineck and Anger (2008). Anger and Heineck used both the word
fluency test and the symbol correspondence test to study the relationship
between cognitive abilities and labor earnings in Germany. They used data
from the 2005 pretest of the GSOEP and found a positive association between
symbol correspondence ability scores and earnings but no statistically significant
relationship between word fluency test scores and earnings. In a follow-up
study, Heineck and Anger focused on the links among personality traits,
cognitive ability, and labor earnings. Using data from the main GSOEP survey,
they reported a positive significant relationship between cognitive ability and
labor earnings for men, but not for women. Their measure of cognitive ability
is derived from the symbol correspondence test.
A third measure on cognitive abilities was introduced in 2006 in a
questionnaire distributed to all 17-year-old GSOEP participants. Because
fluid intelligence is known to be stable from the beginning of adulthood on,
cognitive abilities measured at this time point can be used as predictors of
later developments in a person’s life. Although there are already a large number
of established and carefully validated psychological tests of fluid intelligence
in adults, none of these is adequate for the survey interviews used for data
collection in the GSOEP. Thus, one of the existing tests, the intelligence
structure test (Intelligenz-Struktur Test [I-S-T] 2000; Amthauer et al., 2001),
was modified to be used in the context of individual panel survey interviews.
The test is widely used in Germany and was carefully validated by its authors.
The modifications of the I-S-T 2000 in the GSOEP are described in detail
by Solga et al. (2005). The most important modifications were that the test is
not described as an intelligence test but with the title “Wanna DJ?”, where DJ
stands for “Denksport und Jugend [Brainteasers and Youth].” The questionnaires
were given a more colorful design. The titles of task groups that sounded quite
technical in the original test were replaced by more casual, attention-getting
titles, like “Just the right word . . . ” or “A good sign . . . ” Because of time
restrictions, only three subscales of the I-S-T 2000 R were used: (a) Analogies,
as a measure of verbal intelligence; (b) Arithmetic Operations, as a measure of
numerical intelligence; and (c) Matrices, as a measure of figural intelligence.
The total score of all three subscales (IST total) reflects general reasoning
abilities as opposed to domain-specific knowledge (Amthauer et al., 2001).
Biomarkers
In household panel studies and most other types of secondary data, health
data have usually been collected through self-reported health variables. The
three household panels GSOEP, UKHLS, and MESS are currently planning
an innovation in this area in the near future: the collection of various physical
health measures known as biomarkers. Including measured height, weight,
waist circumference, blood pressure, saliva samples, heart rate variability, peak
flow tests, grip strength, timed walk, balance test, and puff test, biomarkers
are considered to provide objective and reliable information about people’s
physical condition.
In 2006, the GSOEP started collecting a noninvasive health measure,
hand grip strength (i.e., the maximum force with which a person can grasp
someone’s hand), after a successful pretest in 2005. In 2008, a second wave of
grip strength was collected. The results of several studies suggest that it is
feasible to measure grip strength among survey respondents (Giampaoli et al.,
1999), that the participation rate is very high (Hank et al., 2009), and that it
is reliable even among physically weak participants.
The measurement of grip strength constitutes a noninvasive health
indicator to measure muscular upper body strength and physical functioning.
It is measured by using a mechanical dynamometer. It is considered an objec-
tive measure because it is less susceptible to response bias than self-reported
health variables. In addition, self-reported health measures do not allow
researchers to identify health differences among respondents who report no
health problems. Several studies have found that grip strength is a significant
predictor of future physical disability, morbidity, and mortality among older
people (Giampaoli et al., 1999; Metter et al., 2002). If respondents have no
limiting health conditions, the grip strength test is performed twice on each
hand. Prior to the test, interviewers inform respondents that the grip strength
test is not dangerous or harmful and can be conducted at any age, except if
respondents have certain clinical conditions such as swelling, inflammation,
pain, or if they have had an operation or injury in the last 6 months. If one
arm is affected by one of these conditions, grip strength is measured on the
healthy arm only. Interviewers are provided with a very detailed description
of the test procedure, including several photos showing the correct arm and
body positioning when conducting the test. Moreover, interviewers are asked
to demonstrate the grip strength test and explain it in detail before respondents
are asked to participate in the test themselves. It is crucial for the study that
interviewers are well trained in conducting the test accurately and in persuading
respondents to participate.
Finally, we point out that household panels can serve as useful reference
points for researchers who collect their own data. A recent example is the
study by Geyer et al. (2008), who examined whether individuals ages 17 to
45 with operated congenital heart disease have adverse employment chances
compared with people without heart problems. Geyer et al. compared their
sample of patients (N = 314) with a sample drawn from the GSOEP, which
served as a comparison group.
The study by Ermisch et al. (2009) also exemplifies how a panel survey
can help in accessing the extent to which a particular sample is representative
of the general population. Ermisch et al. integrated a new experimental trust
design into a former sample of the British population and compared their trust
sample with a sample from the BHPS. By using a questionnaire similar to the
BHPS, they were able to determine that their trust sample overrepresents
women and people who are retired, older, divorced, or separated. A recent
article by Siedler et al. (2009) discusses how household panels can serve as
reference data for researchers collecting data sets that do not represent the
full universe of the population of interest.
The existing data can be used almost free of charge by independent and
reputable researchers worldwide. Further information about the various house-
hold panel studies can be obtained at the following websites.
䡲 Panel Study of Income Dynamics. PSID data can be downloaded
at http://psidonline.isr.umich.edu/data/.
䡲 German Socioeconomic Panel. GSOEP data cannot be down-
loaded from the Internet because of German data protection
regulations. For further information on data distribution, see
http://www.diw.de/english/faq/.
䡲 British Household Panel Survey and United Kingdom Household
Longitudinal Study. Data from the BHPS have been deposited
REFERENCES
Alesina, A., & La Ferrara, E. (2002). Who trusts others? Journal of Public Economics,
85, 207–234. doi:10.1016/S0047-2727(01)00084-6
Amthauer, R., Brocke, B., Liepmann, D., & Beauducel, A. (2001). Intelligenz-Struktur-
Test 2000 R (I-S-T 2000 R). Göttingen, the Netherlands: Hogrefe.
Anger, S., & Heineck, G. (in press). Cognitive abilities and earnings—First evidence
for Germany. Applied Economics Letters.
Bellemare, C., & Kröger, S. (2007). On representative social capital. European Economic
Review, 51, 183–202. doi:10.1016/j.euroecorev.2006.03.006
Berg, J., Dickhaut, J., & McCabe, K. (1995). Trust, reciprocity, and social history. Games
and Economic Behavior, 10, 122–142. doi:10.1006/game.1995.1027
Camerer, C. (2003). Behavioral game theory: Experiments in strategic interaction. Princeton,
NJ: Princeton University Press.
Dehne, M., & Schupp, J. (2007). Persönlichkeitsmerkmale im Sozio-oekonomischen
Panel (SOEP): Konzept, Umsetzung und empirische Eigenschaften [Personality
characteristics in the Socio-Economic Panel (SOEP): Concept, implementation
and empirical properties]. DIW Research Note. 26. Berlin, Germany: DIW Berlin.
Easterlin, R. A. (2008). Lost in transition: Life satisfaction on the road to capitalism.
IZA Discussion Paper No 3409. Bonn, Germany: IZA.
Ermisch, J., Francesconi, M., & Siedler, T. (2006). Intergenerational mobility and
marital sorting. The Economic Journal, 116, 659–679. doi:10.1111/j.1468-0297.
2006.01105.x
Ermisch, J., Gambetta, D., Laurie, H., Siedler, T., & Uhrig, S. C. N. (2009). Measuring
people’s trust. Journal of the Royal Statistical Society: Statistics in Society, 172A,
749–769.
Fehr, E., Fischbacher, U., von Rosenbladt, B., Schupp, J., & Wagner, G. G. (2002).
A nation-wide laboratory: Examining trust and trustworthiness by integrating
behavioral experiments into representative surveys. Schmollers Jahrbuch, 122,
1–24.
Francesconi, M., Jenkins, S. P., & Siedler, T. (in press). Childhood family structure
and schooling outcomes: Evidence for Germany. Journal of Population Economics.
Gerlitz, J.-Y., & Schupp, J. (2005). Zur Erhebung der Big-Five-basierten Persönlichkeits-
merkmale im SOEP [The survey of the Big Five personality traits based on SOEP].
DIW Research Notes 2005: Vol. 4. Berlin, Germany: DIW Berlin.
Geyer, S., Norozi, K., Buchhorn, R., & Wessel, A. (2008). Chances of employment
in a population of women and men after surgery of congenital heart disease:
Gender-specific comparisons between patients and the general population.
SOEP Papers on Multidisciplinary Panel Data Research: Vol. 91. Berlin, Germany:
DIW Berlin.
Giampaoli, S., Ferrucci, L., Cecchi, F., Noce, C. L., Poce, A., Dima, F., et al. (1999).
Hand-grip strength predicts incident disability in non-disabled older men. Age
and Ageing, 28, 283–288. doi:10.1093/ageing/28.3.283
Glaeser, E. L., Laibson, D. I., Scheinkman, J. A., & Soutter, C. L. (2000). Measuring trust.
The Quarterly Journal of Economics, 115, 811–846. doi:10.1162/003355300554926
Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A very brief measure of
the Big Five personality domains. Journal of Research in Personality, 37, 504–528.
doi:10.1016/S0092-6566(03)00046-1
Hank, K., Jürges, H., Schupp, J., & Wagner, G. G. (2009). Isometrische Greifkraft
und sozialgerontologische Forschung—Ergebnisse und Analysepotentiale des
SHARE und SOEP [Isometric grip strength and social gerontology—Research
results and analysis potentials of SHARE and SOEP]. Zeitschrift für Gerontologie
und Geriatrie, 42, 117–126.
Heineck, G., & Anger, S. (2008). The returns to cognitive abilities and personality
traits in Germany. SOEP Papers on Multidisciplinary Panel Data Research: Vol. 124.
Berlin, Germany: DIW Berlin.
John, O. P., & Srivastava, S. (1999). The Big Five trait taxonomy: History, measurement,
and theoretical perspectives. In O. P. John & L. A. Pervin (Eds.), Handbook of
personality: Theory and research (pp. 102–138). New York, NY: Guilford Press.
Lang, F. R., Weiss, D., Stocker, A., & von Rosenbladt, B. (2007). Assessing cognitive
capacities in computer-assisted survey research: Two ultra-short tests of intellec-
tual ability in the German Socio-Economic Panel (SOEP). Schmollers Jahrbuch,
127, 183–192.
Levitt, S. D., & List, J. A. (2008, January 15). Homo economicus evolves. Science, 319,
909–910. doi:10.1126/science.1153640
Lindenberger, U. (2002). Erwachsenenalter und Alter [Adulthood and age]. In
R. Oerter & L. Montada (Eds..), Entwicklungspsychologie [Developmental
psychology] (5th ed., pp. 350–391). Weinheim, Germany: Beltz PVU.
Metter, E. J., Talbot, L. A., Schrager, M., & Conwit, R. (2002). Skeletal muscle strength
as a predictor of all-cause mortality in healthy men. Journals of Gerontology:
Biological Sciences, 57A, 359–365.
Naef, M., & Schupp, J. (2009). Measuring trust: Experiments and surveys in contrast
and combination. SOEP Papers on Multidisciplinary Panel Data Research, 167.
Berlin, Germany: DIW Berlin.
Rammstedt, B. (2007). Who worries and who is happy? Explaining individual differ-
ences in worries and satisfaction by personality. Personality and Individual Differens,
43, 1626–1634. doi:10.1016/j.paid.2007.04.031
Rammstedt, B., & Schupp, J. (2008). Personality similarities in couples—Only the
congruent survive. Personality and Individual Differences, 45, 533–535. doi:10.1016/
j.paid.2008.06.007
Rantanen, T., Guralnik, J. M., Foley, D., Masaki, K., Leveille, S., Curb, J. D., &
White, L. (1999). Midlife hand grip strength as a predictor of old age disability.
JAMA, 281, 558–560. doi:10.1001/jama.281.6.558
Sapienza, P., Toldra, A., & Zingales, L. (2007). Understanding trust. NBER working
Paper 13387. Cambridge, MA.
Schimmak, U., & Lucas, R. (2007). Marriage matters: Spousal similarity in life
satisfaction. Schmollers Jahrbuch, 127, 105–111.
Scott, J., & Alwin, D. (1998). Retrospective versus prospective measurement of life
histories in longitudinal research. In J. Z. Giele & G. H. Elder, Jr. (Eds.), Methods
of life course research (pp. 98–127). Thousand Oaks, CA: Sage.
Siedler, T., Schupp, J., Spiess, C. K., & Wagner, G. G. (2009). The German
Socio-Economic Panel as reference data set. Schmollers Jahrbuch, 129, 374–374.
doi:10.3790/schm.129.2.367
Smith, A. (1973). Symbol Digit Modalities Test. Los Angeles, CA: Western Psychological
Services.
Solga, H., Stern, E., Rosenbladt, B. v., Schupp, J., & Wagner, G. G. (2005). The
measurement and importance of general reasoning potentials in schools and labor
markets: Pre-test report (Research note 10). Berlin, Germany: DIW Berlin.
Solon, G. R. (1992). Intergenerational income mobility in the United States. The
American Economic Review, 82, 393–408.
Spiess, M., Kroh, M., Pischner, R., & Wagner, G. G. (2008). On the treatment of
non-original sample members in the German Household Panel Study (SOEP)—
Tracing, weighting, and frequencies. SOEP Papers on Multidisciplinary Panel Data
Research: Vol. 98. Berlin, Germany: DIW Berlin.
Winkelmann, L., & Winkelmann, R. (2008). Personality, work, and satisfaction:
evidence from the German Socio-Economic Panel. The Journal of Positive
Psychology, 3, 266–275. doi:10.1080/17439760802399232
II
USING SECONDARY DATA
IN PSYCHOLOGICAL
RESEARCH
12110-08_PT2-Ch07_rev1.qxd 6/23/10 1:51 PM Page 120
12110-08_PT2-Ch07_rev1.qxd 6/23/10 1:51 PM Page 121
7
THE USE OF SECONDARY DATA
IN ADULT DEVELOPMENT
AND AGING RESEARCH
DANIEL K. MROCZEK, LINDSAY PITZER, LAURA MILLER,
NICK TURIANO, AND KAREN FINGERMAN
121
12110-08_PT2-Ch07_rev1.qxd 6/23/10 1:51 PM Page 122
stability and change, and (b) questions regarding the prediction of physical
health outcomes and mortality.
The data in each of the studies cited here contained multiple measure-
ment occasions per person. More important, none of the lead authors on any of
the above articles had a key role in collecting any of their early data. In some
cases, the initial measurement occasions (or even all the occasions) were col-
lected so long ago that those original researchers were no longer involved in the
study. For example, in the Veterans Affairs (VA) Normative Aging Study (e.g.,
Mroczek & Spiro, 2003), the original scientists who recruited the panel in the
late 1960s were either retired or deceased by the time the more recent analyses
were conducted and published. In the next section, we discuss the Terman sam-
ple, which has yielded many interesting and important findings in the areas of
both child and adult development. Yet, the study founder, Lewis Terman (born
in 1877), did not live to read many of the well-known articles that were even-
tually published on his data. He died in 1956, and some of the most-cited arti-
cles using his sample were not published until the 1990s. The maintenance of
long-term longitudinal studies is an enterprise that usually requires multiple
generations of researchers. The most senior researchers often do not see the ulti-
mate fruits of their labors.
duct a study of mortality using a university’s human subject pool or any other
kind of short-term convenience sample. Yet, the scientific yield from mortal-
ity studies is of high value. The recent burgeoning literature on behavioral and
social predictors of mortality has made many researchers think more deeply
about the connections between psychosocial and biomedical variables.
Behavioral and social variables, such as positive outlook, personality traits,
social support, and cognitive functioning, have emerged as key predictors of
mortality, with effect sizes that are comparable to those of many biomedical
variables such as total blood cholesterol (Deary & Der, 2005; Ferraro & Kelley-
Moore, 2001; Friedman et al., 1993; Levy, Slade, Kunkel, & Kasl, 2002; Maier
& Smith, 1999; Mroczek & Spiro, 2007; Small & Backman, 1997; Small,
Frantiglioni, von Strauss, & Backman, 2003; Sugisawa, Liang & Liu, 1994).
Perhaps the best example of this is Friedman et al.’s (1993) analyses of the
Terman data. In 1921, Terman, at Stanford, recruited a sample of gifted chil-
dren and teenagers from grammar schools and high schools across the state of
California. These children were assessed on a number of variables, including
intelligence and (teacher-rated) personality. They were followed into their
adult years, and eventually, as they began to die, cause of death and age at death
were obtained from death certificates. Many people from the 1920s to the 1990s
had a role in maintaining the Terman data set and archiving previously col-
lected data. By the early 1990s, the vast majority of the panel had died. It was
at this time that Friedman et al. used the teacher-rated childhood personality
traits obtained in 1921 to predict survival using proportional hazards (Cox,
1972) models. Despite the 7-decade lag, the childhood personality rating of
conscientiousness predicted survival into middle age and older adulthood. As
noted above, Terman died in 1956 and missed Friedman’s 1993 publication by
nearly 40 years, reinforcing our point that the ultimate fruits of long-term lon-
gitudinal research are often never seen by the scientists who begin the studies.
CONCLUSION
enhance the status of particular data sets. The researcher who shares data
often enjoys a gain in much the same way a small privately held company
can increase its value greatly by letting in outside investors. The founder of
such a company has to give up 100% ownership and turn over a considerable
percentage to outsiders. Yet, the new infusions of money and resources
provided by those outsiders can enlarge the company far beyond what the
founder could have done on his or her own. It becomes a win–win situation
in which both the original owner and outside investors reap the benefits of
an enlarged pie. Similarly, researchers who archive and share data with “out-
sider” secondary data analysts can themselves gain additional accolades and
citations that would not have accrued if the data set had been kept in pri-
vate hands.
There is one other argument for placing data in the public domain. Most
high-quality data sets are collected with at least some form of taxpayer funding.
We as researchers often benefit from such taxpayer beneficence (our employers
usually reward us with raises, promotions, and resources for bringing in such
funds), and it is fitting that we give back by making data public for others to use.
We should share data because it is generally the right thing to do if we have
received significant funding from our fellow citizens.
We conclude with three other points. First, as research questions become
clarified over time, statistical techniques are developed to better answer these
key questions. The past 4 decades have seen the development of structural
equation models, proportional hazards models, item response theory, and multi-
level (mixed) models. Each has been important in social science research. Each
also was, in part, a response to problems that had arisen in the course of con-
ducting research. Once these statistical models were fully developed, in many
instances they were applied to existing data. This led to insights that had not
been possible at the time of original data collection. A good example is Lucas’s
work using the German Socioeconomic Panel Survey (GSOEP), which con-
tains multiple waves of data spanning nearly 2 decades (Lucas, Clark, Georgellis,
& Diener 2003, 2004). Lucas et al. (2003, 2004) used multilevel modeling
to examine how changes in marital and employment status influence well-
being. When the GSOEP was founded in the early 1980s, the development of
multilevel models had only recently begun. By the time the GSOEP had
over a dozen waves, multilevel models were well developed and widely avail-
able in software packages, permitting Lucas et al. (2003, 2004) to capitalize on
this large existing data set. In the early 1980s, the founders of the GSOEP could
not have foreseen what new statistical techniques would yield important and
interesting results from their data 2 decades later, yet by archiving and making
their data publically available, they set the stage for future studies. Often, sta-
tistical techniques need time to catch up to available data. This is a persuasive
argument for archiving and for using archives.
Second, the use of archived data (in aging or other areas) represents
value-added on grant dollars already spent. In this time of tight federal research
money, grantors are often favorably disposed toward funding secondary data
analysis projects. Such projects are typically much less expensive than those
that propose new, and often pricey, data collection. For a small fraction of the
cost, a grantor can often expect outstanding publications. Sometimes, these
secondary analysis articles are more influential than work from the original
rounds of data collection. There is simply more “bang for the buck” in second-
ary analysis.
However, we strongly recommend that researchers applying for funds to
do secondary analysis should overtly state the concept of value-added in the
grant application itself. The reason is that some reviewers are not favorably
disposed to secondary analysis. Although funding agencies might recognize
the value of such work, many reviewers may not appreciate it unless the point
about value-added is made explicitly.
Third, we also must emphasize that although we are strong advocates of
the use of secondary data, doing so presents challenges that are not encoun-
tered when collecting one’s own data. Often, the exact variables you want
may not be available. If you are interested in psychological well-being, you
may need to settle for a single-item that asks “how happy you are” in general.
Indeed, many constructs may be assessed with single items, as opposed to the
multi-item, high–Cronbach’s alpha scales that are preferred in the psycholog-
ical literature. Additionally, some existing data sets may suffer from a lack of
good maintenance through the years, or worse, from neglect. Quite a bit of
data cleaning and sprucing up may be necessary before any kind of analysis is
possible. Some data are in such dire shape that private companies and non-
profits have been created that do nothing other than put such data into usable
form. Medicare and Social Security data are notorious for being particularly
unfriendly to those (mostly economists and demographers) who wish to ana-
lyze them. Yet, the obstacles presented by secondary data are usually out-
weighed by the benefits.
Cronbach pointed out that by the mid-1950s a rift had developed between
psychologists who relied mostly on experimentation and those who ventured
into the area of nonexperimental methods, including surveys and longitudinal
designs. The experimentalists were the vast majority then, and to this day
many psychologists are still trained to believe that experimental methods are
superior.
Secondary data are almost never experimental (except see Chapter 6, this
volume, for recent advances using experimental designs within archived survey
studies), and there are many social scientists, especially among psychologists,
who fundamentally mistrust any non-experimental data. If an independent
variable is not directly manipulated, many of these scholars view the results
with suspicion. This lingering doubt presents a challenge for those of us
who advocate the use of archives. However, many questions simply cannot
be answered with experimentation. As we argued earlier, longitudinal data are
perhaps the most valuable type of data in adult development and aging research.
Experimental evidence is more often supplementary than primary in answering
important questions about change in some of the key aspects of aging, such as
cognition, personality, or social behavior. This is not without parallel in other
areas of inquiry, such as astronomy or evolutionary theory, where experimen-
tation is very difficult or impossible. Astronomers, for example, are often unable
to do experiments because the manipulation of certain key variables is not pos-
sible. If astronomers are interested in how the gravitational pull of stars influ-
ences the speed of planetary orbits, they cannot manipulate gravitational pull
and randomly assign planets to different gravitation groups. They must measure
the actual, natural world and draw conclusions as best they can. So it is with
many questions in adult development and aging, and indeed, in many areas of
the social sciences. It is not possible to assign people to groups that receive vary-
ing levels of education and other cognitive stimulation to see who develops
early dementia. Similarly, it is not possible and is unethical to assign people to
groups that smoke or do not smoke, to determine who gets lung cancer or
emphysema. Only long-term data, in which the earliest waves are by definition
archival by the time the later waves are collected, can answer these questions.
Books such as the present volume will, we hope, begin to assuage the
concerns of those who doubt the usefulness of secondary analysis. We hope
that, eventually, the use of secondary data will attain equal status with other
techniques within the social scientist’s toolkit.
Charles, S. T., Reynolds, C. A., & Gatz, M. (2001). Age-related differences and
change in positive and negative affect over 23 years. Journal of Personality and
Social Psychology, 80, 136–151.
Friedman, H. S., Tucker, J. S., Tomlinson-Keasey, C., Schwartz, J. E., Wingard,
D. L., & Criqui, M. H., (1993). Does childhood personality predict longevity?
Journal of Personality and Social Psychology, 65, 176–185.
Lucas, R. E., Clark, A. E., Georgellis, Y., & Diener, E. (2004). Unemployment alters
the set point for life satisfaction. Psychological Science, 15, 8–13.
Mroczek, D. K., & Spiro, A. (2007). Personality change influences mortality in older
men. Psychological Science, 18, 371–376.
Small, B. J., Hertzog, C., Hultsch, D. F., & Dixon, R. A. (2003). Stability and change
in adult personality over 6 years: Findings from the Victoria Longitudinal Study.
Journal of Gerontology: Psychological Sciences, 58B, 166–176.
Trzesniewski, K. H., Donnellan, M. B., & Robins, R. W. (2003). Stability of self-esteem
across the life span. Journal of Personality and Social Psychology, 84, 205–220.
REFERENCES
Charles, S. T., Reynolds, C. A., & Gatz, M. (2001). Age-related differences and
change in positive and negative affect over 23 years. Journal of Personality and
Social Psychology, 80, 136–151. doi:10.1037/0022-3514.80.1.136
Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the
Royal Statistical Society: Methodological, 74B, 187–220.
Cronbach, L. J. (1957). The two disciplines of scientific psychology. American
Psychologist, 12, 671–684. doi:10.1037/h0043943
Deary, I. J., & Der, G. (2005). Reaction time explains IQ’s association with death.
Psychological Science, 16, 64–69. doi:10.1111/j.0956-7976.2005.00781.x
Ferraro, K. F., & Kelley-Moore, J. (2001). Self-rated health and mortality among
Black and White adults: Examining the dynamic evaluation thesis. Journal of
Gerontology: Social Sciences, 56B, 195–205.
Friedman, H. S., Tucker, J. S., Tomlinson-Keasey, C., Schwartz, J. E., Wingard, D. L.,
& Criqui, M. H. (1993). Does childhood personality predict longevity? Journal of
Personality and Social Psychology, 65, 176–185. doi:10.1037/0022-3514.65.1.176
Griffin, P., Mroczek, D. K., & Spiro, A. (2006). Variability in affective change among
aging men: Findings from the VA Normative Aging Study. Journal of Research in
Personality, 40, 942–965. doi:10.1016/j.jrp.2005.09.011
Levy, B. R., Slade, M., Kunkel, S., & Kasl, S. (2002). Longevity increased by positive
self-perceptions of aging. Journal of Personality and Social Psychology, 83, 261–270.
doi:10.1037/0022-3514.83.2.261
Lucas, R. E., Clark, A. E., Georgellis, Y., & Diener, E. (2003). Reexamining adaptation
and the set point model of happiness: Reactions to changes in marital status.
Journal of Personality and Social Psychology, 84, 527–539. doi:10.1037/0022-3514.
84.3.527
Lucas, R. E., Clark, A. E., Georgellis, Y., & Diener, E. (2004). Unemployment alters the
set point for life satisfaction. Psychological Science, 15, 8–13. doi:10.1111/j.0963-
7214.2004.01501002.x
Maier, H., & Smith, J. (1999). Psychological predictors of mortality in old age. Journal
of Gerontology: Social Sciences, 54B, 44–54.
Mroczek, D. K., & Spiro, A., III. (2003). Modeling intraindividual change in person-
ality traits: Findings from the Normative Aging Study. Journal of Gerontology:
Psychological Sciences, 58B, 153–165.
Mroczek, D. K., & Spiro, A. (2005). Change in life satisfaction during adulthood:
Findings from the Veterans Affairs Normative Aging Study. Journal of Personality
and Social Psychology, 88, 189–202. doi:10.1037/0022-3514.88.1.189
Mroczek, D. K., & Spiro, A. (2007). Personality change influences mortality in older
men. Psychological Science, 18, 371–376. doi:10.1111/j.1467-9280.2007.01907.x
Schwarz, N. (2006). Measurement: Aging and the psychology of self-report. In L. L.
Carstensen & C. R. Hartel (Eds.), When I’m 64: Committee on Aging Frontiers in
8
USING SECONDARY DATA TO TEST
QUESTIONS ABOUT THE GENETIC
BASIS OF BEHAVIOR
MICHELLE B. NEISS, CONSTANTINE SEDIKIDES,
AND JIM STEVENSON
Behavioral genetic studies over the past several decades have shown
that most human behavior is genetically influenced (Turkheimer, 2000). In
general, however, research on genetic factors that influence human behavior
becomes more fruitful when investigators move beyond the issue of whether
heredity plays a role. Our own work uses behavioral genetic methods to iden-
tify the genetically influenced mediators between self-esteem and social
behavior. Innate, heritable influences are important in explaining the ori-
gins of self-esteem, accounting for approximately 40% of the variance in self-
esteem (Neiss, Sedikides, & Stevenson, 2002). Nonetheless, there is probably
no “self-esteem gene.” Rather, the pathway from DNA to self-esteem involves
multiple genes whose expression relates to multiple processes, which in turn
are related to multiple behaviors. For example, self-esteem is an affective eval-
uation of the self and thus may overlap with affective style in general. So it
might be the case that the genetic influence on self-esteem reflects positive
or negative affective style rather than genetic factors on self-esteem per se.
Existing studies often include a wide range of constructs and thus provide an
excellent opportunity to investigate genetic links among multiple behaviors.
As such, secondary data sets are a useful tool for behavioral genetic research.
Perhaps even more pertinently, secondary data sets provide an excellent way
133
12110-09_Ch08_rev2.qxd 6/23/10 1:50 PM Page 134
ILLUSTRATIVE STUDY
TWIN STUDY
Method
We used the twin sample from the MIDUS survey (N = 1,914 individ-
uals). The design allowed multiple twin pairs from the same family to partic-
ipate; we limited our sample to only one pair per family. Our selection process
yielded 878 twin pairs: 344 identical, or monozygotic (MZ), twin pairs (160
female pairs, 184 male pairs), and 534 fraternal, or dizygotic (DZ), twin pairs
(189 female pairs, 115 male pairs, 230 mixed-sex pairs). More detail on the
sample and methods can be found elsewhere (Neiss et al., 2005).
Results
a1 c1 e1 a2 c2 e2
SE Exec NA
a3 c3 e3
stemmed primarily from the third, specific factor. In other words, nonshared
environmental effects were primarily unique to each variable. Any modest
overlap stemmed from the common factor underlying all three. These estimates
include measurement error.
The multivariate analyses yielded modest links between just executive
self and negative affectivity. Therefore, we tested one final model in which
we dropped all shared environment paths (as described above) and the
remaining direct genetic and nonshared environmental paths between exec-
utive self and negative affect (a2 and e2 paths to NA). This reduced model
fit well, χ2 (32, N = 572) = 32.52, p < .44 (AIC = −31.48; RMSEA = .02). Of
note, this model suggests that executive self does not display any genetic or
environmental link with negative affect over and above those effects shared
with self-esteem.
CONCLUSION
Our aim was to investigate the overlap between aspects of the self sys-
tem (executive self and self-esteem) and negative affectivity. Using a second-
ary data set allowed us to compare phenotypic analyses and behavioral
genetic analyses involving large samples and complicated study design (twin
methodology). Capitalizing on both sets of results, we concluded that self-
esteem explained much of the relation between executive self and negative
affectivity. The behavioral genetic analyses added the information that the
overlap stemmed primarily from common genetic influences. Nonetheless,
the behavioral genetic methodology allowed us also to specify distinctions
between the self system and negative affectivity, as illustrated by specific
genetic and nonshared environmental influences.
The use of secondary data sets permits researchers to use behavioral
genetic methods without undergoing the arduous process of actually having
to collect genetically informative data. Although behavior genetic method-
ology can be used to answer theoretically driven questions about psycho-
logical phenomena, relatively few psychologists include this method in their
toolbox. One obstacle is the difficulty in collecting relevant data—a difficulty
that can be overcome by turning to secondary data sets.
relatively rare to find genetically informative data that are readily available
to other researchers. We note that many twin registries do in fact allow
researchers to propose secondary data analyses, collaborate with project direc-
tors or principal investigators, or pay for data collection. These are all valuable
ways to access genetically informed data sets without setting up independent
registries. We encourage researchers to pursue these routes as well. In keep-
ing with the spirit of this book, however, we describe here several archived
data sets that are available to researchers. This availability is especially laud-
able, as the large time and monetary investment in obtaining genetically
informative data often encourages proprietary proclivities.
䡲 National Survey of Midlife Development in the United States
(MIDUS). Our own research drew from the MIDUS data set,
available from Interuniversity Consortium for Political and
Social Research (ICPSR; http://www.icpsr.umich.edu). The
MIDUS represents an interdisciplinary collaboration to exam-
ine the patterns, predictors, and consequences of midlife devel-
opment in the areas of physical health, psychological well-being,
and social responsibility. Respondents provided extensive infor-
mation on their physical and mental health. Participants also
answered questions about their work histories and work-related
demands. In addition, they provided information about child-
hood experiences, such as presence or absence of parents, famil-
ial environments, and quality of relationships with siblings and
parents. Psychological well-being measures included feelings of
accomplishment, desire to learn, sense of control over one’s
life, broad interests, and hopes for the future. The data include
respondents ages 25 to 74 recruited from the general popula-
tion in a random-digit dialing procedure (N = 4,244), siblings
of the general population respondents (N = 950), and a twin
sample (N = 1,914). The first data wave was collected in 1995
to 1996 (Brim et al., 2007), and the second in 2004 to 2006
(Ryff et al., 2006).
䡲 Swedish Adoption/Twin Study on Aging (SATSA). Also available
from ICPSR are data from SATSA (Pedersen, 1993). SATSA
was designed to study the environmental and genetic factors
contributing to individual differences in aging. SATSA includes
four data waves (sample sizes vary by questionnaire and year,
with N =1,736 at 1984). The sample includes twins who were
separated at an early age and raised apart as well as a control
sample of twins raised together. Respondents answered ques-
tions about their personality, attitudes, health status, the way
REFERENCES
Moffitt, T. E., Caspi, A., & Rutter, M. (2005). Strategy for investigating interactions
between measured genes and measured environments. Archives of General
Psychiatry, 62, 473–481. doi:10.1001/archpsyc.62.5.473
Mroczek, D. K., & Kolarz, C. M. (1998). The effect of age on positive and negative
affect: A developmental perspective on happiness. Journal of Personality and
Social Psychology, 75, 1333–1349. doi:10.1037/0022-3514.75.5.1333
Neiss, M. B., Sedikides, C., & Stevenson, J. (2002). Self-esteem: A behavioural genetic
perspective. European Journal of Personality, 16, 351–367. doi:10.1002/per.456
Neiss, M. B., Stevenson, J., Legrand, L. N., Iacono, W. G., & Sedikides, C. (in press).
Self-esteem, negative emotionality, and depression as a common temperamen-
tal core: A study of mid-adolescent twin girls. Journal of Personality.
Neiss, M. B., Stevenson, J., Sedikides, C., Kumashiro, M., Finkel, E. J., & Rusbult, C. E.
(2005). Executive self, self-esteem, and negative affectivity: Relations at the
phenotypic and genotypic level. Journal of Personality and Social Psychology, 89,
593–606. doi:10.1037/0022-3514.89.4.593
Pedersen, N. L. (1993). Swedish Adoption/Twin Study on Aging (SATSA), 1984,
1987, 1990, and 1993 [computer file]. ICPSR version. Stockholm, Sweden:
Karolinska Institutet [producer], 1993. Ann Arbor, MI: Interuniversity Consortium
for Political and Social Research [distributor], 2004.
Plomin, R., DeFries, J. C., McClearn, G. E., & McGuffin, P. (2001). Behavioral genetics.
New York, NY: Worth.
Plomin, R., Fulker, D. W., Corley, R., & DeFries, J. C. (1997). Nature, nurture, and
cognitive development from 1–16 years: A parent–offspring adoption study.
Psychological Science, 8, 442–447. doi:10.1111/j.1467-9280.1997.tb00458.x
Pyszczynski, T., Greenberg, J., Solomon, S., Arndt, J., & Schimel, J. (2004). Why do
people need self-esteem? A theoretical and empirical review. Psychological Bulletin,
130, 435–468. doi:10.1037/0033-2909.130.3.435
Rutter, M., Moffitt, T. E., & Caspi, A. (2006). Gene–environment interplay: Multiple
varieties but real effects. Journal of Child Psychology and Psychiatry, and Allied
Disciplines, 47, 226–261. doi:10.1111/j.1469-7610.2005.01557.x
Ryff, C. D., Almeida, D. M., Ayanian, J. S., Carr, D. S., Cleary, P. D., Coe, C., et al.
(2006). Midlife Developments in the United States (MIDUS2), 2004–2006
[computer file]. ICPSR04652-v1. Madison, WI: University of Wisconsin, Survey
Center [producer], 2006. Ann Arbor, MI: Interuniversity Consortium for Political
and Social Research [distributor], 2007.
Sedikides, C., & Gregg, A. P. (2003). Portraits of the self. In M. A. Hogg & J. Cooper
(Eds.), Sage handbook of social psychology (pp. 110–138). London, England: Sage.
Turkheimer, E. (2000). Three laws of behavior genetics and what they mean. Current
Directions in Psychological Science, 9, 160–164. doi:10.1111/1467-8721.00084
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of
brief measures of positive and negative affect. Journal of Personality and Social
Psychology, 54, 1063–1070. doi:10.1037/0022-3514.70.3.567
9
SECONDARY DATA ANALYSIS IN
PSYCHOPATHOLOGY RESEARCH
NICHOLAS R. EATON AND ROBERT F. KRUEGER
Mental disorders pose huge costs for individuals and societies alike. In
addition to the significant economic expenses that arise from caring for affected
individuals, the high levels of psychopathology-related impairment and
disability are themselves staggering (Lopez & Murray, 1998). Psychological
researchers strive to understand the nature of mental disorders, but it is an unfor-
tunate fact that many empirical questions in this area are not very tractable
because of limitations on funding, time, availability of study participants, and
so on. The presence of large, publicly available data sets provides a means to
investigate topics of fundamental importance by leveraging existing resources.
Although straightforward logistical issues highlight the benefits of sec-
ondary research, they are by no means the only reason an investigator might
decide to use large, public data sets. Sample size can be an issue of major
importance to certain research endeavors. Particular topics of study (e.g., low
base-rate phenomena) may require large data sets to ensure that an adequate
number of cases are observed. Similarly, some multivariate statistical models
(e.g., cluster analysis, multidimensional scaling, latent class analysis) are cen-
trally relevant to key questions about mental disorders, but they often require
considerable sample sizes. In our own research, which we touch on briefly
throughout this chapter, such a scenario has frequently been the case. Indeed,
149
12110-10_Ch09_rev1.qxd 6/23/10 1:54 PM Page 150
some analyses have required such substantial samples that we have combined
across multiple large public data sets.
Throughout this chapter, we address the benefits of using secondary data
for psychopathology research. We also discuss the general limitations present
in the analysis of existing data sets. In addition, existing studies of mental dis-
orders face their own unique challenges, which are a major focus of our con-
sideration. Finally, resources such as data sets and readings are provided for
researchers interested in pursuing the topic further.
latent variables that underlie these disorders and account for these relations.
Indeed, our research into this topic has identified two major latent factors when
examining these sorts of disorders: an internalizing factor, composing depression-
and anxiety-related disorders, and an externalizing factor, involving substance
use, oppositional behaviors, and conduct problems (see Krueger, 1999). The
identification of these latent factors required, among other things, large sam-
ples. The confirmatory factor analyses involved in uncovering these underlying
structures of common mental disorders necessitated a great deal of variation in
the symptom count data, which tend to be highly positively skewed and endorse-
ments of certain symptoms can occur somewhat infrequently. The provision of
a sufficient number of individuals with psychopathological features as well as
an adequate amount of across-disorder symptom covariance was accomplished
by using secondary analysis of archival data sets (and sometimes merging these
large data sets together; e.g., Krueger & Markon, 2006).
One clear consideration in the use of secondary data sets is whether they
permit publicly open access. Although we are limiting the present discussion
to publicly available data sets, it is important to note that many outstanding
proprietary archived data sets exist. One example of such data, from a study
on which we collaborate, are those from the Minnesota Center for Twin and
Family Research (MCTFR), which encompasses sizeable samples of twins,
adoptees, and families followed longitudinally for years (Iacono, McGue, &
Krueger, 2006). Proprietary studies often have specific aims, which facilitate
certain types of analyses (in the MCTFR, for example, twin and adoption
study designs allow researchers to parcel apart genetic and environmental
contributions to overt behavior and individual differences). When studies are
not publicly available, collaborations with principal investigators are a possi-
bility, and researchers would be remiss in not pursuing them. A researcher
need not reinvent the proverbial wheel if adequate data to answer his or her
empirical questions are already available in a proprietary data set that is acces-
sible through a collaborative effort or data sharing agreement.
Even when data sets are publicly available, other difficulties must be
addressed. One major concern investigators may have, whether warranted or
not, regards their lack of control over the data collection. This is a justified
concern associated with the use of existing data sets—researchers who use
them are at the mercy of the original investigators. There can be no absolute
assurance that the procedures of the study were undertaken as they are
described. As noted earlier, however, a survey of the large-scale studies that
have gone on to become publicly available data sets reveals the competency
of the investigators. Indeed, the amount of funding alone required to conduct
such large-scale studies indicates the researchers showed a high degree of pro-
ficiency (at least in the eyes of the granting agency). Thus, although investi-
gators may be apprehensive about publishing from data collected beyond
their purview, this should not prevent them from using the extraordinary
resources available in existing data sets. It should also be noted that other
fields have embraced the benefits of collaboration by aggregating large data
sets collected by multiple research teams, so this issue is not unique to psy-
chopathology research. The study of human genetics is a prime example. Key
questions in genetics will require multiple data sets and large-scale coopera-
tion. Collaborative work in this area has yielded promising results (The
Wellcome Trust Case Control Consortium, 2007).
Concerns about the integrity of the original data are common to all sec-
ondary research projects, regardless of the field, because of the importance
of accuracy in scientific inquiry. However, psychopathology research pres-
ents unique considerations for prospective users of publicly archived data.
Specifically, the investigation of mental disorders can be a difficult endeavor
even when all aspects of a study are under the researcher’s control; when other
is to locate another data set that has the desired information. Although this
will not always be possible, several data sets exist using the most recently
revised version of the taxonomic system, especially for the most common men-
tal disorders. Researchers whose interests lie in less commonly studied forms
of psychopathology (e.g., dissociative identity disorder) may have a very lim-
ited group of existing data set options to begin with, and they may have to set-
tle for outdated diagnostic criteria in exchange for the advantages of using
existing data.
A second solution to the problem of outdated diagnostic criteria is for
the researcher to use these diagnostic criteria anyway. This will certainly not
be an adequate solution in all cases, but many psychopathology-related ques-
tions do not require data about a specific DSM diagnosis. In our own research,
for instance, we often use symptom data based on DSM–III–R and sometimes
even 1980’s DSM–III. When investigating the structure of common mental
disorders, a researcher is concerned with estimating the latent constructs
underlying and interconnecting disorders by virtue of symptom counts or
covariances between disorders’ symptom counts. As long as the symptoms
are related to the latent factor(s) under investigation, it is not particularly
important whether they appear in the most recent, or any, version of the DSM.
The results of structural studies of this nature, in fact, support broad underlying
factors as underlying psychopathology and leading to significant comorbidity
across disorders (Krueger, 1999; Krueger, Caspi, Moffit, & Silva, 1998). The
implications of such findings are that focusing on specific diagnoses found
in a particular version of the DSM may not only be overly confining to
researchers but also that important connections between the latent and man-
ifest forms of psychopathology may be missed when investigators are unnec-
essarily concerned with specific diagnostic criteria from the DSM treated
verbatim. It seems that positive results can come out of using somewhat out-
dated criteria in data sets; even data sets compiled almost 30 years ago using
DSM–III criteria likely contain valuable information for thoughtful and cau-
tious researchers.
Even if investigators are willing to be to be somewhat flexible with regard
to how psychopathology was defined at the time of the original study, archived
data sets still may not include information about the disorders of greatest inter-
est. For numerous practical reasons, most large-scale studies were unable to
probe deeply for the presence of every mental disorder. These holes in the data
represent the necessity of selective assessment of disorders by the original
investigators. For example, the National Epidemiologic Survey on Alcohol
and Related Conditions (NESARC) data set, described in more detail in the
Recommended Data Sets section, is an excellent resource for studies of per-
sonality pathology because these disorders are not always addressed in major
studies that become archived data sets. However, even in NESARC, only a
the increasing application of modern test theory (e.g., item response theory)
has led to the creation of more psychometrically sound measures in recent
years. On a case-by-case basis, researchers must evaluate the assessment instru-
ments. A reasonable degree of flexibility and psychometric competence allows
investigators to take greater advantage of existing resources. We suggest, how-
ever, that when researchers have qualms about the assessment tools used in a
given data set, they include these reservations in resulting publications.
These points highlight the importance of adequate documentation in sec-
ondary data sets. It is imperative that psychopathology-related data sets provide
thorough documentation to allow future researchers to use them responsibly.
Although the level of documentation varies from study to study, the major
existing data sets used for mental disorder research typically do contain detailed
information related to data collection methodology, assessment batteries used,
treatment of data, explanations of variables included in the data set, and so on.
If this information is missing or incomplete, users of these data sets may be able
to contact the original researchers or locate answers to their questions in arti-
cles or chapters that have resulted from the archived data. However, such
avenues may not be possible, and it is conceivable that researchers would be
forced to proceed with some amount of uncertainty. Clearly, this is another
example in which publications that result from use of the existing data should
reflect the author’s uncertainty due to any unknown information.
The National Survey of Midlife Development in the United States
(MIDUS) studies, discussed in the Recommended Data Sets section, are excel-
lent examples of how readily available study documentation facilitates ease
(and confidence) of use by future investigators (Brim, Ryff, & Kessler, 2004).
MIDUS includes, along with the data, information about the sampling proce-
dures, recruitment, interviewing, and participant compensation. In addition,
all measures used are downloadable, so future researchers can see exactly
what questions were asked and in what order. When new variables were
coded from other data, algorithms and explanations of this coding are pro-
vided. For instance, depression symptoms were assessed individually. The
number of symptoms present in each individual was summed to create a con-
tinuous variable of depression symptomatology, which was then split into a
binary variable whose values indicated whether the number of depression
symptoms present met the DSM diagnostic threshold. In this way, researchers
have several variables available for analysis, some of which arose from other
variables. Clear explanations by the original researchers provided in the
MIDUS documentation allow new investigators to be certain of how these
new variables were created and what they mean. Finally, MIDUS provides ref-
erences in addition to information about parallel forms used in the original and
follow-up assessments.
sets must understand what criteria were used for each screen to determine
study inclusion–exclusion rules.
A related issue to that of screening is of general inclusion and exclusion
criteria studies incorporate into their designs. For instance, there is a notable
difference if one study admitted participants age 18 versus age 21, although this
issue is likely immaterial to many empirical questions. Issues relating to psy-
chopathology itself may be more problematic, however. One common exclu-
sion criteria is to screen out individuals who show mental retardation (MR),
pervasive developmental disorders (PDDs), or psychosis (although, clearly, the
inclusion and exclusion criteria used depends on the question of interest for a
given study). If one of the studies in an aggregated data set created by a
researcher was representative of a major metropolitan district X, minus indi-
viduals who were screened out because of MR, PDDs, and psychosis, this sam-
ple does not overlap completely with another data set whose sample was
representative of all individuals who live in district X. Although such over-
lap may not be of great importance to the research inquiries of the investiga-
tor using the aggregated data set, such incongruencies in inclusion–exclusion
criteria should still be noted in resulting publications.
A final concern that commonly arises with large psychopathology data
sets involves the way in which mental disorders are coded. As mentioned pre-
viously, one can code a particular disorder on the basis of either the number of
symptoms an individual endorses or by an algorithm that outputs a binary vari-
able: a yes–no indication of whether a given disorder is present. We refer to the
former as “Study A” and the latter as “Study B.” There are at least two
approaches that researchers can adopt when faced with differentially coded
psychopathology variables in aggregated data sets. The first is simple: One can
dichotomize the symptom-count variable in Study A using the algorithm to cre-
ate the binary variable in Study B. However, this algorithm may not always be
available to the researcher. A second approach would be to use statistical meth-
ods that allow binary variables to be treated as continuous (e.g., symptom
counts), such as a tetrachoric correlation. Space limitations prevent in-depth
discussion of such methods, but the user of aggregated data sets who faces this
variable coding problem may find it helpful to investigate possible statistical
means of addressing the differences in variable coding.
SUMMARY
The following are brief descriptions of some major existing data sets fre-
quently used in psychopathology research. This list is by no means exhaustive,
but these studies represent excellent resources for individuals interested in
studying mental disorders through the secondary analysis of existing resources.
䡲 Epidemiologic Catchment Area (ECA). The ECA study emerged
in the early 1980s as a major epidemiological study of psycho-
pathology. Conducted in several sites across the United States,
the ECA consisted of over 20,000 participants assessed over
two waves of data collection. A broad array of disorders was
screened for in the ECA study, including mood, anxiety, psy-
chosis, and substance use. For ECA study information and data,
visit the Interuniversity Consortium for Political and Social
Research (ICPSR) at http://www.icpsr.umich.edu/.
䡲 Midlife in the United States (MIDUS). The MIDUS study, which
began around 1994, concerned midlife development in the
United States. A large national probability sample of 3,485 indi-
viduals included oversamples of metropolitan areas, a twin sam-
ple, and a sibling sample. Approximately a decade later, a second
wave of data collection, MIDUS-II, recontacted many of the
original participants, making this an excellent longitudinal
resource. In addition to broad psychopathological variables, the
sample also includes a broad range of diversity, including older
adults. Overall, the MIDUS documentation is excellent, making
it an ideal place for researchers to begin using secondary data sets.
For more information, see http://midus.wisc.edu. The MIDUS
data are also available through the ICPSR.
䡲 National Comorbidity Survey (NCS). The NCS study, conducted
in the early 1990s, was a nationally representative study of men-
tal health in the United States. A second wave of data collection,
NCS-2, followed up the original participants approximately a
decade later. Another sample of 10,000 individuals (NCS-R) was
The interested researcher might find it helpful to read a few studies that
addressed psychopathological questions through the analysis of existing data
sets. The articles below are only a few of the possible choices of readings; many
of the websites listed in Recommended Data Sets have links to studies emerg-
ing from their data sets, and a literature search for a particular data set should
yield additional studies.
Krueger, R. F., Chentsova-Dutton, Y. E., Markon, K. E., Goldberg, D., & Ormel, J.
(2003). A cross-cultural study of the structure of comorbidity among common
psychopathological syndromes in the general health care setting. Journal of
Abnormal Psychology, 112, 437–447. doi:10.1037/0021-843X.112.3.437
Krueger et al. discuss how a large World Health Organization data set was used
to confirm previous findings about a two-factor model of psychopathology, this
time in 14 countries around the globe. This data set was accessible to Goldberg
and Ormel because of their involvement with the data collection.
Kessler, R. C., DuPont, R. L., Berglund, P., & Wittchen, H. (1999). Impairment
in pure and comorbid generalized anxiety disorder and major depression at
12 months in two national surveys. The American Journal of Psychiatry, 156,
1915–1923.
Kessler et al. used data from the NCS and MIDUS archives to determine
whether generalized anxiety disorder was due to depression (or other comorbid
disorders) and to explore the level of impairment seen in independently occur-
ring cases of generalized anxiety disorder.
Krueger, R. F. (1999). The structure of common mental disorders. Archives of General
Psychiatry, 56, 921–926. doi:10.1001/archpsyc.56.10.921
Krueger used the NCS data set (publicly available on the web) to explore how
10 mental disorders fit into two- and three-factor models of psychopathology.
Krueger, R. F., & Markon, K. E. (2006). Reinterpreting comorbidity: A model-based
approach to understanding and classifying psychopathology. Annual Review of
Clinical Psychology, 2, 111–133. doi:10.1146/annurev.clinpsy.2.022305.095213
Krueger and Markon conducted a meta-analysis of studies published on major
data sets (e.g., NCS, NCS-R, Virginia Twin Registry). The data for this analy-
sis (correlation matrices) were taken from published reports.
REFERENCES
10
USING SECONDARY DATA
TO STUDY ADOLESCENCE AND
ADOLESCENT DEVELOPMENT
STEPHEN T. RUSSELL AND EVA MATTHEWS
163
12110-11_Ch10_rev1.qxd 6/23/10 1:50 PM Page 164
ADVANTAGES
significant for policy research, for which a typical goal is to provide findings
specifically applicable to a target policy audience. For example, analyses of the
California Health Kids Survey, a representative survey of more than 230,000
California public school students in Grades 7, 9, and 11 in the 2001–2002
school year, were used to advocate for school safety legislation in the state of
California. The Safe Place to Learn report (O’Shaughnessy, Russell, Heck,
Calhoun, & Laub, 2004) highlighted the deleterious consequences of harass-
ment based on sexual orientation and gender identity, and led to the enact-
ment of California Assembly Bill 394, the Safe Place to Learn Act of 2007, a
law that provides clarification and guidance to school districts to ensure that
school safety standards are implemented. Central to the success of the legisla-
tive advocacy was the availability of state-level representative data to docu-
ment to prevalence and negative outcomes associated with harassment by
adolescents in California schools.
An additional advantage is that many secondary data sources include
information from multiple reporters. The importance of multiple reporters in
developmental research is clear: Multiple perspectives allow for estimates of
reliability and consistency in reporting. The challenges of reporter bias may
be particularly relevant during the adolescent years, when the influence of
social desirability is strong; this is especially true for the reporting of sensitive
information from adolescents (e.g., emotional health, risk behavior, delin-
quency, sexual behavior). In answer to these challenges, large-scale surveys
have included parents, siblings, peers, and teachers as respondents in surveys
designed to assess adolescent development. For example, recent research on
aggression in adolescence highlights the importance of independent reports
from adolescents, their siblings, and their parents. Using longitudinal survey
data from the Iowa Youth and Families Project, Williams, Conger, and Blozis
(2007) showed that adolescents’ interpersonal aggression can be predicted by
independent reports by their siblings of the siblings’ own aggression.
Access to data about multiple contexts of adolescents’ lives is an addi-
tional advantage of many existing data sources and is increasingly important
in developmental research that seeks to situate the study of adolescence within
the broader contexts that guide and shape development (Schulenberg, 2006).
Adolescent survey data may include intrapersonal questions about the ado-
lescent: physical development; physical and emotional health; risk-taking
and health-promoting behaviors; and attitudes, beliefs, values, and goals.
Surveys may also include questions about interpersonal relationships (rela-
tionships between adolescents and their parents, siblings, and peers), the
school environment (school characteristics, and adolescents’ attitudes and
beliefs about their school), peers and friendships, and religion. The field of
adolescence studies has been strongly influenced by notions of developmen-
tal contextualism (Lerner, 1986), which emphasizes the changing and inter-
CHALLENGES
CONCLUSION
Several key advantages of existing survey data sets have been discussed
in this chapter: large sample sizes, population representative data, longitudi-
nal data, data from multiple reporters, insights about multiple contexts of
development, and the ability to conduct cross-historical or cross-national
comparisons. Although these advantages may not be unique to scholars inter-
ested in issues of adolescence, it is critical to note that scholars who seek to
study sensitive issues such as adolescent sexuality, substance use, and peer vic-
timization face unique logistical challenges in collecting primary data. Thus,
the availability of secondary data that assesses such topics has created oppor-
tunities for scientists to study these topics, and for public health programs to
meet the needs of this often vulnerable population.
REFERENCES
Aceves, M., & Cookston, J. (2007). Violent victimization, aggression, and parent–
adolescent relations: Quality parenting as a buffer for violently victimized
youth. Journal of Youth and Adolescence, 36, 635–647. doi:10.1007/s10964-
006-9131-9
Akers, R. L., Massey, J., & Clarke, W. (1983). Are self-reports of adolescent deviance
valid? Biochemical measures, randomized response, and the bogus pipeline in
smoking behavior. Social Forces, 62, 234–251. doi:10.2307/2578357
Bearman, P. S., & Moody, J. (2004). Suicide and friendships among American ado-
lescents. American Journal of Public Health, 94, 89–95. doi:10.2105/AJPH.94.1.89
Berkner, L. (2000). Using National Educational Longitudinal Study data to examine
the transition to college. New Directions for Institutional Research, 2000, 103–107.
doi:10.1002/ir.10707
Centers for Disease Control and Prevention. (2004). National Youth Tobacco Survey
methodology report. Retrieved from http://www.cdc.gov/tobacco/NYTS/nyts
2004.htm
Crockett, L. J., Brown, J., Russell, S. T., & Shen, Y.-L. (2007). The meaning of good
parent–child relationships for Mexican American adolescents. Journal of Research
on Adolescence, 17, 639–668.
Crockett, L. J., Randall, B. A., Shen, Y., Russell, S. T., & Driscoll, A. K. (2005).
Measurement equivalence of the Center for Epidemiological Studies Depression
Scale for Latino and Anglo adolescents: A national study. Journal of Consulting
and Clinical Psychology, 73, 47–58. doi:10.1037/0022-006X.73.1.47
Elder, G. H., Jr. (1974). Children of the Great Depression. Chicago, IL: University of
Chicago Press.
Grimm, K. J. (2007). Multivariate longitudinal methods for studying developmental
relationships between depression and academic achievement. International Journal
of Behavioral Development, 31, 328–339. doi:10.1177/0165025407077754
Haynie, D. L., & Piquero, A. R. (2006). Pubertal development and physical victim-
ization in adolescence. Journal of Research in Crime and Delinquency, 43, 3–35.
doi:10.1177/0022427805280069
Kann, L. (2001). The Youth Risk Behavior Surveillance System: Measuring health-
risk behaviors. American Journal of Health Behavior, 25, 272–277.
Lerner, R. M. (1986). Concepts and theories of human development (2nd ed.). New
York, NY: Random House.
Lynam, D. R., Caspi, A., Moffitt, T. E., Loeber, R., & Stouthamer-Loeber, M. (2007).
Longitudinal evidence that psychopathy scores in early adolescence predict
adult psychopathy. Journal of Abnormal Psychology, 116, 155–165. doi:10.1037/
0021-843X.116.1.155
Miller, B. C., Fan, X., Christensen, M., Grotevant, H. D., & van Dulmen, M. (2000).
Comparisons of adopted and non-adopted adolescents in a large, nationally rep-
resentative sample. Child Development, 71, 1458–1473. doi:10.1111/1467-
8624.00239
Modell, J. (1991). Into one’s own: From youth to adulthood in the United States, 1920–1975.
Berkeley, CA: University of California Press.
National Institutes of Health. (2003). NIH data sharing policy and implementation
guidance. Retrieved from the National Institutes of Health, Office of Extra-
11
USING SECONDARY DATA TO
ADVANCE CROSS-CULTURAL
PSYCHOLOGY
EVERT VAN DE VLIERT
Archives are echos of culture. If this statement calls for clarification, I would
like to add the part played by language: No language, no culture, no archives.
Oral and written languages, such as Arabic, Chinese, and Dutch, are tools to
create, send, and receive cultural values, beliefs, and behaviors. Accordingly,
cross-cultural psychologists often compare cultures on the basis of responses to
interviews and questionnaires, and observations of behaviors, not infrequently
turning their findings into data sets for further use. For laypeople and scholars
alike, languages are also tools to save and store information about cultural
values, beliefs, and behaviors in archives. Thus, archival data are rich sources for
the researcher looking for how culture is revealed or displayed (manifestations);
how culture comes about (antecedents); and how culture influences people’s
comings, doings, and goings (consequences), as the following examples illustrate.
MANIFESTATIONS
177
12110-12_Ch11_rev1.qxd 6/23/10 1:57 PM Page 178
ANTECEDENTS
CONSEQUENCES
Vandello and Cohen (1999) used secondary data to rank the 50 states
of the United States on consequences of culture, ranging from most collec-
tivistic (Hawaii, Louisiana, and South Carolina) to most individualistic
(Nebraska, Oregon, and Montana). The more individualistic the state’s
citizens, the more likely they were to live alone or at least in households
with no grandchildren, be divorced, have no religious affiliation, vote for the
Libertarian party, and be self-employed. For the same reason, carpooling to
work compared with driving alone was found more in the Deep South than
in the Mountain West and Great Plains. It is not surprising that this 50-state
Advantages
Disadvantages
run. The second mistake was that climate–culture researchers, including the
younger me, overlooked the complicating role of money. A valid analysis
of cultural adaptation to climate should take into account how much cash
(ready money) and capital (unready money) a society has available to cope
with bitter winters, scorching summers, or both. We need to search for climato-
economic origins of culture.
Theory
Of course, a society may adapt its cultural values, beliefs, and practices
to its climatic environment (e.g., House, Hanges, Javidan, Dorfman, & Gupta,
2004), its economic environment (e.g., Inglehart & Baker, 2000), both in a
parallel fashion (e.g., Nolan & Lenski, 1999), or both in a sequential fashion
(e.g., Hofstede, 2001). But all of these viewpoints neglect the equally obvi-
ous viewpoint that the climatic and economic impacts on culture may influ-
ence each other. A more accurate understanding of culture may unfold when
one thinks of the environment as an integrated climato-economic habitat
requiring integrated cultural responses. Hence, my emphasis on the hypothesis
that the interaction of climatic demands and collective income matters most
to culture. Demanding colder-than-temperate and hotter-than-temperate cli-
mates make income resources more useful. Income resources make harsher
climates less threatening and often more challenging.
Psychologists will have little trouble adopting the general idea that
resources can make up for demands; this line of reasoning is familiar to them
(e.g., Bandura, 1997; Karasek, 1979; Lazarus & Folkman, 1984). Greater
demands mismatched by unavailable or inadequate resources to meet the
demands impair psychological functioning, as the actors cannot control the
threatening and stressful situation. By contrast, greater demands matched by
personal or societal resources to meet the demands improve psychological
functioning, as the actors can control the situation, can turn threats into
opportunities, and can experience relief and pleasure instead of disappoint-
ment and pain. If the demands are negligible, resources do not serve a useful
purpose, with the consequence that no joint impact of demands and resources
on psychological functioning surfaces.
Using secondary data, recent research demonstrated that these demands–
resources explanations of human functioning are generalizable to (mis)matches
of climatic demands and income resources as origins of aspects of culture.
Greater climatic demands mismatched by collective poverty produce more life
stress and mortality salience, and increase one’s quest for certainty and one’s
inclination to show favor to “us” above “them.” These threatening climato-
economic niches led to more encouragement of selfishness in children,
more rule-and-role taking in organizations (Van de Vliert, 2009), and more
Level of Analysis
Figure 11.1. Effect of more demanding cold or hot climates on autocratic versus
democratic leadership ideals, broken down for poorer and richer countries.
mate, low versus high collective income, and autocratic versus democratic lead-
ership culture (because of space restrictions, the positions of Australians, Dutch,
German-speaking Swiss, and White South Africans could not be labeled).
In support of the hypothesis, Figure 11.1 shows a downward sloping
regression line for the poorer countries, r(30) = −.26, p < .08; an upward slop-
ing regression line for the richer countries, r(30) = .56, p < .001; negligible
differences between poorer and richer countries in temperate climates at the
left, r(30) = .01, ns; and significant differences between poorer and richer
countries in harsher climates at the right, r(30) = .65, p < .001. The down-
ward sloping line confirms the proposition that more threatening mismatches
of climatic demands and income resources impair psychological functioning
by generating more autocratic leadership ideals. The upward sloping line
confirms the proposition that more challenging matches of climatic demands
and income resources improve psychological functioning by generating more
democratic leadership ideals.
This pattern of results survived many attacks by rival predictors. Notably,
main and interactive effects of climatic precipitation could not account for,
or qualify, the joint impact of thermal climate and income per head. Likewise,
main and interactive effects of income inequality could not destroy the
picture either. As expected, only when I controlled for the above-mentioned
World Values Surveys’ dimension of survival culture versus self-expression
culture (Inglehart et al., 2004), did the climato-economic regression coeffi-
cients fail to reach significance anymore. Variation in survival versus self-
expression culture, accounting for 41% of the variation in autocratic versus
democratic leadership culture, appeared once again to function as an overar-
ching cultural umbrella of values, beliefs, and practices. Climato-economic
niches of survival culture, including autocratic leadership ideals, contrast with
climato-economic niches of self-expression culture, including democratic
leadership ideals.
Together with previous findings in this strand of research (Van de
Vliert, 2009), the present pattern of results reflects consistency, parsimony,
and accuracy, and inspires confidence in the following interpretation of
Figure 11.1.
䡲 More autocratic and relatively selfish leaders thrive in survival
cultures that evolve in poorer countries with harsher climates,
as represented by China, Kazakhstan, and Russia in the lower
right corner.
䡲 Leaders who embrace neither autocratic nor democratic
approaches thrive in countries with temperate climates irre-
spective of the inhabitants’ income per head (e.g., Malaysia and
Zambia, in the middle at the left).
䡲 More democractic and relatively cooperative leaders thrive in
self-expression cultures that evolve in richer countries with
harsher climates, as represented by Austria, Denmark, and
Finland in the upper right corner.
This set of conclusions supports the impression that leaders are products
rather than producers of culture. Indeed, rather than shaping climato-economic
niches, the results indicate that leadership cultures are shaped by their envi-
ronment, just as world citizens adapt their values, beliefs, and practices to the
climate of their residential area using the money they have and hold avail-
able to cope with that climate. Everyone, everyday, everywhere has to satisfy
climate-based needs for thermal comfort, nutrition, and health with the help
of money resources. Corresponding values, beliefs, and practices have been
learned with the greatest of ease, without awareness of their age-long evolu-
tion, and with next to no recollection of survival as their ultimate objective.
As a consequence, archived cultural remnants of the climato-economic past
are silently waiting to be discovered as companions of the climato-economic
present.
Gupta, V., Sully de Luque, M., & House, R.J. (2004). Multisource construct validity
of GLOBE scales. In R. J. House, P. J. Hanges, M. Javidan, P. W. Dorfman, &
V. Gupta (Eds.), Culture, leadership, and organizations: The GLOBE study of
62 societies (pp. 152–177). Thousand Oaks, CA: Sage.
Van de Vliert, E. (2009). Climate, affluence, and culture. New York, NY: Cambridge
University Press.
REFERENCES
Bandura, A. (1997). Self-efficacy: The exercise of control. New York, NY: Freeman
Press.
Buss, D.M. (2004). Evolutionary psychology: The new science of the mind (2nd ed.).
Boston, MA: Allyn & Bacon.
Dawkins, R. (1989). The selfish gene (2nd ed.). New York, NY: Oxford University
Press.
Diamond, J. (2005). Collapse: How societies choose to fail or survive. New York, NY:
Penguin.
Gupta, V., Sully de Luque, M., & House, R.J. (2004). Multisource construct validity
of GLOBE scales. In R. J. House, P. J. Hanges, M. Javidan, P. W. Dorfman, &
V. Gupta (Eds.), Culture, leadership, and organizations: The GLOBE study of
62 societies (pp. 152–177). Thousand Oaks, CA: Sage.
Hanges, P. J., & Dickson, M. W. (2004). The development and validation of the
GLOBE culture and leadership scales. In R. J. House, P. J. Hanges, M. Javidan,
P. W. Dorfman, & V. Gupta (Eds.), Culture, leadership, and organizations: The
GLOBE study of 62 societies (pp. 122–151). Thousand Oaks, CA: Sage.
Hofstede, G. (2001). Culture’s consequences: Comparing values, behaviors, institutions,
and organizations across nations. London, England: Sage.
House, R. J., & Hanges, P. J. (2004). Research design. In R. J. House, P. J. Hanges,
M. Javidan, P. W. Dorfman, & V. Gupta (Eds.), Culture, leadership, and orga-
nizations: The GLOBE study of 62 societies (pp. 95–101). Thousand Oaks, CA:
Sage.
House, R. J., Hanges, P. J., Javidan, M., Dorfman, P. W., & Gupta, V. (Eds.). (2004).
Culture, leadership, and organizations: The GLOBE study of 62 societies. Thousand
Oaks, CA: Sage.
Inglehart, R., & Baker, W.E. (2000). Modernization, cultural change, and the persis-
tence of traditional values. American Sociological Review, 65, 19–51. doi:10.2307/
2657288
Inglehart, R., Basáñez, M., Díez-Medrano, J., Halman, L., & Luijkx, R. (Eds.). (2004).
Human beliefs and values: A cross-cultural sourcebook based on the 1999–2002
values surveys. Mexico City, Mexico: Siglo XXI Editores. Also available from
http://www.worldvaluessurvey.org
Kanazawa, S. (2006). Where do cultures come from? Cross-Cultural Research, 40,
152–176.
Karasek, R. A. (1979). Job demands, job decision latitude, and mental strain:
Implications for job redesign. Administrative Science Quarterly, 24, 285–308.
Kemmelmeier, M., Jambor, E. J., & Letner, J. (2006). Individualism and collectivism:
Cultural variation in giving and volunteering across the United States. Journal
of Cross-Cultural Psychology, 37, 327–344.
Knight, C. (1991). Blood relations: Menstruation and the origins of culture. New Haven,
CT: Yale University Press.
Lazarus, R. S., & Folkman, S. (1984). Stress, appraisal, and coping. New York, NY:
Springer.
McClelland, D. C. (1961). The achieving society. Princeton, NJ: Van Nostrand.
Nolan, P., & Lenski, G. (1999). Human societies: An introduction to macrosociology
(8th ed.). New York, NY: McGraw-Hill.
Parker, P. M. (1997). National cultures of the world: A statistical reference. Westport,
CT: Greenwood Press.
Ross, M. H. (1993). The culture of conflict. New Haven, CT: Yale University Press.
12
USING THE AMERICAN NATIONAL
ELECTION STUDY SURVEYS TO TEST
SOCIAL PSYCHOLOGICAL
HYPOTHESES
DANIEL SCHNEIDER, MATTHEW DEBELL, AND JON A. KROSNICK
Since 1948, the American National Election Study (ANES) has been
collecting huge data sets, allowing social scientists to study the psychology of
voting behavior, political attitudes and beliefs, stereotyping, political social-
ization, the effects of social networks, the impact of the news media on the
political process, and much more. Every 2 years, representative samples of
more than 1,000 Americans have been interviewed in-depth after the national
elections. In presidential election years, these respondents have been inter-
viewed in-depth before the election as well. Panel studies have been con-
ducted to track changes in people’s attitudes, beliefs, and behavior, and pilot
studies have been conducted to develop new measurements to be used later
in the interviews.
The ANES was conceived at the University of Michigan’s Institute for
Social Research by Angus Campbell, Philip Converse, Warren Miller, and
Donald Stokes. Some of the most widely cited books on the psychology of
voting were written by these scholars using ANES data (e.g., Campbell,
Converse, Miller, & Stokes, 1960; Converse, 1964; Miller, 1974), and the
data have been used by numerous other researchers to test hypotheses and
produce thousands of books and articles. In 1977, the National Science
193
12110-13_Ch12_rev1.qxd 6/23/10 1:49 PM Page 194
the same phenomena through survey data (Kinder & Palfrey, 1993). Surveys
allow tracking changes in attitudes, beliefs, and behavior over time, either in
individual respondents or in the aggregate of the general population. Such
data allow for sophisticated statistical analyses to bolster confidence in the
directions of causal relations (Finkel, 1995), and researchers can embed exper-
iments in surveys to simultaneously enhance internal and external validity
(Piazza & Sniderman, 1998; Visser, Krosnick, & Lavrakas, 2000). Doing so
with general population samples allows researchers to explore the impact of
a wide array of possible individual difference moderators of effects, when such
investigation is more difficult with more homogeneous conventional lab stud-
ies with college student participants (Sears, 1986).
SURVEY METHODOLOGY
use computers that guide them through a questionnaire and implement exper-
imental manipulations in a survey when desired.
Self-administered paper-and-pencil questionnaires, often mailed to
respondents, have been used for large-scale survey operations but yield low
response rates in general population surveys unless particular methods are
used to enhance response rates (Dillman, 2000). Paper-and-pencil question-
naires are not well suited to complex skip patterns, whereby answers to early
questions determine which later questions a person should answer. Some stud-
ies have shown that telephone interviewing can produce more accurate
results than self-administered paper-and-pencil questionnaires (Silver &
Krosnick, 2001), but there have also been cases in which mail surveys provided
excellent results (Visser, Krosnick, Marquette, & Kurtin, 1996).
Self-administered questionnaires can also be completed through com-
puters and the Internet, and this methodology is now increasingly popular.
Internet surveys combine many of the positive features of other survey modes:
No interviewers are needed, which saves money; complex filtering and exper-
imental manipulations can be implemented; visual presentation of response
options is routine, perhaps reducing respondent burden; and audio and video
material can be practically presented. However, Internet access is not univer-
sal among the general population (DeBell & Chapman, 2006), which pres-
ents challenges in the use of web-based surveys. Some commercial firms in
the United States and other countries have recruited representative samples
of adults and given computer equipment and Internet access to households
without it, thus yielding accurate data through this mode (Chang & Krosnick,
2001a, 2001b).
Regardless of the mode selected, respondent recruitment procedures
should be designed to minimize nonresponse bias and maximize the response
rate. Nonresponse bias occurs when a sample is systematically different from
the population. Response rates—defined as the percentage of eligible sample
members who complete a survey (American Association for Public Opinion
Research, 2006)—are of interest to survey researchers because they indicate
the degree of risk that a survey sample might be unrepresentative. If nearly all
sampled individuals complete a survey, and if the survey is designed and
implemented optimally, then the potential for nonresponse bias is low.
Conversely, if the response rate is low, there is increased potential for non-
response bias to affect estimates. However, low response rates per se are not
evidence of nonresponse bias; they merely indicate the possibility of bias, and
an accumulating body of studies indicates that if a probability sample is drawn
from a population and serious efforts are made to collect data from as many
sampled individuals as possible, results appear to be minimally affected by
response rates (Curtin, Presser, & Singer, 2002; Holbrook, Krosnick, & Pfent,
2008; Keeter et al., 2000).
Panel Surveys
1Although the preelection and postelection waves of the face-to-face ANES studies constitute two-wave
panels, few questions have been asked identically in both interviews, thus limiting the study’s ability to
track changes over time in individuals.
Some ANES panel studies have tracked respondents over longer time
periods spanning several elections (1956–1958–1960, 1972–1974–1976,
2000–2002–2004). Other ANES panels have tracked changes over 1-year
periods, such as from 1990 to 1991 to 1992, or changes during a single elec-
tion campaign season, such as in 1980, when data were collected in January,
June, September, and November–December.
A study by Bannon, Krosnick, and Brannon (2006) used the specific
structure of the ANES Panel Study of 1990–1992–1994 to address the occur-
rence of media priming in election campaigns. It also tested an alternative
explanation for the priming effect of media coverage on how issues are used
by citizens in their overall evaluations of political candidates. Media priming
theories postulate that citizens form the evaluations on the level of issues and
then combine those into an overall candidate evaluation. However, past
research has shown that sometimes respondents engage in rationalization strate-
gies (Rahn, Krosnick, & Breuning, 1994), that is, forming an overall evalua-
tion and then inferring the lower level issue evaluations from that overall
evaluation. During the 1990–1992–1994 election cycle, the researchers used
the change in public attention from the Gulf War to the state of the econ-
omy to check how individuals in the panel study changed their evaluation of
the economic policies of the president and how those evaluations translated
into changes of the overall evaluation. Using structural equation modeling,
the researchers used the panel data to isolate the direction of causality. They
found that although both rationalization and traditional media priming do
occur, priming has a much stronger effect. In fact, previous studies might have
underestimated the media priming effect because it was counterbalanced by
the presence of unmodeled rationalization effects.
cognition and need to evaluate were more interested in politics and more
engaged in the campaign. Those high in need to evaluate had more extreme
attitudes, and those high in need for cognition were less likely to be dissatis-
fied with the experience of taking the survey. Although need for cognition
has usually been measured with 18 questionnaire items (Cacioppo, Petty, &
Kao, 1984) and need to evaluate has been measured with 16 (Jarvis & Petty,
1996), the ANES implemented two optimally formatted questions measur-
ing need for cognition and three optimally formatted questions measuring
need to evaluate, which yielded remarkably effective measurements.
The 2006 pilot study addressed many topics of interest to psychologists:
character judgments, defensive confidence, need for closure, belief in a just
world, self monitoring, interpersonal trust, basic values, optimism–pessimism,
social networks, tolerance, and many others. More than 20 initial reports on
many of these topics can be seen on the ANES website (http://www.election
studies.org). This is more than for any prior pilot study, but much work remains
to be done with the 2006 data set.
All ANES data sets are available through the Internet at no cost from
the ANES website. Each data file is accompanied by documentation of the
study’s design. Many ANES data files provide information about the inter-
viewers and their assessments of the respondents and the interview situations,
which can be used to investigate methodological issues. The website also pro-
vides a great deal of other information about the ANES and maintains a bib-
liography that documents many of the thousands of articles, books, and papers
that have used its data.
Of course, the ANES is only beneficial to the researcher provided the
questions he or she is interested in were actually asked in the surveys. There is
a bit of a chicken-and-egg problem here: A researcher does not know whether
the ANES offers measures suitable to address his or her hypotheses until he or
she learns what measures are available, but the task of learning about all avail-
able measures administered during many hours of interviewing over many
years seems daunting if he or she does not know in advance that suitable mea-
sures will be found there. An efficient solution to this problem is for researchers
to get familiar with one of the traditional preelection–postelection survey
questionnaire pairs. Many of the same sorts of items are asked across many
years, so becoming familiar with one questionnaire is a good first step toward
becoming familiar with many.
All ANES measures are documented in the codebooks and question-
naires available for each data set (go to http://www.electionstudies.org). The
questionnaires list all the questions asked, often also documenting instruc-
tions given to the interviewers about how to ask the questions. The code-
books list every variable in each data set, showing the responses to the
questions and also many additional variables about the interview process, the
interviewer, and the respondent. The documentation also includes descrip-
tions of sample designs, experiments embedded in the surveys, and informa-
tion on the data collection modes. Researchers should bear in mind that
coding approaches may differ across variables, waves, or years of data collec-
tion, so it is important to carefully read the codebook description of every
variable one works with.
2ANES measures party identification with a branching question that begins, “Generally speaking, do
you think of yourself as a Republican, a Democrat, an independent, or what?” If the respondent answers
“Republican” or “Democrat,” a follow-up question asks, “Would you call yourself a strong [Republican/
Democrat] or a not very strong [Republican/Democrat]?” If the respondent does not answer “Republican”
or “Democrat,” a follow-up question asks, “Do you think of yourself as closer to the Republican Party or
to the Democratic Party?” Responses to these questions can be combined to yield a 7-point summary
scale: strong Democrat, not very strong Democrat, independent closer to the Democratic Party, independent,
independent closer to the Republican Party, not very strong Republican, strong Republican. However, some
research indicates that this 7-point scale is not monotonically related to other variables, so analysts
should check for monotonicity before computing statistics with it.
The ANES does not use simple random samples, so statistical proce-
dures that assume simple random sampling are generally inappropriate for the
analysis of ANES data. The extent to which the study design differs from a
simple random sample and the extent to which documentation and data files
support design-consistent statistical procedures have varied over the decades.
In general, specialized statistical steps should be taken when analyzing ANES
data. These steps are unfamiliar to most researchers who have not been trained
in survey methodology, but they are fairly easy to implement correctly. There
are two steps: weight the data and compute design-consistent estimates of
variance (including standard errors).
Weights
If an ANES data set includes an analysis weight variable, researchers
should use it if they wish to project their results to the population of American
adult citizens. It is important to use the weights, because weights adjust the
data for unequal probabilities of selection and correct for nonresponse bias,
making estimates such as percentages, means, and regression coefficients
more accurate as parameter estimates for the entire population. In statistical
software such as Stata, SAS, and SPSS, researchers can implement simple
instructions to tell the software to use a weight. For example, in SPSS, once
the 2006 Pilot Study data set has been opened, implementing the syntax com-
mand “weight by v06p002” tells SPSS to run subsequent analyses using the
pilot study’s weight variable, V06P002. The name of the weight variable(s)
for each study can be found in the study’s codebook.
Design-Consistent Estimates
Running analyses with the weights is sufficient to obtain correct point
estimates such as percentages and regression coefficients for the population.
However, by default, most statistical software will calculate sampling errors
and statistical significance using procedures designed for simple random sam-
ples (SRS). The complex sample designs used in most ANES studies differ in
important ways from simple random sampling, so standard errors, confidence
intervals, and levels of statistical significance reported using SRS assumptions
for ANES data are incorrect. Normally, the use of SRS significance statistics
will lead to Type I errors (i.e., false rejection of the null, or making differences
look significant when they are not).
To avoid these errors, data analysts should always use design-consistent sta-
tistical procedures when the data support them. Recent ANES data sets support
Taylor series methods (Kish, 1965; Lee & Forthofer, 2006) to estimate standard
errors, and many statistical programs, including Stata (http://www.stata.com),
CODA
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (Eds.). (1991). Measures of per-
sonality and social psychological attitudes. San Diego, CA: Academic Press.
Contains the exact wording of numerous sets of questions used to measure
dimensions of personality and other psychological constructs.
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (Eds.). (1999). Measures of polit-
ical attitudes. San Diego, CA: Academic Press.
Contains the exact wording of numerous sets of questions used to measure polit-
ical attitudes, as well as essays on conceptualization and measurement issues in
political survey research.
REFERENCES
Abelson, R. P., Kinder, D. R., Peters, M. D., & Fiske, S. T. (1982). Affective and
semantic components in political person perception. Journal of Personality and
Social Psychology, 42, 619–630. doi:10.1037/0022-3514.42.4.619
American Association for Public Opinion Research. (2006). Standard definitions:
Final dispositions of case codes and outcome rates for surveys (4th ed.). Lenexa, KS:
Author.
Bannon, B., Krosnick, J. A., & Brannon, L. (2006, August/September). News media
priming: Derivation or rationalization? Paper presented at the annual meeting of
the American Political Science Association, Philadelphia, PA.
Bizer, G. Y., Krosnick, J. A., Holbrook, A. L., Wheeler, S. C., Rucker, D. D., & Petty,
R. E. (2004). The impact of personality on cognitive, behavioral, and affective
political processes: The effects of need to evaluate. Journal of Personality, 72,
995–1027. doi:10.1111/j.0022-3506.2004.00288.x
Bizer, G. Y., Krosnick, J. A., Petty, R. E., Rucker, D. D., & Wheeler, S. C. (2000). Need
for cognition and need to evaluate in the 1998 National Election Survey Pilot Study.
Jarvis, W. B. G., & R. E. Petty. (1996). The need to evaluate. Journal of Personality
and Social Psychology, 70, 172–194. doi:10.1037/0022-3514.70.1.172
Keeter, S., Miller, C., Kohut, A., Groves, R. M., & Presser, S. (2000). Consequences
of reducing nonresponse in a national telephone survey. Public Opinion Quarterly,
64, 125–148. doi:10.1086/317759
Kinder, D. R., & Palfrey, T. R. (1993). On behalf of an experimental political sci-
ence. In D. R. Kinder & T. R. Palfrey (Eds.), Experimental foundations of political
science (pp. 1–39). Ann Arbor, MI: The University of Michigan Press.
Kish, L. (1965). Survey sampling. New York, NY: Wiley.
Krosnick, J. A., & Kinder, D. R. (1990). Altering popular support for the president
through priming: The Iran-Contra affair. The American Political Science Review,
84, 497–512. doi:10.2307/1963531
Lee, E. S., & Forthofer, R. N. (2006). Analyzing complex survey data. Thousand Oaks,
CA: Sage.
Miller, A. H. (1974). Political issues and trust in government: 1964–1970. The
American Political Science Review, 68, 951–972. doi:10.2307/1959140
Piazza, T., & Sniderman, P. M. (1998). Incorporating experiments into computer
assisted surveys. In M. P. Couper, R. P. Baker, J. Bethlehem, C. Z. F. Clark,
J. Martin, W. L., Nicholls, & J. M. O’Reilly (Eds.), Computer assisted survey
information collection (pp. 167–184). New York, NY: Wiley.
Rahn, W. M., Krosnick, J. A., & Breuning, M. (1994). Rationalization and deriva-
tion processes in survey studies of political candidate evaluation. American
Journal of Political Science, 38, 582–600. doi:10.2307/2111598
Sears, D. O. (1986). College sophomores in the laboratory: Influences of a narrow
data base on social psychology’s view of human nature. Journal of Personality and
Social Psychology, 51, 515–530. doi:10.1037/0022-3514.51.3.515
Shanks, J. M., Sanchez, M., & Morton, B. (1983). Alternative approaches to survey data
collection for the National Election Studies (ANES Technical Report No. nes010120).
Ann Arbor, MI: University of Michigan.
Silver, M. D., & Krosnick, J. A. (2001, May). An experimental comparison of the qual-
ity of data obtained in telephone and self-administered mailed surveys with a listed sam-
ple. Paper presented at the annual meeting of the American Association for
Public Opinion Research, Montreal, Canada.
Visser, P. S., Krosnick, J. A., & Lavrakas, P. J. (2000). Survey research. In H. T. Reis
& C. M. Judd (Eds.), Handbook of research methods in social and personality psy-
chology (pp. 223–252). Cambridge, England: Cambridge University Press.
Visser, P. S., Krosnick, J. A., Marquette, J., & Kurtin, M. (1996). Mail surveys for
election forecasting? An evaluation of the Columbus Dispatch Poll. Public
Opinion Quarterly, 60, 181–227. doi:10.1086/297748
13
FAMILY-LEVEL VARIANCE IN VERBAL
ABILITY CHANGE IN THE
INTERGENERATIONAL STUDIES
KEVIN J. GRIMM, JOHN J. MCARDLE, AND KEITH F. WIDAMAN
1Data from the IGS are currently not publicly available. However, efforts are currently being made to
make IGS data available to the research community. Interested researchers should contact the first
author.
209
12110-14_Ch13_rev1.qxd 6/23/10 1:57 PM Page 210
More recently, Finkel, Reynolds, McArdle, and Pedersen (2005) fit lon-
gitudinal growth models to twin data from the Swedish Adoption Twin Study
of Aging. A quadratic growth model was fit for verbal ability, and the genetic
contributions to the intercept, linear slope, and quadratic slope were exam-
ined. Finkel et al. reported a large (79%) genetic contribution to the intercept
(centered at age 65), a small (13%) and nonsignificant genetic contribution
to the linear slope, and no genetic contribution to the quadratic slope. In their
study of four cognitive abilities (verbal, spatial, memory, and speed), they
found that the only significant genetic contribution to the linear slope
was found for the speed factor (32%); however, larger genetic contributions
were found for the quadratic slopes for certain abilities. Additionally, non-
significant amounts of common environment variation were found in the
intercepts and in the linear and quadratic slopes for all cognitive variables, sug-
gesting that the development of these abilities arises from a combination of
genetic and unshared environmental contributions.
tered, leading to less reliable measures and making it difficult to compare scores
(and parameter estimates) obtained from the secondary data set with scores
obtained from other studies that used more complete or comprehensive mea-
surements. For example, in the NLSY–Children and Young Adults, a measure
of children’s behavior—the Behavior Problems Index (Zill & Peterson,
1986)—has much in common with the Child Behavior Checklist (Achenbach
& Edelbrock, 1983), a popular measure of children’s behavior problems.
However, direct comparison of these scores is not appropriate. A second
example comes from the ECLS–K, in which the academic achievement mea-
sures (reading, mathematics, and general knowledge) were obtained using
adaptive testing formats. That is, each participant was administered a set of
items targeted toward his or her ability level. Even though participants were
asked different sets of questions, their scores were comparable because the
items have been linked using item response theory techniques. However,
the ECLS–K cognitive ability scores are not comparable with commonly used
cognitive ability measures (e.g., Woodcock–Johnson Tests of Achievement;
Woodcock & Johnson, 1989) collected in other studies.
The foregoing considerations regarding the lack of comparability of scores
across studies underscore the importance of the presence of item-level data.
Item responses represent the most basic unit of data, and with item-level data,
comparable scores can be created using linking techniques from item response
theory if some items are shared by the measures used in the secondary and ancil-
lary data sets, as long as differential item functioning is not present. Item-level
data have been necessary to examine change in the IGS, as different measures
of the same construct (e.g., verbal ability) have been administered throughout
the course of the study.
METHOD
Intergenerational Studies
birth. The infants were assessed every month from 1 to 15 months, every
3 months from 18 through 36 months, and annually from 4 to 18 years of
age. In adulthood, BGS participants were assessed at ages 21, 26, 36, 52,
66, and 72.2
Data from the BGS participants include measures of maternity and pre-
natal health, cognition, motor skills, personality, social behavior, anthropomor-
phic characteristics, psychological health, military service, marriage, alcohol,
smoking, and physical examinations. Cognitive data come from the California
First-Year Mental Scale (Bayley, 1933), California Preschool Mental Scale
(Jaffa, 1934), Stanford–Binet (Terman, 1916), Wechsler–Bellevue, and WAIS.
Many children of the BGS participants (N = 149) were also repeatedly
measured through childhood and adolescence with many of the same mental
tests (e.g., California Preschool Mental Scale, Stanford–Binet, WAIS) admin-
istered to the original participants. The children of the BGS participants were
assessed up to 17 times during childhood, adolescence, and early adulthood.
Additionally, personality and mental test performance data are available
from the parents (in 1968) and spouses of BGS participants (in 1980, 1994,
and 2000).
Guidance Study
The GS was initiated in 1928 by Jean Macfarlane (1939). The 248 par-
ticipants were drawn from a survey of every third birth in Berkeley, California,
for the 18-month period starting January 1, 1928. Home visits began when
infants were 3 months of age and continued through 18 months. Infants and
parents were interviewed and tested every 6 months from 2 through 4 years and
then annually from 5 to 18 years of age. In adulthood, the GS participants were
assessed at 40 and 52 years of age. Measurements in the GS included medical
examinations, health histories, anthropometric assessments, intelligence tests,
socioeconomic and home variables, and personality characteristics.
As in the BGS, data from the children of the GS participants (N = 424)
are available and include cognitive and personality assessments. These mea-
surements were conducted in 1960 and 1970. Personality measures and mental
tests were also administered to the parents (Generation 1) in 1970, and spouses
of GS participants were assessed in 1970 and 1980.
2McArdle et al. completed interviews with BGS participants and spouses in 2008 (age 80).
The IGS data are archived at the Institute of Human Development at the
University of California, Berkeley. Original documents of all tests and measures,
as well as interesting notes and documented interactions between study mem-
bers and project leaders, are stored. Electronic data files for many of the major
tests and measures were created in the 1970s. In the early 1980s, Jack McArdle
and Mark Welge coded the Wechsler intelligence tests (Wechsler–Bellevue,
WAIS, and WAIS–R) administered to the original participants in adulthood at
the item level. Over the past few years, the first two authors of this chapter made
several visits to UC Berkeley to copy and code all Stanford–Binet (1916
Stanford–Binet Forms L and M (Terman & Merrill, 1937), and LM (Terman &
Merrill, 1960) and remaining Wechsler intelligence tests at the item level.3
The IGS provides a wealth of data across the life span, but extracting lon-
gitudinal change information across this extended time period is not always
straightforward. For example, in the work by Block (1971) examining life-span
development of personality, Q-sort data had to be constructed because Q-sorts
3
The IGS data archive is now located in the Department of Psychology at the University of California,
Davis.
were not administered in adolescence and early adulthood. Q-sort ratings were
made by three clinically trained professionals and were based on case material
from the relevant life stage (see Huffine & Aerts, 1998).
As previously mentioned, McArdle et al. (2009) investigated life-span
development of verbal ability and short-term memory with data from the IGS.
In this project, McArdle et al. (2009) attempted to separate changes in partic-
ipants’ ability over the life span from changes in the intelligence tests, collat-
ing information from the 16 different intelligence batteries that had been
administered over the course of the study. To separate change in ability from
change in testing protocols, an outside data set—the Bradway–McArdle
Longitudinal Study (Bradway, 1944; McArdle, Hamagami, Meredith, &
Bradway, 2000)—was included because these participants took multiple intel-
ligence tests (i.e., WAIS and Stanford–Binet) within a short period of time.
Item-level data were analyzed because the revisions of the Stanford–Binet
(e.g., 1916 Stanford–Binet Forms L, M, and LM) and Wechsler scales (e.g.,
WAIS and WAIS–R) shared items even though their scale or raw scores were
not comparable because the tests had a different number of items, different
standardization samples, and/or do not contain all of the same items. McArdle
et al. (2009) fit a combined item response and nonlinear growth model to
examine the growth and decline of verbal ability and short-term memory
across the life span.
Item-level data are available for a select set of intelligence tests that rep-
resent the most comprehensive and often administered scales in the IGS.
These include the Stanford–Binet (1916 Stanford–Binet, 1937 Revisions,
and 1960 Revision) and Wechsler intelligence scales (Wechsler–Bellevue,
Wechsler Intelligence Scale for Children, WAIS, and WAIS–R). Table 13.1
contains information on the number of participants in Generations 2 and 3
for which intelligence test data are available. The item-level data from the
vocabulary scales, as well as item data from the Information Scale of the
Wechsler tests, were combined into a single data file as if the items composed
a single test. This longitudinal file contained multiple records per participant to
represent the repeated nature of the data. The file contained a total of 231 items
and 4,563 records from 1,373 individual participants from 501 families.
Analytic Techniques
Number of Subjects and Total Data Points Available for Each Intelligence Test Broken Down by Study and Generation
BGS GS OGS
BGS children GS children OGS children
6/23/10
1916 Stanford–Binet
Subjects 58 — 211 — 202 —
Data points 107 — 409 — 203 —
1:57 PM
Stanford–Binet Form L
Subjects 62 139 212 303 — 251
Data points 235 621 634 387 — 309
Stanford–Binet Form M
Subjects 61 142 199 63 152 —
Page 217
217
12110-14_Ch13_rev1.qxd 6/23/10 1:57 PM Page 218
longitudinal item-level data were stacked such that repeated observations were
contained as multiple records (e.g., long file). The partial credit model (PCM;
Masters, 1982) was fit to the item-level data, without accounting for the
dependency due to the repeated observations, and person–ability estimates
were derived from the model for each person at each occasion. The PCM can
be written as
⎛ PX [ t ]in ⎞
ln ⎜ ⎟ = θ [ t ]n − β i , (1)
⎝ 1 − PX [ t ]in ⎠
where θ[t]nj is the estimated verbal ability score from the PCM for participant
n in family j at age t, Inj is the intercept score for participant n in family j, Snj
is the slope score for participant n in family j representing the individual’s rate
of change, a1 is the rate of decline (as participants reach older adulthood), a2
is the rate of growth (as participants increase in age from early childhood
through adolescence and into adulthood), and e[t]nj is a time-specific residual
score. The individual intercept and slope can be decomposed into family-
level scores and individual deviations from the family-level scores such as
I nj = β 0 j + u 0 nj
Snj = β1 j + u1nj , (3)
where β0j and β1j are the family-level intercept and slope scores for family j and
u0nj and u1nj are individual deviations from the family-level scores for participant
n in family j. Finally, the family-level intercept and slope scores are composed
of sample-level means (γ00, γ10) and family-level deviations (s00j, s10j), such as
β 0 j = γ 00 + s00 j
β1 j = γ 10 + s10 j . (4)
The variability of u0nj compared with s00j and u1nj compared with s10j indicates
the amount of within-family versus between-families variance in the intercept
and slope, and the ratio of between-families variance to total variance for the
intercept and slope are estimates of familial resemblance. The family growth
curve models were fit using Mplus 4.13 (Muthén & Muthén, 1998-2007).
Input and output scripts can be downloaded from http://psychology.ucdavis.
edu/labs/Grimm/personal/downloads.html.
RESULTS
The model-based reliability under a partial credit model was .97; how-
ever, this estimated reliability was likely to be positively biased because the
dependencies among observations that were not modeled with the PCM. On
average, individuals were administered 38.6 items (17.9% of the items in
which there was variability in the response pattern). The estimated ability
scores (i.e., θ[t]n) are a function of the items administered and the person’s
pattern of correct and incorrect responses. The scale of the ability estimates,
as well as of the item difficulties, is in the logit metric, which is an arbitrary
scaling of the scores (i.e., as is the scale of factor scores). The logit metric was
scaled such that the mean item difficulty was zero and item discrimination
was 1. In this scale, the mean and standard deviation of the person ability was
estimated to be −1.38 and 3.72, respectively.
The verbal ability estimates for each person are plotted against the per-
sons’ age at testing in Panels A and B of Figure 13.1 for the original subjects
(Generation 2) and the children of the original subjects (Generation 3),
(A)
(B)
Figure 13.1. Longitudinal plots of verbal ability for (A) Generation 2 and
(B) Generation 3 of the Intergenerational Studies.
Parameter estimates and their respective standard errors from the dual
exponential family growth curve are contained in Table 13.2. The intercept
was centered at age 10 years, making estimates associated with the intercept
(i.e., intercept mean, and the family and individual-level intercept deviations)
indicative of parameters at age 10. The growth and decline rates control the
overall shape of the curve and are important parameters for describing changes
in verbal ability across the life span. The growth rate was .126, the decline rate
TABLE 13.2
Parameter Estimates and Standard Errors for the Dual Exponential Family
Growth Curve for Verbal Ability
Parameter estimate
Parameter (standard error)
was .005, and both were significantly different from zero, suggesting a rapid
increase in verbal ability through childhood and a very slight decline in verbal
ability into older adulthood. Previous research with IGS and additional life-
span data (e.g., McArdle et al., 2009; McArdle, Ferrer-Caja, Hamagami, &
Woodcock, 2002) has found similar significant, but relatively small, estimates
of decline in verbal ability through older adulthood.
The mean intercept was −1.94, and the mean slope was 6.23. Therefore,
performance at age 10 is about .17 deviations less than the average for the
sample. The mean intercept represents the expected family score for verbal
ability at age 10, and the mean slope represents how rapidly verbal ability was
expected to grow for the average family. Significant between-families and
within-family variances were obtained for the intercept as well as the slope,
indicating that families differed in their verbal ability at age 10 years and in
how quickly they change across the life span. Additionally, these estimates
captured how family members differed from one another in their level of ver-
bal ability at age 10 and how quickly they changed across the life span. Fifty-
five percent of the variation in the intercept was at the family level, whereas
22% of the slope variance was at the family level. These percentages repre-
sent how families differ from one another and reveal the extent of familial
resemblance in verbal ability and verbal ability change. The mean predicted
trajectory of verbal ability is shown in Figure 13.2 surrounded by two times
the expected family-level and then two times the expected individual-level
standard deviations. This plot gives a visual representation of the relation-
ship of the expected between-families variance compared with the within-
family variance at each age across the life span.
DISCUSSION
Figure 13.2. Expected life span trajectory for verbal ability based on the longitudinal
family models. Solid line is the expected mean trajectory; hard dashed lines are a
two standard deviation confidence boundary of the expected between-families
variance; soft dashed lines are an additional two standard deviation confidence
boundary of the expected within-family variance.
familial component for the intercept was nearly identical to the median value
(.53) reported by DeFries et al. (1979) for regressions of midchild (i.e., average
of children’s scores) on midparent cognitive scores. The size of the between-
families variance for the slope was slightly higher than the heritability estimate
for the linear slope reported by Finkel et al. (2005) but was lower than estimates
for quadratic slope. These discrepancies in the amount of familial resemblance
in rates of change were likely to be due to differences across studies in the ages
at which participants were assessed, the particular measured variables analyzed,
the use of alternative (e.g., twin vs. family) study designs, and the homogene-
ity of participants in the IGS, among other factors. But differences can also arise
because two sources of interindividual differences in change were evaluated in
the quadratic growth model fit by Finkel et al. (2005), whereas only one source
of interindividual differences in change was included in the exponential model
fit in this project. The estimate of familial resemblance was in line with the her-
itability estimate obtained by McArdle, Prescott, Hamagami, and Horn (1998)
for vocabulary ability change (33% genetic; 0% shared environmental), even
though the age ranges were distinct across studies.
CONCLUDING REMARKS
REFERENCES
Achenbach, T. M., & Edelbrock, C. (1983). Manual for the Child Behavior Checklist
and Revised Child Behavior Profile. Burlington, VT: University of Vermont
Department of Psychiatry.
Bartels, M., Rietveld, M. J. H., Van Baal, G. C. M., & Boomsma, D. I. (2002).
Genetic and environmental influences on the development of intelligence.
Behavior Genetics, 32, 237–249. doi:10.1023/A:1019772628912
Bayley, N. (1932). A study of the crying of infants during mental and physical tests.
The Journal of Genetic Psychology, 40, 306–329.
Bayley, N. (1933). The California First-Year Mental Scale. Berkeley, CA: University
of California Press.
Bayley, N. (1943). Skeletal maturing in adolescence as a basis for determining per-
centage of completed growth. Child Development, 14, 1–46. doi:10.2307/1125612
Bayley, N. (1949). Consistency and variability in the growth of intelligence from
birth to eighteen years. The Journal of Genetic Psychology, 75, 165–196.
Bayley, N. (1955). On the growth of intelligence. American Psychologist, 10, 805–818.
doi:10.1037/h0043803
Bayley, N. (1957). Data on the growth of intelligence between 16 and 21 years as mea-
sured by the Wechsler–Bellevue Scale. The Journal of Genetic Psychology, 90, 3–15.
Bayley, N. (1964). Consistency of maternal and child behaviors in the Berkeley
Growth Study. Vita Humana, 7, 73–95.
Bayley, N. (1968). Behavioral correlates of mental growth: Birth to thirty-six years.
American Psychologist, 23, 1–17. doi:10.1037/h0037690
Bayley, N., & Jones, H. E. (1937). Environmental correlates of mental and motor
development: A cumulative study from infancy to six years. Child Development,
8, 329–341.
Block, J. (1971). Lives through time. Berkeley, CA: Bancroft Books.
Bradway, K. P. (1944). IQ constancy on the Revised Stanford–Binet from the pre-
school to the junior high school level. The Journal of Genetic Psychology, 65,
197–217.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and
data analysis methods. Newbury Park, CA: Sage.
Cleveland, H. H. (2003). Disadvantaged neighborhoods and adolescent aggression:
Behavioral genetic evidence of contextual effects. Journal of Research on
Adolescence, 13, 211–238.
Curran, P. J., & Hussong, A. M. (2009). Integrative data analysis: The simultaneous
analysis of multiple data sets. Psychological Methods, 14, 81–100. doi:10.1037/
a0015914
DeFries, J. C., Johnson, R. C., Kuse, A. R., McClearn, G. E., Polovina, J., Vandenberg,
S., & Wilson, J. R. (1979). Familial resemblance for specific cognitive abilities.
Behavior Genetics, 9, 23–43.
Finkel, D., Reynolds, C. A., McArdle, J. J., & Pedersen, N. L. (2005). The longitu-
dinal relationship between processing speed and cognitive ability: Genetic and
environmental influences. Behavior Genetics, 35, 535–549. doi:10.1007/s10519-
005-3281-5
Harris, K. M., Halpern, C. T., Smolen, A., & Haberstick, B. C. (2006). The National
Longitudinal Study of Adolescent Health (Add Health) twin data. Twin Research
and Human Genetics, 9, 988–997. doi:10.1375/twin.9.6.988
Hofer, S. M., & Piccinin, A. M. (2009). Integrative data analysis through coordination
of measurement and analysis protocol across independent longitudinal studies.
Psychological Methods, 14, 150–164. doi:10.1037/a0015566
Huffine, C. L., & Aerts, E. (1998). The Intergenerational Studies at the Institute of Human
Developmental University of California, Berkeley: Longitudinal studies of children
and families, 1928–present: A guide to the data archives. Available from http://ihd.
berkeley.edu/igsguide2.pdf
Jaffa, A. S. (1934). The California Preschool Mental Scale, Form A. Berkeley, CA:
University of California Press.
Jones, H. E. (1938). The California Adolescent Growth Study. The Journal of Educational
Research, 31, 561–567.
Jones, H. E. (1939a). The Adolescent Growth Study: Principles and methods. Journal
of Consulting Psychology, 3, 157–159. doi:10.1037/h0050181
Jones, H. E. (1939b). The Adolescent Growth Study: Procedures. Journal of Consulting
Psychology, 3, 177–180. doi:10.1037/h0060864
Macfarlane, J. W. (1939). The Guidance Study. Sociometry, 2, 1–23. doi:10.2307/
2785296
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,
149–174. doi:10.1007/BF02296272
McArdle, J. J. (1986). Latent variable growth within behavior genetic models.
Behavior Genetics, 16, 163–200. doi:10.1007/BF01065485
McArdle, J. J. (1988). Dynamic but structural equation modeling of repeated
measures data. In J. R. Nesselroade & R. B. Cattell (Eds.), Handbook of multi-
variate experimental psychology (2nd ed., pp. 561–614). New York, NY: Plenum
Press.
McArdle, J. J., Ferrer-Caja, E., Hamagami, F., & Woodcock, R. W. (2002). Compa-
rative longitudinal structural analyses of the growth and decline of multiple intel-
lectual abilities over the life span. Developmental Psychology, 38, 115–142. doi:10.
1037/0012-1649.38.1.115
McArdle, J. J., & Goldsmith, H. H. (1990). Alternative common factor models for
multivariate biometric analyses. Behavior Genetics, 20, 569–608. doi:10.1007/
BF01065873
McArdle, J. J., Grimm, K. J., Hamagami, F., Bowles, R. P., & Meredith, W. (2009).
Modeling life span growth curves of cognition using longitudinal data with
multiple samples and changing scales of measurement. Psychological Methods,
14, 126–149. doi:10.1037/a0015857
McArdle, J. J., Hamagami, F., Meredith, W., & Bradway, K. P. (2000). Modeling the
dynamic hypotheses of Gf-Gc theory using longitudinal life-span data. Learning
and Individual Differences, 12, 53–79. doi:10.1016/S1041-6080(00)00036-4
McArdle, J. J., & Horn, J. L. (2002, October). The benefits and limitations of mega-
analysis with illustrations for the WAIS. Paper presented at the 18th International
Conference of Committee on Data for Science and Technology, Montreal,
Quebec, Canada.
McArdle, J. J., Prescott, C. A., Hamagami, F., & Horn, J. L. (1998). A contemporary
method for developmental–genetic analyses of age changes in intellectual abil-
ities. Developmental Neuropsychology, 14, 69–114. doi:10.1080/8756564980954
0701
Meredith, W., & Tisak, J. (1990). Latent curve analysis. Psychometrika, 55, 107–122.
doi:10.1007/BF02294746
Muthén, L. K., & Muthén, B. O. (1998–2007). Mplus user’s guide (4th ed.). Los
Angeles, CA: Authors.
Nagoshi, C. T., & Johnson, R. C. (1993). Familial transmission of cognitive abilities
in offspring tested in adolescence and adulthood: A longitudinal study. Behavior
Genetics, 23, 279–285. doi:10.1007/BF01082467
National Center for Education Statistics. (2001). User’s manual for the ECLS–K
public-use data files and electronic codebook. Washington, DC: U.S. Department
of Education.
Rietveld, M. J. H., Dolan, C. V., Van Baal, G. C. M., & Boomsma, D. I. (2003). A
twin study of differentiation of cognitive abilities in childhood. Behavior Genetics,
33, 367–381. doi:10.1023/A:1025388908177
Rodgers, J. L., Rowe, D. C., & Li, C. (1994). Beyond nature versus nurture: DF Analysis
of nonshared influences on problem behaviors. Developmental Psychology, 30,
374–384. doi:10.1037/0012-1649.30.3.374
Rogosa, D. R., & Willett, J. B. (1985). Understanding correlates of change by model-
ing individual differences in growth. Psychometrika, 50, 203–228. doi:10.1007/
BF02294247
Sands, L. P., Terry, H., & Meredith, W. (1989). Change and stability in adult intel-
lectual functioning assessed by Wechsler item responses. Psychology and Aging,
4, 79–87. doi:10.1037/0882-7974.4.1.79
Spuhler, K. P., & Vandenberg, S. G. (1980). Comparison of parent–offspring resem-
blance for specific cognitive abilities. Behavior Genetics, 10, 413–418. doi:10.
1007/BF01065603
Terman, L. M. (1916). The measurement of intelligence. Boston, MA: Houghton Mifflin.
Terman, L. M. & Merrill, M. A. (1937). Measuring intelligence. Boston, MA: Houghton
Mifflin.
Terman, L. M. & Merrill, M. A. (1960). Measuring intelligence. Cambridge, MA:
Houghton Mifflin.
Vogler, G. P., & DeFries, J. C. (1985). Bivariate path analysis of familial resemblance
for reading ability and symbol processing speed. Behavior Genetics, 15, 111–121.
doi:10.1007/BF01065892
Wechsler, D. (1946). The Wechsler–Bellevue Intelligence Scale. New York, NY: Psycho-
logical Corporation.
Wechsler, D. (1949). Wechsler Intelligence Scale for Children. New York, NY: Psycho-
logical Corporation.
Wechsler, D. (1955). Manual for the Wechsler Adult Intelligence Scale. New York, NY:
Psychological Corporation.
Wechsler, D. (1981). WAIS–R manual. New York, NY: Psychological Corporation.
Woodcock, R. W. & Johnson, M. B. (1989). Woodcock–Johnson Psycho-Educational
Battery—Revised. Allen, TX: DLM Teaching Resources.
Zill, N., & Peterson, J. L. (1986). Behavior Problems Index. Washington, DC: Child
Trends.
INDEX
231
12110-15_Index.qxd 6/23/10 1:49 PM Page 232
232 INDEX
12110-15_Index.qxd 6/23/10 1:49 PM Page 233
INDEX 233
12110-15_Index.qxd 6/23/10 1:49 PM Page 234
234 INDEX
12110-15_Index.qxd 6/23/10 1:49 PM Page 235
INDEX 235
12110-15_Index.qxd 6/23/10 1:49 PM Page 236
236 INDEX
12110-15_Index.qxd 6/23/10 1:49 PM Page 237
INDEX 237
12110-15_Index.qxd 6/23/10 1:49 PM Page 238
238 INDEX
12110-15_Index.qxd 6/23/10 1:49 PM Page 239
INDEX 239
12110-15_Index.qxd 6/23/10 1:49 PM Page 240
240 INDEX
12110-15_Index.qxd 6/23/10 1:49 PM Page 241
INDEX 241
12110-15_Index.qxd 6/23/10 1:49 PM Page 242
242 INDEX
12110-15_Index.qxd 6/23/10 1:49 PM Page 243
INDEX 243
12110-15_Index.qxd 6/23/10 1:49 PM Page 244
244 INDEX
12110-16_AboutEd_rev3.qxd 6/23/10 2:01 PM Page 245
245
12110-16_AboutEd_rev3.qxd 6/23/10 2:01 PM Page 246