What is a codebook?

A codebook describes the contents, structure, and layout of a data collection. A well-documented codebook "contains information intended to be complete and self-explanatory for each variable in a data file1."

Codebooks begin with basic front matter, including the study title, name of the principal investigator(s), table of contents, and an introduction describing the purpose and format of the codebook. Some codebooks also include methodological details, such as how weights were computed, and data collection instruments, while others, especially with larger or more complex data collections, leave those details for a separate user guide and/or data collection instrument.

The main body of a codebook contains unambiguous variable level details. These include, as shown in the example below from the National Longitudinal Survey of Youth, 19792, the following:


Assessment of R's General Health

  • Variable name: The name or number assigned to each variable in the data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers - e.g., Q1, Q2b, etc. [In above example, H40-SF12-2]
  • Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording. ["SF12 - ASSESSMENT OF R'S GENERAL HEALTH"]
  • Question text: Where applicable, the exact wording from survey questions. ["In general, would you say your health is . . ."]
  • Values: The actual coded values in the data for this variable. [1, 2, 3, 4, 5]
  • Value labels: The textual descriptions of the codes. [Excellent, Very Good, Good, Fair, Poor]
  • Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.
  • Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes, including "system missing" and blank. [e.g., Refusal (-1)]
  • Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables. [e.g., Default Next Question: H00035.00]
  • Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

For variables that are compiled, created, or constructed, such as the examples below from the Aging of Veterans of the Union Army: Military, Pension, and Medical Records, 1820-19403 study and the Welfare, Children, and Families: A Three-City Study4 , fewer details are needed: variable name and label, as well as a description of how the data were compiled or created.


Variable name: Siblings
Illegal Activities

The order of variable descriptions in the codebook usually matches the order of the data. To enhance usability on complex or larger data collections, researchers sometimes add appendices listing variable names and labels alphabetically, by sample characteristic, or according to the substantive groups to which they belong - e.g., Demographic Variables, Health Status Variables. This is helpful to the user in locating variables of interest.

Codebooks come in a variety of shapes and formats. As long as the content is complete and self-explanatory, the stylistic touches can match the needs of the research project.


Additional Examples

Below are additional examples of variable level details from a wide variety of research codebooks.

American National Election Study, 2008-2009 Panel Study5

Does R like or dislike Joe Biden

National Longitudinal Study of Adolescent Health (Add Health), 1994-19956

National Longitudinal Study of Adolescent Health (Add Health), 1994-1995

General Social Surveys, 1972-20087

General Social Surveys, 1972-2008

National Survey on Drug Use and Health, 20098

National Survey on Drug Use and Health, 2009

Capital Punishment in the United States, 1973-20089

Capital Punishment in the United States, 1973-2008

Resources

UK Data Archive, "Documenting Your Data/Data Level/Structured Tabular Data"

http://www.data-archive.ac.uk/create-manage/document/data-level?index=1

Institute for Health and Care Research Quality Handbook

http://www.emgo.nl/kc/codebook/

Princeton University Data and Statistical Services, "How to Use a Codebook"

http://dss.princeton.edu/online_help/analysis/codebook.htm

UCLA Social Science Data Archive, "Codebooks"

http://dataarchives.ss.ucla.edu/tutor/tutcode.htm



References


1Guide to the NLSY97 Data. Retrieved August 1, 2011, from http://www.nlsinfo.org/nlsy97/97guide/chap3.htm#threethree

2Ohio State University. Center for Human Resource Research. National Longitudinal Survey of Youth, 1979 [Computer file]. ICPSR04683-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2007-09-17. doi:10.3886/ICPSR04683

3Fogel, Robert W., et al. Aging of Veterans of the Union Army: Military, Pension, and Medical Records, 1820-1940 [Computer file]. ICPSR06837-v6. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2006-06-05. doi:10.3886/ICPSR06837

4Angel, Ronald, Linda Burton, P. Lindsay Chase-Lansdale, Andrew Cherlin, and Robert Moffitt. Welfare, Children, and Families: A Three-City Study [Computer file]. ICPSR04701-v7. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2009-02-10. doi:10.3886/ICPSR04701

5American National Election Study, 2008-2009 Panel Study Frequency codebook, version 20090903. Retrieved August 1, 2011, from http://electionstudies.org/studypages/2008_2009panel/anes2008_2009panel_fcodebook.txt

6National Longitudinal Study of Adolescent Health (Add Health), Wave I School Administrator Codebook. Retrieved August 1, 2011, from http://www.cpc.unc.edu/projects/addhealth/codebooks/wave1/index.html

7Davis, James A., Tom W. Smith, and Peter V. Marsden. General Social Surveys, 1972-2008 [Cumulative File] [Computer file]. ICPSR25962-v2. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut/Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2010-02-08. doi:10.3886/ICPSR25962

8United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies. National Survey on Drug Use and Health, 2009 [Computer file]. ICPSR29621-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-11-16. doi:10.3886/ICPSR29621

9United States Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. Capital Punishment in the United States, 1973-2008 [Computer file]. ICPSR27982-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-09-07. doi:10.3886/ICPSR27982

Research Connections is supported by grant #90YE0104 from the Office of Planning, Research and Evaluation, Administration for Children and Families, U.S. Department of Health and Human Services. The contents are solely the responsibility of the National Center for Children in Poverty and the Inter-university Consortium for Political and Social Research and do not necessarily represent the official views of the Office of Planning, Research and Evaluation, the Administration for Children and Families, or the U.S. Department of Health and Human Services.

Google Translate