Best Practices Throughout the Research Lifecycle

Project plans should involve decisions on the following data and documentation topics, many of which are related to the core data management plan. Documentation should be as much a part of project planning as data-related considerations, such as data collection, questionnaire construction, or analysis plans.

Initial Questions to Consider

Data and file structure: What is the data file going to look like and how will it be organized? What is the unit of analysis?

Naming conventions: How will files and variables be named? What naming conventions will be used to achieve consistency?

Data integrity: How will data be input or captured? Will the variable formats be numeric or character? What checks will be used to find invalid values, inconsistent responses, incomplete records, etc.? What checks will be used to manage the data versions as the files move through data entry, cleaning, and analysis?

Preparing dataset documentation: What will the dataset documentation or metadata look like and how will it be produced? How much is necessary for future retrieval and archival processing? What documentation standard will be used?

Variable construction: What variables will be constructed following the collection of the original data? How will these be named and documented?

Project documentation: What steps will be taken to document decisions that are made as the project unfolds? How will information be recorded on field procedures, coding decisions, variable construction, and the like? Research project Web sites and various Intranet options are increasingly used for capturing this kind of information, and Research Connections is prepared to include Web-based information in deposits.

Variable Names

It is important to remember that the variable name is the referent that analysts will use most often when working with the data. At a minimum, it should convey correct information, and ideally it should be unambiguous in terms of content.

Question numbers: Variable names also may correspond to question numbers, e.g., Q1, Q2a, Q2b. . .Qn. This approach relates variable names directly to the original questionnaire, but, like one-up numbers, such names are not easily remembered. Further, a single question often yields several distinct variables with letters or numbers (e.g., Q12a, Q12a1), which may not exist on the questionnaire.

Mnemonic names: Short variable names that represent the substantive meaning of variables have some advantages, in that they are recognizable and memorable.

Prefix, root, suffix systems: A more systematic approach involves constructing variable names containing a root, a prefix, and possibly a suffix. For example, all variables having to do with education might have the root ED. Mother's education might then be MOED, father's education FAED, and so on.

Variable Labels

Most statistical programs permit the user to link extended labels for each variable to the variable name. Variable labels are extremely important. They should provide at least three pieces of information:

  1. the item or question number in the original data collection instrument (unless the item number is part of the variable name),
  2. a clear indication of the variable's content, and
  3. an indication of whether the variable is constructed from other items.

If the number of characters available for labels is limited, one should develop a set of standard abbreviations in advance and present it as part of the documentation for the dataset.

Codes and Coding

Common coding conventions (a) assure that all statistical software packages will be able to handle the data, and (b) promote greater measurement comparability.

Guidelines to keep in mind while coding:

Identification variables: Provide fields at the beginning of each record to accommodate all identification variables. Identification variables often include a unique study number and a respondent number to represent each case.

Code categories: Code categories should be mutually exclusive, exhaustive, and precisely defined.

Preserving original information: Code as much detail as possible.

Closed-ended questions: Responses to survey questions that are precoded in the questionnaire should retain this coding scheme in the machine-readable data to avoid errors and confusion.

Open-ended questions: For open-ended items, investigators can either use a predetermined coding scheme or review the initial survey responses to construct a coding scheme based on major categories that emerge.

Check-coding: It is a good idea to verify or check-code some cases during the coding process—that is, repeat the process with an independent coder.