Protecting Respondent Confidentiality

Resources for disclosure mitigation (videos):

Two kinds of variables often found in social science datasets present problems that could endanger the confidentiality of research subjects: direct and indirect identifiers.

Direct identifiers: these are variables that point explicitly to particular individuals or units. For instance, Social Security numbers uniquely identify individuals who are registered with the Social Security Administration. Any variable that functions as an explicit name can be a direct identifier -- for example, a license number, phone number, or mailing address. Data depositors should carefully consider the analytic role that such variables fulfill and should remove any identifiers not necessary for analysis. Indirect identifiers: data depositors should also carefully consider a second class of problematic variables -- indirect identifiers. Such variables make unique cases visible. For instance, a United States ZIP code field may not be troublesome on its own, but when combined with other attributes like race and annual income, a ZIP code may identify unique individuals (e.g., extremely wealthy or poor) within that ZIP code, which means that answers the respondent thought would be private are no longer private.

Treating indirect identifiers: if, in the judgment of the principal investigator, a variable might act as an indirect identifier, the investigator should treat that variable in a special manner when preparing a public-use dataset. Commonly used types of treatment are as follows:

 Data producers can consult with Research Connections staff to design public-use datasets that maintain the confidentiality of respondents and are of maximum utility for all users. The staff will also perform an independent confidentiality review of datasets submitted to the archive and will work with the investigators to resolve any remaining problems of confidentiality.