A Naïve Methodology for Imputing Missing Survey Information due to Survey Skip Conditions
Restricted (Penn State Only)
- Author:
- Hopkins, Grant
- Area of Honors:
- Statistics
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Le Bao, Thesis Supervisor
Matthew D Beckman, Thesis Honors Advisor - Keywords:
- Population-Based HIV Impact Assessments
Survey
Deduction-Complete Dataset
Skip Condition
Missing Data
Sequentially Asked Questions - Abstract:
- Surveying a sample to make an inference upon a population is a fundamental role of statistics. In the simplest cases, a survey is conducted upon respondents who are selected via simple random sampling, all respondents answer all questions with no missing information, and the survey gives meaningful insight into a population of interest. In reality, however, it is often necessary to employ complex sampling designs in order to reach representative respondents and to collect large amounts of information without introducing survey fatigue. Moreover, there are also cases where respondents refuse to answer certain questions, do not know the answer to certain questions, or even provide inaccurate answers. For this reason, it is infeasible and unfavorable to ask respondents questions that they have already answered, cannot answer, would likely decline to answer, or would likely not know the answer. Such is the premise of the Population-Based HIV Impact Assessment [PHIA] survey, conducted across multiple countries in Sub-Saharan Africa to understand the status of the HIV epidemic in those countries. A particular challenge of analyzing the PHIA survey is that information about a respondent is shrouded in survey skip conditions, prohibiting an analyst from understanding why a respondent does not have an answer to a particular question. Perhaps the respondent’s answer can be deduced from an earlier question; perhaps the respondent’s answer is impossible due to a logical inconsistency; perhaps the respondent’s answer to the question is missing and may be predicted. This thesis proposes a naïve methodology that researchers can use to probabilistically predict missing information in an indicator variable in the context of large surveys that utilize skip conditions. First, I propose a variable selection method based upon marginal association with the indicator variable and the proportion of non-skipped values. Second, I discuss the need to impute skipped values among the predictor variables in order to have a fully-specified predictor matrix upon which the response is modelled. Next, I implement the LASSO procedure for subsetting to a sparse set of predictor variables. Finally, I train a logistic regression model on respondents with non-missing indicator values, assess the model performance, and apply the model to respondents with missing indicator values. In addition to researchers who wish to model with data from surveys with skip conditions, designers of such surveys may take interest in the discussion surrounding data encoding. Surveys with skip conditions have the great potential to discover niche behavioral patterns and risk factors by targeting questions based upon preceding responses. Improving data encodings will shed light into what subpopulations a particular pattern holds for, and will also provide clarity into the reasons for missing information throughout the survey. Ignoring this missing information may bias sample estimates for population parameters.