Preprocess Data to Handle Missing Variables — process

Process data to account for missingness in preparation for TMLE

process_missing(
  data,
  node_list,
  complete_nodes = c("A", "Y"),
  impute_nodes = NULL,
  max_p_missing = 0.5
)

Arguments

data,	`data.table`, containing the missing variables
node_list,	`list`, what variables comprise each node
complete_nodes,	`character vector`, nodes we must observe
impute_nodes,	`character vector`, nodes we will impute
max_p_missing,	`numeric`, what proportion of missing is tolerable? Beyond that, the variable will be dropped from the analysis

Value

list containing the following elements:

data, the updated dataset
node_list, the updated list of nodes
n_dropped, the number of observations dropped
dropped_cols, the variables dropped due to excessive missingness

Details

Rows where there is missingness in any of the complete_nodes will be dropped. Then, missingness will be median-imputed for the variables in the impute_nodes. Indicator variables of missingness will be generated for these nodes.

Then covariates will be processed as follows:

any covariate with more than max_p_missing missingness will be dropped
indicators of missingness will be generated
missing values will be median-imputed