Process data to account for missingness in preparation for TMLE

process_missing(data, node_list, complete_nodes = c("A", "Y"),
  impute_nodes = NULL, max_p_missing = 0.5)

Arguments

data,

data.table, containing the missing variables

node_list,

list, what variables comprise each node

complete_nodes,

character vector, nodes we must observe

impute_nodes,

character vector, nodes we will impute

max_p_missing,

numeric, what proportion of missing is tolerable? Beyond that, the variable will be dropped from the analysis

Value

list containing the following elements:

  • data, the updated dataset

  • node_list, the updated list of nodes

  • n_dropped, the number of observations dropped

  • dropped_cols, the variables dropped due to excessive missingness

Details

Rows where there is missingness in any of the complete_nodes will be dropped. Then, missingness will be median-imputed for the variables in the impute_nodes. Indicator variables of missingness will be generated for these nodes.

Then covariates will be processed as follows:

  1. any covariate with more than max_p_missing missingness will be dropped

  2. indicators of missingness will be generated

  3. missing values will be median-imputed