Process data to account for missingness in preparation for TMLE

process_missing(data, node_list, complete_nodes = c("A", "Y"),
  impute_nodes = NULL, max_p_missing = 0.5)



data.table, containing the missing variables


list, what variables comprise each node


character vector, nodes we must observe


character vector, nodes we will impute


numeric, what proportion of missing is tolerable? Beyond that, the variable will be dropped from the analysis


list containing the following elements:

  • data, the updated dataset

  • node_list, the updated list of nodes

  • n_dropped, the number of observations dropped

  • dropped_cols, the variables dropped due to excessive missingness


Rows where there is missingness in any of the complete_nodes will be dropped. Then, missingness will be median-imputed for the variables in the impute_nodes. Indicator variables of missingness will be generated for these nodes.

Then covariates will be processed as follows:

  1. any covariate with more than max_p_missing missingness will be dropped

  2. indicators of missingness will be generated

  3. missing values will be median-imputed