Chapter 3 Datasets

3.1 International Stroke Trial Example Dataset

The International Stroke Trial database contains individual patient data from the International Stroke Trial (IST), a multi-national randomized trial conducted between 1991 and 1996 (pilot phase between 1991 and 1993) that aimed to assess whether early administration of aspirin, heparin, both or neither influenced the clinical course of acute ischaemic stroke (Sandercock et al. 1997). The IST dataset includes data on 19,435 patients with acute stroke, with 99% complete follow-up. De-identified data are available for download at https://datashare.is.ed.ac.uk/handle/10283/128. This study is described in more detail at the bottom of this page, and in the corresponding block quote reference. In the example data for this workshop, we consider a sample of 5,000 patients and the binary outcome of recurrent ischemic stroke within 14 days after randomization. Also in our example data, we ensure that we have subjects with a missing outcome. The data dictionary is available in the data folder, ist_variables.pdf.

library(tidyverse)

# read in data
ist <- read_csv("https://raw.githubusercontent.com/tlverse/deming2019-workshop/master/data/ist_sample.csv")
ist

# A tibble: 5,000 x 26
   RDELAY RCONSC SEX     AGE RSLEEP RATRIAL RCT   RVISINF RHEP24 RASP3  RSBP
    <dbl> <chr>  <chr> <dbl> <chr>  <chr>   <chr> <chr>   <chr>  <chr> <dbl>
 1     46 F      F        85 N      N       N     N       Y      N       150
 2     33 F      M        71 Y      Y       Y     Y       N      Y       180
 3      6 D      M        88 N      Y       N     N       N      N       140
 4      8 F      F        68 Y      N       Y     Y       N      N       118
 5     13 F      M        60 N      N       Y     N       N      N       140
 6     16 F      F        71 Y      N       Y     N       N      N       160
 7      6 F      M        71 Y      N       N     N       N      Y       130
 8     15 F      M        84 N      N       Y     N       Y      N       160
 9      9 D      F        81 N      N       N     N       N      Y       138
10     20 F      F        70 Y      N       N     N       N      N       170
# … with 4,990 more rows, and 15 more variables: RDEF1 <chr>, RDEF2 <chr>,
#   RDEF3 <chr>, RDEF4 <chr>, RDEF5 <chr>, RDEF6 <chr>, RDEF7 <chr>,
#   RDEF8 <chr>, STYPE <chr>, RXHEP <chr>, REGION <chr>,
#   MISSING_RATRIAL_RASP3 <dbl>, MISSING_RHEP24 <dbl>, RXASP <dbl>,
#   DRSISC <dbl>

For the purposes of this workshop, we we start by treating the data as independent and identically distributed (i.i.d.) random draws from a very large target population. We could, with available options, account for the clustering of the data (within sampled geographic regions), but, for simplification, we avoid these details in these workshop presentations, although modifications of our methodology for biased samples, repeated measures, etc., are available.

We have 26 variables measured, of which 1 variable is set to be the outcome of interest. This outcome, \(Y\), indicates recurrent ischemic stroke within 14 days after randomization (DRSISC in ist); the treatment of interest, \(A\), is the randomized aspirin vs. no aspirin treatment allocation (RXASP in ist); and the adjustment set, \(W\), consists simply of other variable measured at baseline. In this data our outcome is occasionally missing, but we do not need to create a variable indicating this missingness (such as \(\Delta\)) for analyses in the tlverse. If we let \(\Delta\) denote the indicator that the outcome is missing such that \(\Delta = 1\) when the outcome is observed and \(\Delta = 0\) when the outcome is not observed, then we can denote our observed data structure as \(n\) i.i.d. copies of \(O_i = (W_i, A_i, \Delta_i, \Delta Y_i)\), for \(i = 1, \ldots, n\).

Using the skimr package, we can quickly summarize the variables in our data:

library(skimr)
skim(ist)

(#tab:skim_ist_data)Data summary
Name	ist
Number of rows	5000
Number of columns	26
_______________________
Column type frequency:
character	19
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
RCONSC	1	1	1	3
SEX	1	1	1	2
RSLEEP	1	1	1	2
RATRIAL	1	1	1	3
RCT	1	1	1	2
RVISINF	1	1	1	2
RHEP24	1	1	1	3
RASP3	1	1	1	3
RDEF1	1	1	1	3
RDEF2	1	1	1	3
RDEF3	1	1	1	3
RDEF4	1	1	1	3
RDEF5	1	1	1	3
RDEF6	1	1	1	3
RDEF7	1	1	1	3
RDEF8	1	1	1	3
STYPE	1	3	4	5
RXHEP	1	1	1	4
REGION	1	10	26	7

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RDELAY	0	1	20.14	12.43	1	9	19	29	48	▇▆▆▃▂
AGE	0	1	71.93	11.65	16	65	74	81	99	▁▁▃▇▂
RSBP	0	1	160.62	27.84	71	140	160	180	290	▁▇▇▁▁
MISSING_RATRIAL_RASP3	0	1	0.05	0.22	0	0	0	0	1	▇▁▁▁▁
MISSING_RHEP24	0	1	0.02	0.13	0	0	0	0	1	▇▁▁▁▁
RXASP	0	1	0.50	0.50	0	0	0	1	1	▇▁▁▁▇
DRSISC	10	1	0.02	0.15	0	0	0	0	1	▇▁▁▁▁

A convenient summary of the relevant variables is given just above.

"The study had a prospective, randomised, open treatment, blinded outcome (PROBE) design. The inclusion criteria were: clinical diagnosis of acute ischaemic stroke, with onset within the previous 48 hours and no clear indication for, or clear contraindication to, treatment with aspirin or subcutaneous heparin. Unlike many stroke trials of that era (and subsequently), the study did not set an upper age limit. Patients were to have a CT brain scan to confirm the diagnosis of stroke, and this was to be done before randomisation if at all possible. To enter a patient in the study, the clinician telephoned a central randomisation service (at the Clinical Trial Service Unit, Oxford) during this telephone call, the baseline variables were entered and checked, and once validated, the computer allocated the treatment and the telephonist then informed the clinician. The patients and treating clinicians were not blinded to the treatment given. Early outcome data were collected by the treating physician who completed a follow-up form at 14 days, death or hospital discharge (whichever occurred first). This form recorded data on events in hospital within 14 days, and the doctor’s opinion on the final diagnosis of the initial event that led to randomisation. These unblinded data, may therefore be subject to some degree of bias. The primary outcome was the proportion of patients who were either dead or dependent on other people for activities of daily living at six months after randomisation. This outcome was collected by postal questionnaire mailed directly to the patient, or (in Italy) by telephone interview of the patient by a trained researcher, blinded to treatment allocation. The primary outcome was therefore assessed - as far as practicable - blind to treatment allocation and hence should be free from bias. We re-checked the data set for inaccuracies and inconsistencies and extracted data on the variables assessed at randomisation, and at the two outcome assessment points: at 14-days after randomisation, death or prior hospital discharge (whichever occurred first) and at 6-months.

— Sandercock, Niewada, and Członkowska (2011)

3.2 WASH Benefits Example Dataset

The data come from a study of the effect of water quality, sanitation, hand washing, and nutritional interventions on child development in rural Bangladesh (WASH Benefits Bangladesh): a cluster-randomised controlled trial (Luby et al. 2018). The study enrolled pregnant women in their first or second trimester from the rural villages of Gazipur, Kishoreganj, Mymensingh, and Tangail districts of central Bangladesh, with an average of eight women per cluster. Groups of eight geographically adjacent clusters were block-randomised, using a random number generator, into six intervention groups (all of which received weekly visits from a community health promoter for the first 6 months and every 2 weeks for the next 18 months) and a double-sized control group (no intervention or health promoter visit). The six intervention groups were:

chlorinated drinking water;
improved sanitation;
hand-washing with soap;
combined water, sanitation, and hand washing;
improved nutrition through counseling and provision of lipid-based nutrient supplements; and
combined water, sanitation, handwashing, and nutrition.

In the workshop, we concentrate on child growth (size for age) as the outcome of interest. For reference, this trial was registered with ClinicalTrials.gov as NCT01590095.

library(tidyverse)

# read in data
dat <- read_csv("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv")
dat

# A tibble: 4,695 x 28
     whz tr    fracode month  aged sex   momage momedu momheight hfiacat Nlt18
   <dbl> <chr> <chr>   <dbl> <dbl> <chr>  <dbl> <chr>      <dbl> <chr>   <dbl>
 1  0    Cont… N05265      9   268 male      30 Prima…      146. Food S…     3
 2 -1.16 Cont… N05265      9   286 male      25 Prima…      149. Modera…     2
 3 -1.05 Cont… N08002      9   264 male      25 Prima…      152. Food S…     1
 4 -1.26 Cont… N08002      9   252 fema…     28 Prima…      140. Food S…     3
 5 -0.59 Cont… N06531      9   336 fema…     19 Secon…      151. Food S…     2
 6 -0.51 Cont… N06531      9   304 male      20 Secon…      154. Severe…     0
 7 -2.46 Cont… N08002      9   336 fema…     19 Prima…      151. Food S…     2
 8 -0.6  Cont… N06528      9   312 fema…     25 No ed…      142. Food S…     2
 9 -0.23 Cont… N06528      9   322 male      30 Secon…      153. Food S…     1
10 -0.14 Cont… N06453      9   376 male      30 No ed…      156. Modera…     2
# … with 4,685 more rows, and 17 more variables: Ncomp <dbl>, watmin <dbl>,
#   elec <dbl>, floor <dbl>, walls <dbl>, roof <dbl>, asset_wardrobe <dbl>,
#   asset_table <dbl>, asset_chair <dbl>, asset_khat <dbl>, asset_chouki <dbl>,
#   asset_tv <dbl>, asset_refrig <dbl>, asset_bike <dbl>, asset_moto <dbl>,
#   asset_sewmach <dbl>, asset_mobile <dbl>

We have 28 variables measured, of which 1 variable is set to be the outcome of interest. This outcome, \(Y\), is the weight-for-height Z-score (whz in dat); the treatment of interest, \(A\), is the randomized treatment group (tr in dat); and the adjustment set, \(W\), consists simply of everything else. This results in our observed data structure being \(n\) i.i.d. copies of \(O_i = (W_i, A_i, Y_i)\), for \(i = 1, \ldots, n\).

Like before, we can summarize the variables measured in the WASH Benefits data set with skimr:

skim(dat)

(#tab:skim_washb_data)Data summary
Name	dat
Number of rows	4695
Number of columns	28
_______________________
Column type frequency:
character	5
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
tr	1	3	15	7
fracode	1	2	6	20
sex	1	4	6	2
momedu	1	12	15	3
hfiacat	1	11	24	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
whz	0	1.00	-0.59	1.03	-4.67	-1.28	-0.6	0.08	4.97	▁▆▇▁▁
month	0	1.00	6.45	3.33	1.00	4.00	6.0	9.00	12.00	▇▇▅▇▇
aged	0	1.00	266.32	52.17	42.00	230.00	266.0	303.00	460.00	▁▂▇▅▁
momage	18	1.00	23.91	5.24	14.00	20.00	23.0	27.00	60.00	▇▇▁▁▁
momheight	31	0.99	150.50	5.23	120.65	147.05	150.6	154.06	168.00	▁▁▆▇▁
Nlt18	0	1.00	1.60	1.25	0.00	1.00	1.0	2.00	10.00	▇▂▁▁▁
Ncomp	0	1.00	11.04	6.35	2.00	6.00	10.0	14.00	52.00	▇▃▁▁▁
watmin	0	1.00	0.95	9.48	0.00	0.00	0.0	1.00	600.00	▇▁▁▁▁
elec	0	1.00	0.60	0.49	0.00	0.00	1.0	1.00	1.00	▆▁▁▁▇
floor	0	1.00	0.11	0.31	0.00	0.00	0.0	0.00	1.00	▇▁▁▁▁
walls	0	1.00	0.72	0.45	0.00	0.00	1.0	1.00	1.00	▃▁▁▁▇
roof	0	1.00	0.99	0.12	0.00	1.00	1.0	1.00	1.00	▁▁▁▁▇
asset_wardrobe	0	1.00	0.17	0.37	0.00	0.00	0.0	0.00	1.00	▇▁▁▁▂
asset_table	0	1.00	0.73	0.44	0.00	0.00	1.0	1.00	1.00	▃▁▁▁▇
asset_chair	0	1.00	0.73	0.44	0.00	0.00	1.0	1.00	1.00	▃▁▁▁▇
asset_khat	0	1.00	0.61	0.49	0.00	0.00	1.0	1.00	1.00	▅▁▁▁▇
asset_chouki	0	1.00	0.78	0.41	0.00	1.00	1.0	1.00	1.00	▂▁▁▁▇
asset_tv	0	1.00	0.30	0.46	0.00	0.00	0.0	1.00	1.00	▇▁▁▁▃
asset_refrig	0	1.00	0.08	0.27	0.00	0.00	0.0	0.00	1.00	▇▁▁▁▁
asset_bike	0	1.00	0.32	0.47	0.00	0.00	0.0	1.00	1.00	▇▁▁▁▃
asset_moto	0	1.00	0.07	0.25	0.00	0.00	0.0	0.00	1.00	▇▁▁▁▁
asset_sewmach	0	1.00	0.06	0.25	0.00	0.00	0.0	0.00	1.00	▇▁▁▁▁
asset_mobile	0	1.00	0.86	0.35	0.00	1.00	1.0	1.00	1.00	▁▁▁▁▇

Note that the asset variables reflect socio-economic status of the study participants.

3.3 Veterans’ Administration Lung Cancer Trial Dataset

This data corresponds to a study conducted by the US Veterans Administration. Male patients with advanced inoperable lung cancer were given either the standard therapy or a test chemotherapy. The primary goal of the study was to assess if the test chemotherapy improved survival. This data set has been published in Kalbfleisch and Prentice (2011) and it is available in the MASS and survival R packages. Time to death was recorded for 128 patients, and 9 patients left the study before death. Various covariates were also documented for each patient.

library(tidyverse)

# read in data
vet <- read_csv("https://raw.githubusercontent.com/tlverse/deming2019-workshop/master/data/veteran.csv")
vet

# A tibble: 137 x 9
      X1   trt celltype  time status karno diagtime   age prior
   <dbl> <dbl> <chr>    <dbl>  <dbl> <dbl>    <dbl> <dbl> <dbl>
 1     1     1 squamous    72      1    60        7    69     0
 2     2     1 squamous   411      1    70        5    64    10
 3     3     1 squamous   228      1    60        3    38     0
 4     4     1 squamous   126      1    60        9    63    10
 5     5     1 squamous   118      1    70       11    65    10
 6     6     1 squamous    10      1    20        5    49     0
 7     7     1 squamous    82      1    40       10    69    10
 8     8     1 squamous   110      1    80       29    68     0
 9     9     1 squamous   314      1    50       18    43     0
10    10     1 squamous   100      0    70        6    70     0
# … with 127 more rows

A snapshot of the data set in shown below:

skim(vet)

(#tab:skim_vet_data)Data summary
Name	vet
Number of rows	137
Number of columns	9
_______________________
Column type frequency:
character	1
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
celltype	0	1	5	9	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X1	1	69.00	39.69	1	35	69	103	137	▇▇▇▇▇
trt	1	1.50	0.50	1	1	1	2	2	▇▁▁▁▇
time	1	121.63	157.82	1	25	80	144	999	▇▁▁▁▁
status	1	0.93	0.25	0	1	1	1	1	▁▁▁▁▇
karno	1	58.57	20.04	10	40	60	75	99	▁▅▇▇▂
diagtime	1	8.77	10.61	1	3	5	11	87	▇▁▁▁▁
age	1	58.31	10.54	34	51	62	66	81	▃▂▅▇▁
prior	1	2.92	4.56	0	0	0	10	10	▇▁▁▁▃

References

Kalbfleisch, John D, and Ross L Prentice. 2011. The Statistical Analysis of Failure Time Data. Vol. 360. John Wiley & Sons.

Luby, Stephen P, Mahbubur Rahman, Benjamin F Arnold, Leanne Unicomb, Sania Ashraf, Peter J Winch, Christine P Stewart, et al. 2018. “Effects of Water Quality, Sanitation, Handwashing, and Nutritional Interventions on Diarrhoea and Child Growth in Rural Bangladesh: A Cluster Randomised Controlled Trial.” The Lancet Global Health 6 (3). Elsevier: e302–e315.

Sandercock, P, R Collins, C Counsell, B Farrell, R Peto, J Slattery, and C Warlow. 1997. “For the International Stroke Trial Collaborative Group. The International Stroke Trial (Ist): A Randomized Trial of Aspirin, Subcutaneous Heparin, Both, or Neither Among 19,435 Patients with Acute Ischemic Stroke.” Lancet 349 (9065): 1569–81.

Sandercock, Peter AG, Maciej Niewada, and Anna Członkowska. 2011. “The International Stroke Trial Database.” Trials 12 (1). BioMed Central: 101.