Chapter 3 Datasets
3.1 International Stroke Trial Example Dataset
The International Stroke Trial database contains individual patient data from the International Stroke Trial (IST), a multi-national randomized trial conducted between 1991 and 1996 (pilot phase between 1991 and 1993) that aimed to assess whether early administration of aspirin, heparin, both or neither influenced the clinical course of acute ischaemic stroke (Sandercock et al. 1997). The IST dataset includes data on 19,435 patients with acute stroke, with 99% complete follow-up. De-identified data are available for download at https://datashare.is.ed.ac.uk/handle/10283/128. This study is described in more detail at the bottom of this page, and in the corresponding block quote reference. In the example data for this workshop, we consider a sample of 5,000 patients and the binary outcome of recurrent ischemic stroke within 14 days after randomization. Also in our example data, we ensure that we have subjects with a missing outcome. The data dictionary is available in the data folder, ist_variables.pdf.
library(tidyverse)
# read in data
ist <- read_csv("https://raw.githubusercontent.com/tlverse/deming2019-workshop/master/data/ist_sample.csv")
ist
# A tibble: 5,000 x 26
RDELAY RCONSC SEX AGE RSLEEP RATRIAL RCT RVISINF RHEP24 RASP3 RSBP
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 46 F F 85 N N N N Y N 150
2 33 F M 71 Y Y Y Y N Y 180
3 6 D M 88 N Y N N N N 140
4 8 F F 68 Y N Y Y N N 118
5 13 F M 60 N N Y N N N 140
6 16 F F 71 Y N Y N N N 160
7 6 F M 71 Y N N N N Y 130
8 15 F M 84 N N Y N Y N 160
9 9 D F 81 N N N N N Y 138
10 20 F F 70 Y N N N N N 170
# … with 4,990 more rows, and 15 more variables: RDEF1 <chr>, RDEF2 <chr>,
# RDEF3 <chr>, RDEF4 <chr>, RDEF5 <chr>, RDEF6 <chr>, RDEF7 <chr>,
# RDEF8 <chr>, STYPE <chr>, RXHEP <chr>, REGION <chr>,
# MISSING_RATRIAL_RASP3 <dbl>, MISSING_RHEP24 <dbl>, RXASP <dbl>,
# DRSISC <dbl>
For the purposes of this workshop, we we start by treating the data as independent and identically distributed (i.i.d.) random draws from a very large target population. We could, with available options, account for the clustering of the data (within sampled geographic regions), but, for simplification, we avoid these details in these workshop presentations, although modifications of our methodology for biased samples, repeated measures, etc., are available.
We have 26 variables measured, of which 1 variable is set to be the outcome of
interest. This outcome, Y, indicates recurrent ischemic stroke within 14 days
after randomization (DRSISC
in ist
); the treatment of interest,
A, is the randomized aspirin vs. no aspirin treatment allocation (RXASP
in
ist
); and the adjustment set, W, consists simply of other variable measured
at baseline. In this data our outcome is occasionally missing, but we do not
need to create a variable indicating this missingness (such as Δ) for
analyses in the tlverse
. If we let Δ denote the indicator that
the outcome is missing such that Δ=1 when the outcome is observed and
Δ=0 when the outcome is not observed, then we can denote our observed
data structure as n i.i.d. copies of Oi=(Wi,Ai,Δi,ΔYi),
for i=1,…,n.
Using the skimr
package, we can
quickly summarize the variables in our data:
Name | ist |
Number of rows | 5000 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 19 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
RCONSC | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
SEX | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
RSLEEP | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
RATRIAL | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RCT | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
RVISINF | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
RHEP24 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RASP3 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF1 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF2 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF3 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF4 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF5 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF6 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF7 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
RDEF8 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
STYPE | 0 | 1 | 3 | 4 | 0 | 5 | 0 |
RXHEP | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
REGION | 0 | 1 | 10 | 26 | 0 | 7 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
RDELAY | 0 | 1 | 20.14 | 12.43 | 1 | 9 | 19 | 29 | 48 | ▇▆▆▃▂ |
AGE | 0 | 1 | 71.93 | 11.65 | 16 | 65 | 74 | 81 | 99 | ▁▁▃▇▂ |
RSBP | 0 | 1 | 160.62 | 27.84 | 71 | 140 | 160 | 180 | 290 | ▁▇▇▁▁ |
MISSING_RATRIAL_RASP3 | 0 | 1 | 0.05 | 0.22 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
MISSING_RHEP24 | 0 | 1 | 0.02 | 0.13 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
RXASP | 0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▇ |
DRSISC | 10 | 1 | 0.02 | 0.15 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
A convenient summary of the relevant variables is given just above.
"The study had a prospective, randomised, open treatment, blinded outcome (PROBE) design. The inclusion criteria were: clinical diagnosis of acute ischaemic stroke, with onset within the previous 48 hours and no clear indication for, or clear contraindication to, treatment with aspirin or subcutaneous heparin. Unlike many stroke trials of that era (and subsequently), the study did not set an upper age limit. Patients were to have a CT brain scan to confirm the diagnosis of stroke, and this was to be done before randomisation if at all possible. To enter a patient in the study, the clinician telephoned a central randomisation service (at the Clinical Trial Service Unit, Oxford) during this telephone call, the baseline variables were entered and checked, and once validated, the computer allocated the treatment and the telephonist then informed the clinician. The patients and treating clinicians were not blinded to the treatment given. Early outcome data were collected by the treating physician who completed a follow-up form at 14 days, death or hospital discharge (whichever occurred first). This form recorded data on events in hospital within 14 days, and the doctor’s opinion on the final diagnosis of the initial event that led to randomisation. These unblinded data, may therefore be subject to some degree of bias. The primary outcome was the proportion of patients who were either dead or dependent on other people for activities of daily living at six months after randomisation. This outcome was collected by postal questionnaire mailed directly to the patient, or (in Italy) by telephone interview of the patient by a trained researcher, blinded to treatment allocation. The primary outcome was therefore assessed - as far as practicable - blind to treatment allocation and hence should be free from bias. We re-checked the data set for inaccuracies and inconsistencies and extracted data on the variables assessed at randomisation, and at the two outcome assessment points: at 14-days after randomisation, death or prior hospital discharge (whichever occurred first) and at 6-months.
— Sandercock, Niewada, and Członkowska (2011)
3.2 WASH Benefits Example Dataset
The data come from a study of the effect of water quality, sanitation, hand washing, and nutritional interventions on child development in rural Bangladesh (WASH Benefits Bangladesh): a cluster-randomised controlled trial (Luby et al. 2018). The study enrolled pregnant women in their first or second trimester from the rural villages of Gazipur, Kishoreganj, Mymensingh, and Tangail districts of central Bangladesh, with an average of eight women per cluster. Groups of eight geographically adjacent clusters were block-randomised, using a random number generator, into six intervention groups (all of which received weekly visits from a community health promoter for the first 6 months and every 2 weeks for the next 18 months) and a double-sized control group (no intervention or health promoter visit). The six intervention groups were:
- chlorinated drinking water;
- improved sanitation;
- hand-washing with soap;
- combined water, sanitation, and hand washing;
- improved nutrition through counseling and provision of lipid-based nutrient supplements; and
- combined water, sanitation, handwashing, and nutrition.
In the workshop, we concentrate on child growth (size for age) as the outcome of interest. For reference, this trial was registered with ClinicalTrials.gov as NCT01590095.
library(tidyverse)
# read in data
dat <- read_csv("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv")
dat
# A tibble: 4,695 x 28
whz tr fracode month aged sex momage momedu momheight hfiacat Nlt18
<dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 0 Cont… N05265 9 268 male 30 Prima… 146. Food S… 3
2 -1.16 Cont… N05265 9 286 male 25 Prima… 149. Modera… 2
3 -1.05 Cont… N08002 9 264 male 25 Prima… 152. Food S… 1
4 -1.26 Cont… N08002 9 252 fema… 28 Prima… 140. Food S… 3
5 -0.59 Cont… N06531 9 336 fema… 19 Secon… 151. Food S… 2
6 -0.51 Cont… N06531 9 304 male 20 Secon… 154. Severe… 0
7 -2.46 Cont… N08002 9 336 fema… 19 Prima… 151. Food S… 2
8 -0.6 Cont… N06528 9 312 fema… 25 No ed… 142. Food S… 2
9 -0.23 Cont… N06528 9 322 male 30 Secon… 153. Food S… 1
10 -0.14 Cont… N06453 9 376 male 30 No ed… 156. Modera… 2
# … with 4,685 more rows, and 17 more variables: Ncomp <dbl>, watmin <dbl>,
# elec <dbl>, floor <dbl>, walls <dbl>, roof <dbl>, asset_wardrobe <dbl>,
# asset_table <dbl>, asset_chair <dbl>, asset_khat <dbl>, asset_chouki <dbl>,
# asset_tv <dbl>, asset_refrig <dbl>, asset_bike <dbl>, asset_moto <dbl>,
# asset_sewmach <dbl>, asset_mobile <dbl>
We have 28 variables measured, of which 1 variable is set to be the outcome of
interest. This outcome, Y, is the weight-for-height Z-score (whz
in dat
);
the treatment of interest, A, is the randomized treatment group (tr
in
dat
); and the adjustment set, W, consists simply of everything else. This
results in our observed data structure being n i.i.d. copies of Oi=(Wi,Ai,Yi), for i=1,…,n.
Like before, we can summarize the variables measured in the WASH Benefits data
set with skimr
:
Name | dat |
Number of rows | 4695 |
Number of columns | 28 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 23 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
tr | 0 | 1 | 3 | 15 | 0 | 7 | 0 |
fracode | 0 | 1 | 2 | 6 | 0 | 20 | 0 |
sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
momedu | 0 | 1 | 12 | 15 | 0 | 3 | 0 |
hfiacat | 0 | 1 | 11 | 24 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
whz | 0 | 1.00 | -0.59 | 1.03 | -4.67 | -1.28 | -0.6 | 0.08 | 4.97 | ▁▆▇▁▁ |
month | 0 | 1.00 | 6.45 | 3.33 | 1.00 | 4.00 | 6.0 | 9.00 | 12.00 | ▇▇▅▇▇ |
aged | 0 | 1.00 | 266.32 | 52.17 | 42.00 | 230.00 | 266.0 | 303.00 | 460.00 | ▁▂▇▅▁ |
momage | 18 | 1.00 | 23.91 | 5.24 | 14.00 | 20.00 | 23.0 | 27.00 | 60.00 | ▇▇▁▁▁ |
momheight | 31 | 0.99 | 150.50 | 5.23 | 120.65 | 147.05 | 150.6 | 154.06 | 168.00 | ▁▁▆▇▁ |
Nlt18 | 0 | 1.00 | 1.60 | 1.25 | 0.00 | 1.00 | 1.0 | 2.00 | 10.00 | ▇▂▁▁▁ |
Ncomp | 0 | 1.00 | 11.04 | 6.35 | 2.00 | 6.00 | 10.0 | 14.00 | 52.00 | ▇▃▁▁▁ |
watmin | 0 | 1.00 | 0.95 | 9.48 | 0.00 | 0.00 | 0.0 | 1.00 | 600.00 | ▇▁▁▁▁ |
elec | 0 | 1.00 | 0.60 | 0.49 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▆▁▁▁▇ |
floor | 0 | 1.00 | 0.11 | 0.31 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
walls | 0 | 1.00 | 0.72 | 0.45 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▃▁▁▁▇ |
roof | 0 | 1.00 | 0.99 | 0.12 | 0.00 | 1.00 | 1.0 | 1.00 | 1.00 | ▁▁▁▁▇ |
asset_wardrobe | 0 | 1.00 | 0.17 | 0.37 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▂ |
asset_table | 0 | 1.00 | 0.73 | 0.44 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▃▁▁▁▇ |
asset_chair | 0 | 1.00 | 0.73 | 0.44 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▃▁▁▁▇ |
asset_khat | 0 | 1.00 | 0.61 | 0.49 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▅▁▁▁▇ |
asset_chouki | 0 | 1.00 | 0.78 | 0.41 | 0.00 | 1.00 | 1.0 | 1.00 | 1.00 | ▂▁▁▁▇ |
asset_tv | 0 | 1.00 | 0.30 | 0.46 | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 | ▇▁▁▁▃ |
asset_refrig | 0 | 1.00 | 0.08 | 0.27 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
asset_bike | 0 | 1.00 | 0.32 | 0.47 | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 | ▇▁▁▁▃ |
asset_moto | 0 | 1.00 | 0.07 | 0.25 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
asset_sewmach | 0 | 1.00 | 0.06 | 0.25 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
asset_mobile | 0 | 1.00 | 0.86 | 0.35 | 0.00 | 1.00 | 1.0 | 1.00 | 1.00 | ▁▁▁▁▇ |
Note that the asset variables reflect socio-economic status of the study participants.
3.3 Veterans’ Administration Lung Cancer Trial Dataset
This data corresponds to a study conducted by the US Veterans Administration.
Male patients with advanced inoperable lung cancer were given either the
standard therapy or a test chemotherapy. The primary goal of the study was to
assess if the test chemotherapy improved survival. This data set has been
published in Kalbfleisch and Prentice (2011) and it is available in the MASS
and
survival
R
packages. Time to death was recorded for 128 patients, and 9
patients left the study before death. Various covariates were also documented
for each patient.
library(tidyverse)
# read in data
vet <- read_csv("https://raw.githubusercontent.com/tlverse/deming2019-workshop/master/data/veteran.csv")
vet
# A tibble: 137 x 9
X1 trt celltype time status karno diagtime age prior
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 squamous 72 1 60 7 69 0
2 2 1 squamous 411 1 70 5 64 10
3 3 1 squamous 228 1 60 3 38 0
4 4 1 squamous 126 1 60 9 63 10
5 5 1 squamous 118 1 70 11 65 10
6 6 1 squamous 10 1 20 5 49 0
7 7 1 squamous 82 1 40 10 69 10
8 8 1 squamous 110 1 80 29 68 0
9 9 1 squamous 314 1 50 18 43 0
10 10 1 squamous 100 0 70 6 70 0
# … with 127 more rows
A snapshot of the data set in shown below:
Name | vet |
Number of rows | 137 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
celltype | 0 | 1 | 5 | 9 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
X1 | 0 | 1 | 69.00 | 39.69 | 1 | 35 | 69 | 103 | 137 | ▇▇▇▇▇ |
trt | 0 | 1 | 1.50 | 0.50 | 1 | 1 | 1 | 2 | 2 | ▇▁▁▁▇ |
time | 0 | 1 | 121.63 | 157.82 | 1 | 25 | 80 | 144 | 999 | ▇▁▁▁▁ |
status | 0 | 1 | 0.93 | 0.25 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
karno | 0 | 1 | 58.57 | 20.04 | 10 | 40 | 60 | 75 | 99 | ▁▅▇▇▂ |
diagtime | 0 | 1 | 8.77 | 10.61 | 1 | 3 | 5 | 11 | 87 | ▇▁▁▁▁ |
age | 0 | 1 | 58.31 | 10.54 | 34 | 51 | 62 | 66 | 81 | ▃▂▅▇▁ |
prior | 0 | 1 | 2.92 | 4.56 | 0 | 0 | 0 | 10 | 10 | ▇▁▁▁▃ |
References
Kalbfleisch, John D, and Ross L Prentice. 2011. The Statistical Analysis of Failure Time Data. Vol. 360. John Wiley & Sons.
Luby, Stephen P, Mahbubur Rahman, Benjamin F Arnold, Leanne Unicomb, Sania Ashraf, Peter J Winch, Christine P Stewart, et al. 2018. “Effects of Water Quality, Sanitation, Handwashing, and Nutritional Interventions on Diarrhoea and Child Growth in Rural Bangladesh: A Cluster Randomised Controlled Trial.” The Lancet Global Health 6 (3). Elsevier: e302–e315.
Sandercock, P, R Collins, C Counsell, B Farrell, R Peto, J Slattery, and C Warlow. 1997. “For the International Stroke Trial Collaborative Group. The International Stroke Trial (Ist): A Randomized Trial of Aspirin, Subcutaneous Heparin, Both, or Neither Among 19,435 Patients with Acute Ischemic Stroke.” Lancet 349 (9065): 1569–81.
Sandercock, Peter AG, Maciej Niewada, and Anna Członkowska. 2011. “The International Stroke Trial Database.” Trials 12 (1). BioMed Central: 101.