This is an annotated copy of the .Rmd file that generated my slides for the talk at the Clinical Trials Conference at the University of Pennsylvania, April 17 2023. It is intended to be useful for anyone who wants to replicate the analyses done there.
The colon cancer data set is found in the survival package. Each subject has 2 rows, one for time to recurrence, the other for time to death. Split this into two data sets, and then create the progression free survival time.
Make the multi-state data set. Each subject is described by a set of rows with variable of subject id, time1, time2, state, and covariates. The primary rule for such a data set is that each the rows for each subject describe a physically possible path through the states. - no overlapping intervals (can’t be two places at once) - no gaps in the time course (have to be somewhere) - if you enter a state, the time in state is >0 (no teleporting)
The tmerge routine is a tool to help create such data sets, but it is not necessary. Other than the state or “status” variable, the data has the same form as timed-dependent covariate data sets; many have their own process for creating those. The survcheck routine checks above conditions, and gives a table of the observed transitions.
Call:
survcheck(formula = Surv(time1, time2, state) ~ 1, data = cdata,
id = id)
Unique identifiers Observations Transitions
929 1390 915
Transitions table:
to
from recur death (censored)
(s0) 463 43 423
recur 0 409 52
death 0 0 0
Number of subjects with 0, 1, ... transitions to each state:
count
state 0 1 2
recur 466 463 0
death 477 452 0
(any) 423 97 409
The first table tells the observed transtions in the data, each row of the second how many subjects made 0, 1, … transitions into that state.
The statefig routine draws simple state space figures. Those who want prettier ones can employ one of the many routines for DAGs. (The annotation below is about as fancy as you can get with this routine.)
Fit the multistate model, and display the coefficients as a graph. (Fitting is easy, all the work was setting up the data set.) For those who know ggplot: feel free to make a much nicer figure.
Call:
coxph(formula = Surv(time1, time2, state) ~ lev5 + age10 + extent +
node4, data = cdata, id = id)
1:2 coef exp(coef) se(coef) robust se z p
lev5 -0.52113 0.59385 0.10739 0.10935 -4.766 1.88e-06
age10 -0.04278 0.95813 0.03915 0.04199 -1.019 0.308
extent 0.53143 1.70137 0.11399 0.12263 4.334 1.47e-05
node4 0.82599 2.28413 0.09671 0.10028 8.237 < 2e-16
1:3 coef exp(coef) se(coef) robust se z p
lev5 0.08992 1.09409 0.31145 0.31975 0.281 0.7785
age10 0.95268 2.59264 0.18940 0.18190 5.237 1.63e-07
extent 0.47976 1.61568 0.36679 0.46724 1.027 0.3045
node4 0.80634 2.23969 0.33753 0.35185 2.292 0.0219
2:3 coef exp(coef) se(coef) robust se z p
lev5 0.23120 1.26011 0.11413 0.12111 1.909 0.0563
age10 0.09618 1.10096 0.03989 0.04391 2.190 0.0285
extent 0.09959 1.10472 0.10598 0.11439 0.871 0.3839
node4 0.40838 1.50438 0.10455 0.10256 3.982 6.83e-05
States: 1= (s0), 2= recur, 3= death
Likelihood ratio test=188.9 on 12 df, p=< 2.2e-16
n= 1390, number of events= 915
The set of probability in state curves for a fitted multistate PH model can be generated for any target set of covariates. (Predictions are always for a hypothetical someone). Below, I dropped age to make things a bit simpler. Then there is a trick: if we exclude any observation time after recurrence (lines with td.recur==1), then the model is a competing risks model and the probability in state for recurrence will be “ever recurred”. It is an easy way to get the probability of ever recurring.
The objects csurv2 and csurv will each have the curves for 16 subjects, the number of rows in data set dummy; each curve has information on 3 states of (s0), recurrence and death. (If you don’t provide an explicit name for the initial state, and I didn’t, it uses (s0) as a generic label.) The intermediate printing below was not shown on the slides.
data states
16 3
n nevent rmean
1, (s0) 1390 0 6.5680067
2, (s0) 1390 0 7.0445011
3, (s0) 1390 0 5.7484689
4, (s0) 1390 0 6.4688788
5, (s0) 1390 0 4.6183596
6, (s0) 1390 0 5.6179260
7, (s0) 1390 0 3.2591642
8, (s0) 1390 0 4.4647544
9, (s0) 1390 0 5.1645521
10, (s0) 1390 0 6.0429096
11, (s0) 1390 0 3.8857981
12, (s0) 1390 0 5.0234830
13, (s0) 1390 0 2.5047500
14, (s0) 1390 0 3.7340445
15, (s0) 1390 0 1.3681868
16, (s0) 1390 0 2.3763783
1, recur 1390 463 0.4864469
2, recur 1390 463 0.2532152
3, recur 1390 463 0.6983250
4, recur 1390 463 0.3716350
5, recur 1390 463 0.9418232
6, recur 1390 463 0.5227841
7, recur 1390 463 1.1579902
8, recur 1390 463 0.6891639
9, recur 1390 463 0.6936849
10, recur 1390 463 0.3729134
11, recur 1390 463 0.8824530
12, recur 1390 463 0.5039921
13, recur 1390 463 1.0037567
14, recur 1390 463 0.6281135
15, recur 1390 463 0.9977164
16, recur 1390 463 0.6989545
1, death 1390 452 0.9455464
2, death 1390 452 0.7022837
3, death 1390 452 1.5532061
4, death 1390 452 1.1594862
5, death 1390 452 2.4398172
6, death 1390 452 1.8592899
7, death 1390 452 3.5828457
8, death 1390 452 2.8460817
9, death 1390 452 2.1417630
10, death 1390 452 1.5841771
11, death 1390 452 3.2317489
12, death 1390 452 2.4725250
13, death 1390 452 4.4914933
14, death 1390 452 3.6378420
15, death 1390 452 5.6340969
16, death 1390 452 4.9246672
[1] 0.02260495 0.11410118 0.81700753 0.04628633
Now draw the barplot without collapsing over extent of disease.
Make a matrix out of the sojourn times, and print it out.
Sojourn(initial) Sojourn(recur) P(recur) recur duration
Trt, nodes <=4 5.69 0.51 0.33 1.55
Control, nodes <=4 4.73 0.91 0.49 1.88
Control, nodes >4 2.67 0.98 0.76 1.29
Trt, nodes >4 3.87 0.61 0.58 1.05
For the curves below I got lazy. Over 80% of the subjects had extent=3, so I draw the curves for extent =3 rather than margining it out. The yates() routine is designed do the busywork for marginal estimates, but currently doesn’t work with mutltistate fits.
The person below with stem cell transplant on the same day as CR has to be modified, to prevent a “0 days in the CR state” error. Making CR 1 day earlier solves this, but is completely arbitrary. The decision not to count CR after SCT is a more nuanced one. One of the main discussions in the talk was CR and duration of CR, compraring arms A and B. If someone has had stem cell transplant, however, should that CR be credited to the earlier treatment, or is it entirely due to SCT? Which way to go is much more a medical question than a statistical one. This data set is also discussed in the main survival vigette as an example of Aalen-Johansen curves, and in that section I kept these 18 CR.
Call:
survcheck(formula = Surv(time1, time2, event) ~ 1, data = mdata,
id = id)
Unique identifiers Observations Transitions
646 1689 1353
Transitions table:
to
from CR SCT relapse death (censored)
(s0) 443 106 13 55 29
CR 0 159 164 15 105
SCT 0 0 49 151 163
relapse 0 99 0 99 28
death 0 0 0 0 0
Number of subjects with 0, 1, ... transitions to each state:
count
state 0 1 2 3 4
CR 203 443 0 0 0
SCT 282 364 0 0 0
relapse 420 226 0 0 0
death 326 320 0 0 0
(any) 29 206 173 151 87
The variable crstat is used for the analysis that treats CR and death as competing risks, txstat for treating transplant and death as competing risks, and tx2 for the second transplant analysis. One should run survcheck on each endpoint to verify it.
to
from CR SCT relapse death (censored)
Entry 443 106 13 55 29
CR 0 159 164 15 105
SCT 0 0 49 151 163
relapse 0 99 0 99 28
death 0 0 0 0 0
The layout() command is an old way to create a montage of plots. I use it here to allow the state space figures to be on the same page as the survival plots.