Stat Bandit

Musings on statistics, computation and data research

An enhanced Kaplan-Meier plot

We often see, in publications, a Kaplan-Meier survival plot, with a table of the number of subjects at risk at different time points aligned below the figure. I needed this type of plot (or really, matrices of such plots) for an upcoming publication. Of course, my preferred toolbox was R and the ggplot2 package.

There were other attempts to do this type of plot in ggplot2, mainly by Gary Collins and an anonymous author as seen on the ggplot2 mailing list. There was also an invaluable blog on Learn-R on plotting a table using geom_text.

I decided to use Gary’s code as my starting point. I wanted to add a log-rank test p-value for stratified KM curves on the plot, and I wanted to code it so that the text in the table remained aligned to the x-axis even when my strata names changed with different data sets. The former was relatively easy, but the latter was challenging (and seemed to be unsolved by Gary, who had posed it as a question). Both were instructive, and took me back a layer to the grid graphics engine (on which both ggplot2 and lattice are based) and the gridExtra package.

The placement of the p-value in the plot is a case of a single character string being placed at a certain point on the graph, which I had done with the text command in base R. However, the geom_text layer seems to be intended for plotting text at all points in the data and not for inserting a single text string. However, I learned of the grid.text function from the grid package, which would enable me to do this for grid-based graphics.

The second part is to plot a table aligned to the x-axis of the plot. The plotting was achieved using Gary’s code using the method outlined on the Learn-R website. The problem was alignment. A few experiments showed that alignment depended on the size of the strata labels used in the plot. I started playing with the plot.margin option in ggplot2 and then figured out that I could maintain the alignment (more or less) if the left margin was computed based on the length of the labels. Some experimentation got me to a reasonable formula for this computation.

Now I had to put the 3 parts together. Both the plot and the table were ggplot objects, so I knew I could stitch them together with grid.arrange from the gridExtra package. However, the p-value was being written using a separate grid.text command. I had to learn more about grid. I decided to use basic grid commands to convert the KM plot to a grid object using ggplotGrob and add the p-value text string to it using addGrob and textGrob. I could now stitch this together with the plotted table using grid.arrange, using the same layout specs.

However, this still wasn’t ideal since the plots were overlapping a bit. I decided to put a buffer blank image in between. In base graphics, I could achieve this with plot(..., type="n"). In ggplot, I learned about geom_blank() and theme_blank() to create a blank image. This I then put in a thin row between the plot and the table.

I incorporated this process into the function ggkm which is given below. An example plot generated by this code:

data(colon)
fit <- survfit(Surv(time,status)~rx, data=colon)
ggkm(fit, timeby=500)

is shown here:
rstudio-V79384

The ggkm function is given below:
Update:The following code is modified thanks to Baptiste Auguie’s comments, which were very helpful. I also make the p-value text optional for the plot, with the default settings omitting the p-values.

Update (3/26/2011) I’ve further updated the code based on comments which allow it to work on a naive install (albeit with the required packages), as well as replacing “df” with “dframe” to get around the fact that “df” is a R function.

Update (9/1/2011) Both Gil Thomas and Mark Cowley have made improvements to the code that have made it more stable and universally run-able (is that a word?). The following reflects Mark’s last edits to the code.

#’ Create a Kaplan-Meier plot using ggplot2
#’
#’ @param sfit a \code{\link[survival]{survfit}} object
#’ @param table logical: Create a table graphic below the K-M plot, indicating at-risk numbers?
#’ @param returns logical: if \code{TRUE}, return an arrangeGrob object
#’ @param xlabs x-axis label
#’ @param ylabs y-axis label
#’ @param ystratalabs The strata labels. \code{Default = levels(summary(sfit)$strata)}
#’ @param ystrataname The legend name. Default = “Strata”
#’ @param timeby numeric: control the granularity along the time-axis
#’ @param main plot title
#’ @param pval logical: add the pvalue to the plot?
#’ @return a ggplot is made. if return=TRUE, then an arrangeGlob object
#’ is returned
#’ @author Abhijit Dasgupta with contributions by Gil Tomas
#’ \url{http://statbandit.wordpress.com/2011/03/08/an-enhanced-kaplan-meier-plot/}
#’ @export
#’ @examples
#’ \dontrun{
#’ data(colon)
#’	fit <- survfit(Surv(time,status)~rx, data=colon)
#'	ggkm(fit, timeby=500)
#' }
ggkm <- function(sfit, table = TRUE, returns = FALSE,
xlabs = "Time", ylabs = "survival probability",
ystratalabs = NULL, ystrataname = NULL,
timeby = 100, main = "Kaplan-Meier Plot",
pval = TRUE, ...) {
require(ggplot2)
require(survival)
require(gridExtra)
if(is.null(ystratalabs)) {
   ystratalabs <- as.character(levels(summary(sfit)$strata))
}
m <- max(nchar(ystratalabs))
if(is.null(ystrataname)) ystrataname <- "Strata"
times <- seq(0, max(sfit$time), by = timeby)
.df <- data.frame(time = sfit$time, n.risk = sfit$n.risk,
    n.event = sfit$n.event, surv = sfit$surv, strata = summary(sfit, censored = T)$strata,
    upper = sfit$upper, lower = sfit$lower)
levels(.df$strata) <- ystratalabs
zeros <- data.frame(time = 0, surv = 1, strata = factor(ystratalabs, levels=levels(.df$strata)),
    upper = 1, lower = 1)
.df <- rbind.fill(zeros, .df)
d <- length(levels(.df$strata))
p <- ggplot(.df, aes(time, surv, groups = strata)) +
    geom_step(aes(linetype = strata), size = 0.7) +
    theme_bw() +
    opts(axis.title.x = theme_text(vjust = 0.5)) +
    scale_x_continuous(xlabs, breaks = times, limits = c(0, max(sfit$time))) +
    scale_y_continuous(ylabs, limits = c(0, 1)) +
    opts(panel.grid.minor = theme_blank()) +
    opts(legend.position = c(ifelse(m < 10, .28, .35), ifelse(d < 4, .25, .35))) +
    opts(legend.key = theme_rect(colour = NA)) +
    labs(linetype = ystrataname) +
    opts(plot.margin = unit(c(0, 1, .5, ifelse(m < 10, 1.5, 2.5)), "lines")) +
    opts(title = main)

## Create a blank plot for place-holding
## .df <- data.frame()
blank.pic <- ggplot(.df, aes(time, surv)) +
    geom_blank() +
    theme_bw() +
    opts(axis.text.x = theme_blank(), axis.text.y = theme_blank(),
        axis.title.x = theme_blank(), axis.title.y = theme_blank(),
        axis.ticks = theme_blank(), panel.grid.major = theme_blank(),
        panel.border = theme_blank())
if(pval) {
    sdiff <- survdiff(eval(sfit$call$formula), data = eval(sfit$call$data))
    pval <- pchisq(sdiff$chisq, length(sdiff$n) – 1, lower.tail = FALSE)
    pvaltxt <- ifelse(pval < 0.0001, "p < 0.0001", paste("p =", signif(pval, 3)))
    p <- p + annotate("text", x = 0.6 * max(sfit$time), y = 0.1, label = pvaltxt)
}
if(table) {
    ## Create table graphic to include at-risk numbers
    risk.data <- data.frame(strata = summary(sfit, times = times, extend = TRUE)$strata,
        time = summary(sfit, times = times, extend = TRUE)$time,
        n.risk = summary(sfit, times = times, extend = TRUE)$n.risk)
    data.table <- ggplot(risk.data, aes(x = time, y = strata, label = format(n.risk, nsmall = 0))) +
        #, color = strata)) +
        geom_text(size = 3.5) +
        theme_bw() +
        scale_y_discrete(breaks = as.character(levels(risk.data$strata)), labels = ystratalabs) +
        # scale_y_discrete(#format1ter = abbreviate,
        # breaks = 1:3,
        # labels = ystratalabs) +
        scale_x_continuous("Numbers at risk", limits = c(0, max(sfit$time))) +
        opts(axis.title.x = theme_text(size = 10, vjust = 1), panel.grid.major = theme_blank(),
        panel.grid.minor = theme_blank(), panel.border = theme_blank(),
        axis.text.x = theme_blank(), axis.ticks = theme_blank(),
        axis.text.y = theme_text(face = "bold", hjust = 1))
    data.table <- data.table + opts(legend.position = "none") +
        xlab(NULL) + ylab(NULL)
    data.table <- data.table +
        opts(plot.margin = unit(c(-1.5, 1, 0.1, ifelse(m < 10, 2.5, 3.5) – 0.28 * m), "lines"))
## Plotting the graphs
## p <- ggplotGrob(p)
## p <- addGrob(p, textGrob(x = unit(.8, "npc"), y = unit(.25, "npc"), label = pvaltxt,
    ## gp = gpar(fontsize = 12)))
    grid.arrange(p, blank.pic, data.table,
        clip = FALSE, nrow = 3, ncol = 1,
        heights = unit(c(2, .1, .25),c("null", "null", "null")))
    if(returns) {
        a <- arrangeGrob(p, blank.pic, data.table, clip = FALSE,
            nrow = 3, ncol = 1, heights = unit(c(2, .1, .25),c("null", "null", "null")))
        return(a)
    }
}
else {
    ## p <- ggplotGrob(p)
    ## p <- addGrob(p, textGrob(x = unit(0.5, "npc"), y = unit(0.23, "npc"),
    ## label = pvaltxt, gp = gpar(fontsize = 12)))
    print(p)
    if(returns) return(p)
    }
}
About these ads

29 responses to “An enhanced Kaplan-Meier plot

  1. baptiste March 8, 2011 at 3:57 AM

    Re: the p-value annotation, you could simply use annotate() in ggplot,

    qplot(1, 1) + annotate(“text”, x=0.8, y=0.8, label=”testing”)

    in which case you don’t need to convert the ggplot to a grob before passing it to grid.arrange().

    Also, note that grid.draw(p) would be a better choice than grid.arrange(p) when there’s only one grob to place.

    Finally, you could perhaps use grid.table() to draw the table rather than ggplot, unless the alignment of the x axes is really important (at first sight i didn’t see why it should be).

    • Abhijit March 8, 2011 at 7:56 AM

      Baptiste,

      I was hoping for comments from experienced hands. I never discovered annotate, so thanks.

      The alignment is actually present in almost all the examples I’ve seen, so yes it is important to my collaborator. I agree that it is not necessary.

      Thank you for taking the time to comment.

      Abhijit

      • baptiste March 8, 2011 at 3:14 PM

        After I posted I realised that the alignment could be useful, indeed. I don’t see why you use blank_pic however, rather than just setting the plot margin for instance, or even an empty rectGrob(). I couldn’t run the example above for some reason (presumably because df is not defined for blank_pic, and also fit is not a data.frame).

        I have a comment regarding the general structure of the function with if() else blocks. I’d suggest the following,

        ## create the main plot
        p = ggplot(…) …

        ## add the optional p value
        if(pvalue)
        p = p + annotate(…)

        ## store the plot in a list
        l = list(p)

        ## optional: create the table and append it to the list

        if(table) {
        gt = ggplot(…)
        l = c(list(gt), l)
        }

        ## finally, plot and/or return

        plots_and_options = c(l, list(clip=F, nrow=3, ncol=1,
        heights = unit(c(2, .1, .25), c(‘null’,’null’,’null’))))

        all = do.call(arrangeGrob, plots_and_options)

        if(plot)
        grid.draw(l)

        if(return)
        return(invisible(l))

        (all this untested)

        HTH,

        baptiste

      • Abhijit March 8, 2011 at 3:33 PM

        Baptiste,

        I appreciate your comments and effort in this.

        I understand your suggested structure, which has the distinct advantage of being more modular. Modifications to my program are forthcoming:)

        Didn’t know about rectGrob(). Thanks

        fit is a survival object, the result of survival:::survfit. If you run the 3-line example in the blog (just above the code) it should work.

        Thanks again.

  2. baptiste March 8, 2011 at 5:37 PM

    You probably don’t need rectGrob(),

    qplot(1,1) + opts(plot.margin = unit(c(1,1,10,1), “lines”))

    I haven’t been able to run the code above; perhaps you should try it in a fresh vanilla session.

    HTH

  3. Kelvin March 10, 2011 at 10:24 AM

    Thank you for the good work Abhijit. I tried to run the codes above. When I submitted >ggkm(fit,timeby=500) I’ve got an error message “Error: ggplot2 doesn’t know how to deal with data of class function”. Can you explain why? That will certainly help me understand ggplot a bit more. Thanks!!!

    • Gil Tomas March 21, 2011 at 8:22 AM

      Kelvin, that’s because df happens to be a function of the stats package in R (the density function of the F distribution). See:
      ?df
      I also had to tweak the code a bit so it would produce the intended result in my machine:
      ggkm <- function (sfit,
      table = TRUE,
      returns = FALSE,
      xlabs = "Time",
      ylabs = "survival probability",
      ystratalabs = NULL,
      ystrataname = NULL,
      timeby = 100,
      main = "Kaplan-Meier Plot",
      pval = FALSE,
      …) {

      require (ggplot2)
      require (survival)
      require (gridExtra)

      ## Create Kaplan Meier plot
      if (is.null (ystratalabs)) {
      ystratalabs <- as.character (levels (summary (sfit)$strata))
      }
      (m <- max (nchar (ystratalabs)))
      if (is.null (ystrataname)) ystrataname <- "Strata"
      times <- seq (0, max (sfit$time), by = timeby)
      .df <- data.frame (time = sfit$time,
      n.risk = sfit$n.risk,
      n.event = sfit$n.event,
      surv = sfit$surv,
      strata = summary (sfit, censored = T)$strata,
      upper = sfit$upper,
      lower = sfit$lower)
      levels (.df$strata) <- ystratalabs
      zeros <- data.frame (time = 0,
      surv = 1,
      strata = ystratalabs,
      upper = 1,
      lower = 1)
      .df <- rbind.fill (zeros, .df)
      d <- length (levels (.df$strata))
      p <- ggplot (.df, aes (time, surv, groups = strata)) +
      geom_step (aes (linetype = strata),
      size = 0.7) +
      theme_bw () +
      opts (axis.title.x = theme_text (vjust = 0.5)) +
      scale_x_continuous (xlabs,
      breaks = times,
      limits = c (0, max (sfit$time))) +
      scale_y_continuous (ylabs, limits = c (0, 1)) +
      opts (panel.grid.minor = theme_blank ()) +
      opts (legend.position = c (ifelse (m < 10, .28, .35),
      ifelse (d < 4, .25, .35))) +
      opts (legend.key = theme_rect (colour = NA)) +
      labs (linetype = ystrataname) +
      opts (plot.margin = unit (c (0, 1, .5, ifelse (m < 10, 1.5, 2.5)),
      "lines")) +
      opts (title = main)

      ## Create a blank plot for place-holding
      ## .df <- data.frame ()
      blank.pic <- ggplot (.df, aes (time, surv)) +
      geom_blank () +
      theme_bw () +
      opts (axis.text.x = theme_blank (),
      axis.text.y = theme_blank (),
      axis.title.x = theme_blank (),
      axis.title.y = theme_blank (),
      axis.ticks = theme_blank (),
      panel.grid.major = theme_blank (),
      panel.border = theme_blank ())

      if (pval) {
      sdiff <- survdiff (eval (sfit$call$formula),
      data = eval (sfit$call$data))
      pval <- pchisq (sdiff$chisq,
      length (sdiff$n) – 1,
      lower.tail = FALSE)
      pvaltxt <- ifelse (pval < 0.0001, "p < 0.0001",
      paste ("p =", signif (pval, 3)))
      p <- p +
      annotate ("text",
      x = 0.6 * max (sfit$time),
      y = 0.1,
      label = pvaltxt)
      }

      if (table) {
      ## Create table graphic to include at-risk numbers
      risk.data <- data.frame (strata = summary (sfit,
      times = times,
      extend = TRUE)$strata,
      time = summary (sfit,
      times = times,
      extend = TRUE)$time,
      n.risk = summary (sfit,
      times = times,
      extend = TRUE)$n.risk)
      data.table <- ggplot (risk.data,
      aes (x = time,
      y = strata,
      label = format (n.risk, nsmall = 0))) + #, color = strata)) +
      geom_text (size = 3.5) +
      theme_bw () +
      scale_y_discrete (breaks = as.character (levels (risk.data$strata)),
      labels = ystratalabs) +
      # scale_y_discrete (#format1ter = abbreviate,
      # breaks = 1:3,
      # labels = ystratalabs) +
      scale_x_continuous ("Numbers at risk",
      limits = c (0, max (sfit$time))) +
      opts (axis.title.x = theme_text (size = 10, vjust = 1),
      panel.grid.major = theme_blank (),
      panel.grid.minor = theme_blank (),
      panel.border = theme_blank (),
      axis.text.x = theme_blank (),
      axis.ticks = theme_blank (),
      axis.text.y = theme_text (face = "bold",
      hjust = 1))

      data.table <- data.table +
      opts (legend.position = "none") +
      xlab (NULL) +
      ylab (NULL)

      data.table <- data.table +
      opts (plot.margin = unit (c (-1.5, 1, 0.1, ifelse (m < 10, 2.5, 3.5) – 0.28 * m), "lines"))

      ## Plotting the graphs
      ## p <- ggplotGrob (p)
      ## p <- addGrob (p, textGrob (x = unit (.8, "npc"), y = unit (.25, "npc"), label = pvaltxt,
      ## gp = gpar (fontsize = 12)))
      grid.arrange (p,
      blank.pic,
      data.table,
      clip = FALSE,
      nrow = 3,
      ncol = 1,
      heights = unit (c (2, .1, .25),
      c ("null", "null", "null")))
      if (returns) {
      a <- arrangeGrob (p,
      blank.pic,
      data.table,
      clip = FALSE,
      nrow = 3,
      ncol = 1,
      heights = unit (c (2, .1, .25),
      c ("null", "null", "null")))
      return (a)
      }
      }
      else {
      ## p <- ggplotGrob (p)
      ## p <- addGrob (p, textGrob (x = unit (0.5, "npc"), y = unit (0.23, "npc"),
      ## label = pvaltxt, gp = gpar (fontsize = 12)))
      print (p)
      if (returns) return (p)
      }
      }

    • Gil Tomas March 21, 2011 at 8:22 AM

      Kelvin, that’s because “df” *happens* to be a function of the stats package in R (the density function of the F distribution). See:
      ?df
      I also had to tweak the code a bit so it would produce the intended result in my machine:
      ggkm <- function (sfit,
      table = TRUE,
      returns = FALSE,
      xlabs = "Time",
      ylabs = "survival probability",
      ystratalabs = NULL,
      ystrataname = NULL,
      timeby = 100,
      main = "Kaplan-Meier Plot",
      pval = FALSE,
      …) {

      require (ggplot2)
      require (survival)
      require (gridExtra)

      ## Create Kaplan Meier plot
      if (is.null (ystratalabs)) {
      ystratalabs <- as.character (levels (summary (sfit)$strata))
      }
      (m <- max (nchar (ystratalabs)))
      if (is.null (ystrataname)) ystrataname <- "Strata"
      times <- seq (0, max (sfit$time), by = timeby)
      .df <- data.frame (time = sfit$time,
      n.risk = sfit$n.risk,
      n.event = sfit$n.event,
      surv = sfit$surv,
      strata = summary (sfit, censored = T)$strata,
      upper = sfit$upper,
      lower = sfit$lower)
      levels (.df$strata) <- ystratalabs
      zeros <- data.frame (time = 0,
      surv = 1,
      strata = ystratalabs,
      upper = 1,
      lower = 1)
      .df <- rbind.fill (zeros, .df)
      d <- length (levels (.df$strata))
      p <- ggplot (.df, aes (time, surv, groups = strata)) +
      geom_step (aes (linetype = strata),
      size = 0.7) +
      theme_bw () +
      opts (axis.title.x = theme_text (vjust = 0.5)) +
      scale_x_continuous (xlabs,
      breaks = times,
      limits = c (0, max (sfit$time))) +
      scale_y_continuous (ylabs, limits = c (0, 1)) +
      opts (panel.grid.minor = theme_blank ()) +
      opts (legend.position = c (ifelse (m < 10, .28, .35),
      ifelse (d < 4, .25, .35))) +
      opts (legend.key = theme_rect (colour = NA)) +
      labs (linetype = ystrataname) +
      opts (plot.margin = unit (c (0, 1, .5, ifelse (m < 10, 1.5, 2.5)),
      "lines")) +
      opts (title = main)

      ## Create a blank plot for place-holding
      ## .df <- data.frame ()
      blank.pic <- ggplot (.df, aes (time, surv)) +
      geom_blank () +
      theme_bw () +
      opts (axis.text.x = theme_blank (),
      axis.text.y = theme_blank (),
      axis.title.x = theme_blank (),
      axis.title.y = theme_blank (),
      axis.ticks = theme_blank (),
      panel.grid.major = theme_blank (),
      panel.border = theme_blank ())

      if (pval) {
      sdiff <- survdiff (eval (sfit$call$formula),
      data = eval (sfit$call$data))
      pval <- pchisq (sdiff$chisq,
      length (sdiff$n) – 1,
      lower.tail = FALSE)
      pvaltxt <- ifelse (pval < 0.0001, "p < 0.0001",
      paste ("p =", signif (pval, 3)))
      p <- p +
      annotate ("text",
      x = 0.6 * max (sfit$time),
      y = 0.1,
      label = pvaltxt)
      }

      if (table) {
      ## Create table graphic to include at-risk numbers
      risk.data <- data.frame (strata = summary (sfit,
      times = times,
      extend = TRUE)$strata,
      time = summary (sfit,
      times = times,
      extend = TRUE)$time,
      n.risk = summary (sfit,
      times = times,
      extend = TRUE)$n.risk)
      data.table <- ggplot (risk.data,
      aes (x = time,
      y = strata,
      label = format (n.risk, nsmall = 0))) + #, color = strata)) +
      geom_text (size = 3.5) +
      theme_bw () +
      scale_y_discrete (breaks = as.character (levels (risk.data$strata)),
      labels = ystratalabs) +
      # scale_y_discrete (#format1ter = abbreviate,
      # breaks = 1:3,
      # labels = ystratalabs) +
      scale_x_continuous ("Numbers at risk",
      limits = c (0, max (sfit$time))) +
      opts (axis.title.x = theme_text (size = 10, vjust = 1),
      panel.grid.major = theme_blank (),
      panel.grid.minor = theme_blank (),
      panel.border = theme_blank (),
      axis.text.x = theme_blank (),
      axis.ticks = theme_blank (),
      axis.text.y = theme_text (face = "bold",
      hjust = 1))

      data.table <- data.table +
      opts (legend.position = "none") +
      xlab (NULL) +
      ylab (NULL)

      data.table <- data.table +
      opts (plot.margin = unit (c (-1.5, 1, 0.1, ifelse (m < 10, 2.5, 3.5) – 0.28 * m), "lines"))

      ## Plotting the graphs
      ## p <- ggplotGrob (p)
      ## p <- addGrob (p, textGrob (x = unit (.8, "npc"), y = unit (.25, "npc"), label = pvaltxt,
      ## gp = gpar (fontsize = 12)))
      grid.arrange (p,
      blank.pic,
      data.table,
      clip = FALSE,
      nrow = 3,
      ncol = 1,
      heights = unit (c (2, .1, .25),
      c ("null", "null", "null")))
      if (returns) {
      a <- arrangeGrob (p,
      blank.pic,
      data.table,
      clip = FALSE,
      nrow = 3,
      ncol = 1,
      heights = unit (c (2, .1, .25),
      c ("null", "null", "null")))
      return (a)
      }
      }
      else {
      ## p <- ggplotGrob (p)
      ## p <- addGrob (p, textGrob (x = unit (0.5, "npc"), y = unit (0.23, "npc"),
      ## label = pvaltxt, gp = gpar (fontsize = 12)))
      print (p)
      if (returns) return (p)
      }
      }

  4. Johannes April 25, 2011 at 3:40 PM

    Thank you so much for this plot function, you just saved me hours of work! Super helpful.

  5. Rafik Margaryan June 20, 2011 at 3:04 AM

    Hi
    Nice piece of code.
    On my plot does not appear the p value, some suggestions?
    Thanks

  6. Mark Cowley September 1, 2011 at 9:03 AM

    nice code, thanks a lot.
    @Rafik, did you set pval=TRUE
    Based on Gil Thomas’ code, using the example given, the line legend had 6 strata: 1, 2, 3, rx=Lev, rx=Lev+5FU, rx=Obs
    I traced this down to a bug in the creation of zeros data.frame, fixed in the below + added roxygen comments.

    #’ Create a Kaplan-Meier plot using ggplots2
    #’
    #’ @param sfit a \code{\link[survival]{survfit}} object
    #’ @param table logical: Create a table graphic below the K-M plot, indicating at-risk numbers?
    #’ @param returns logical: if \code{TRUE}, return an arrangeGrob object
    #’ @param xlabs x-axis label
    #’ @param ylabs y-axis label
    #’ @param ystratalabs The strata labels. \code{Default = levels(summary(sfit)$strata)}
    #’ @param ystrataname The legend name. Default = “Strata”
    #’ @param timeby numeric: control the granularity along the time-axis
    #’ @param main plot title
    #’ @param pval logical: add the pvalue to the plot?
    #’ @return a ggplot is made. if return=TRUE, then an arrangeGlob object
    #’ is returned
    #’ @author Abhijit Dasgupta with contributions by Gil Tomas
    #’ \url{http://statbandit.wordpress.com/2011/03/08/an-enhanced-kaplan-meier-plot/#comment-55}
    #’ @export
    #’ @examples
    #’ \dontrun{
    #’ data(colon)
    #’ fit <- survfit(Surv(time,status)~rx, data=colon)
    #' ggkm(fit, timeby=500)
    #' }
    ggkm <- function(sfit,
    table = TRUE,
    returns = FALSE,
    xlabs = "Time",
    ylabs = "survival probability",
    ystratalabs = NULL,
    ystrataname = NULL,
    timeby = 100,
    main = "Kaplan-Meier Plot",
    pval = TRUE,
    …) {

    require(ggplot2)
    require(survival)
    require(gridExtra)

    if(is.null(ystratalabs)) {
    ystratalabs <- as.character(levels(summary(sfit)$strata))
    }
    (m <- max(nchar(ystratalabs)))
    if(is.null(ystrataname)) ystrataname <- "Strata"
    times <- seq(0, max(sfit$time), by = timeby)
    .df <- data.frame(
    time = sfit$time,
    n.risk = sfit$n.risk,
    n.event = sfit$n.event,
    surv = sfit$surv,
    strata = summary(sfit, censored = T)$strata,
    upper = sfit$upper,
    lower = sfit$lower
    )
    levels(.df$strata) <- ystratalabs
    zeros <- data.frame(time = 0,
    surv = 1,
    strata = factor(ystratalabs, levels=levels(.df$strata)),
    upper = 1,
    lower = 1
    )
    .df <- rbind.fill(zeros, .df)
    d <- length(levels(.df$strata))
    p <- ggplot(
    .df,
    aes(time, surv, groups = strata)
    ) +
    geom_step(aes(linetype = strata), size = 0.7) +
    theme_bw() +
    opts(axis.title.x = theme_text(vjust = 0.5)) +
    scale_x_continuous(
    xlabs,
    breaks = times,
    limits = c(0, max(sfit$time))
    ) +
    scale_y_continuous(
    ylabs,
    limits = c(0, 1)
    ) +
    opts(panel.grid.minor = theme_blank()) +
    opts(
    legend.position = c(
    ifelse(m < 10, .28, .35),
    ifelse(d < 4, .25, .35)
    )
    ) +
    opts(legend.key = theme_rect(colour = NA)) +
    labs(linetype = ystrataname) +
    opts(
    plot.margin = unit(
    c(0, 1, .5,
    ifelse(m < 10, 1.5, 2.5)
    ),
    "lines"
    )
    ) +
    opts(title = main)

    ## Create a blank plot for place-holding
    ## .df <- data.frame()
    blank.pic <- ggplot(.df, aes(time, surv)) +
    geom_blank() +
    theme_bw() +
    opts(
    axis.text.x = theme_blank(),
    axis.text.y = theme_blank(),
    axis.title.x = theme_blank(),
    axis.title.y = theme_blank(),
    axis.ticks = theme_blank(),
    panel.grid.major = theme_blank(),
    panel.border = theme_blank()
    )

    if(pval) {
    sdiff <- survdiff(eval(sfit$call$formula),
    data = eval(sfit$call$data))
    pval <- pchisq(
    sdiff$chisq,
    length(sdiff$n) – 1,
    lower.tail = FALSE
    )
    pvaltxt <- ifelse(
    pval < 0.0001,
    "p < 0.0001",
    paste("p =", signif(pval, 3))
    )
    p <- p + annotate("text",
    x = 0.6 * max(sfit$time),
    y = 0.1,
    label = pvaltxt
    )
    }

    if(table) {
    ## Create table graphic to include at-risk numbers
    risk.data <- data.frame(
    strata = summary(
    sfit,
    times = times,
    extend = TRUE
    )$strata,
    time = summary(
    sfit,
    times = times,
    extend = TRUE
    )$time,
    n.risk = summary(
    sfit,
    times = times,
    extend = TRUE
    )$n.risk
    )
    data.table <- ggplot(
    risk.data,
    aes(x = time, y = strata, label = format(n.risk, nsmall = 0))
    ) + #, color = strata)) +
    geom_text(size = 3.5) +
    theme_bw() +
    scale_y_discrete(
    breaks = as.character(levels(risk.data$strata)),
    labels = ystratalabs
    ) +
    # scale_y_discrete(#format1ter = abbreviate,
    # breaks = 1:3,
    # labels = ystratalabs) +
    scale_x_continuous(
    "Numbers at risk", limits = c(0, max(sfit$time))
    ) +
    opts(
    axis.title.x = theme_text(size = 10, vjust = 1),
    panel.grid.major = theme_blank(),
    panel.grid.minor = theme_blank(),
    panel.border = theme_blank(),
    axis.text.x = theme_blank(),
    axis.ticks = theme_blank(),
    axis.text.y = theme_text(
    face = "bold",
    hjust = 1
    )
    )

    data.table <- data.table +
    opts(legend.position = "none") +
    xlab(NULL) +
    ylab(NULL)

    data.table <- data.table +
    opts(
    plot.margin = unit(c(-1.5, 1, 0.1, ifelse(m < 10, 2.5, 3.5) – 0.28 * m), "lines")
    )

    ## Plotting the graphs
    ## p <- ggplotGrob(p)
    ## p <- addGrob(p, textGrob(x = unit(.8, "npc"), y = unit(.25, "npc"), label = pvaltxt,
    ## gp = gpar(fontsize = 12)))
    grid.arrange(
    p,
    blank.pic,
    data.table,
    clip = FALSE,
    nrow = 3,
    ncol = 1,
    heights = unit(
    c(2, .1, .25),
    c("null", "null", "null")
    )
    )
    if(returns) {
    a <- arrangeGrob(
    p,
    blank.pic,
    data.table,
    clip = FALSE,
    nrow = 3,
    ncol = 1,
    heights = unit(c(2, .1, .25),
    c("null", "null", "null"))
    )
    return(a)
    }
    }
    else {
    ## p <- ggplotGrob(p)
    ## p <- addGrob(p, textGrob(x = unit(0.5, "npc"), y = unit(0.23, "npc"),
    ## label = pvaltxt, gp = gpar(fontsize = 12)))
    print(p)
    if(returns) return(p)
    }
    }

    • Abhijit September 1, 2011 at 10:13 AM

      @Mark Crowley: Thank you for polishing this up. I’m quite happy that this code (and the modifications by Gil and yourself) have been very useful to the R community. I’ll modify the main post later tonight to reflect the changes. Thanks specially for the roxygen preamble.

  7. Kel September 2, 2011 at 9:19 AM

    Thank you so much for posting (and all the hard work)!

    Kelvin

  8. danielsbrewer October 27, 2011 at 5:55 AM

    Great function, and very useful thanks. Just two questions:
    1) Any ideas how to put censoring marks on the plot? I am sure geom_point plays a role but I can’t work out how to get the censoring marks data out of the survfit object
    2) I have found that if the labels are too long that this pushes the table out of alignment e.g.

    KM table alignment

  9. danielsbrewer October 27, 2011 at 6:21 AM

    Adding:

    geom_point(shape=3,aes(size=(!is.na(n.event) & n.event==0), linetype = strata)) +
    scale_size_manual(values = c(NA,2), legend = F) +

    to the ggplot function line adds the censoring points. This is messy and ends up introducing warnings for all data that isn’t censored e.g.
    “Removed 844 rows containing missing values (geom_point)”

    I couldn’t work out anyway to only plot the censored points

  10. danielsbrewer October 27, 2011 at 6:27 AM

    Even better just add:
    geom_point(shape=3,data=.df[!is.na(.df$n.event) & .df$n.event==0,]) +

  11. Jean-Louis abitbol November 4, 2011 at 7:34 AM

    When I run the script I have an error:
    + if(pval) {
    + sdiff <- survdiff(eval(sfit$call$formula), data = eval(sfit$call$data))
    + pval <- pchisq(sdiff$chisq, length(sdiff$n) – 1, lower.tail = FALSE )
    Erreur : entrée inattendu(e) dans :
    " sdiff <- survdiff(eval(sfit$call$formula), data = eval(sfit$call$data))
    pval pvaltxt <- ifelse(pval < 0.0001, "p < 0.0001", paste("p =", signif(pval, 3)))
    Erreur dans ifelse(pval < 1e-04, "p p }
    Erreur : ‘}’ inattendu(e) dans “}”
    and cannot see where it comes from.

    Any help appreciated. Thanks JL (abitbol{at}sent.com)

  12. Matt November 5, 2011 at 9:18 PM

    Hey mate, thought you might be interested in this http://mcfromnz.wordpress.com/2011/11/06/kaplan-meier-survival-plot-with-at-risk-table/

    Thanks for all your hard work on the code.

  13. t.koyama December 17, 2011 at 10:35 PM

    Hey mate, thought you might be interested in this http://biostat.mc.vanderbilt.edu/wiki/Main/TatsukiRcode#kmplot

    My preferred tool is R but not ggplot2…

    • Abhijit December 19, 2011 at 8:44 AM

      Very nice. R does have several approaches for doing the same thing, so the more the merrier. Like some of your other code too :)

  14. Rafik Margaryan (@raffdoc) February 27, 2012 at 9:22 AM

    Hi guys,
    Cool stuff happening here! I have a question!
    If in one of sruvivil fit groups the event is not happening , your code returnes with error !
    Here is the error
    Error in `levels<-.factor`(`*tmp*`, value = "age2group=< 16 years") :
    number of levels differs
    How can I fix this.
    Thanks

  15. Mark B April 5, 2012 at 12:29 PM

    Hi Abhijit

    Nice work, however for some reason the code in the box above has an error. On line 046 the aesthetics of ggplot are presently … ggplot(.df, aes( time, surv, groups = strata)) + … this should read …ggplot(.df, aes( time, surv, group = strata)) +…

    With that it works really nicely – thanks

    Mark

  16. Gicu April 19, 2012 at 5:35 PM

    Great work! nevertheless, if the data has no strata, your function gives an error. for instance:

    data(colon)

    fit <- survfit(Surv(time,status)~1, data=colon)

    ggkm(fit, timeby=500)

    would give you :

    Error in data.frame(time = sfit$time[subs2], n.risk = sfit$n.risk[subs2], :
    arguments imply differing number of rows: 1, 0
    In addition: Warning message:
    In max(nchar(ystratalabs)) :
    no non-missing arguments to max; returning -Inf

    I believe an if statement would be the yeoman solution for this.

  17. Petra May 3, 2012 at 10:13 AM

    Hey,

    After using the ggkm function, I get this error:
    Error in `[.data.frame`(data, “group”) : undefined columns selected

    My data consists of two groups, and the plot() function in R looks good.

    How can I fix this error?

    Thanks in advance.
    Petra

  18. nzcoops May 5, 2012 at 12:03 AM

    Working code is available here http://mcfromnz.wordpress.com/2012/05/05/kaplan-meier-survival-plot-with-at-risk-table-by-sub-groups/. The issue was just the argument ‘groups = strata’ in the ggplot(…) line. Remove this and the code runs fine.

  19. Pingback: Kaplan-Meier plots using ggplots2 (updated) | Stat Bandit

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 327 other followers

%d bloggers like this: