Kaplan-Meier plots using ggplots2 (updated)

About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error, and published it in a Gist. This gist has two functions, ggkm (basic Kaplan-Meier plot) and ggkmTable (enhanced Kaplan-Meier plot with table showing numbers at risk at various times).

This gist is published here. If you find errors or want to enhance these functions, please fork, update and send me a link to your fork in the comments. I’ll pull and merge them. Unfortunately Github doesn’t allow pull requests directly for gists (see here for the StackOverflow answer I’m basing this on).

If you want to go back to the original post, you can read it here.

An enhanced Kaplan-Meier plot

We often see, in publications, a Kaplan-Meier survival plot, with a table of the number of subjects at risk at different time points aligned below the figure. I needed this type of plot (or really, matrices of such plots) for an upcoming publication. Of course, my preferred toolbox was R and the ggplot2 package.

There were other attempts to do this type of plot in ggplot2, mainly by Gary Collins and an anonymous author as seen on the ggplot2 mailing list. There was also an invaluable blog on Learn-R on plotting a table using geom_text.

I decided to use Gary’s code as my starting point. I wanted to add a log-rank test p-value for stratified KM curves on the plot, and I wanted to code it so that the text in the table remained aligned to the x-axis even when my strata names changed with different data sets. The former was relatively easy, but the latter was challenging (and seemed to be unsolved by Gary, who had posed it as a question). Both were instructive, and took me back a layer to the grid graphics engine (on which both ggplot2 and lattice are based) and the gridExtra package.

The placement of the p-value in the plot is a case of a single character string being placed at a certain point on the graph, which I had done with the text command in base R. However, the geom_text layer seems to be intended for plotting text at all points in the data and not for inserting a single text string. However, I learned of the grid.text function from the grid package, which would enable me to do this for grid-based graphics.

The second part is to plot a table aligned to the x-axis of the plot. The plotting was achieved using Gary’s code using the method outlined on the Learn-R website. The problem was alignment. A few experiments showed that alignment depended on the size of the strata labels used in the plot. I started playing with the plot.margin option in ggplot2 and then figured out that I could maintain the alignment (more or less) if the left margin was computed based on the length of the labels. Some experimentation got me to a reasonable formula for this computation.

Now I had to put the 3 parts together. Both the plot and the table were ggplot objects, so I knew I could stitch them together with grid.arrange from the gridExtra package. However, the p-value was being written using a separate grid.text command. I had to learn more about grid. I decided to use basic grid commands to convert the KM plot to a grid object using ggplotGrob and add the p-value text string to it using addGrob and textGrob. I could now stitch this together with the plotted table using grid.arrange, using the same layout specs.

However, this still wasn’t ideal since the plots were overlapping a bit. I decided to put a buffer blank image in between. In base graphics, I could achieve this with plot(..., type="n"). In ggplot, I learned about geom_blank() and theme_blank() to create a blank image. This I then put in a thin row between the plot and the table.

I incorporated this process into the function ggkm which is given below. An example plot generated by this code:

fit <- survfit(Surv(time,status)~rx, data=colon)
ggkm(fit, timeby=500)

is shown here:

The ggkm function is given below:
Update:The following code is modified thanks to Baptiste Auguie’s comments, which were very helpful. I also make the p-value text optional for the plot, with the default settings omitting the p-values.

Update (3/26/2011) I’ve further updated the code based on comments which allow it to work on a naive install (albeit with the required packages), as well as replacing “df” with “dframe” to get around the fact that “df” is a R function.

Update (9/1/2011) Both Gil Thomas and Mark Cowley have made improvements to the code that have made it more stable and universally run-able (is that a word?). The following reflects Mark’s last edits to the code.

#’ Create a Kaplan-Meier plot using ggplot2
#’ @param sfit a \code{\link[survival]{survfit}} object
#’ @param table logical: Create a table graphic below the K-M plot, indicating at-risk numbers?
#’ @param returns logical: if \code{TRUE}, return an arrangeGrob object
#’ @param xlabs x-axis label
#’ @param ylabs y-axis label
#’ @param ystratalabs The strata labels. \code{Default = levels(summary(sfit)$strata)}
#’ @param ystrataname The legend name. Default = “Strata”
#’ @param timeby numeric: control the granularity along the time-axis
#’ @param main plot title
#’ @param pval logical: add the pvalue to the plot?
#’ @return a ggplot is made. if return=TRUE, then an arrangeGlob object
#’ is returned
#’ @author Abhijit Dasgupta with contributions by Gil Tomas
#’ \url{}
#’ @export
#’ @examples
#’ \dontrun{
#’ data(colon)
#’	fit <- survfit(Surv(time,status)~rx, data=colon)
#'	ggkm(fit, timeby=500)
#' }
ggkm <- function(sfit, table = TRUE, returns = FALSE,
xlabs = "Time", ylabs = "survival probability",
ystratalabs = NULL, ystrataname = NULL,
timeby = 100, main = "Kaplan-Meier Plot",
pval = TRUE, ...) {
if(is.null(ystratalabs)) {
   ystratalabs <- as.character(levels(summary(sfit)$strata))
m <- max(nchar(ystratalabs))
if(is.null(ystrataname)) ystrataname <- "Strata"
times <- seq(0, max(sfit$time), by = timeby)
.df <- data.frame(time = sfit$time, n.risk = sfit$n.risk,
    n.event = sfit$n.event, surv = sfit$surv, strata = summary(sfit, censored = T)$strata,
    upper = sfit$upper, lower = sfit$lower)
levels(.df$strata) <- ystratalabs
zeros <- data.frame(time = 0, surv = 1, strata = factor(ystratalabs, levels=levels(.df$strata)),
    upper = 1, lower = 1)
.df <- rbind.fill(zeros, .df)
d <- length(levels(.df$strata))
p <- ggplot(.df, aes(time, surv, groups = strata)) +
    geom_step(aes(linetype = strata), size = 0.7) +
    theme_bw() +
    opts(axis.title.x = theme_text(vjust = 0.5)) +
    scale_x_continuous(xlabs, breaks = times, limits = c(0, max(sfit$time))) +
    scale_y_continuous(ylabs, limits = c(0, 1)) +
    opts(panel.grid.minor = theme_blank()) +
    opts(legend.position = c(ifelse(m < 10, .28, .35), ifelse(d < 4, .25, .35))) +
    opts(legend.key = theme_rect(colour = NA)) +
    labs(linetype = ystrataname) +
    opts(plot.margin = unit(c(0, 1, .5, ifelse(m < 10, 1.5, 2.5)), "lines")) +
    opts(title = main)

## Create a blank plot for place-holding
## .df <- data.frame()
blank.pic <- ggplot(.df, aes(time, surv)) +
    geom_blank() +
    theme_bw() +
    opts(axis.text.x = theme_blank(), axis.text.y = theme_blank(),
        axis.title.x = theme_blank(), axis.title.y = theme_blank(),
        axis.ticks = theme_blank(), panel.grid.major = theme_blank(),
        panel.border = theme_blank())
if(pval) {
    sdiff <- survdiff(eval(sfit$call$formula), data = eval(sfit$call$data))
    pval <- pchisq(sdiff$chisq, length(sdiff$n) – 1, lower.tail = FALSE)
    pvaltxt <- ifelse(pval < 0.0001, "p < 0.0001", paste("p =", signif(pval, 3)))
    p <- p + annotate("text", x = 0.6 * max(sfit$time), y = 0.1, label = pvaltxt)
if(table) {
    ## Create table graphic to include at-risk numbers <- data.frame(strata = summary(sfit, times = times, extend = TRUE)$strata,
        time = summary(sfit, times = times, extend = TRUE)$time,
        n.risk = summary(sfit, times = times, extend = TRUE)$n.risk)
    data.table <- ggplot(, aes(x = time, y = strata, label = format(n.risk, nsmall = 0))) +
        #, color = strata)) +
        geom_text(size = 3.5) +
        theme_bw() +
        scale_y_discrete(breaks = as.character(levels($strata)), labels = ystratalabs) +
        # scale_y_discrete(#format1ter = abbreviate,
        # breaks = 1:3,
        # labels = ystratalabs) +
        scale_x_continuous("Numbers at risk", limits = c(0, max(sfit$time))) +
        opts(axis.title.x = theme_text(size = 10, vjust = 1), panel.grid.major = theme_blank(),
        panel.grid.minor = theme_blank(), panel.border = theme_blank(),
        axis.text.x = theme_blank(), axis.ticks = theme_blank(),
        axis.text.y = theme_text(face = "bold", hjust = 1))
    data.table <- data.table + opts(legend.position = "none") +
        xlab(NULL) + ylab(NULL)
    data.table <- data.table +
        opts(plot.margin = unit(c(-1.5, 1, 0.1, ifelse(m < 10, 2.5, 3.5) – 0.28 * m), "lines"))
## Plotting the graphs
## p <- ggplotGrob(p)
## p <- addGrob(p, textGrob(x = unit(.8, "npc"), y = unit(.25, "npc"), label = pvaltxt,
    ## gp = gpar(fontsize = 12)))
    grid.arrange(p, blank.pic, data.table,
        clip = FALSE, nrow = 3, ncol = 1,
        heights = unit(c(2, .1, .25),c("null", "null", "null")))
    if(returns) {
        a <- arrangeGrob(p, blank.pic, data.table, clip = FALSE,
            nrow = 3, ncol = 1, heights = unit(c(2, .1, .25),c("null", "null", "null")))
else {
    ## p <- ggplotGrob(p)
    ## p <- addGrob(p, textGrob(x = unit(0.5, "npc"), y = unit(0.23, "npc"),
    ## label = pvaltxt, gp = gpar(fontsize = 12)))
    if(returns) return(p)

ggplot2 joy

I’ve been working on a long-term (25+yr) longitudinal study of rheumatoid arthritis with my boss. He just walked in and asked if I could create a plot showing the trajectory of pain scores over time for each subject, separated by educational level (4 groups). Having now worked with ggplot2 for a while, and learning more at the last two DC useR meetups, I realized that I could formulate this in ggplot very easily and in short order. Hooray!!! Basically, all I needed to do was:

ggplot(data, aes(time, pain,, color=education.level))+geom_line()

I actually spent more time figuring out how to change the legend title 🙂 (fyi, it is + labs(colour='Education'), with the British spelling being necessary).

I’m actually pretty thrilled that I could use ggplot2 on short order to make this plot.

On another note, my friend Brian Danielak gave a brilliant presentation at last night’s DC R Users meetup on some ggplot2-based development he’s doing for graphical ANOVA. A link to his talk should be on the site in short order, so please do check it out.

Forest plots using R and ggplot2

Forest plots are most commonly used in reporting meta-analyses, but can be profitably used to summarise the results of a fitted model. They essentially display the estimates for model parameters and their corresponding confidence intervals.

Matt Shotwell just posted a message to the R-help mailing list with his lattice-based solution to the problem of creating forest plots in R. I just figured out how to create a forest plot for a consulting report using ggplot2. The availability of the geom_pointrange layer makes this process very easy!!

Update January 26, 2016: ggplot2 has changed a bit in the last five years. I’ve created a gist that will be easier to maintain. The link is here. <- function(d){
 # d is a data frame with 4 columns
 # d$x gives variable names
 # d$y gives center point
 # d$ylo gives lower limits
 # d$yhi gives upper limits
 p <- ggplot(d, aes(x=x, y=y, ymin=ylo, ymax=yhi))+
 geom_hline(yintercept = 0, linetype=2)+

If we start with some dummy data, like

d <- data.frame(x = toupper(letters[1:10]),
                y = rnorm(10, 0, 0.1))
d <- transform(d, ylo = y-1/10, yhi=y+1/10)

we can get the following graph: