Stat Bandit

Musings on statistics, computation and data research

The need for documenting functions

My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code!

One of the common problems I (and I’m sure many of us) have is that we tend to hack code and functions with the end in mind, just getting the job done. However, when we have to re-visit code when it’s no longer fresh in our memory, it takes significant time to decipher what some code snippet or function is doing, or why we did it in the first place. Then, when our paper gets published and a reader wants our code to try, it’s a bear getting it into any kind of shareable form. Recently I’ve had both issues, and decided, enough was enough!!

R has a fantastic package roxygen2 that makes documenting functions quite easy. The documentation sits just above the function code, so it is there front and center. Taking 2-5 minutes to write even a bare-bones documentation, that includes

  • what the function does
  • what are the inputs (in English) and their required R class
  • what is the output and its R class
  • maybe one example

makes the grief of re-discovering the function and trying to decipher it go away. What does this look like? Here’s a recent example from my files:

#' Find column name corresponding to a particular functional
#' The original data set contains very long column headers. This function
#' does a keyword search over the headers to find those column headers that
#' match a particular keyword, e.g., mean, median, etc.
#' @param x The data we are querying (data.frame)
#' @param v The keyword we are searching for (character)
#' @param ignorecase Should case be ignored (logical)
#' @return A vector of column names matching the keyword
#' @export
findvar <- function(x,v, ignorecase=TRUE) {
  if(!is.character(v)) stop('name must be character')
  if(! stop('x must be a data.frame')
  v <- grep(v,names(x),value=T,
  if(length(v)==0) v <- NA

My code above might not meet best practices, but it achieves two things for me. It reminds me of why I wrote this function, and tells me what I need to run it. This particular snippet is not part of any R package (though I could, with my new directory structure for projects, easily create a project-specific package if I need to). Of course this type of documentation is required if you are indeed writing packages.  

Update: As some of you have pointed out, the way I’m using this is as a fancy form of commenting, regardless of future utility in packaging. 100% true, but it’s actually one less thing for me to think about. I have a template, fill it out, and I’m done, with all the essential elements included. Essentially this creates a “minimal viable comment” for a function, and I only need to look in one place later to see what’s going on. I still comment my code, but this still gives me value for not very much overhead.


There are several resources for learning about roxygen2. First and foremost is the chapter Documenting functions from Hadley Wickham’s online book. roxygen2 also has its own tag on StackOverflow.

On the software side, RStudio supports roxygen2; see here. Emacs/ESS also has extensive roxygen2 support. The Rtools package for Sublime Text provides a template for roxygen2 documentation. So getting started in the editor of your choice is not a problem.

Newer dplyr!!

Last week Statistical Programming DC had a great meetup with my partner-in-crime Marck Vaisman talking about data.table and dplyr as powerful, fast R tools for data manipulation in R. Today Hadley Wickham announced the release of dplyr v.0.2, which is packed with new features and incorporates the “piping” syntax from Stefan Holst Bache‘s magrittr package. I suspect that these developments will change the semantics of working in R, specially during the data munging phase. I think the jury is still out on whether the “piping” model for function chaining will lead to better (and not just more jumbled) coding practice, but for some of my use cases, specially with the previous version of dplyr, it made me happier than before. 



Quick notes on file management in Python

This is primarily for my recollection

To expand ~ in a path name:


To get the size of a directory:

import os
def getsize(start<em>path = '.'):
    totalsize = 0
    for dirpath, dirnames, filenames in os.walk(start</em>path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            totalsize += os.path.getsize(fp)
    return totalsize

IPython notebooks: the new glue?

IPython notebooks have become a defacto standard for presenting Python-based analyses and talks, as evidenced by recent Pycon and PyData events. As anyone who has used them knows, they are great for “reproducible research”, presentations, and sharing via the nbviewer. There are extensions connecting IPython to R, Octave, Matlab, Mathematica, SQL, among others.

However, the brilliance of the design of IPython is in the modularity of the underlying engine (3 cheers to Fernando Perez and his team). About a year ago, a Julia engine was written, allowing Julia to be run using the IPython notebook platform (named, appropriately, IJulia). More recently, an R engine has been developed to enable R to run natively on the IPython notebook platform. Though the engines cannot be run interchangeably during the same session of the notebook server, it shows that a common user-facing interface now exists for running the three most powerful open-source scientific and data-centric software systems.

Another recent advancement in the path of IPython notebooks as the common medium for reporting data analyses is my friend Ramnath‘s proof-of-concept work in translating R Markdown documents to IPython notebooks.

I encourage you to explore IPython notebooks, as well as the extensions to R and Julia, specially my colleagues using R and/or Python in the data space.

A new data-centric incubator project in DC

District Data Labs is a new endeavor by members of the local data community (myself included) to increase educational outreach about data-related topics through workshops and other media to the local data community.

We want District Data Labs to be an efficient learning resource for people who want to enhance and expand their analytical and technical skill sets. Whether you are a statistician who wants to learn more about programming and creating useful data products or a software engineer that wants to learn how to properly analyze data and use statistical methods to improve the basic analyses you’re doing, we want to equip you with the right skills to better yourself and advance your career.

DDL has recently run several PyData workshops, and one on using Python for creating Data Apps is forthcoming.

DDL just announced a new initiative to bring the data community closer — a Data Science Project Incubator where like-minded people can collaborate and develop data-centric projects under the umbrella of DDL. You can find out more details bout this new initiative here.

Kaplan-Meier plots using ggplots2 (updated)

About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error, and published it in a Gist. This gist has two functions, ggkm (basic Kaplan-Meier plot) and ggkmTable (enhanced Kaplan-Meier plot with table showing numbers at risk at various times).

This gist is published here. If you find errors or want to enhance these functions, please fork, update and send me a link to your fork in the comments. I’ll pull and merge them. Unfortunately Github doesn’t allow pull requests directly for gists (see here for the StackOverflow answer I’m basing this on).

If you want to go back to the original post, you can read it here.

Slidify: Data driven presentations

Publishers note: This blog was posted on August 1, 2013 on the Data Community DC blog,

Presentations are the stock-in-trade for consultants, managers, teachers, public speakers, and, probably, you. We all have to present our work at some level, to someone we report to or to our peers, or to introduce newcomers to our work. Of course, presentations are passe, so why blog about it? There’s already PowerPoint, and maybe Keynote. What more need we talk about?

Well, technology has changed, and vibrant dynamic presentations are here today for everyone to see. No, I mean literally everybody, if I like. All anyone will need is a web browser to see it. Graphs can be interactive, flow can be nonlinear, and presentations can be fun and memorable again!

But PowerPoint is so easy! You click, paste, type, add a bit of glitz, and you’re done, right? Well, as most of us can attest to, not really. It takes a bit more effort and putzing around to really get things in reasonable shape, let alone great shape.

And there are powerful alternatives. Which are simple and easy. And do a pretty great job on their own. Oh, and, by the way, if you have data and analysis results to present, super slick and a one-stop-shop from analysis to presentation. Really!! Actually there are a few out there, but I’m going to talk about just one. My favorite. Slidify.

Slidify is a fantastic R package that takes a document written in RMarkdown, which is Markdown possibly interspersed with R code that result in tables or figures or interactive graphics, weaves in the results of that code, and then formats it into beautiful web presentations using HTML5. You can decide on the format template (it comes with quite a few or brew your own. You can make your presentation look and behave the way you want, even like a Prezi (using ImpressJS). You can also make interactive questionnaires and even put in windows to code interactively within your presentation!!

Slidify is obviously feature-rich, and infinitely customizable, but that’s not really what attracted me to it. It was the ability to write presentations in Markdown, which is super easy and let’s me put down content quickly without worrying about appearance (Between you and me, I’m writing this post in Markdown, on a Nexus 7). And let me weave in results of my analyses easily, keeping the code in one place within my document. So when my data changes, I can create an updated presentation literally with the press of a button. Markdown is geared to create HTML documents. Pandoc and MultiMarkdown let you create HTML presentations from Markdown, but not living, data driven presentations like Slidify. I get to put my presentations up on Github or on Rpubs, or even in Dropbox, directly using Slidify, share the link, and I’m good to go.

Dr. Ramnath Vaidyanathan created Slidify to help him teach more effectively at McGill University, where he is on the faculty of the School of Business. But, for me, it is now the goto place for creating presentations , even if I don’t need to incorporate data. If you’re an analyst and live in the R ecosystem, I highly recommend Slidify. If you don’t and use other tools, Slidify is a great reason to come and see what R can do for you. Even if it to just create great presentations. There are plenty of great examples of what’s possible at

Input data interactively into R

To input data interactively into R, use the function readline:

x <- readline("What is your answer? ")

The many faces of statistics/data science: Can’t we all just get along and learn from each other?

Two blog posts in the last 24 hours caught my attention. First was this post by Jeff Leek noting that there are many fields which are applied statistics by another name (and I’d add operations research to his list). The second is an excellent post on Cloudera’s blog on constructing case-control studies. It is generally excellent, but has this rather unfortunate (in my view) statement:

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

First of all, this ignores what biostatisticians have been doing in collaboration with epidemiologists for decades. The design of a study, as any statistician understands, is just as, if not more, important than the analysis, and statisticians have been at the forefront of pushing good study design. Second, it shows a fundamental lack of understanding of the breadth of what statistics as a discipline encompasses. Third, this almost reiterates Jeff’s point about the different fields, considered different but essentially “applied statistics”. There seems to be a strong push to claim a new field as different and sexier than what has come before (an issue of branding and worth, perhaps?) without understanding what is already out there.

Statistics as a field has been guilty of this as well. The most obvious and wasteful consequence of this is “re-inventing the wheel”, rather than leveraging the power of other discoveries. Ownership of an idea is a powerful concept, but there must be the recognition that while translating a concept for a new audience is useful and extremely necessary, merely claiming ownership while willfully ignoring the developments by colleagues in another field is wasteful and disingenuous.

A recent discussion with a colleague further reiterated this point even within statistics. Some of the newer developments in a relatively new methodologic space are along the same lines of theoretical development in an older methodologic space. The new guys are coming up against the same brick walls as the earlier researchers, and there seems to be a lack of understanding among the new researchers of the path already travelled (since the keywords are different and not necessarily directly related, Google Scholar fails).

The bottom line here is the strong need for more cross-talk between disciplines, more collaboration among researchers, having greater understanding for the knowledge already out there, and more breadth in our own training and knowledge.

Pocketbook costs of software

I have always been provided SAS as part of my job, so I never really realized how much it cost. I’ve bought Stata before, and of course R :). I recently found out how much a reasonable bundle of SAS modules along with base SAS costs per year per seat, at least under the GSA. I tried finding out how much IBM SPSS is for a comparable bundle, but their web page was “not available”. Stata costs in the ballpark of $1700 (for a permanent license of Stata/SE) or $845 for an annual license. SAS costs over 5 times that per seat for similar functionality (Ouch!!). R, with its quirks but with similar if not enhanced functionality in a lot of areas, is of course, freely downloadable. 

Matlab is another software I’ve bought as part of my job. For a reasonable bundle, in an academic setting, it is close to $3000. Of course, here it’s a bit easier to pick and choose, since I don’t need most of the modules which are of more interest to engineers.


Get every new post delivered to your Inbox.

Join 327 other followers

%d bloggers like this: