Cleaning up tables

This post is re-published from my blog. Please see it for the latest updates from May 16, 2018.


One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.

One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things. But, of course, this being a R blog, I wanted to create a function that would fix this. I’ve created hack-y functions for this before, but a neat trick pointed out here gave me an idea for a more elegant solution. It also meant I had to use the tidyeval paradigm a bit, which I figured I should at least become familiar with.

Here’s what I want to do

Take a table like this:

Location Gender values
Rural Male 32.74
Rural Female 25.18
Urban Male 40.48
Urban Female 25.28

to something like this:

Location Gender values
Rural Male 32.74
Female 25.18
Urban Male 40.48
Female 25.28

The point is that the first column has repeating values, and I just want the first row of the cluster of rows corresponding to Rural and Urban to have text, the rest being blank. I find this a cleaner look and more typical of tables I see in papers.

This is purely for presentation purposes.I would never do this for data frames I’ll still analyze, since the blank cells screw up things. Of course this could be fixed easily using last value carried forward imputation on the column.

A solution

I created this simple function to do this for a single column within a magrittr pipeline:

clean_col = function(x, colvar){
colv = enquo(colvar)
x %>% group_by(!!colv) %>%
      mutate(rown = row_number()) %>%
      ungroup() %>%
      mutate_at(vars(!!colv), funs(ifelse(rown > 1, '', .))) %>%
      select (-rown)

The first thing to note here is that I’m using quosures and quasiquotation to allow the pipeline to work with the function’s inputs, specifically the column name, which is provided as an unquoted name. Admittedly this was done without much understanding, following examples on Edwin Thoen’s excellent blog.

The second thing was the use of the dummy rown column to identify the first row of each cluster of rows defined by the variable colvar. This was inspired by this blog I read through R-Bloggers yesterday. This trick allowed me to easily “blank out” the appropriate cells in the colvar column.

Desirable updates

There are two directions I want to take this, but I don’t understand tidyeval or functions with variable numbers of arguments well enough yet to do it. The simpler extension is to do the same process using two or more columns, rather than one column. For example, taking

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
Rural Male 55-59 18.1
Rural Male 60-64 26.9
Rural Male 65-69 41.0
Rural Male 70-74 66.0
Rural Female 50-54 8.7
Rural Female 55-59 11.7
Rural Female 60-64 20.3
Rural Female 65-69 30.9
Rural Female 70-74 54.3
Urban Male 50-54 15.4
Urban Male 55-59 24.3
Urban Male 60-64 37.0
Urban Male 65-69 54.6
Urban Male 70-74 71.1
Urban Female 50-54 8.4
Urban Female 55-59 13.6
Urban Female 60-64 19.3
Urban Female 65-69 35.1
Urban Female 70-74 50.0


Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
55-59 18.1
60-64 26.9
65-69 41.0
70-74 66.0
Rural Female 50-54 8.7
55-59 11.7
60-64 20.3
65-69 30.9
70-74 54.3
Urban Male 50-54 15.4
55-59 24.3
60-64 37.0
65-69 54.6
70-74 71.1
Urban Female 50-54 8.4
55-59 13.6
60-64 19.3
65-69 35.1
70-74 50.0

The second extension would be to create truly nested row labels, like this:

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
55-59 18.1
60-64 26.9
65-69 41.0
70-74 66.0
Female 50-54 8.7
55-59 11.7
60-64 20.3
65-69 30.9
70-74 54.3
Urban Male 50-54 15.4
55-59 24.3
60-64 37.0
65-69 54.6
70-74 71.1
Female 50-54 8.4
55-59 13.6
60-64 19.3
65-69 35.1
70-74 50.0

I can create these on a case-by-case basis, but I’m not sure how to do this in a function, yet. Looking forward to comments.


I can do the first example with the following code (based on Edwin Thoen’s blog, again):

clean_col = function(x, ...){
  colvs = quos(...)
  x %>% group_by(!!!colvs) %>%
    mutate(rown = row_number()) %>%
    ungroup() %>%
    mutate_at(vars(!!!colvs), funs(ifelse(rown > 1, '', .))) %>%
    select (-rown)

Update 2

The second example can be solved by extracting elements of quosures, which are essentially a list:

clean_cols = function(x, ...){
  colvs = quos(...)
  for(i in 1:length(colvs)){
    rowvar =  rlang::sym(paste0('rown',i)) # Create dummy
    x = x %>% group_by(!!!colvs[1:i]) %>%
      mutate(!!rowvar := row_number()) %>%
  for(i in 1:length(colvs)){
    rowvar = rlang::sym(paste0('rown',i))
    x = x %>% mutate_at(vars(!!colvs[[i]]),
      funs(ifelse(!!rowvar > 1, '', .)))
  x = x %>% select(-starts_with('rown')) # remove the dummies

Tidying messy Excel data (tidyxl)

Reposted from Abhijit’s blog. Some <- have been replaced by = due to idiosyncracies of the WordPress platform.

Well, here’s what I was dealing with:

Exemplar Excel file from collaborator

Exemplar Excel file from collaborator

Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent).


A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.

Attempt 1 (readxl)

d1 <- readxl::read_excel('spreadsheet1.xlsx')

## # A tibble: 6 x 26
## X__1 patient `44` `44__1` `10` `10__1` `3` `3__1` `53` `53__1`
## 1 IDS T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6
## 2 angles A B A B A B A B
## 3 60 31.83… 1 31.52… 1 32.9… 0 31.8… 0
## 4 65 31.66… 1 31.33… 1 32.2… 0 32.3… 0
## 5 70 31.45… 1 31.09… 0.20200… 31.7… 0 32.5… 0
## 6 75 31.08… 1 30.96… 0.44831… 31.2… 8.641… 32.3… 1
## # … with 16 more variables: `2` , `2__1` `8` ,
## # `8__1` , `6` , `6__1` , `43` , `43__1` ,
## # `48` , `48__1` , `46` , `46__1` , `4` ,
## # `4__1` , `9` , `9__1`

This strategy gives us funky column names, and pushes two of the headers into data rows. Since the headers are in rows, they’re a little harder to extract and work with. More worrisome is the fact that since the headers leaked into the data rows, the columns are all of type character rather than type numeric, which would now require further careful conversion after cleaning. So I don’t think readxl is the way to go here, if there’s a better solution.

Attempt 2 (tidyxl)

d2 <- tidyxl::xlsx_cells('spreadsheet1.xlsx')

## # A tibble: 6 x 21
## sheet address row col is_blank data_type error logical numeric
## 1 T5T6 B1 1 2 FALSE character NA NA
## 2 T5T6 C1 1 3 FALSE numeric NA 44.
## 3 T5T6 D1 1 4 FALSE numeric NA 44.
## 4 T5T6 E1 1 5 FALSE numeric NA 10.
## 5 T5T6 F1 1 6 FALSE numeric NA 10.
## 6 T5T6 G1 1 7 FALSE numeric NA 3.
## # … with 12 more variables: date , character ,
## # character_formatted , formula , is_array ,
## # formula_ref , formula_group , comment , height ,
## # width , style_format , local_format_id

The xlsx_cells captures the data in a tidy fashion, explicitly calling out rows and columns and other metadata within each cell. We can clean up this data using tidyverse functions:

cleanData1 = function(d) {
  angle = d %>% filter(row >= 4, col == 1) %>% pull(numeric)
  name = d %>% filter(row %in% c(1,3), col >= 3) %>%
    mutate(character = ifelse(is.na(character),
                              character)) %>%
    select(row, col, character) %>%
    filter(!is.na(character)) %>%
    spread(row, character) %>%
    unite(ID, `1`:`3`, sep = '_') %>%
  data = d  %>% filter(row >= 4, col >= 3) %>%
    filter(!is.na(numeric)) %>%
    select(row, col, numeric) %>%
    spread(col, numeric) %>%
    select(-row) %>%
    set_names(name) %>%
    cbind(angle) %>%
    gather(variable, value, -angle) %>%
    separate(variable, c('ID','Measure'), sep = '_') %>%
    spread(Measure, value) %>%
    select(ID, angle, A, B) %>%
    arrange(ID, angle)

##   ID angle        A        B
## 1 10    60 31.52867 1.000000
## 2 10    65 31.33477 1.000000
## 3 10    70 31.09272 0.202002
## 4 10    75 30.96078 0.448317
## 5 10    80 30.79397 0.670876
## 6 10    85 30.52185 0.461406

This is a lot of data munging, and though dplyr is powerful, it took a lot of trial and error to get the final pipeline done.Nonetheless, I was really psyched about tidyxl, since it automated a job that would have taken manual manipulation (I had 12 spreadsheets like this to process). I was going to write a blog post on this cool package that made my life dealing with messy Excel file a piece of cake. But wait, there’s more…

Attempt 3 (tidyxl + unpivotr)

I didn’t know about unpivotr until this post:

So maybe all that complicated munging can be simplfied.

# devtools::install_github('nacnudus/unpivotr')

cleanData2 = function(d){
  bl = d %>% select(row, col, data_type, numeric, character) %>%
    behead('N', ID) %>%
    behead('N', spine) %>%
    behead('N', variable)
  # Extract the angles column
  bl1 = bl %>% filter(variable == 'angles') %>% spatter(variable) %>%
    select(row, angles)
  # Extract the rest of the columns
  bl2 = bl %>% filter(variable %in% c('A','B')) %>% select(-spine, -col) %>%
    spatter(ID) %>% # Spread to columns
    select(-character) %>% # All my variables are numeric
    gather(ID, value, -row, -variable) %>%
    spread(variable, value)
  final = bl1 %>% left_join(bl2) %>% # put things back together
    arrange(ID, angles) %>%
    select(ID, everything(),-row) # re-arrange columns

## # A tibble: 588 x 4
##    ID    angles     A     B
##  1 10       60.  31.5 1.00
##  2 10       65.  31.3 1.00
##  3 10       70.  31.1 0.202
##  4 10       75.  31.0 0.448
##  5 10       80.  30.8 0.671
##  6 10       85.  30.5 0.461
##  7 10       90.  30.3 0.245
##  8 10       95.  30.0 0.159
##  9 10      100.  29.7 0.170
## 10 10      105.  29.2 0.421
## # ... with 578 more rows

In this example, I’m using the behead function (available in the development version of unpivotr on GitHub) to extract out the three rows of headers. Then I’m extracting out the angles column separately and merging it with the rest of the columns.

In case you’re wondering about the “N” in the behead code, unpivotr has a geographic options system as to where the headers are with respect to the main code. This vignette explains this nomenclature.

Attempt 4 (tidyxl + unpivotr)

After re-reading the unpivotr documentation, I realized that the angles column could be treated as a row header in the unpivotr code. So I further modified the function:

cleanData3 = function(d) {
  final = d %>%
    select(row, col, data_type, numeric, character) %>%
    behead('N', ID) %>%  # Extract column headers
    behead('N', spine) %>%
    behead('N', variable) %>%
    behead('W', angles) %>% # angles as row header
    select(numeric, ID:angles, data_type, -spine) %>% # all vars are numeric
    filter(variable %in% c'A','B')) %>% # Kills off some extra columns
    spatter(variable) # Spreads, using data_type, numeric

## A tibble: 588 x 4
##   ID    angles     A     B
## 1 10       60.  31.5 1.00
## 2 10       65.  31.3 1.00
## 3 10       70.  31.1 0.202
## 4 10       75.  31.0 0.448
## 5 10       80.  30.8 0.671
## 6 10       85.  30.5 0.461
## 7 10       90.  30.3 0.245
## 8 10       95.  30.0 0.159
## 9 10      100.  29.7 0.170
##10 10      105.  29.2 0.421
## ... with 578 more rows

I get to the same output, but with much cleaner code. This is cool!!I’m going to go deeper into the unpivotr documentation and see what else can be in my regular pipeline. A big thank you to the tool-makers that create these tools that make everyday activies easier and make us stay saner.

Tidying messy Excel data (Introduction)

[Re-posted from Abhijit’s blog]

Personal expressiveness, or how data is stored in a spreadsheet

When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!

But for smaller data, or data collected in the lab, the possibilities are truly endless! You can get every possiblle color-coding of rows, columns and cells, merged cells, hidden columns and rows, and inopportune blank spaces that convert numbers to characters, and notes where there should be numbers. That’s just from the more organized spreadsheets. Then you get multiple tables in the same spreadsheet, ad hoc computations in some cells, cells copied by hand (with error), and sundry other variations on this theme. In other words, it can be a nightmare scenario for the analyst. To wit,

In thinking about this post, I went back and looked at the documentation of the readxl package, which has made reading Excel files into R a lot easier than before. This package is quite powerful, so as long as data are in a relatively clean tabular form, this tool can pull it into R; see these vignettes to get a real sense of how to process decently behaved Excel files with R.

On a side note, how many ways of importing Excel files into R can you name, or have you used?

The readxl package has been my friend for a while, but then I received some well-intentioned spreadsheets that even readxl wouldn’t touch, despite much coaxing. Now, I had two options: ask the collaborator to re-format the spreadsheet (manually, of course 😄), which in my experience is a disaster waiting to happen; or just take the spreadsheet as is and figure out how to import it into R. I almost always take the latter route, and tidyxl is my bosom friend in this effort. In the next part of this series, I’ll describe my experiences with tidyxl and why, every time I use it, I say a small blessing for the team that created it.


  1. tidyeval meets Spreadsheet Hell
  2. Carpentries lesson on spreadsheet practices
  3. readxl workflows

Quirks about running Rcpp on Windows through RStudio

Quirks about running Rcpp on Windows through RStudio

This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.

I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.h. First, devtools::find_rtools was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating.

After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file:

Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/",
            "C:/RBuildTools/3.4/mingw_64/bin", sep = ";"))
Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")

Note that C:/RBuildTools is the default location suggested when I installed RTools.

This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!!

Note that instead of putting it in the global .Rprofile, you could put it in a project-specific .Rprofile, or even in your R code as long as it is run before loading the Rcpp or derivative packages. Note also that if you use binary packages that use Rcpp, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows.

Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post.


pandas “transform” using the tidyverse

Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.

Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.

First the pandas code:

import pandas as pd
dat = pd.read_excel('sales_transactions.xlsx')
dat['Percent_of_Order'] = dat['ext price']/dat.groupby('order')['ext price'].transform('sum')

A similar implementation using the tidyverse:

dat <- read_excel('sales_transactions.xlsx')
dat <- dat %>%
group_by(order) %>%
mutate(Percent_of_Order = `ext price`/sum(`ext price`))

A quick exploration of the ReporteRs package

The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.

This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr. For example, a report I wrote today maintained my flow while writing R code to create the report.


mydoc <- docx() %>%
  addParagraph(value = 'Correlation matrix', style='Titre2') %>%
  addParagraph(value='Estimates') %>%
  addFlexTable(FlexTable(cormat)) %>%
  addParagraph(value = 'P-values') %>%
  addFlexTable(FlexTable(corpval)) %>%
  addParagraph(value = "Boxplots", style='Titre2') %>%
  addPlot(fun=print, x = plt, height=3, width=5) %>%
  writeDoc(file = 'Report.docx)

Note that plt is a ggplot object and so we actually have to print it rather than just put the object in the addPlot command.

This was my first experience in a while using ReporteRs, and it seemed pretty good to me.

Creating new data with max values for each subject

We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.

R solutions

Hadleyverse solutions

Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.

dat_max <- dat %>% group_by(ID) %>% filter(row_number(X)==n())

To be a bit more explicit, row_number is a wrapper around rank, using the option ties.method=“first”. So you can use the rank function explicitly here as well.

A solution using the plyr package might be

dat_max <- ddply(dat, .(ID), function(u) u[rank(u$X, ties.method='first')==nrow(u),])

A data.table solution

The data.table package also has great munging capabilities like dplyr. The same StackOverflow thread linked above also provides this solution using data.table, provided by Arun:

dt <- as.data.table(dat)
dt[dt[, .I[which.max(X)], by=ID]$V1]

Using SQL statements via sqldf

The sqldf package allows an easy implementation of using SQL statements on data frame objects. As Ryan mentions in the comments, the possibilities of solving our problem using sqldf is straightforward:

bl <- sqldf("select ID, max(X) from dat group by ID")
names(bl)[2] <- 'X'
sqldf('select * from bl LEFT JOIN dat using(ID, X)')

Python solutions

Using pandas

In Python, the excellent pandas package allows us to do similar operations. The following example is from this thread on StackOverflow.

import pandas as pd
df = DataFrame({’Sp’:[‘a’,’b‘,’c’,’d‘,’e’,’f’], ‘Mt’:[‘s1’, ‘s1’, ‘s2’,’s2‘,’s2’,’s3’], ‘Value’:[1,2,3,4,5,6], ‘count’:[3,2,5,10,10,6]})
df.iloc[df.groupby([‘Mt’]).apply(lambda x: x[‘count’].idxmax())]

You could also do (from the same thread)

df.sort(‘count’, ascending=False).groupby(‘Mt’, as_index=False).first()

but it is about 50% slower.

Using pandasql

The package pandasql is a port of sqldf to Python developed by
yhat. Using this package, we can mimic the sqldf solution given above:

from pandasql import sqldf
sqldf('select Sp, Mt, Value, max(count) from df group by Mt', globals())

“LaF”-ing about fixed width formats

If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is quite an efficient format to store large data in (see this StackOverflow discussion). However, it requires quite a bit of coding, in terms of defining which columns represent which variables.

A recent project that our group at NIH is working on involves the National Inpatient Sample (NIS) from the Healthcare Cost and Utilization Project (HCUP). The current task is to load the Core data (8.2M records for 2008), filter it for a particular set of DRG codes, and see how many discharges with these DRG codes are present in each hospital in the set of hospitals included in the sample. Furthermore, we need to categorize each hospital as low, medium and high volume hospitals based on the number of discharges.

Note that to protect our project, I will not use the DRG codes that we’re actually using. I actually don’t know what DRG=500 is in terms of what the patient undergoes.

Translating available code

HCUP provides load files for their data sets in SAS, SPSS and Stata. Note that no open-source data software made the cut, possibly for legacy reasons. This post will focus on a R based solution. A follow up post will present a Python based solution.

First of all, I admit that I am lazy. I don’t want to sit with the data dictionary and write out all the format statements in R. However, if HCUP is providing load files, why not use them? Looking at them, I felt that the Stata load file was cleanest in terms of formatting. I wrote the following R script to extract variable names, column widths and formats (string, double, or integer) from the load file.

parseStataCode <- function(statafile){
 statacode <- readLines(statafile)
 indStart <- min(grep('infix', statacode))
 indEnd <- min(grep('using', statacode))-1
 codelines <- statacode[indStart:indEnd]
 codelines[1] <- gsub('infix','', codelines[1])
 codelines = str_trim(codelines)
 types <- str_extract(codelines, '^\\w+')
 translate <- c('int'='integer', 'byte'='integer', 'str'='string', 'double'='double', 'long'='double')
 types2 <- translate[types]
 varnames <- str_extract(codelines, '[A-Z]\\w+')
 startcols <- as.integer(str_extract(codelines,' \\d+'))
 maxcols = unlist(str_extract_all(codelines,'\\d+'))
 startcols = c(startcols, as.integer(maxcols[length(maxcols)]))
 widths = diff(startcols)
 return(data.frame(varnames = varnames, widths=widths, classes=types2))

You can of course add more code to extract missing value definitions and the like, but this sufficed for my current purpose.

My friend Anthony Damico wrote a great little package in R called SAScii, for the purpose of translating SAS load files provided by different US Government agencies for publicly available data (see his site ASDFree for several nice examples with large government surveys). However, the HCUP SAS load files contain custom formatting statements that breaks Anthony’s code. Hence the need to create my own translation code.

Ingesting data

Now on to the actual job of reading the data. R provides a function read.fwf, but that is painfully slow, so I immediately went to the Gods of Google. A quick internet search found the package LaF, which uses C++ underneath to make accessing large ASCII files faster. LaF also allows some column selection and filtering as data is loaded into memory in R. The workhorse for this is the laf_open_fwf function, which requires specification of the column widths and types.

wt <- parseStataCode('StataLoad_NIS_2008_Core.Do')
d <- laf_open_fwf('NIS_2008_Core.ASC', column_widths=wt$widths,
 column_types=wt$classes, column_names = wt$varnames)

The code above does not actually load the data into R, but essentially creates a pointer to the data on disk. Now I decided to use the (awesome) power of dplyr and magrittr to extract the frequencies I need. This strategy provides very clear and fast code that I’m happy with right now.

d1 <- d[,c("DRG","HOSPID")] %>%                           # Select only columns I need
 dplyr::filter(DRG %in% c(500)) %>%                       # Keep DRGs I want
 group_by(HOSPID) %>%
 summarize(n=n()) %>%                                     # Tabulate volumes by hospital
 mutate(volumeCat = cut(n, c(0,100,400,max(n)+1),
     labels=c('Low','Medium','High'))) %>%                # Categorize hospitals
 group_by(volumeCat) %>%
 summarize(n=n())                                         # Tabulate hospital categories

Why am I doing the filtering using dplyr rather than LaF? At least for my initial runs, it seemed that the dplyr strategy is faster. You could, in fact, use LaF as follows: d[d[,"DRG"] %in% c(500), c("DRG","HOSPID")]. See the LaF user manual for details about the syntax for row filtering, since it is a bit idiosyncratic.

So how does this actually perform? We’re working with 8.2 million records here. I ran my code (after wrapping it in a function) on my desktop (Intel Xeon @2.67 Ghz, 8GB RAM running Windows 7 64-bit Enterprise Edition, running R 3.1.1).

system.time(freq <- zip2freq('NIS_2008_Core.ASC', drg=c(500)))
user system elapsed
3.19 1.57 45.56

On my MacBook Air (2014) with 8GB memory and a 1.4 GHz Intel Core i5 and a Solid State Disk, this took about 16 seconds (elapsed time).

For contrast, I used the HCUP-supplied SAS load file (on SAS 9.3), adding the lines

if DRG=500;

to the end of the provided DATA statement to filter the data and keep only the two variables I need. I’m not good enough in SAS to do the remaining data munging efficiently, but the timings were interesting to say the least.

 NOTE: 8158381 records were read from the infile 'H:\tmp\NIS_2008_Core.ASC'. The minimum record length was 516. The maximum record length was 516. NOTE: The data set WORK.NIS_2008_CORE has 1068 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 2:26.37 cpu time 2:11.15 

Practical Data Science Cookbook

Practical Data Science Cookbook

Book cover

My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.

Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications. We chose the languages R and Python to demonstrate our recipes since they are by far the most popular open-source languages for data science, with large ecosystems of mature data-related packages and active contributors. We selected several domains to demonstrate what we feel are interesting use cases that will have broad interest.

You can buy the book either in print form or as a DRM-free e-book from Packt, the publisher, or O’Reilly. Print and e-book versions are also available from Amazon, Barnes & Noble. We hope you will find this book useful for your data science work and understanding.

Nov 10, 2014: Until 2AM tonight EST, the book is O’Reilly’s E-book Deal of the Day. Use the discount code DEAL to get 50% off the regular price.

The need for documenting functions

My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code!

One of the common problems I (and I’m sure many of us) have is that we tend to hack code and functions with the end in mind, just getting the job done. However, when we have to re-visit code when it’s no longer fresh in our memory, it takes significant time to decipher what some code snippet or function is doing, or why we did it in the first place. Then, when our paper gets published and a reader wants our code to try, it’s a bear getting it into any kind of shareable form. Recently I’ve had both issues, and decided, enough was enough!!

R has a fantastic package roxygen2 that makes documenting functions quite easy. The documentation sits just above the function code, so it is there front and center. Taking 2-5 minutes to write even a bare-bones documentation, that includes

  • what the function does
  • what are the inputs (in English) and their required R class
  • what is the output and its R class
  • maybe one example

makes the grief of re-discovering the function and trying to decipher it go away. What does this look like? Here’s a recent example from my files:

#' Find column name corresponding to a particular functional
#' The original data set contains very long column headers. This function
#' does a keyword search over the headers to find those column headers that
#' match a particular keyword, e.g., mean, median, etc.
#' @param x The data we are querying (data.frame)
#' @param v The keyword we are searching for (character)
#' @param ignorecase Should case be ignored (logical)
#' @return A vector of column names matching the keyword
#' @export
findvar <- function(x,v, ignorecase=TRUE) {
  if(!is.character(v)) stop('name must be character')
  if(!is.data.frame(x)) stop('x must be a data.frame')
  v <- grep(v,names(x),value=T, ignore.case=ignorecase)
  if(length(v)==0) v <- NA

My code above might not meet best practices, but it achieves two things for me. It reminds me of why I wrote this function, and tells me what I need to run it. This particular snippet is not part of any R package (though I could, with my new directory structure for projects, easily create a project-specific package if I need to). Of course this type of documentation is required if you are indeed writing packages.  

Update: As some of you have pointed out, the way I’m using this is as a fancy form of commenting, regardless of future utility in packaging. 100% true, but it’s actually one less thing for me to think about. I have a template, fill it out, and I’m done, with all the essential elements included. Essentially this creates a “minimal viable comment” for a function, and I only need to look in one place later to see what’s going on. I still comment my code, but this still gives me value for not very much overhead.


There are several resources for learning about roxygen2. First and foremost is the chapter Documenting functions from Hadley Wickham’s online book. roxygen2 also has its own tag on StackOverflow.

On the software side, RStudio supports roxygen2; see here. Emacs/ESS also has extensive roxygen2 support. The Rtools package for Sublime Text provides a template for roxygen2 documentation. So getting started in the editor of your choice is not a problem.