R

pandas “transform” using the tidyverse

Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.

Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.

First the pandas code:

import pandas as pd
dat = pd.read_excel('sales_transactions.xlsx')
dat['Percent_of_Order'] = dat['ext price']/dat.groupby('order')['ext price'].transform('sum')

A similar implementation using the tidyverse:

library(tidyverse)
library(readxl)
dat <- read_excel('sales_transactions.xlsx')
dat <- dat %>%
group_by(order) %>%
mutate(Percent_of_Order = `ext price`/sum(`ext price`))

Changing names in the tidyverse: An example for many regressions

A collaborator posed an interesting R question to me today. She wanted to do
several regressions using different outcomes, with models being computed on
different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then
filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

library(tidyverse)

dat <- as_tibble(expand.grid(letters[1:4], 1:5))
d <- vector('list', nrow(dat))
set.seed(102)
for(i in 1:nrow(dat)){
x <- rnorm(100)
d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100))
}
dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest()
knitr::kable(head(dat), format='html')
Var1 Var2 x y1 y2
a 1 0.1805229 4.2598245 -3.004535
a 1 0.7847340 0.0023338 -2.104949
a 1 -1.3531646 3.1711898 -9.156758
a 1 1.9832982 -0.7140910 5.966377
a 1 1.2384717 0.3523034 2.131004
a 1 1.2006174 0.6267716 1.752106

Now we’re going to perform two regressions, one using y1 and one using y2 as the dependent variables, for each stratum defined by Var1 and Var2.

out <- dat %>%
nest(-Var1, -Var2) %>%
mutate(model1 = map(data, ~lm(y1~x, data=.)),
model2 = map(data, ~lm(y2~x, data=.)))

Now conceptually, all we do is tidy up the output for the models using the broom package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

library(broom)
out_problem <- out %>% mutate(output1 = map(model1, ~tidy(.)),
output2 = map(model2, ~tidy(.))) %>%
select(-data, -model1, -model2) %>%
unnest()
names(out_problem)

[1] “Var1” “Var2” “term” “estimate” “std.error”
[6] “statistic” “p.value” “term” “estimate” “std.error”
[11] “statistic” “p.value”

We’ve got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

out_nice <- out %>% mutate(output1 = map(model1, ~tidy(.)),
output2 = map(model2, ~tidy(.)),
output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))),
output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>%
select(-data, -model1, -model2) %>%
unnest()

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

out_nice %>% filter(y1_term=='x') %>%
select(Var1, Var2, ends_with('estimate')) %>%
knitr::kable(digits=3, format='html')
Var1 Var2 y1_estimate y2_estimate
a 1 -1.897 5.036
b 1 -2.000 5.022
c 1 -1.988 4.888
d 1 -2.089 5.089
a 2 -2.052 5.015
b 2 -1.922 5.004
c 2 -1.936 4.969
d 2 -1.961 4.959
a 3 -2.043 5.017
b 3 -2.045 4.860
c 3 -1.996 5.009
d 3 -1.922 4.894
a 4 -2.000 4.942
b 4 -2.000 4.932
c 4 -2.033 5.042
d 4 -2.165 5.049
a 5 -2.094 5.010
b 5 -1.961 5.122
c 5 -2.106 5.153
d 5 -1.974 5.009

 

Copying tables from R to Outlook

I work in an ecosystem that uses Outlook for e-mail. When I have to communicate results with collaborators one of the most frequent tasks I face is to take a tabular output in R (either a summary table or some sort of tabular output) and send it to collaborators in Outlook. One method is certainly to export the table to Excel  and then copy the table from there into Outlook. However, I think I prefer another method which works a bit quicker for me.

I’ve been writing full reports using Rmarkdown for a while now, and it’s my preferred report-generation method. Usually I use knitr::kable to generate a Markdown version of a table in R. I can then copy the generated Markdown version of the table into a Markdown editor (I use Minimalist Markdown Editor), then just copy the HTML-rendered table from the preview pane to Outlook. This seem  to work pretty well for me

A (much belated) update to plotting Kaplan-Meier curves in the tidyverse

One of the most popular posts on this blog has been my attempt to create Kaplan-Meier plots with an aligned table of persons-at-risk below it under the ggplot paradigm. That post was last updated 3 years ago. In the interim, Chris Dardis has built upon these attempts to create a much more stable and feature-rich version of this function in his package survMisc; the function is called autoplot.

A quick exploration of the ReporteRs package

The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.

This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr. For example, a report I wrote today maintained my flow while writing R code to create the report.

library(ReporteRs)
library(magrittr)

mydoc <- docx() %>%
  addParagraph(value = 'Correlation matrix', style='Titre2') %>%
  addParagraph(value='Estimates') %>%
  addFlexTable(FlexTable(cormat)) %>%
  addParagraph(value = 'P-values') %>%
  addFlexTable(FlexTable(corpval)) %>%
  addParagraph(value = "Boxplots", style='Titre2') %>%
  addPlot(fun=print, x = plt, height=3, width=5) %>%
  writeDoc(file = 'Report.docx)

Note that plt is a ggplot object and so we actually have to print it rather than just put the object in the addPlot command.

This was my first experience in a while using ReporteRs, and it seemed pretty good to me.

Annotated Facets with ggplot2

I was recently asked to do a panel of grouped boxplots of a continuous variable,
with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem.

First of all, one caveat. I’m a huge fan of Hadley Wickham’s tidyverse and so most of my code will reflect this ethos, including packages and pipes. There are, of course, other approaches to doing this.

library(broom)
library(ggplot2)
library(tidyverse)

I will use the warpbreaks dataset as an example. This data looks at the number of breaks in yarn, for two types of yarn (A and B) and three loom tensions (L, M, H)

data(warpbreaks)
glimpse(warpbreaks)
## Observations: 54
## Variables: 3
## $ breaks  <dbl> 26, 30, 54, 25, 70, 52, 51, 26, 67, 18, 21, 29, 17, 12...
## $ wool    <fctr> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,...
## $ tension <fctr> L, L, L, L, L, L, L, L, L, M, M, M, M, M, M, M, M, M,...

A single panel

Let’s first start with creating a grouped boxplot of breaks by wool type. This is pretty straightforward in ggplot2.

plt1 <- ggplot(warpbreaks, aes(x=wool, y=breaks))+
geom_boxplot()
plt1

plot of chunk plt1

Adding the p-value to this is also rather simple. We will perform a two-sample t.test and report the p-value on the plot

tst = t.test(breaks ~ wool, data=warpbreaks)
pval = tidy(tst)$p.value # from broom
plt1 + geom_text(aes(x=2, y = 60, # Where do we put the annotation
label=paste('P-value:', format.pval(pval, digits=2))))

plot of chunk plt1-1

Multiple panels

I’ll now show how to do multiple panels. First, we have to transform the data into long format so that ggplot2 can leverage it.

warp2 <- gather(warpbreaks, variable, value, -breaks)
head(warp2)
##   breaks variable value
## 1     26     wool     A
## 2     30     wool     A
## 3     54     wool     A
## 4     25     wool     A
## 5     70     wool     A
## 6     52     wool     A

Now we can plot this using ggplot2

ggplot(warp2, aes(x = value, y = breaks))+geom_boxplot()+facet_wrap(~variable, nrow=1)

plot of chunk fig1

Ooops!! When we used gather to melt the data frame and create one variable for the two categorical variables, with the levels all shared in the variable column. So the plot shows all the possible categories. We can easily fix this.

plt2 <- ggplot(warp2, aes(x=value, y=breaks))+geom_boxplot()+facet_wrap(~variable, nrow=1, scales='free_x')

Next, we want to put a p-value on each plot. The trick is to create a new data frame for the p-values, with one row per level of the facetting variable. In our example, wool has two levels while tension has three, so to create one p-value per plot I’ll run ANOVA models and provide the p-value of the model effect, i.e., is there any difference in the average number of breaks by levels of each predictor. I’ll use the template Wickham suggests about running multiple models here.

pvalues <- warp2 %>%
nest(-variable) %>%
mutate(mods = map(data, ~anova(lm(breaks ~ value, data = .))),
pval = map_dbl(mods, ~tidy(.)$p.value[1])) %>%
select(variable, pval)
pvalues
## # A tibble: 2 × 2
##   variable        pval
##      <chr>       <dbl>
## 1     wool 0.108397092
## 2  tension 0.001752817

I’ll also include in this data frame (or tibble as the tidyverse calls it) the location of the annotation on each graph, using the same variable names as the graph

pvalues <- pvalues %>%
mutate(value = c(1.5,2.5), breaks = 60)
pvalues
## # A tibble: 2 × 4
##   variable        pval value breaks
##      <chr>       <dbl> <dbl>  <dbl>
## 1     wool 0.108397092   1.5     60
## 2  tension 0.001752817   2.5     60

Now, let’s plot these onto the facetted plot

plt2_annot <- plt2 +
geom_text(data=pvalues, aes(x=value, y=breaks,
label = paste('P-value:',
format.pval(pval, digits=1))))
plt2_annot

plot of chunk annotated

Obviously, if we wanted to make this graph publication quality, we’d need to modify the labels and perhaps the background, but this example gets you to the point where you can put p-values onto facetted plots. In fact, there’s nothing special about p-values here; you could add any facet-specific annotation onto your facetted plots using the same template.

Reading fixed width formats in the Hadleyverse

This is an update to a previous post on reading fixed width formats in R.

A new addition to the Hadleyverse is the package readr, which includes a function read_fwf to read fixed width format files. I’ll compare the LaF approach to the readr approach using the same dataset as before. The variable wt is generated from parsing the Stata load file as before.

I want to read all the data in two columns: DRG and HOSPID. The LaF solution for this is:

library(LaF)
d <- laf_open_fwf('NIS_2008_Core.ASC', column_widths=wt$widths,
                  column_types=wt$classes, column_names = wt$varnames)
d1 <- d[,c('DRG','HOSPID')]

For readr, we use the following code:

library(readr)
x <- as.data.frame(fwf_widths(wt$widths, wt$varnames))
fwfpos <- fwf_positions(x[c(15,62),1], x[c(15,62),2], col_names=c(x[c(15,62),3]))
d1 <- read_fwf('NIS_2008_Core.ASC', fwfpos)

As a comparison of times on my MacBook Air (2014) with 8GB memory and a 1.4 GHz Intel Core i5 and a Solid State Disk, the LaF approach took 20.73s (system time 2.52s), while the readr solution takes 40.23s (system time 10.55s). While readr has improved things a lot compared to read.fwf and other base R solutions, the LaF approach is still more efficient.

Creating new data with max values for each subject

We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.

R solutions

Hadleyverse solutions

Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.

library(dplyr)
dat_max <- dat %>% group_by(ID) %>% filter(row_number(X)==n())

To be a bit more explicit, row_number is a wrapper around rank, using the option ties.method=“first”. So you can use the rank function explicitly here as well.

A solution using the plyr package might be

library(plyr)
dat_max <- ddply(dat, .(ID), function(u) u[rank(u$X, ties.method='first')==nrow(u),])

A data.table solution

The data.table package also has great munging capabilities like dplyr. The same StackOverflow thread linked above also provides this solution using data.table, provided by Arun:

dt <- as.data.table(dat)
dt[dt[, .I[which.max(X)], by=ID]$V1]

Using SQL statements via sqldf

The sqldf package allows an easy implementation of using SQL statements on data frame objects. As Ryan mentions in the comments, the possibilities of solving our problem using sqldf is straightforward:

library(sqldf)
bl <- sqldf("select ID, max(X) from dat group by ID")
names(bl)[2] <- 'X'
sqldf('select * from bl LEFT JOIN dat using(ID, X)')

Python solutions

Using pandas

In Python, the excellent pandas package allows us to do similar operations. The following example is from this thread on StackOverflow.

import pandas as pd
df = DataFrame({’Sp’:[‘a’,’b‘,’c’,’d‘,’e’,’f’], ‘Mt’:[‘s1’, ‘s1’, ‘s2’,’s2‘,’s2’,’s3’], ‘Value’:[1,2,3,4,5,6], ‘count’:[3,2,5,10,10,6]})
df.iloc[df.groupby([‘Mt’]).apply(lambda x: x[‘count’].idxmax())]

You could also do (from the same thread)

df.sort(‘count’, ascending=False).groupby(‘Mt’, as_index=False).first()

but it is about 50% slower.

Using pandasql

The package pandasql is a port of sqldf to Python developed by
yhat. Using this package, we can mimic the sqldf solution given above:

from pandasql import sqldf
sqldf('select Sp, Mt, Value, max(count) from df group by Mt', globals())

“LaF”-ing about fixed width formats

If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is quite an efficient format to store large data in (see this StackOverflow discussion). However, it requires quite a bit of coding, in terms of defining which columns represent which variables.

A recent project that our group at NIH is working on involves the National Inpatient Sample (NIS) from the Healthcare Cost and Utilization Project (HCUP). The current task is to load the Core data (8.2M records for 2008), filter it for a particular set of DRG codes, and see how many discharges with these DRG codes are present in each hospital in the set of hospitals included in the sample. Furthermore, we need to categorize each hospital as low, medium and high volume hospitals based on the number of discharges.

Note that to protect our project, I will not use the DRG codes that we’re actually using. I actually don’t know what DRG=500 is in terms of what the patient undergoes.

Translating available code

HCUP provides load files for their data sets in SAS, SPSS and Stata. Note that no open-source data software made the cut, possibly for legacy reasons. This post will focus on a R based solution. A follow up post will present a Python based solution.

First of all, I admit that I am lazy. I don’t want to sit with the data dictionary and write out all the format statements in R. However, if HCUP is providing load files, why not use them? Looking at them, I felt that the Stata load file was cleanest in terms of formatting. I wrote the following R script to extract variable names, column widths and formats (string, double, or integer) from the load file.

parseStataCode <- function(statafile){
 require(stringr)
 options(stringsAsFactors=FALSE)
 statacode <- readLines(statafile)
 indStart <- min(grep('infix', statacode))
 indEnd <- min(grep('using', statacode))-1
 codelines <- statacode[indStart:indEnd]
 codelines[1] <- gsub('infix','', codelines[1])
 codelines = str_trim(codelines)
 types <- str_extract(codelines, '^\\w+')
 translate <- c('int'='integer', 'byte'='integer', 'str'='string', 'double'='double', 'long'='double')
 types2 <- translate[types]
 varnames <- str_extract(codelines, '[A-Z]\\w+')
 startcols <- as.integer(str_extract(codelines,' \\d+'))
 maxcols = unlist(str_extract_all(codelines,'\\d+'))
 startcols = c(startcols, as.integer(maxcols[length(maxcols)]))
 widths = diff(startcols)
 return(data.frame(varnames = varnames, widths=widths, classes=types2))
}

You can of course add more code to extract missing value definitions and the like, but this sufficed for my current purpose.

My friend Anthony Damico wrote a great little package in R called SAScii, for the purpose of translating SAS load files provided by different US Government agencies for publicly available data (see his site ASDFree for several nice examples with large government surveys). However, the HCUP SAS load files contain custom formatting statements that breaks Anthony’s code. Hence the need to create my own translation code.

Ingesting data

Now on to the actual job of reading the data. R provides a function read.fwf, but that is painfully slow, so I immediately went to the Gods of Google. A quick internet search found the package LaF, which uses C++ underneath to make accessing large ASCII files faster. LaF also allows some column selection and filtering as data is loaded into memory in R. The workhorse for this is the laf_open_fwf function, which requires specification of the column widths and types.

wt <- parseStataCode('StataLoad_NIS_2008_Core.Do')
d <- laf_open_fwf('NIS_2008_Core.ASC', column_widths=wt$widths,
 column_types=wt$classes, column_names = wt$varnames)

The code above does not actually load the data into R, but essentially creates a pointer to the data on disk. Now I decided to use the (awesome) power of dplyr and magrittr to extract the frequencies I need. This strategy provides very clear and fast code that I’m happy with right now.

d1 <- d[,c("DRG","HOSPID")] %>%                           # Select only columns I need
 dplyr::filter(DRG %in% c(500)) %>%                       # Keep DRGs I want
 group_by(HOSPID) %>%
 summarize(n=n()) %>%                                     # Tabulate volumes by hospital
 mutate(volumeCat = cut(n, c(0,100,400,max(n)+1),
     labels=c('Low','Medium','High'))) %>%                # Categorize hospitals
 group_by(volumeCat) %>%
 summarize(n=n())                                         # Tabulate hospital categories

Why am I doing the filtering using dplyr rather than LaF? At least for my initial runs, it seemed that the dplyr strategy is faster. You could, in fact, use LaF as follows: d[d[,"DRG"] %in% c(500), c("DRG","HOSPID")]. See the LaF user manual for details about the syntax for row filtering, since it is a bit idiosyncratic.

So how does this actually perform? We’re working with 8.2 million records here. I ran my code (after wrapping it in a function) on my desktop (Intel Xeon @2.67 Ghz, 8GB RAM running Windows 7 64-bit Enterprise Edition, running R 3.1.1).

system.time(freq <- zip2freq('NIS_2008_Core.ASC', drg=c(500)))
user system elapsed
3.19 1.57 45.56

On my MacBook Air (2014) with 8GB memory and a 1.4 GHz Intel Core i5 and a Solid State Disk, this took about 16 seconds (elapsed time).

For contrast, I used the HCUP-supplied SAS load file (on SAS 9.3), adding the lines

if DRG=500;
keep DRG HOSPID;

to the end of the provided DATA statement to filter the data and keep only the two variables I need. I’m not good enough in SAS to do the remaining data munging efficiently, but the timings were interesting to say the least.

 NOTE: 8158381 records were read from the infile 'H:\tmp\NIS_2008_Core.ASC'. The minimum record length was 516. The maximum record length was 516. NOTE: The data set WORK.NIS_2008_CORE has 1068 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 2:26.37 cpu time 2:11.15 

Practical Data Science Cookbook

Practical Data Science Cookbook

Book cover

My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.

Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications. We chose the languages R and Python to demonstrate our recipes since they are by far the most popular open-source languages for data science, with large ecosystems of mature data-related packages and active contributors. We selected several domains to demonstrate what we feel are interesting use cases that will have broad interest.

You can buy the book either in print form or as a DRM-free e-book from Packt, the publisher, or O’Reilly. Print and e-book versions are also available from Amazon, Barnes & Noble. We hope you will find this book useful for your data science work and understanding.

Nov 10, 2014: Until 2AM tonight EST, the book is O’Reilly’s E-book Deal of the Day. Use the discount code DEAL to get 50% off the regular price.