Author: Abhijit

Surprising result when exploring Rcpp gallery

I’m starting to incorporate more Rcpp in my R work, and so decided to spend some time exploring the Rcpp Gallery. One example by John Merrill caught my eye. He provides a C++ solution to transforming an list of lists into a data frame, and shows impressive speed savings compared to as.data.frame.

This got me thinking about how I do this operation currently. I tend to rely on the do.call method. To mimic the example in the Rcpp example:

a <- replicate(250, 1:100, simplify=FALSE)
b <- do.call(cbind, a)

For fairness, I should get a data frame rather than a matrix, so for my comparisons, I do convert b into a data frame. I follow the original coding in the example, adding my method above into the mix. Comparing times:

res <- benchmark(as.data.frame(a),
                 CheapDataFrameBuilder(a),
                 as.data.frame(do.call(cbind, a)),
                 order="relative", replications=500)
res[,1:4]

The results were quite interesting to me 🙂

                              test replications elapsed relative
3 as.data.frame(do.call(cbind, a))          500    0.36    1.000
2         CheapDataFrameBuilder(a)          500    0.52    1.444
1                 as.data.frame(a)          500    7.28   20.222

I think part of what’s happening here is that as.data.frame.list expends overhead checking for different aspects of making a legit data frame, including naming conventions. The comparison to CheapDataFrameBuilder should really be with my barebones strategy. Having said that, the example does provide great value in showing what can be done using Rcpp.

Quirks about running Rcpp on Windows through RStudio

Quirks about running Rcpp on Windows through RStudio

This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.

I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.h. First, devtools::find_rtools was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating.

After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file:

Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/",
            "C:/RBuildTools/3.4/mingw_64/bin", sep = ";"))
Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")

Note that C:/RBuildTools is the default location suggested when I installed RTools.

This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!!

Note that instead of putting it in the global .Rprofile, you could put it in a project-specific .Rprofile, or even in your R code as long as it is run before loading the Rcpp or derivative packages. Note also that if you use binary packages that use Rcpp, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows.

Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post.

 

Finding my Dropbox in R

I’ll often keep non-sensitive data on Dropbox so that I can access it on all my machines without gumming up git. I just wrote a small script to find the Dropbox location on each of my computers automatically. The crucial information is available here, from Dropbox.

My small snippet of code is the following:

if (Sys.info()['sysname'] == 'Darwin') {
  info <- RJSONIO::fromJSON(
    file.path(path.expand("~"),'.dropbox','info.json'))
}
if (Sys.info()['sysname'] == 'Windows') {
  info <- RJSONIO::fromJSON(
    if (file.exists(file.path(Sys.getenv('APPDATA'), 'Dropbox','info.json'))) {
      file.path(Sys.getenv('APPDATA'), 'Dropbox', 'info.json')
    } else {
    file.path(Sys.getenv('LOCALAPPDATA'),'Dropbox','info.json')
    }
  )
}

dropbox_base <- info$personal$path

I haven’t included the Linux option since I don’t really use a Linux box, but the Dropbox link above will show you where the info.json file lies in Linux. Also, if you have a business Dropbox account, you’ll probably need info$business$path.

Hope this helps!!!

Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO.

1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected. But when our databases are significant fractions of the world population, we are much farther away from the controlled samples of good laboratory science. This means…

2. Data science as it’s currently practiced is essentially bad science. When we take a biased, dependent population of samples and try to generalize a conclusion from it, we need to be fully aware of how flawed our study is. That doesn’t mean things we discover using data analytics aren’t useful, but they need to be understood through the lens of the bias and complex dependencies present in the training data.

3. Computational methods should be aware of, and take advantage of, known dependencies. Some subfields of data mining and machine learning address this, like structured output learning, graph mining, relational learning, and more. But there is a lot of research progress needed. The data we’re mostly interested in nowadays comes from complex phenomena, which means we have to pay for accurate modeling with a little computational and cognitive complexity. How we manage that is a big open problem.

Specially point 3 is one I’ve been thinking about a lot recently. Our current frameworks are quite limited in dealing with dependencies and complexity. We’ve been happy using decades-old methods since they work pretty well on the predictive side as a reasonable approximation to the truth. However, having machines understanding complexity and incorporating it in predictions or understanding is a second-level challenge that can use significant research effort.

pandas “transform” using the tidyverse

Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.

Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.

First the pandas code:

import pandas as pd
dat = pd.read_excel('sales_transactions.xlsx')
dat['Percent_of_Order'] = dat['ext price']/dat.groupby('order')['ext price'].transform('sum')

A similar implementation using the tidyverse:

library(tidyverse)
library(readxl)
dat <- read_excel('sales_transactions.xlsx')
dat <- dat %>%
group_by(order) %>%
mutate(Percent_of_Order = `ext price`/sum(`ext price`))

Changing names in the tidyverse: An example for many regressions

A collaborator posed an interesting R question to me today. She wanted to do
several regressions using different outcomes, with models being computed on
different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then
filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

library(tidyverse)

dat <- as_tibble(expand.grid(letters[1:4], 1:5))
d <- vector('list', nrow(dat))
set.seed(102)
for(i in 1:nrow(dat)){
x <- rnorm(100)
d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100))
}
dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest()
knitr::kable(head(dat), format='html')
Var1 Var2 x y1 y2
a 1 0.1805229 4.2598245 -3.004535
a 1 0.7847340 0.0023338 -2.104949
a 1 -1.3531646 3.1711898 -9.156758
a 1 1.9832982 -0.7140910 5.966377
a 1 1.2384717 0.3523034 2.131004
a 1 1.2006174 0.6267716 1.752106

Now we’re going to perform two regressions, one using y1 and one using y2 as the dependent variables, for each stratum defined by Var1 and Var2.

out <- dat %>%
nest(-Var1, -Var2) %>%
mutate(model1 = map(data, ~lm(y1~x, data=.)),
model2 = map(data, ~lm(y2~x, data=.)))

Now conceptually, all we do is tidy up the output for the models using the broom package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

library(broom)
out_problem <- out %>% mutate(output1 = map(model1, ~tidy(.)),
output2 = map(model2, ~tidy(.))) %>%
select(-data, -model1, -model2) %>%
unnest()
names(out_problem)

[1] “Var1” “Var2” “term” “estimate” “std.error”
[6] “statistic” “p.value” “term” “estimate” “std.error”
[11] “statistic” “p.value”

We’ve got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

out_nice <- out %>% mutate(output1 = map(model1, ~tidy(.)),
output2 = map(model2, ~tidy(.)),
output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))),
output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>%
select(-data, -model1, -model2) %>%
unnest()

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

out_nice %>% filter(y1_term=='x') %>%
select(Var1, Var2, ends_with('estimate')) %>%
knitr::kable(digits=3, format='html')
Var1 Var2 y1_estimate y2_estimate
a 1 -1.897 5.036
b 1 -2.000 5.022
c 1 -1.988 4.888
d 1 -2.089 5.089
a 2 -2.052 5.015
b 2 -1.922 5.004
c 2 -1.936 4.969
d 2 -1.961 4.959
a 3 -2.043 5.017
b 3 -2.045 4.860
c 3 -1.996 5.009
d 3 -1.922 4.894
a 4 -2.000 4.942
b 4 -2.000 4.932
c 4 -2.033 5.042
d 4 -2.165 5.049
a 5 -2.094 5.010
b 5 -1.961 5.122
c 5 -2.106 5.153
d 5 -1.974 5.009

 

Copying tables from R to Outlook

I work in an ecosystem that uses Outlook for e-mail. When I have to communicate results with collaborators one of the most frequent tasks I face is to take a tabular output in R (either a summary table or some sort of tabular output) and send it to collaborators in Outlook. One method is certainly to export the table to Excel  and then copy the table from there into Outlook. However, I think I prefer another method which works a bit quicker for me.

I’ve been writing full reports using Rmarkdown for a while now, and it’s my preferred report-generation method. Usually I use knitr::kable to generate a Markdown version of a table in R. I can then copy the generated Markdown version of the table into a Markdown editor (I use Minimalist Markdown Editor), then just copy the HTML-rendered table from the preview pane to Outlook. This seem  to work pretty well for me

A (much belated) update to plotting Kaplan-Meier curves in the tidyverse

One of the most popular posts on this blog has been my attempt to create Kaplan-Meier plots with an aligned table of persons-at-risk below it under the ggplot paradigm. That post was last updated 3 years ago. In the interim, Chris Dardis has built upon these attempts to create a much more stable and feature-rich version of this function in his package survMisc; the function is called autoplot.

A quick exploration of the ReporteRs package

The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.

This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr. For example, a report I wrote today maintained my flow while writing R code to create the report.

library(ReporteRs)
library(magrittr)

mydoc <- docx() %>%
  addParagraph(value = 'Correlation matrix', style='Titre2') %>%
  addParagraph(value='Estimates') %>%
  addFlexTable(FlexTable(cormat)) %>%
  addParagraph(value = 'P-values') %>%
  addFlexTable(FlexTable(corpval)) %>%
  addParagraph(value = "Boxplots", style='Titre2') %>%
  addPlot(fun=print, x = plt, height=3, width=5) %>%
  writeDoc(file = 'Report.docx)

Note that plt is a ggplot object and so we actually have to print it rather than just put the object in the addPlot command.

This was my first experience in a while using ReporteRs, and it seemed pretty good to me.

Annotated Facets with ggplot2

I was recently asked to do a panel of grouped boxplots of a continuous variable,
with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem.

First of all, one caveat. I’m a huge fan of Hadley Wickham’s tidyverse and so most of my code will reflect this ethos, including packages and pipes. There are, of course, other approaches to doing this.

library(broom)
library(ggplot2)
library(tidyverse)

I will use the warpbreaks dataset as an example. This data looks at the number of breaks in yarn, for two types of yarn (A and B) and three loom tensions (L, M, H)

data(warpbreaks)
glimpse(warpbreaks)
## Observations: 54
## Variables: 3
## $ breaks  <dbl> 26, 30, 54, 25, 70, 52, 51, 26, 67, 18, 21, 29, 17, 12...
## $ wool    <fctr> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,...
## $ tension <fctr> L, L, L, L, L, L, L, L, L, M, M, M, M, M, M, M, M, M,...

A single panel

Let’s first start with creating a grouped boxplot of breaks by wool type. This is pretty straightforward in ggplot2.

plt1 <- ggplot(warpbreaks, aes(x=wool, y=breaks))+
geom_boxplot()
plt1

plot of chunk plt1

Adding the p-value to this is also rather simple. We will perform a two-sample t.test and report the p-value on the plot

tst = t.test(breaks ~ wool, data=warpbreaks)
pval = tidy(tst)$p.value # from broom
plt1 + geom_text(aes(x=2, y = 60, # Where do we put the annotation
label=paste('P-value:', format.pval(pval, digits=2))))

plot of chunk plt1-1

Multiple panels

I’ll now show how to do multiple panels. First, we have to transform the data into long format so that ggplot2 can leverage it.

warp2 <- gather(warpbreaks, variable, value, -breaks)
head(warp2)
##   breaks variable value
## 1     26     wool     A
## 2     30     wool     A
## 3     54     wool     A
## 4     25     wool     A
## 5     70     wool     A
## 6     52     wool     A

Now we can plot this using ggplot2

ggplot(warp2, aes(x = value, y = breaks))+geom_boxplot()+facet_wrap(~variable, nrow=1)

plot of chunk fig1

Ooops!! When we used gather to melt the data frame and create one variable for the two categorical variables, with the levels all shared in the variable column. So the plot shows all the possible categories. We can easily fix this.

plt2 <- ggplot(warp2, aes(x=value, y=breaks))+geom_boxplot()+facet_wrap(~variable, nrow=1, scales='free_x')

Next, we want to put a p-value on each plot. The trick is to create a new data frame for the p-values, with one row per level of the facetting variable. In our example, wool has two levels while tension has three, so to create one p-value per plot I’ll run ANOVA models and provide the p-value of the model effect, i.e., is there any difference in the average number of breaks by levels of each predictor. I’ll use the template Wickham suggests about running multiple models here.

pvalues <- warp2 %>%
nest(-variable) %>%
mutate(mods = map(data, ~anova(lm(breaks ~ value, data = .))),
pval = map_dbl(mods, ~tidy(.)$p.value[1])) %>%
select(variable, pval)
pvalues
## # A tibble: 2 × 2
##   variable        pval
##      <chr>       <dbl>
## 1     wool 0.108397092
## 2  tension 0.001752817

I’ll also include in this data frame (or tibble as the tidyverse calls it) the location of the annotation on each graph, using the same variable names as the graph

pvalues <- pvalues %>%
mutate(value = c(1.5,2.5), breaks = 60)
pvalues
## # A tibble: 2 × 4
##   variable        pval value breaks
##      <chr>       <dbl> <dbl>  <dbl>
## 1     wool 0.108397092   1.5     60
## 2  tension 0.001752817   2.5     60

Now, let’s plot these onto the facetted plot

plt2_annot <- plt2 +
geom_text(data=pvalues, aes(x=value, y=breaks,
label = paste('P-value:',
format.pval(pval, digits=1))))
plt2_annot

plot of chunk annotated

Obviously, if we wanted to make this graph publication quality, we’d need to modify the labels and perhaps the background, but this example gets you to the point where you can put p-values onto facetted plots. In fact, there’s nothing special about p-values here; you could add any facet-specific annotation onto your facetted plots using the same template.