Data Science

Quirks about running Rcpp on Windows through RStudio

Quirks about running Rcpp on Windows through RStudio

This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.

I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.h. First, devtools::find_rtools was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating.

After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file:

Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/",
            "C:/RBuildTools/3.4/mingw_64/bin", sep = ";"))
Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")

Note that C:/RBuildTools is the default location suggested when I installed RTools.

This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!!

Note that instead of putting it in the global .Rprofile, you could put it in a project-specific .Rprofile, or even in your R code as long as it is run before loading the Rcpp or derivative packages. Note also that if you use binary packages that use Rcpp, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows.

Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post.

 

Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO.

1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected. But when our databases are significant fractions of the world population, we are much farther away from the controlled samples of good laboratory science. This means…

2. Data science as it’s currently practiced is essentially bad science. When we take a biased, dependent population of samples and try to generalize a conclusion from it, we need to be fully aware of how flawed our study is. That doesn’t mean things we discover using data analytics aren’t useful, but they need to be understood through the lens of the bias and complex dependencies present in the training data.

3. Computational methods should be aware of, and take advantage of, known dependencies. Some subfields of data mining and machine learning address this, like structured output learning, graph mining, relational learning, and more. But there is a lot of research progress needed. The data we’re mostly interested in nowadays comes from complex phenomena, which means we have to pay for accurate modeling with a little computational and cognitive complexity. How we manage that is a big open problem.

Specially point 3 is one I’ve been thinking about a lot recently. Our current frameworks are quite limited in dealing with dependencies and complexity. We’ve been happy using decades-old methods since they work pretty well on the predictive side as a reasonable approximation to the truth. However, having machines understanding complexity and incorporating it in predictions or understanding is a second-level challenge that can use significant research effort.

Annotated Facets with ggplot2

I was recently asked to do a panel of grouped boxplots of a continuous variable,
with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem.

First of all, one caveat. I’m a huge fan of Hadley Wickham’s tidyverse and so most of my code will reflect this ethos, including packages and pipes. There are, of course, other approaches to doing this.

library(broom)
library(ggplot2)
library(tidyverse)

I will use the warpbreaks dataset as an example. This data looks at the number of breaks in yarn, for two types of yarn (A and B) and three loom tensions (L, M, H)

data(warpbreaks)
glimpse(warpbreaks)
## Observations: 54
## Variables: 3
## $ breaks  <dbl> 26, 30, 54, 25, 70, 52, 51, 26, 67, 18, 21, 29, 17, 12...
## $ wool    <fctr> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,...
## $ tension <fctr> L, L, L, L, L, L, L, L, L, M, M, M, M, M, M, M, M, M,...

A single panel

Let’s first start with creating a grouped boxplot of breaks by wool type. This is pretty straightforward in ggplot2.

plt1 <- ggplot(warpbreaks, aes(x=wool, y=breaks))+
geom_boxplot()
plt1

plot of chunk plt1

Adding the p-value to this is also rather simple. We will perform a two-sample t.test and report the p-value on the plot

tst = t.test(breaks ~ wool, data=warpbreaks)
pval = tidy(tst)$p.value # from broom
plt1 + geom_text(aes(x=2, y = 60, # Where do we put the annotation
label=paste('P-value:', format.pval(pval, digits=2))))

plot of chunk plt1-1

Multiple panels

I’ll now show how to do multiple panels. First, we have to transform the data into long format so that ggplot2 can leverage it.

warp2 <- gather(warpbreaks, variable, value, -breaks)
head(warp2)
##   breaks variable value
## 1     26     wool     A
## 2     30     wool     A
## 3     54     wool     A
## 4     25     wool     A
## 5     70     wool     A
## 6     52     wool     A

Now we can plot this using ggplot2

ggplot(warp2, aes(x = value, y = breaks))+geom_boxplot()+facet_wrap(~variable, nrow=1)

plot of chunk fig1

Ooops!! When we used gather to melt the data frame and create one variable for the two categorical variables, with the levels all shared in the variable column. So the plot shows all the possible categories. We can easily fix this.

plt2 <- ggplot(warp2, aes(x=value, y=breaks))+geom_boxplot()+facet_wrap(~variable, nrow=1, scales='free_x')

Next, we want to put a p-value on each plot. The trick is to create a new data frame for the p-values, with one row per level of the facetting variable. In our example, wool has two levels while tension has three, so to create one p-value per plot I’ll run ANOVA models and provide the p-value of the model effect, i.e., is there any difference in the average number of breaks by levels of each predictor. I’ll use the template Wickham suggests about running multiple models here.

pvalues <- warp2 %>%
nest(-variable) %>%
mutate(mods = map(data, ~anova(lm(breaks ~ value, data = .))),
pval = map_dbl(mods, ~tidy(.)$p.value[1])) %>%
select(variable, pval)
pvalues
## # A tibble: 2 × 2
##   variable        pval
##      <chr>       <dbl>
## 1     wool 0.108397092
## 2  tension 0.001752817

I’ll also include in this data frame (or tibble as the tidyverse calls it) the location of the annotation on each graph, using the same variable names as the graph

pvalues <- pvalues %>%
mutate(value = c(1.5,2.5), breaks = 60)
pvalues
## # A tibble: 2 × 4
##   variable        pval value breaks
##      <chr>       <dbl> <dbl>  <dbl>
## 1     wool 0.108397092   1.5     60
## 2  tension 0.001752817   2.5     60

Now, let’s plot these onto the facetted plot

plt2_annot <- plt2 +
geom_text(data=pvalues, aes(x=value, y=breaks,
label = paste('P-value:',
format.pval(pval, digits=1))))
plt2_annot

plot of chunk annotated

Obviously, if we wanted to make this graph publication quality, we’d need to modify the labels and perhaps the background, but this example gets you to the point where you can put p-values onto facetted plots. In fact, there’s nothing special about p-values here; you could add any facet-specific annotation onto your facetted plots using the same template.

A follow-up to Crowdsourcing Research

Last month I published some thoughts on crowdsourcing research, inspired by Anthony Goldbloom’s talk at Statistical Programming DC on the Kaggle experience. Today, I found a rather similar discussion  on crowdsourcing research (on the online version of the magazine Good) as a potential way to increase the accuracy of scientific research and reducing bias. I think more consideration needs to be made both by academia, funding agencies, journals and consumers of scientific and technological research to break silos and make progress accurate and reproducible, and finding new ways of preserving the profit imperative in technological progress that allows for the sharing and crowdsourcing of knowledge and research progress.

Crowdsourcing research

Last evening, Anthony Goldbloom, the founder of Kaggle.com, gave a very nice talk at a joint Statistical Programming DC/Data Science DC event about the Kaggle experience and what can be learned from the results of their competitions. One of the take away messages was that crowdsourcing data problems to a diligent and motivated group of entrepreneurial data scientists can get you to the threshold of extracting signal and patterns from data far more quickly than if a closed and siloed group of analysts worked on the problem. There were several examples where analysts, some with very different domain expertise from the problem domain, were able to, in rather short order, improve upon the best analyses that the academic community could produce after years of effort. It is evidence that many varied approaches from many intelligent minds can be very efficient, and can break the “box” that domain experts and academics often limit themselves to. This phenomenon should be eye-opening to researchers.

Another feature of the competitions was that once one team had a breakthrough, others quickly followed to match and beat it. Anthony called this the “Roger Bannister effect”, i.e., once other competitors see that a barrier was breakable, a psychological wall is broken and the new threshold is rapidly matched by others. This is an interesting observation, in that it seems to indicate that we limit ourselves and are satisfied with what we believe to be the best until someone proves us wrong, then we are quick to figure out how to achieve the new limit. It’s also interesting that some individuals or teams believe that the current limit is artificial and can be broken. Eventually a true limit is reached (all the signal is extracted from the data) and the subsequent efforts produce miniscule incremental improvements.

I’ve been thinking recently about ways of crowdsourcing research. Crowdsourced data collection is indeed possible, and infrastructures like Apple’s ResearchKit and other IT solutions (like one we’re developing at Zansors) can help. However, it is essential that the analyses and interpretation of studies use as many eyes and hands as possible to get innovative and optimal solutions. At an individual analyst level, it may involve trying many many models and looking at ensemble models. Teams can look at different approaches to a problem, and the community can add even more diversity to approaches, provided the data is shared. Unfortunately the current incentives in academic science will not promote this; rather it promotes isolationist and siloed research. Data sharing is limited at best, due to the (not unfounded) fear of being scooped by competitors; this is a result of the current incentive system in science. Team or community science is not really promoted, because then particular investigators cannot get the credit they need for tenure or promotions. Platforms like Kaggle can help, but it also needs cultural acceptance within the scientific community. It is noteworthy that most of the successful competitions on Kaggle have been sponsored by companies with financial stakes in the outcome, or by government agencies like NIH in association with companies.

A third point that Anthony demonstrated is that Kaggle competitions can make a difference. There are tough problems in image analysis applied to disease diagnosis or prevention that are crowdsourced on Kaggle, with the hope that a good solution will emerge that will change paradigms for clinical and public health practice. There are similarities to how the US government (and other governments) promote research by funding academic groups to address problems that agencies deem important. This paradigm is a crowdsourcing solution to some extent, since many groups are engaged towards a common goal by virtue of particular research programs and grant portfolios that agencies like NIH run, but the fundamental paradigm in grant funding is competition rather than cooperation. It’s not really crowdsourcing in the spirit of Kaggle or the Netflix prize, where there are hundreds of teams focused on a common goal, and competitors coalesced and collaborated and cooperated to form bigger, more capable teams to get to a better solution. There is a bit of a problem translating this to academia or real research, since almost all the teams in these competitions are volunteer efforts outside of their usual jobs, and are typically unfunded. Research, and sustained research, requires regular funding to succeed. How that would work within a crowdsourcing paradigm is unclear to me.

My main thoughts after Anthony’s presentation were that there is a lot of promise in collaborative crowdsourcing for particular well-defined problems, and such avenues should be more widely used to solve these well-defined problems. In more nascent or amorphous problems, we would be well-served to increase discussion and collaboration and share negative findings so that directions can be eliminated without too much wasted effort so that progress can happen more efficiently and quickly.

Reading fixed width formats in the Hadleyverse

This is an update to a previous post on reading fixed width formats in R.

A new addition to the Hadleyverse is the package readr, which includes a function read_fwf to read fixed width format files. I’ll compare the LaF approach to the readr approach using the same dataset as before. The variable wt is generated from parsing the Stata load file as before.

I want to read all the data in two columns: DRG and HOSPID. The LaF solution for this is:

library(LaF)
d <- laf_open_fwf('NIS_2008_Core.ASC', column_widths=wt$widths,
                  column_types=wt$classes, column_names = wt$varnames)
d1 <- d[,c('DRG','HOSPID')]

For readr, we use the following code:

library(readr)
x <- as.data.frame(fwf_widths(wt$widths, wt$varnames))
fwfpos <- fwf_positions(x[c(15,62),1], x[c(15,62),2], col_names=c(x[c(15,62),3]))
d1 <- read_fwf('NIS_2008_Core.ASC', fwfpos)

As a comparison of times on my MacBook Air (2014) with 8GB memory and a 1.4 GHz Intel Core i5 and a Solid State Disk, the LaF approach took 20.73s (system time 2.52s), while the readr solution takes 40.23s (system time 10.55s). While readr has improved things a lot compared to read.fwf and other base R solutions, the LaF approach is still more efficient.

Practical Data Science Cookbook

Practical Data Science Cookbook

Book cover

My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.

Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications. We chose the languages R and Python to demonstrate our recipes since they are by far the most popular open-source languages for data science, with large ecosystems of mature data-related packages and active contributors. We selected several domains to demonstrate what we feel are interesting use cases that will have broad interest.

You can buy the book either in print form or as a DRM-free e-book from Packt, the publisher, or O’Reilly. Print and e-book versions are also available from Amazon, Barnes & Noble. We hope you will find this book useful for your data science work and understanding.

Nov 10, 2014: Until 2AM tonight EST, the book is O’Reilly’s E-book Deal of the Day. Use the discount code DEAL to get 50% off the regular price.