Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO.

1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected. But when our databases are significant fractions of the world population, we are much farther away from the controlled samples of good laboratory science. This means…

2. Data science as it’s currently practiced is essentially bad science. When we take a biased, dependent population of samples and try to generalize a conclusion from it, we need to be fully aware of how flawed our study is. That doesn’t mean things we discover using data analytics aren’t useful, but they need to be understood through the lens of the bias and complex dependencies present in the training data.

3. Computational methods should be aware of, and take advantage of, known dependencies. Some subfields of data mining and machine learning address this, like structured output learning, graph mining, relational learning, and more. But there is a lot of research progress needed. The data we’re mostly interested in nowadays comes from complex phenomena, which means we have to pay for accurate modeling with a little computational and cognitive complexity. How we manage that is a big open problem.

Specially point 3 is one I’ve been thinking about a lot recently. Our current frameworks are quite limited in dealing with dependencies and complexity. We’ve been happy using decades-old methods since they work pretty well on the predictive side as a reasonable approximation to the truth. However, having machines understanding complexity and incorporating it in predictions or understanding is a second-level challenge that can use significant research effort.


The need for documenting functions

My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code!

One of the common problems I (and I’m sure many of us) have is that we tend to hack code and functions with the end in mind, just getting the job done. However, when we have to re-visit code when it’s no longer fresh in our memory, it takes significant time to decipher what some code snippet or function is doing, or why we did it in the first place. Then, when our paper gets published and a reader wants our code to try, it’s a bear getting it into any kind of shareable form. Recently I’ve had both issues, and decided, enough was enough!!

R has a fantastic package roxygen2 that makes documenting functions quite easy. The documentation sits just above the function code, so it is there front and center. Taking 2-5 minutes to write even a bare-bones documentation, that includes

  • what the function does
  • what are the inputs (in English) and their required R class
  • what is the output and its R class
  • maybe one example

makes the grief of re-discovering the function and trying to decipher it go away. What does this look like? Here’s a recent example from my files:

#' Find column name corresponding to a particular functional
#' The original data set contains very long column headers. This function
#' does a keyword search over the headers to find those column headers that
#' match a particular keyword, e.g., mean, median, etc.
#' @param x The data we are querying (data.frame)
#' @param v The keyword we are searching for (character)
#' @param ignorecase Should case be ignored (logical)
#' @return A vector of column names matching the keyword
#' @export
findvar <- function(x,v, ignorecase=TRUE) {
  if(!is.character(v)) stop('name must be character')
  if(! stop('x must be a data.frame')
  v <- grep(v,names(x),value=T,
  if(length(v)==0) v <- NA

My code above might not meet best practices, but it achieves two things for me. It reminds me of why I wrote this function, and tells me what I need to run it. This particular snippet is not part of any R package (though I could, with my new directory structure for projects, easily create a project-specific package if I need to). Of course this type of documentation is required if you are indeed writing packages.  

Update: As some of you have pointed out, the way I’m using this is as a fancy form of commenting, regardless of future utility in packaging. 100% true, but it’s actually one less thing for me to think about. I have a template, fill it out, and I’m done, with all the essential elements included. Essentially this creates a “minimal viable comment” for a function, and I only need to look in one place later to see what’s going on. I still comment my code, but this still gives me value for not very much overhead.


There are several resources for learning about roxygen2. First and foremost is the chapter Documenting functions from Hadley Wickham’s online book. roxygen2 also has its own tag on StackOverflow.

On the software side, RStudio supports roxygen2; see here. Emacs/ESS also has extensive roxygen2 support. The Rtools package for Sublime Text provides a template for roxygen2 documentation. So getting started in the editor of your choice is not a problem.

Converting images in Python

I had a recent request to convert an entire folder of JPEG images into EPS or similar vector graphics formats. The client was on a Mac, and didn’t have ImageMagick. I discovered the Python Image Library  to be enormously useful in this, and allowed me to implement the conversion in around 10 lines of Python code!!!

import Image
from glob import glob

jpgfiles = glob('*.jpg')
for u in jpgfiles:
    out = u.replace('jpg','eps')
    print "Converting %s to %s" % (u, out)
    img.thumbnails((800,800)) # Changing the size

What an elegant solution from Python —- “batteries included”

To be sure, using ImageMagick is more powerful, and Python wrappers (PyMagick), albeit old, do exist.

SAS, R and categorical variables

One of the disappointing problems in SAS (as I need PROC MIXED for some analysis) is to recode categorical variables to have a particular reference category. In R, my usual tool, this is rather easy both to set and to modify using the  relevel command available in base R (in the stats package). My understanding is that this is actually easy in SAS for GLM, PHREG and some others, but not in PROC MIXED. (Once again I face my pet peeve about the inconsistencies within a leading commercial product and market “leader” like SAS). The easiest way to deal with this, I believe, is to actually create the dummy variables by hand using ifelse statements and use them in the model rather than the categorical variables themselves. If most of the covariates are not categorical, this isn’t too burdensome.

I’m sure some SAS guru will comment on the elegant or “right” solution to this problem.

Forest plots using R and ggplot2

Forest plots are most commonly used in reporting meta-analyses, but can be profitably used to summarise the results of a fitted model. They essentially display the estimates for model parameters and their corresponding confidence intervals.

Matt Shotwell just posted a message to the R-help mailing list with his lattice-based solution to the problem of creating forest plots in R. I just figured out how to create a forest plot for a consulting report using ggplot2. The availability of the geom_pointrange layer makes this process very easy!!

Update January 26, 2016: ggplot2 has changed a bit in the last five years. I’ve created a gist that will be easier to maintain. The link is here. <- function(d){
 # d is a data frame with 4 columns
 # d$x gives variable names
 # d$y gives center point
 # d$ylo gives lower limits
 # d$yhi gives upper limits
 p <- ggplot(d, aes(x=x, y=y, ymin=ylo, ymax=yhi))+
 geom_hline(yintercept = 0, linetype=2)+

If we start with some dummy data, like

d <- data.frame(x = toupper(letters[1:10]),
                y = rnorm(10, 0, 0.1))
d <- transform(d, ylo = y-1/10, yhi=y+1/10)

we can get the following graph:


A small customization of ESS

JD Long (at Cerebral Mastication) posted a question on Twitter about an artifact in ESS, where typing “_” gets you “<-“. This is because in the early days of S+, “_” was an allowed assignment operator, and ESS was developed in that era. Later, it was disallowed in favor of “<-” and “=”, so ESS was modified to map “_” to “<-“. Now I like the typing convenience of this map, and I don’t use underscores in my variable names, so I was fine. JD probably was using underscores in his variable names, so this was rather frustrating. There are, I discovered, three ways around this:

  1. Type “_” twice, which puts in the underscore
  2. Use “C-q _”, i.e. Ctrl-q then underscore
  3. Put (setq ess-S-assign "_") in your .emacs file

The last fix obviously customizes ESS permanently for your emacs setup, while the first two allow you to get to underscore using the default ESS setup.

Update: Seth Falcon posted his .emacs on Twitter, which allows C-= to map the assignment operator, and leaves _ alone 🙂

(setq ess-S-assign-key (kbd "C-="))
(ess-toggle-S-assign-key t) ; enable above key definition
;; leave my underscore key alone!
(ess-toggle-underscore nil)

Nice, Seth!!

FYI, ESS is Emacs Speaks Statistics, an emacs addon developed by Tony Rossini and others to enable intelligent editing of statistical scripts in S+, R, SAS and Stata, as well as scripts for the Gibbs Sampling programs BUGS and JAGS, and can be found here

Quick and dirty parallel processing in R

R has some powerful tools for parallel processing, which I discovered while searching for ways to fully utilize my 8-core computer at work. What surprised me is how easy it is…about 6 lines of code, if that. Given that I wasn’t allowed to install heavy duty parallel-processing systems like MPICH on the computer, I found that the library SNOW fit the bill nicely through its use of sockets. I also discovered the libraries foreach and iterators, which were released to the community by the development team at Revolution R. Using these 3 libraries, I could easily parallelize a transformation of my dataset where the transformations happened within each unique ID. The following code did the trick:

cl <- makeCluster(6, type="SOCK") # using 6 nodes
uID <- unique(ID)
foreach(i=icount(length(uID)) %dopar% {

Note that this is for a multiprocessor single computer. Doing this on a cluster may be more complicated, but this serves my purposes quite nicely. There are other choices for this, including the multicore library and others described in the CRAN Task View

Update: I found that this strategy did not work for R 2.11 Windows versions, since snow is not properly spawning processes. However, there is a library doSMP provided by Revolution Analytics which gets around this problem. So replacing doSNOW with doSMP should do the trick. 

Update (7/25/2011): It appears that SNOW does work again in R 2.13.0, the current version, on Windows. I’ve been using the snowfall package recently on my multi-core WinXP64 computer, and it works beautifully.


Floating point pitfalls

John D. Cook over at the Endeavour has a series of articles talking about floating-point arithmetic and how it can burn us in computing statistics like the standard deviation, correlation and regression coefficients using the book formulae. Specially enlightening for me was the trick of using the Taylor series expansion of log(1+x) for small values of x, since the error is actually quite small. Fantastic points, John!!

A good summary of his points can be found here

Workflow with Python and R

I seem to be doing more and more with Python for work over and above using it as a generic scripting language. R has been my workhorse for analysis for a long time (15+ years in various incarnations of S+ and R), but it still has some deficiencies. I’m finding Python easier and faster to work with for large data sets. I’m also a bit happier with Python’s graphical capabilities via matplotlib, which allows dynamic updating of graphs a la Matlab, another drawback that R has despite great graphical capabilities.

Where am I finding Python useful? Mainly in reading, cleaning and transforming data sets, and a bit of analysis using scipy. Python seems more efficient in reading and working through large data sets than R ever was.  Data cleaning using the string utilities and the re module and exploration also seems pretty easy. I’ll probably have to right a few utilities, or just pass that stuff into R. I’m more comfortable doing the analysis within R, so I’m using rpy2 quite a bit. Gautier has done a nice upgrade of the old rpy which I used quite a bit.

One thing that Python doesn’t have well yet is a literate programming interface. Sweave is one of the strengths of R (and StatWeave looks interesting as an interface with other software like SAS, Stata, etc) which I use almost on a daily basis for report writing. pyreport 0.3 seems promising, and does allow for the report to be written in LaTeX, but I need to play with it some more before I can make a call on it. pyreport does allow the simplicity of reStructured Text for documentation, which I wish Sweave was capable of. I figure this can be easily remedied in R by modifying the RweaveHTML driver written by my old friend Greg Snow. [Addendum, 3/22/09: I recently found a python package for LaTeX (python.tex), which allows one to embed python code in a LaTeX document, then run latex using the –shell-escape flag. This then runs the python code and embeds the results into the LaTeX document. Sort of the opposite of Sweave, but I figure it will be quite useful as well. It should even work within Sweave documents, since the Sweave parser will take out the R/S parts, then running latex will take care of the python parts.]

Speaking of report writing, this in another place I use Python a lot in my workflow to convert file formats. I use the Python API for to transform formats, both for Writer documents and for spreadsheets. I’ve written small Python scripts in my ~/bin so that I can, on the fly, convert HTML to odt or doc. This is proving quite useful and seems to preserve formats reasonably well. So my reporting workflow is to use Sweave to create a LaTeX document, which I then convert to PDF and HTML, and then transform the HTML to doc using Python. I also create all my graphics as PDF, EPS and SVG formats for subsequent editing by clients. These formats produce the least loss on transformation (the vector formats EPS and SVG have no loss), which is great for large, multicolored heatmaps I produce. I will also create PNG graphics if I have to provide a Word document for the client.

Easy (?) way to tack Fortran onto Python

A recent post on  Walking Randomly gave a nice example of using the Python ctypes module to load Fortran functions that have been compiled into a shared library (*.so) or DLL (*.dll). This seems an easier option than using f2py or pyfort, which have not been working well for me.