Creating new data with max values for each subject

We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.

R solutions

Hadleyverse solutions

Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.

dat_max <- dat %>% group_by(ID) %>% filter(row_number(X)==n())

To be a bit more explicit, row_number is a wrapper around rank, using the option ties.method=“first”. So you can use the rank function explicitly here as well.

A solution using the plyr package might be

dat_max <- ddply(dat, .(ID), function(u) u[rank(u$X, ties.method='first')==nrow(u),])

A data.table solution

The data.table package also has great munging capabilities like dplyr. The same StackOverflow thread linked above also provides this solution using data.table, provided by Arun:

dt <-
dt[dt[, .I[which.max(X)], by=ID]$V1]

Using SQL statements via sqldf

The sqldf package allows an easy implementation of using SQL statements on data frame objects. As Ryan mentions in the comments, the possibilities of solving our problem using sqldf is straightforward:

bl <- sqldf("select ID, max(X) from dat group by ID")
names(bl)[2] <- 'X'
sqldf('select * from bl LEFT JOIN dat using(ID, X)')

Python solutions

Using pandas

In Python, the excellent pandas package allows us to do similar operations. The following example is from this thread on StackOverflow.

import pandas as pd
df = DataFrame({’Sp’:[‘a’,’b‘,’c’,’d‘,’e’,’f’], ‘Mt’:[‘s1’, ‘s1’, ‘s2’,’s2‘,’s2’,’s3’], ‘Value’:[1,2,3,4,5,6], ‘count’:[3,2,5,10,10,6]})
df.iloc[df.groupby([‘Mt’]).apply(lambda x: x[‘count’].idxmax())]

You could also do (from the same thread)

df.sort(‘count’, ascending=False).groupby(‘Mt’, as_index=False).first()

but it is about 50% slower.

Using pandasql

The package pandasql is a port of sqldf to Python developed by
yhat. Using this package, we can mimic the sqldf solution given above:

from pandasql import sqldf
sqldf('select Sp, Mt, Value, max(count) from df group by Mt', globals())

Practical Data Science Cookbook

Practical Data Science Cookbook

Book cover

My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.

Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications. We chose the languages R and Python to demonstrate our recipes since they are by far the most popular open-source languages for data science, with large ecosystems of mature data-related packages and active contributors. We selected several domains to demonstrate what we feel are interesting use cases that will have broad interest.

You can buy the book either in print form or as a DRM-free e-book from Packt, the publisher, or O’Reilly. Print and e-book versions are also available from Amazon, Barnes & Noble. We hope you will find this book useful for your data science work and understanding.

Nov 10, 2014: Until 2AM tonight EST, the book is O’Reilly’s E-book Deal of the Day. Use the discount code DEAL to get 50% off the regular price.

Converting images in Python

I had a recent request to convert an entire folder of JPEG images into EPS or similar vector graphics formats. The client was on a Mac, and didn’t have ImageMagick. I discovered the Python Image Library  to be enormously useful in this, and allowed me to implement the conversion in around 10 lines of Python code!!!

import Image
from glob import glob

jpgfiles = glob('*.jpg')
for u in jpgfiles:
    out = u.replace('jpg','eps')
    print "Converting %s to %s" % (u, out)
    img.thumbnails((800,800)) # Changing the size

What an elegant solution from Python —- “batteries included”

To be sure, using ImageMagick is more powerful, and Python wrappers (PyMagick), albeit old, do exist.

Python and Excel

Excel is unfortunately the lingua franca of data delivery (at least in small amounts) from my collaborators. Often I have to merge several disparate bits of information from several Excel files together.  I used to do this using R, since that’s what I’ve known for many years.

Now, of course, I’ve discovered Python!!! I fortunately discovered the excellent xlrd and xlwt packages by John Machin, and the subsequent addition of the xlutils package. You can find links for these at the newly created  I found it was quite easy to read data from Excel files using xlrd, and then create Python dicts using the common ID as the keys. Merging was then merely appending, or rather extending, the appropriate lists based on the common ID using a loop or list comprehension. The new list of lists could then be easily written out to a new Excel file using xlwt. I suppose I could also convert the list of lists to a Numpy array and then directly save it into R, but that would actually require advanced thinking, now, wouldn’t it.

My other joyful experience today using Python was in a similar vein, but using the powerful and excellent re module. I’ve found that the re module is easier and appears more powerful than R”s grep tools, though both are based on regular expressions. I had a grid which had scores for a biomarker, which mimiced the actual grid the pathologist was reading from. Of course this is not the kind of data structure R likes, per se, so I tried to get the IDs and scores out using Python. It turned out that 10 lines of code utilizing powerful grouping tools in regular expressions enabled me to parse this grid-like data quickly into a dict with IDs as keys and scores as data. This was then easily merged to other clinical data on the patients to create a spreadsheet. I then used xlwt to write this out to a fresh Excel file for archiving as well as for R to read. The development of this code took maybe 3o minutes!!! Again I’m impressed by Python’s power.

Now if only I can find an easy and reliable way to convert LaTeX tables and figures into OpenOffice or Word while preserving the integrity of the document.

Genz-Bretz multivariate normal in Python

I’ve been fighting for some time to try and get Genz-Bretz’s method for calculating orthant probabilities in multivariate normal distributions imported into Python. I downloaded the fortran code from Alan Genz’s site and was unsuccessful in using f2py to link it with Python. However, I discovered the usefulness of the Python ctypes module in linking with shared libraries (see here). So, I compiled the fortran code using

gfortran mvtdstpack.f -shared -o

and then, within Python, did

from ctypes import *
libmvn = cdll.LoadLibrary('./')
pmvn = libmvn.mvtdst_  # the underscore is added while compiling

This allows us access to the function.  The inputs have to be formatted properly, with the use of c_int, c_double and numpy.ctypeslib.ndpointer, and placed in pmvn.argtypes to prototype the function. The variables can then be initialized and passed through the function or subroutine.

For my purposes this took a bit of a learning curve, but I found a nice example online to make the formatting easier, and the results are really quite fast.  I may actually create a larger Fortran library for this project to speed things up, once I profile my program.

Workflow with Python and R

I seem to be doing more and more with Python for work over and above using it as a generic scripting language. R has been my workhorse for analysis for a long time (15+ years in various incarnations of S+ and R), but it still has some deficiencies. I’m finding Python easier and faster to work with for large data sets. I’m also a bit happier with Python’s graphical capabilities via matplotlib, which allows dynamic updating of graphs a la Matlab, another drawback that R has despite great graphical capabilities.

Where am I finding Python useful? Mainly in reading, cleaning and transforming data sets, and a bit of analysis using scipy. Python seems more efficient in reading and working through large data sets than R ever was.  Data cleaning using the string utilities and the re module and exploration also seems pretty easy. I’ll probably have to right a few utilities, or just pass that stuff into R. I’m more comfortable doing the analysis within R, so I’m using rpy2 quite a bit. Gautier has done a nice upgrade of the old rpy which I used quite a bit.

One thing that Python doesn’t have well yet is a literate programming interface. Sweave is one of the strengths of R (and StatWeave looks interesting as an interface with other software like SAS, Stata, etc) which I use almost on a daily basis for report writing. pyreport 0.3 seems promising, and does allow for the report to be written in LaTeX, but I need to play with it some more before I can make a call on it. pyreport does allow the simplicity of reStructured Text for documentation, which I wish Sweave was capable of. I figure this can be easily remedied in R by modifying the RweaveHTML driver written by my old friend Greg Snow. [Addendum, 3/22/09: I recently found a python package for LaTeX (python.tex), which allows one to embed python code in a LaTeX document, then run latex using the –shell-escape flag. This then runs the python code and embeds the results into the LaTeX document. Sort of the opposite of Sweave, but I figure it will be quite useful as well. It should even work within Sweave documents, since the Sweave parser will take out the R/S parts, then running latex will take care of the python parts.]

Speaking of report writing, this in another place I use Python a lot in my workflow to convert file formats. I use the Python API for to transform formats, both for Writer documents and for spreadsheets. I’ve written small Python scripts in my ~/bin so that I can, on the fly, convert HTML to odt or doc. This is proving quite useful and seems to preserve formats reasonably well. So my reporting workflow is to use Sweave to create a LaTeX document, which I then convert to PDF and HTML, and then transform the HTML to doc using Python. I also create all my graphics as PDF, EPS and SVG formats for subsequent editing by clients. These formats produce the least loss on transformation (the vector formats EPS and SVG have no loss), which is great for large, multicolored heatmaps I produce. I will also create PNG graphics if I have to provide a Word document for the client.

Easy (?) way to tack Fortran onto Python

A recent post on  Walking Randomly gave a nice example of using the Python ctypes module to load Fortran functions that have been compiled into a shared library (*.so) or DLL (*.dll). This seems an easier option than using f2py or pyfort, which have not been working well for me.