Month: February 2011

The split-apply-combine paradigm in R

Last night at the DC R Users meetup, which was our largest meetup to date, I gave an introductory presentation on data munging, and spent a bit of time on the split-apply-combine paradigm that I use almost daily in my work. I talked mainly about the packages plyr and doBy, which I use a lot now. David Smith posted a link on the Revolution blog to this article by Steve Miller, talking about the virtues of the data.table package for doing “by-group processing”. It got me thinking about changing my workflow yet again and engaging this package in my computational workflow. I also noticed that Hadley Wickham tweeted that he wants to make plyr faster as well in the near future, which will of course be a very welcome development.

ggplot2 joy

I’ve been working on a long-term (25+yr) longitudinal study of rheumatoid arthritis with my boss. He just walked in and asked if I could create a plot showing the trajectory of pain scores over time for each subject, separated by educational level (4 groups). Having now worked with ggplot2 for a while, and learning more at the last two DC useR meetups, I realized that I could formulate this in ggplot very easily and in short order. Hooray!!! Basically, all I needed to do was:

ggplot(data, aes(time, pain, groups=patient.id, color=education.level))+geom_line()

I actually spent more time figuring out how to change the legend title 🙂 (fyi, it is + labs(colour='Education'), with the British spelling being necessary).

I’m actually pretty thrilled that I could use ggplot2 on short order to make this plot.

On another note, my friend Brian Danielak gave a brilliant presentation at last night’s DC R Users meetup on some ggplot2-based development he’s doing for graphical ANOVA. A link to his talk should be on the meetup.com site in short order, so please do check it out.