Stat Bandit

Musings on statistics, computation and data research

  • About

The many faces of statistics/data science: Can’t we all just get along and learn from each other?

Posted by Abhijit on April 12, 2012
Posted in: Uncategorized. Tagged: Data Science, Statistics. 3 comments

Two blog posts in the last 24 hours caught my attention. First was this post by Jeff Leek noting that there are many fields which are applied statistics by another name (and I’d add operations research to his list). The second is an excellent post on Cloudera’s blog on constructing case-control studies. It is generally excellent, but has this rather unfortunate (in my view) statement:

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

First of all, this ignores what biostatisticians have been doing in collaboration with epidemiologists for decades. The design of a study, as any statistician understands, is just as, if not more, important than the analysis, and statisticians have been at the forefront of pushing good study design. Second, it shows a fundamental lack of understanding of the breadth of what statistics as a discipline encompasses. Third, this almost reiterates Jeff’s point about the different fields, considered different but essentially “applied statistics”. There seems to be a strong push to claim a new field as different and sexier than what has come before (an issue of branding and worth, perhaps?) without understanding what is already out there.

Statistics as a field has been guilty of this as well. The most obvious and wasteful consequence of this is “re-inventing the wheel”, rather than leveraging the power of other discoveries. Ownership of an idea is a powerful concept, but there must be the recognition that while translating a concept for a new audience is useful and extremely necessary, merely claiming ownership while willfully ignoring the developments by colleagues in another field is wasteful and disingenuous.

A recent discussion with a colleague further reiterated this point even within statistics. Some of the newer developments in a relatively new methodologic space are along the same lines of theoretical development in an older methodologic space. The new guys are coming up against the same brick walls as the earlier researchers, and there seems to be a lack of understanding among the new researchers of the path already travelled (since the keywords are different and not necessarily directly related, Google Scholar fails).

The bottom line here is the strong need for more cross-talk between disciplines, more collaboration among researchers, having greater understanding for the knowledge already out there, and more breadth in our own training and knowledge.

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

Pocketbook costs of software

Posted by Abhijit on February 23, 2012
Posted in: R. Tagged: R, SAS. 6 comments

I have always been provided SAS as part of my job, so I never really realized how much it cost. I’ve bought Stata before, and of course R :) . I recently found out how much a reasonable bundle of SAS modules along with base SAS costs per year per seat, at least under the GSA. I tried finding out how much IBM SPSS is for a comparable bundle, but their web page was “not available”. Stata costs in the ballpark of $1700 (for a permanent license of Stata/SE) or $845 for an annual license. SAS costs over 5 times that per seat for similar functionality (Ouch!!). R, with its quirks but with similar if not enhanced functionality in a lot of areas, is of course, freely downloadable. 

Matlab is another software I’ve bought as part of my job. For a reasonable bundle, in an academic setting, it is close to $3000. Of course, here it’s a bit easier to pick and choose, since I don’t need most of the modules which are of more interest to engineers.

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

Converting images in Python

Posted by Abhijit on September 29, 2011
Posted in: Computation. Tagged: PIL, Python. 3 comments

I had a recent request to convert an entire folder of JPEG images into EPS or similar vector graphics formats. The client was on a Mac, and didn’t have ImageMagick. I discovered the Python Image Library  to be enormously useful in this, and allowed me to implement the conversion in around 10 lines of Python code!!!

import Image
from glob import glob

jpgfiles = glob('*.jpg')
for u in jpgfiles:
    out = u.replace('jpg','eps')
    print "Converting %s to %s" % (u, out)
    img=Image.read(u)
    img.thumbnails((800,800)) # Changing the size
    img.save(out)

What an elegant solution from Python —- “batteries included”

To be sure, using ImageMagick is more powerful, and Python wrappers (PyMagick), albeit old, do exist.

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

An enhanced Kaplan-Meier plot, updated

Posted by Abhijit on September 1, 2011
Posted in: R. Tagged: ggplot2, Kaplan-Meier, presentation, R, survival. Leave a Comment

I’ve updated the R code for the enhanced K-M plot to include additions and improvements by Gil Thomas and Mark Cowley. Thanks fellows for the feedback and updates.

http://statbandit.wordpress.com/2011/03/08/an-enhanced-kaplan-meier-plot/

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

Another application of R getting press

Posted by Abhijit on August 18, 2011
Posted in: R. 1 comment

Prof. Atul Butte of Stanford University and colleagues just published two articles in Science Translational Research which got a fair amount of press.  In fact I heard about the work on the radio on my commute to work. The research involves developing a computational method which can look at drug-disease interactions based on the NCBI GEO repository to discover potentially new uses for approved drugs. On reading the paper, I realized that their main computational tool is R, in particular the Bioconductor tools as well as pvclust  and qvalue. You can read the article here.

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

RStudio 0.94.92 visited

Posted by Abhijit on July 30, 2011
Posted in: R. Tagged: R, RStudio. 2 comments

I just updated my RStudio version to the latest, v.0.94.92 (will this asymptotically approach 1, or actually get to 1?). It was nice to see the number of improvements the development team has implemented, based I’m sure on community feedback. The team has, in my experience, been extraordinarily responsive to user feedback, and I’m sure this played a large part in the development path taken by the team. 

First and foremost, I was happy to see most of my wants met in this version:

  • There now is a keyboard shortcut for <- that is easy and intuitive (Alt+_/Option+_)
  • The File window now allows sorting by modification date  in addition to name, which was becoming an issue for one of my projects
  • Plots can be saved as BMP, TIFF, JPEG and Postscript in addition to PNG and PDF
  • Bracket completion and matching, very much similar to the R Mac GUI, and actually better than Emacs/ESS, specially when deleting.
  • An easy shortcut to repeat blocks of text or transpose two lines of text (though this appears mistakenly overloaded with another shortcut on Windows/Linux)
  • Keyboard shortcuts are reasonably consistent with OS-specific shortcuts, though the Ctrl key is used in Mac more than generally seen in the OS. It is however convenient for those of us migrating from Emacs/ESS, who use the Ctrl key often. 

My wishlist for RStudio is pretty much fulfilled with respect to R development. However, a few improvements need to be made in the TeX/Sweave interface to allow for autocompletion, templates, and fuller functionality in line with Emacs/Auctex and Texmate. Currently writing LaTeX and Sweave feels like writing in Wordpad, albeit with R-specific word completion and R functionality. This can be a bit more polished. Of course TeX and Sweave are still used by a minority of R users, so the fact that this functionality hasn’t developed is no surprise. 

All in all, the current version of RStudio feels like a very usable IDE for R, and certain features and similarities make migrating from Emacs pretty easy (provided you don’t miss Emacs’ overall power and flexibility too much)

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

A ggplot trick to plot different plot types in facets

Posted by Abhijit on July 29, 2011
Posted in: R. Leave a Comment

At the DC useR meetup last week, Marck Vaisman (@wahalulu) showed me a neat trick he’d learned to allow different facets in a faceted ggplot graph to have different plot types. The basis for this trick is this blog post in the Learn-R blog. Marck was trying to plot different statistics on our Meetup group’s membership on a faceted plot. Some of the variables were amenable to a step plot while others were more amenable to plotting using vertical lines.

The interesting trick in this example is to use the subset command within each geom to only layer one facet at a time. The source code is given below:

meetup <- read.csv('MeetupDates.csv', as.is=T)
names(meetup) <- 'Dates'
meetup$Dates <- as.Date(meetup$Dates,format='%m/%d/%y')
files  <- dir(pattern='DC_useR')
bl <- list()
for(f in files){
  bl[[f]] <- read.csv(f, as.is=T)
  bl[[f]]$Date <- as.Date(bl[[f]]$Date,format='%m/%d/%y')
}
dat <- Reduce(function(x,y) merge(x,y), bl) # Merge the data frames by Date
dat2 <- melt(dat,id=1)

# Here comes the trick !!
f1 <- ggplot(dat2, aes(x=Date,y=value,ymin=0,ymax=value))+facet_grid(variable~., scales='free')
f2 <- f1+geom_step(subset=.(variable=='Total.Members'))
f3 <- f2+geom_step(subset=.(variable=='Active.Members'))
f4 <- f3+geom_linerange(subset=.(variable=='Member.Joins'))
f5 <- f4+geom_linerange(subset=.(variable=='RSVPs'))
f5+geom_vline(xintercept=meetup$Dates, color='red',alpha=.3)+ylab('')

This produces the following plot:

A faceted ggplot object with different plot types

Share this:

  • Email
  • Twitter
  • Facebook
  • LinkedIn
  • More
  • Print
  • Tumblr
  • Pinterest
  • Google +1

Like this:

Like Loading...

Posts navigation

← Older Entries
  • Archives

    • April 2012 (1)
    • February 2012 (1)
    • September 2011 (2)
    • August 2011 (1)
    • July 2011 (4)
    • March 2011 (2)
    • February 2011 (2)
    • November 2010 (1)
    • July 2010 (1)
    • May 2010 (1)
    • April 2010 (1)
    • March 2010 (1)
    • October 2009 (1)
    • July 2009 (2)
    • May 2009 (2)
    • April 2009 (1)
    • March 2009 (4)
  • Blogroll

    • Andrew Gelman’s blog
    • Christian Robert’s blog
    • R Bloggers
    • Simply Statistics
    • The Endeavour
    • Walking Randomly
    • WordPress.com
  • Tags

    blogging data.table Data Science emacs ESS forest plot Fortran ggplot2 grid IDE Kaplan-Meier meetup parallel computing PIL plyr presentation Python R RStudio SAS Statistics survival useR2010
Blog at WordPress.com. Theme: Parament by Automattic.
Stat Bandit
Blog at WordPress.com. Theme: Parament.
Follow

Get every new post delivered to your Inbox.

Powered by WordPress.com
Cancel
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.
%d bloggers like this: