Data Science

Practical Data Science Cookbook

Practical Data Science Cookbook

Book cover

My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.

Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications. We chose the languages R and Python to demonstrate our recipes since they are by far the most popular open-source languages for data science, with large ecosystems of mature data-related packages and active contributors. We selected several domains to demonstrate what we feel are interesting use cases that will have broad interest.

You can buy the book either in print form or as a DRM-free e-book from Packt, the publisher, or O’Reilly. Print and e-book versions are also available from Amazon, Barnes & Noble. We hope you will find this book useful for your data science work and understanding.

Nov 10, 2014: Until 2AM tonight EST, the book is O’Reilly’s E-book Deal of the Day. Use the discount code DEAL to get 50% off the regular price.


The many faces of statistics/data science: Can’t we all just get along and learn from each other?

Two blog posts in the last 24 hours caught my attention. First was this post by Jeff Leek noting that there are many fields which are applied statistics by another name (and I’d add operations research to his list). The second is an excellent post on Cloudera’s blog on constructing case-control studies. It is generally excellent, but has this rather unfortunate (in my view) statement:

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

First of all, this ignores what biostatisticians have been doing in collaboration with epidemiologists for decades. The design of a study, as any statistician understands, is just as, if not more, important than the analysis, and statisticians have been at the forefront of pushing good study design. Second, it shows a fundamental lack of understanding of the breadth of what statistics as a discipline encompasses. Third, this almost reiterates Jeff’s point about the different fields, considered different but essentially “applied statistics”. There seems to be a strong push to claim a new field as different and sexier than what has come before (an issue of branding and worth, perhaps?) without understanding what is already out there.

Statistics as a field has been guilty of this as well. The most obvious and wasteful consequence of this is “re-inventing the wheel”, rather than leveraging the power of other discoveries. Ownership of an idea is a powerful concept, but there must be the recognition that while translating a concept for a new audience is useful and extremely necessary, merely claiming ownership while willfully ignoring the developments by colleagues in another field is wasteful and disingenuous.

A recent discussion with a colleague further reiterated this point even within statistics. Some of the newer developments in a relatively new methodologic space are along the same lines of theoretical development in an older methodologic space. The new guys are coming up against the same brick walls as the earlier researchers, and there seems to be a lack of understanding among the new researchers of the path already travelled (since the keywords are different and not necessarily directly related, Google Scholar fails).

The bottom line here is the strong need for more cross-talk between disciplines, more collaboration among researchers, having greater understanding for the knowledge already out there, and more breadth in our own training and knowledge.