Month: April 2014

Quick notes on file management in Python

This is primarily for my recollection

To expand ~ in a path name:


To get the size of a directory:

import os
def getsize(start<em>path = '.'):
    totalsize = 0
    for dirpath, dirnames, filenames in os.walk(start</em>path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            totalsize += os.path.getsize(fp)
    return totalsize

IPython notebooks: the new glue?

IPython notebooks have become a defacto standard for presenting Python-based analyses and talks, as evidenced by recent Pycon and PyData events. As anyone who has used them knows, they are great for “reproducible research”, presentations, and sharing via the nbviewer. There are extensions connecting IPython to R, Octave, Matlab, Mathematica, SQL, among others.

However, the brilliance of the design of IPython is in the modularity of the underlying engine (3 cheers to Fernando Perez and his team). About a year ago, a Julia engine was written, allowing Julia to be run using the IPython notebook platform (named, appropriately, IJulia). More recently, an R engine has been developed to enable R to run natively on the IPython notebook platform. Though the engines cannot be run interchangeably during the same session of the notebook server, it shows that a common user-facing interface now exists for running the three most powerful open-source scientific and data-centric software systems.

Another recent advancement in the path of IPython notebooks as the common medium for reporting data analyses is my friend Ramnath‘s proof-of-concept work in translating R Markdown documents to IPython notebooks.

I encourage you to explore IPython notebooks, as well as the extensions to R and Julia, specially my colleagues using R and/or Python in the data space.

A new data-centric incubator project in DC

District Data Labs is a new endeavor by members of the local data community (myself included) to increase educational outreach about data-related topics through workshops and other media to the local data community.

We want District Data Labs to be an efficient learning resource for people who want to enhance and expand their analytical and technical skill sets. Whether you are a statistician who wants to learn more about programming and creating useful data products or a software engineer that wants to learn how to properly analyze data and use statistical methods to improve the basic analyses you’re doing, we want to equip you with the right skills to better yourself and advance your career.

DDL has recently run several PyData workshops, and one on using Python for creating Data Apps is forthcoming.

DDL just announced a new initiative to bring the data community closer — a Data Science Project Incubator where like-minded people can collaborate and develop data-centric projects under the umbrella of DDL. You can find out more details bout this new initiative here.

Kaplan-Meier plots using ggplots2 (updated)

About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error, and published it in a Gist. This gist has two functions, ggkm (basic Kaplan-Meier plot) and ggkmTable (enhanced Kaplan-Meier plot with table showing numbers at risk at various times).

This gist is published here. If you find errors or want to enhance these functions,¬†please fork, update and send me a link to your fork in the comments. I’ll pull and merge them. Unfortunately Github doesn’t allow pull requests directly for gists (see here¬†for the StackOverflow answer I’m basing this on).

If you want to go back to the original post, you can read it here.