Month: March 2009

Genz-Bretz multivariate normal in Python

I’ve been fighting for some time to try and get Genz-Bretz’s method for calculating orthant probabilities in multivariate normal distributions imported into Python. I downloaded the fortran code from Alan Genz’s site and was unsuccessful in using f2py to link it with Python. However, I discovered the usefulness of the Python ctypes module in linking with shared libraries (see here). So, I compiled the fortran code using

gfortran mvtdstpack.f -shared -o libmvt.so

and then, within Python, did

from ctypes import *
libmvn = cdll.LoadLibrary('./libmvt.so')
pmvn = libmvn.mvtdst_  # the underscore is added while compiling

This allows us access to the function.  The inputs have to be formatted properly, with the use of c_int, c_double and numpy.ctypeslib.ndpointer, and placed in pmvn.argtypes to prototype the function. The variables can then be initialized and passed through the function or subroutine.

For my purposes this took a bit of a learning curve, but I found a nice example online to make the formatting easier, and the results are really quite fast.  I may actually create a larger Fortran library for this project to speed things up, once I profile my program.

Workflow with Python and R

I seem to be doing more and more with Python for work over and above using it as a generic scripting language. R has been my workhorse for analysis for a long time (15+ years in various incarnations of S+ and R), but it still has some deficiencies. I’m finding Python easier and faster to work with for large data sets. I’m also a bit happier with Python’s graphical capabilities via matplotlib, which allows dynamic updating of graphs a la Matlab, another drawback that R has despite great graphical capabilities.

Where am I finding Python useful? Mainly in reading, cleaning and transforming data sets, and a bit of analysis using scipy. Python seems more efficient in reading and working through large data sets than R ever was.  Data cleaning using the string utilities and the re module and exploration also seems pretty easy. I’ll probably have to right a few utilities, or just pass that stuff into R. I’m more comfortable doing the analysis within R, so I’m using rpy2 quite a bit. Gautier has done a nice upgrade of the old rpy which I used quite a bit.

One thing that Python doesn’t have well yet is a literate programming interface. Sweave is one of the strengths of R (and StatWeave looks interesting as an interface with other software like SAS, Stata, etc) which I use almost on a daily basis for report writing. pyreport 0.3 seems promising, and does allow for the report to be written in LaTeX, but I need to play with it some more before I can make a call on it. pyreport does allow the simplicity of reStructured Text for documentation, which I wish Sweave was capable of. I figure this can be easily remedied in R by modifying the RweaveHTML driver written by my old friend Greg Snow. [Addendum, 3/22/09: I recently found a python package for LaTeX (python.tex), which allows one to embed python code in a LaTeX document, then run latex using the –shell-escape flag. This then runs the python code and embeds the results into the LaTeX document. Sort of the opposite of Sweave, but I figure it will be quite useful as well. It should even work within Sweave documents, since the Sweave parser will take out the R/S parts, then running latex will take care of the python parts.]

Speaking of report writing, this in another place I use Python a lot in my workflow to convert file formats. I use the Python API for OpenOffice.org to transform formats, both for Writer documents and for spreadsheets. I’ve written small Python scripts in my ~/bin so that I can, on the fly, convert HTML to odt or doc. This is proving quite useful and seems to preserve formats reasonably well. So my reporting workflow is to use Sweave to create a LaTeX document, which I then convert to PDF and HTML, and then transform the HTML to doc using Python. I also create all my graphics as PDF, EPS and SVG formats for subsequent editing by clients. These formats produce the least loss on transformation (the vector formats EPS and SVG have no loss), which is great for large, multicolored heatmaps I produce. I will also create PNG graphics if I have to provide a Word document for the client.

Easy (?) way to tack Fortran onto Python

A recent post on  Walking Randomly gave a nice example of using the Python ctypes module to load Fortran functions that have been compiled into a shared library (*.so) or DLL (*.dll). This seems an easier option than using f2py or pyfort, which have not been working well for me.