Month: July 2009

Benford’s law

Benford’s law has a prominent place in fraud detection analysis, as once again evidenced by the analyses of the Iranian election. Christian Robert, in his wonderful blog, has an entry describing a fascinating property of Benford’s law. In some ways it calls into question the usefulness of Benford’s law, since it can be satisfied by the product of two unrelated random variables. Actually it calls into question the specificity of Benford’s law for fraud detection — departures from Benford’s law might be indicative of fraud (sensitivity), but lack of departure from Benford’s law might not mean lack of fraud. Need to think about this more :).


Python and Excel

Excel is unfortunately the lingua franca of data delivery (at least in small amounts) from my collaborators. Often I have to merge several disparate bits of information from several Excel files together.  I used to do this using R, since that’s what I’ve known for many years.

Now, of course, I’ve discovered Python!!! I fortunately discovered the excellent xlrd and xlwt packages by John Machin, and the subsequent addition of the xlutils package. You can find links for these at the newly created  I found it was quite easy to read data from Excel files using xlrd, and then create Python dicts using the common ID as the keys. Merging was then merely appending, or rather extending, the appropriate lists based on the common ID using a loop or list comprehension. The new list of lists could then be easily written out to a new Excel file using xlwt. I suppose I could also convert the list of lists to a Numpy array and then directly save it into R, but that would actually require advanced thinking, now, wouldn’t it.

My other joyful experience today using Python was in a similar vein, but using the powerful and excellent re module. I’ve found that the re module is easier and appears more powerful than R”s grep tools, though both are based on regular expressions. I had a grid which had scores for a biomarker, which mimiced the actual grid the pathologist was reading from. Of course this is not the kind of data structure R likes, per se, so I tried to get the IDs and scores out using Python. It turned out that 10 lines of code utilizing powerful grouping tools in regular expressions enabled me to parse this grid-like data quickly into a dict with IDs as keys and scores as data. This was then easily merged to other clinical data on the patients to create a spreadsheet. I then used xlwt to write this out to a fresh Excel file for archiving as well as for R to read. The development of this code took maybe 3o minutes!!! Again I’m impressed by Python’s power.

Now if only I can find an easy and reliable way to convert LaTeX tables and figures into OpenOffice or Word while preserving the integrity of the document.