“LaF”-ing about fixed width formats

If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is quite an efficient format to store large data in (see this StackOverflow discussion). However, it requires quite a bit of coding, in terms of defining which columns represent which variables.

A recent project that our group at NIH is working on involves the National Inpatient Sample (NIS) from the Healthcare Cost and Utilization Project (HCUP). The current task is to load the Core data (8.2M records for 2008), filter it for a particular set of DRG codes, and see how many discharges with these DRG codes are present in each hospital in the set of hospitals included in the sample. Furthermore, we need to categorize each hospital as low, medium and high volume hospitals based on the number of discharges.

Note that to protect our project, I will not use the DRG codes that we’re actually using. I actually don’t know what DRG=500 is in terms of what the patient undergoes.

Translating available code

HCUP provides load files for their data sets in SAS, SPSS and Stata. Note that no open-source data software made the cut, possibly for legacy reasons. This post will focus on a R based solution. A follow up post will present a Python based solution.

First of all, I admit that I am lazy. I don’t want to sit with the data dictionary and write out all the format statements in R. However, if HCUP is providing load files, why not use them? Looking at them, I felt that the Stata load file was cleanest in terms of formatting. I wrote the following R script to extract variable names, column widths and formats (string, double, or integer) from the load file.

parseStataCode <- function(statafile){
 require(stringr)
 options(stringsAsFactors=FALSE)
 statacode <- readLines(statafile)
 indStart <- min(grep('infix', statacode))
 indEnd <- min(grep('using', statacode))-1
 codelines <- statacode[indStart:indEnd]
 codelines[1] <- gsub('infix','', codelines[1])
 codelines = str_trim(codelines)
 types <- str_extract(codelines, '^\\w+')
 translate <- c('int'='integer', 'byte'='integer', 'str'='string', 'double'='double', 'long'='double')
 types2 <- translate[types]
 varnames <- str_extract(codelines, '[A-Z]\\w+')
 startcols <- as.integer(str_extract(codelines,' \\d+'))
 maxcols = unlist(str_extract_all(codelines,'\\d+'))
 startcols = c(startcols, as.integer(maxcols[length(maxcols)]))
 widths = diff(startcols)
 return(data.frame(varnames = varnames, widths=widths, classes=types2))
}

You can of course add more code to extract missing value definitions and the like, but this sufficed for my current purpose.

My friend Anthony Damico wrote a great little package in R called SAScii, for the purpose of translating SAS load files provided by different US Government agencies for publicly available data (see his site ASDFree for several nice examples with large government surveys). However, the HCUP SAS load files contain custom formatting statements that breaks Anthony’s code. Hence the need to create my own translation code.

Ingesting data

Now on to the actual job of reading the data. R provides a function read.fwf, but that is painfully slow, so I immediately went to the Gods of Google. A quick internet search found the package LaF, which uses C++ underneath to make accessing large ASCII files faster. LaF also allows some column selection and filtering as data is loaded into memory in R. The workhorse for this is the laf_open_fwf function, which requires specification of the column widths and types.

wt <- parseStataCode('StataLoad_NIS_2008_Core.Do')
d <- laf_open_fwf('NIS_2008_Core.ASC', column_widths=wt$widths,
 column_types=wt$classes, column_names = wt$varnames)

The code above does not actually load the data into R, but essentially creates a pointer to the data on disk. Now I decided to use the (awesome) power of dplyr and magrittr to extract the frequencies I need. This strategy provides very clear and fast code that I’m happy with right now.

d1 <- d[,c("DRG","HOSPID")] %>%                           # Select only columns I need
 dplyr::filter(DRG %in% c(500)) %>%                       # Keep DRGs I want
 group_by(HOSPID) %>%
 summarize(n=n()) %>%                                     # Tabulate volumes by hospital
 mutate(volumeCat = cut(n, c(0,100,400,max(n)+1),
     labels=c('Low','Medium','High'))) %>%                # Categorize hospitals
 group_by(volumeCat) %>%
 summarize(n=n())                                         # Tabulate hospital categories

Why am I doing the filtering using dplyr rather than LaF? At least for my initial runs, it seemed that the dplyr strategy is faster. You could, in fact, use LaF as follows: d[d[,"DRG"] %in% c(500), c("DRG","HOSPID")]. See the LaF user manual for details about the syntax for row filtering, since it is a bit idiosyncratic.

So how does this actually perform? We’re working with 8.2 million records here. I ran my code (after wrapping it in a function) on my desktop (Intel Xeon @2.67 Ghz, 8GB RAM running Windows 7 64-bit Enterprise Edition, running R 3.1.1).

system.time(freq <- zip2freq('NIS_2008_Core.ASC', drg=c(500)))
user system elapsed
3.19 1.57 45.56

On my MacBook Air (2014) with 8GB memory and a 1.4 GHz Intel Core i5 and a Solid State Disk, this took about 16 seconds (elapsed time).

For contrast, I used the HCUP-supplied SAS load file (on SAS 9.3), adding the lines

if DRG=500;
keep DRG HOSPID;

to the end of the provided DATA statement to filter the data and keep only the two variables I need. I’m not good enough in SAS to do the remaining data munging efficiently, but the timings were interesting to say the least.

 NOTE: 8158381 records were read from the infile 'H:\tmp\NIS_2008_Core.ASC'. The minimum record length was 516. The maximum record length was 516. NOTE: The data set WORK.NIS_2008_CORE has 1068 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 2:26.37 cpu time 2:11.15 
Advertisements

10 comments

  1. Nice post. Always good to see that LaF is useful. One comment though. Your conclusion that LaF performs some compilation the first time it is ran, is not correct. It is not completely clear on which platforms the last run and the SAS run were performed, but I assume that is the windows system. So, in that case the time dropped from 45 seconds to 4.4 seconds. Am I correct? That is a big drop that I cannot explain. Often the first time a file is accessed the run time is longer, because the second time and later the file has been (partially) cached by the operation system. However, I have never seen so large a difference. What storage does your windows machine use? Perhaps you can use the microbenchmark package to run a number of benchmarks.

  2. Your SAS code is reading in all +/- 100 fields in this file, then checking to see if the DRG on the record is the target (DRG=500), and then drops the +/- 98 fields of non-interest. A lot of wasted effort there.

    Try running the SAS code below to do the data extract, it cuts out all of the extraneous stuff you don’t for this particular run. It took less than 2 minutes to figure out what the SAS code was doing, and chop out the excess. (It’s straight-forward code.) I’d guess it took longer than that to write the parseStataCode function.

    However, writing the rest of the data summarization almost certainly takes longer to code in SAS than it did with R … magrittr combined with dplyr is an awesomely clear and concise way of laying out a sequence of munging tasks. Very nice.

    I’ll gave the summarization part a shot anyway.

    (Note: code not tested, I don’t have this data.)

    ********************************************************************************** ;
    * SASload_NIS_2008_Core.SAS * ;
    * This program will load the 2008 NIS ASCII Core File into SAS. * ;
    ********************************************************************************* ;
    * Stripped down version that only extracts the two variables of * ;
    * current interest (and adds a counter var to make the added * ;
    * summary steps easier.) * ;
    ********************************************************************************* ;

    PROC FORMAT;
    INVALUE N3PF
    ‘-99’ = .
    ‘-88’ = .A
    ‘-66’ = .C
    OTHER = (|3.|)
    ;
    run;

    *** Data Step to load the file ***;
    DATA NIS_2008_Core_SUBSET;
    INFILE ‘NIS_2008_Core.ASC’ LRECL = 516;
    *** Define data element attributes ***;
    ATTRIB
    DRG LENGTH=3
    LABEL=”DRG in effect on discharge date”
    HOSPID LENGTH=4 FORMAT=Z5.
    LABEL=”HCUP hospital identification number”
    ;
    *** Read data elements from the ASCII file ***;
    INPUT
    @40 DRG N3PF.
    @233 HOSPID 5.
    ;
    if DRG=500 ;
    retain Discharges 1 ;
    RUN;

    *————————————————————* ;
    * summarize the data. * ;
    *————————————————————* ;
    proc format ;
    value vol_cat
    0-100 = ‘Low’
    101-400 = ‘Medium’
    401-high = ‘High’
    ;
    run;

    proc summary data=NIS_2008_Core_SUBSET missing nway;
    class HOSPID ;
    id DRG ;
    var Discharges ;
    output out=SUMMARY_BY_HOSPID (drop=_:)
    sum= ;
    run;

    data SUMMARY_BY_HOSPID ;
    set SUMMARY_BY_HOSPID ;
    length Vol_Cat $6 ;
    Vol_Cat=put(Discharges,vol_cat.);
    run;

    proc summary data=SUMMARY_BY_HOSPID missing nway ;
    id DRG ;
    class Vol_Cat ;
    var Discharges ;
    output out=SUMMARY_BY_VOL_CAT
    (drop=_TYPE_ rename=(_FREQ_=Count_of_Hosps) )
    sum= ;
    run;

    1. Doug, thanks for the code. I knew there was a better way. I did try to extract the SAS code later, but I don’t use SAS regularly in my work, so it might have taken me longer.

      I actually did this exercise to work around the fact that my colleague didn’t have a working SAS install at the time. He does now. I wanted to see what the R/Python solution would be for someone who didn’t have one of the commercial software installed (like on a home rather than a work computer, or a new freelancer).

      Love dplyr and magrittr. Those two save me a lot of time.

      Update: I just ran your code (just the extraction part, not the munging) and it’s about 40s. Thanks for the update.

  3. Faced with the same problem with the same dataset, we simply decided to load the data into SPSS using the provided load program, combine the data we wanted into one merged file, and use R to read the SPSS data file. I also did NOT have the patience to fit the columns into the dataset. I will have to give your original approach try – thanks for posting.

    1. Hi Raghavan,

      The issue we had faced was that my colleague who was just starting on the project didn’t have a working SAS or SPSS installed on his machine yet. He ended up using SAS, but got me asking how I can ingest the data without needing access to SAS, SPSS or Stata. Hence this post. Thanks for reading.

  4. Any trick for getting the computer to translate one program/language’s “read structured data” statements into the syntax used by another program/language is always a good thing !!!

  5. I’m trying to use LaF to read a government data file (TANF admin data) and it seems that there are not any line breaks — it’s only formatted based on length of field. Is there a way to make this work? The SAS input file has info on the length of the line, which seems to be needed. (I don’t have access to the commercial software, and the data is too large to read on my computer with read.fwf.

    Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s