Cleaning up tables

This post is re-published from my blog. Please see it for the latest updates from May 16, 2018.


Context

One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.

One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things. But, of course, this being a R blog, I wanted to create a function that would fix this. I’ve created hack-y functions for this before, but a neat trick pointed out here gave me an idea for a more elegant solution. It also meant I had to use the tidyeval paradigm a bit, which I figured I should at least become familiar with.

Here’s what I want to do

Take a table like this:

Location Gender values
Rural Male 32.74
Rural Female 25.18
Urban Male 40.48
Urban Female 25.28

to something like this:

Location Gender values
Rural Male 32.74
Female 25.18
Urban Male 40.48
Female 25.28

The point is that the first column has repeating values, and I just want the first row of the cluster of rows corresponding to Rural and Urban to have text, the rest being blank. I find this a cleaner look and more typical of tables I see in papers.

This is purely for presentation purposes.I would never do this for data frames I’ll still analyze, since the blank cells screw up things. Of course this could be fixed easily using last value carried forward imputation on the column.

A solution

I created this simple function to do this for a single column within a magrittr pipeline:

clean_col = function(x, colvar){
require(dplyr)
colv = enquo(colvar)
x %>% group_by(!!colv) %>%
      mutate(rown = row_number()) %>%
      ungroup() %>%
      mutate_at(vars(!!colv), funs(ifelse(rown > 1, '', .))) %>%
      select (-rown)
}

The first thing to note here is that I’m using quosures and quasiquotation to allow the pipeline to work with the function’s inputs, specifically the column name, which is provided as an unquoted name. Admittedly this was done without much understanding, following examples on Edwin Thoen’s excellent blog.

The second thing was the use of the dummy rown column to identify the first row of each cluster of rows defined by the variable colvar. This was inspired by this blog I read through R-Bloggers yesterday. This trick allowed me to easily “blank out” the appropriate cells in the colvar column.

Desirable updates

There are two directions I want to take this, but I don’t understand tidyeval or functions with variable numbers of arguments well enough yet to do it. The simpler extension is to do the same process using two or more columns, rather than one column. For example, taking

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
Rural Male 55-59 18.1
Rural Male 60-64 26.9
Rural Male 65-69 41.0
Rural Male 70-74 66.0
Rural Female 50-54 8.7
Rural Female 55-59 11.7
Rural Female 60-64 20.3
Rural Female 65-69 30.9
Rural Female 70-74 54.3
Urban Male 50-54 15.4
Urban Male 55-59 24.3
Urban Male 60-64 37.0
Urban Male 65-69 54.6
Urban Male 70-74 71.1
Urban Female 50-54 8.4
Urban Female 55-59 13.6
Urban Female 60-64 19.3
Urban Female 65-69 35.1
Urban Female 70-74 50.0

to

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
55-59 18.1
60-64 26.9
65-69 41.0
70-74 66.0
Rural Female 50-54 8.7
55-59 11.7
60-64 20.3
65-69 30.9
70-74 54.3
Urban Male 50-54 15.4
55-59 24.3
60-64 37.0
65-69 54.6
70-74 71.1
Urban Female 50-54 8.4
55-59 13.6
60-64 19.3
65-69 35.1
70-74 50.0

The second extension would be to create truly nested row labels, like this:

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
55-59 18.1
60-64 26.9
65-69 41.0
70-74 66.0
Female 50-54 8.7
55-59 11.7
60-64 20.3
65-69 30.9
70-74 54.3
Urban Male 50-54 15.4
55-59 24.3
60-64 37.0
65-69 54.6
70-74 71.1
Female 50-54 8.4
55-59 13.6
60-64 19.3
65-69 35.1
70-74 50.0

I can create these on a case-by-case basis, but I’m not sure how to do this in a function, yet. Looking forward to comments.

Update

I can do the first example with the following code (based on Edwin Thoen’s blog, again):

clean_col = function(x, ...){
  require(dplyr)
  colvs = quos(...)
  x %>% group_by(!!!colvs) %>%
    mutate(rown = row_number()) %>%
    ungroup() %>%
    mutate_at(vars(!!!colvs), funs(ifelse(rown > 1, '', .))) %>%
    select (-rown)
}

Update 2

The second example can be solved by extracting elements of quosures, which are essentially a list:

clean_cols = function(x, ...){
  colvs = quos(...)
  for(i in 1:length(colvs)){
    rowvar =  rlang::sym(paste0('rown',i)) # Create dummy
    x = x %>% group_by(!!!colvs[1:i]) %>%
      mutate(!!rowvar := row_number()) %>%
      ungroup()
  }
  for(i in 1:length(colvs)){
    rowvar = rlang::sym(paste0('rown',i))
    x = x %>% mutate_at(vars(!!colvs[[i]]),
      funs(ifelse(!!rowvar > 1, '', .)))
  }
  x = x %>% select(-starts_with('rown')) # remove the dummies
  return(x)
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s