# Cleaning up tables

This post is re-published from my blog. Please see it for the latest updates from May 16, 2018.

## Context

One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.

One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things. But, of course, this being a R blog, I wanted to create a function that would fix this. I’ve created hack-y functions for this before, but a neat trick pointed out here gave me an idea for a more elegant solution. It also meant I had to use the tidyeval paradigm a bit, which I figured I should at least become familiar with.

## Here’s what I want to do

Take a table like this:

Location Gender values
Rural Male 32.74
Rural Female 25.18
Urban Male 40.48
Urban Female 25.28

to something like this:

Location Gender values
Rural Male 32.74
Female 25.18
Urban Male 40.48
Female 25.28

The point is that the first column has repeating values, and I just want the first row of the cluster of rows corresponding to Rural and Urban to have text, the rest being blank. I find this a cleaner look and more typical of tables I see in papers.

This is purely for presentation purposes.I would never do this for data frames I’ll still analyze, since the blank cells screw up things. Of course this could be fixed easily using last value carried forward imputation on the column.

## A solution

I created this simple function to do this for a single column within a `magrittr` pipeline:

```clean_col = function(x, colvar){
require(dplyr)
colv = enquo(colvar)
x %>% group_by(!!colv) %>%
mutate(rown = row_number()) %>%
ungroup() %>%
mutate_at(vars(!!colv), funs(ifelse(rown > 1, '', .))) %>%
select (-rown)
}
```

The first thing to note here is that I’m using quosures and quasiquotation to allow the pipeline to work with the function’s inputs, specifically the column name, which is provided as an unquoted name. Admittedly this was done without much understanding, following examples on Edwin Thoen’s excellent blog.

The second thing was the use of the dummy `rown` column to identify the first row of each cluster of rows defined by the variable `colvar`. This was inspired by this blog I read through R-Bloggers yesterday. This trick allowed me to easily “blank out” the appropriate cells in the `colvar` column.

There are two directions I want to take this, but I don’t understand `tidyeval` or functions with variable numbers of arguments well enough yet to do it. The simpler extension is to do the same process using two or more columns, rather than one column. For example, taking

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
Rural Male 55-59 18.1
Rural Male 60-64 26.9
Rural Male 65-69 41.0
Rural Male 70-74 66.0
Rural Female 50-54 8.7
Rural Female 55-59 11.7
Rural Female 60-64 20.3
Rural Female 65-69 30.9
Rural Female 70-74 54.3
Urban Male 50-54 15.4
Urban Male 55-59 24.3
Urban Male 60-64 37.0
Urban Male 65-69 54.6
Urban Male 70-74 71.1
Urban Female 50-54 8.4
Urban Female 55-59 13.6
Urban Female 60-64 19.3
Urban Female 65-69 35.1
Urban Female 70-74 50.0

to

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
55-59 18.1
60-64 26.9
65-69 41.0
70-74 66.0
Rural Female 50-54 8.7
55-59 11.7
60-64 20.3
65-69 30.9
70-74 54.3
Urban Male 50-54 15.4
55-59 24.3
60-64 37.0
65-69 54.6
70-74 71.1
Urban Female 50-54 8.4
55-59 13.6
60-64 19.3
65-69 35.1
70-74 50.0

The second extension would be to create truly nested row labels, like this:

Location Gender AgeGrp DeathRate
Rural Male 50-54 11.7
55-59 18.1
60-64 26.9
65-69 41.0
70-74 66.0
Female 50-54 8.7
55-59 11.7
60-64 20.3
65-69 30.9
70-74 54.3
Urban Male 50-54 15.4
55-59 24.3
60-64 37.0
65-69 54.6
70-74 71.1
Female 50-54 8.4
55-59 13.6
60-64 19.3
65-69 35.1
70-74 50.0

I can create these on a case-by-case basis, but I’m not sure how to do this in a function, yet. Looking forward to comments.

### Update

I can do the first example with the following code (based on Edwin Thoen’s blog, again):

```clean_col = function(x, ...){
require(dplyr)
colvs = quos(...)
x %>% group_by(!!!colvs) %>%
mutate(rown = row_number()) %>%
ungroup() %>%
mutate_at(vars(!!!colvs), funs(ifelse(rown > 1, '', .))) %>%
select (-rown)
}
```

### Update 2

The second example can be solved by extracting elements of quosures, which are essentially a list:

```clean_cols = function(x, ...){
colvs = quos(...)
for(i in 1:length(colvs)){
rowvar =  rlang::sym(paste0('rown',i)) # Create dummy
x = x %>% group_by(!!!colvs[1:i]) %>%
mutate(!!rowvar := row_number()) %>%
ungroup()
}
for(i in 1:length(colvs)){
rowvar = rlang::sym(paste0('rown',i))
x = x %>% mutate_at(vars(!!colvs[[i]]),
funs(ifelse(!!rowvar > 1, '', .)))
}
x = x %>% select(-starts_with('rown')) # remove the dummies
return(x)
}
```

# Tidying messy Excel data (tidyxl)

Reposted from Abhijit’s blog. Some `<-` have been replaced by `=` due to idiosyncracies of the WordPress platform.

Well, here’s what I was dealing with:

Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent).

#### Goal

A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.

```d1 <- readxl::read_excel('spreadsheet1.xlsx')
```

## # A tibble: 6 x 26
## X__1 patient `44` `44__1` `10` `10__1` `3` `3__1` `53` `53__1`
##
## 1 IDS T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6 T5/T6
## 2 angles A B A B A B A B
## 3 60 31.83… 1 31.52… 1 32.9… 0 31.8… 0
## 4 65 31.66… 1 31.33… 1 32.2… 0 32.3… 0
## 5 70 31.45… 1 31.09… 0.20200… 31.7… 0 32.5… 0
## 6 75 31.08… 1 30.96… 0.44831… 31.2… 8.641… 32.3… 1
## # … with 16 more variables: `2` , `2__1` `8` ,
## # `8__1` , `6` , `6__1` , `43` , `43__1` ,
## # `48` , `48__1` , `46` , `46__1` , `4` ,
## # `4__1` , `9` , `9__1`

This strategy gives us funky column names, and pushes two of the headers into data rows. Since the headers are in rows, they’re a little harder to extract and work with. More worrisome is the fact that since the headers leaked into the data rows, the columns are all of type `character` rather than type `numeric`, which would now require further careful conversion after cleaning. So I don’t think `readxl` is the way to go here, if there’s a better solution.

## Attempt 2 (tidyxl)

```d2 <- tidyxl::xlsx_cells('spreadsheet1.xlsx')
```

## # A tibble: 6 x 21
## sheet address row col is_blank data_type error logical numeric
##
## 1 T5T6 B1 1 2 FALSE character NA NA
## 2 T5T6 C1 1 3 FALSE numeric NA 44.
## 3 T5T6 D1 1 4 FALSE numeric NA 44.
## 4 T5T6 E1 1 5 FALSE numeric NA 10.
## 5 T5T6 F1 1 6 FALSE numeric NA 10.
## 6 T5T6 G1 1 7 FALSE numeric NA 3.
## # … with 12 more variables: date , character ,
## # character_formatted , formula , is_array ,
## # formula_ref , formula_group , comment , height ,
## # width , style_format , local_format_id

The `xlsx_cells` captures the data in a tidy fashion, explicitly calling out rows and columns and other metadata within each cell. We can clean up this data using tidyverse functions:

```library(tidyverse)
cleanData1 = function(d) {
angle = d %>% filter(row >= 4, col == 1) %>% pull(numeric)
name = d %>% filter(row %in% c(1,3), col >= 3) %>%
mutate(character = ifelse(is.na(character),
as.character(numeric),
character)) %>%
select(row, col, character) %>%
filter(!is.na(character)) %>%
unite(ID, `1`:`3`, sep = '_') %>%
pull(ID)
data = d  %>% filter(row >= 4, col >= 3) %>%
filter(!is.na(numeric)) %>%
select(row, col, numeric) %>%
select(-row) %>%
set_names(name) %>%
cbind(angle) %>%
gather(variable, value, -angle) %>%
separate(variable, c('ID','Measure'), sep = '_') %>%
select(ID, angle, A, B) %>%
arrange(ID, angle)
return(data)
}

```
```##   ID angle        A        B
## 1 10    60 31.52867 1.000000
## 2 10    65 31.33477 1.000000
## 3 10    70 31.09272 0.202002
## 4 10    75 30.96078 0.448317
## 5 10    80 30.79397 0.670876
## 6 10    85 30.52185 0.461406
```

This is a lot of data munging, and though `dplyr` is powerful, it took a lot of trial and error to get the final pipeline done.Nonetheless, I was really psyched about `tidyxl`, since it automated a job that would have taken manual manipulation (I had 12 spreadsheets like this to process). I was going to write a blog post on this cool package that made my life dealing with messy Excel file a piece of cake. But wait, there’s more…

## Attempt 3 (tidyxl + unpivotr)

I didn’t know about `unpivotr` until this post:

So maybe all that complicated munging can be simplfied.

```# devtools::install_github('nacnudus/unpivotr')
library(unpivotr)

cleanData2 = function(d){
bl = d %>% select(row, col, data_type, numeric, character) %>%
# Extract the angles column
bl1 = bl %>% filter(variable == 'angles') %>% spatter(variable) %>%
select(row, angles)
# Extract the rest of the columns
bl2 = bl %>% filter(variable %in% c('A','B')) %>% select(-spine, -col) %>%
spatter(ID) %>% # Spread to columns
select(-character) %>% # All my variables are numeric
gather(ID, value, -row, -variable) %>%
final = bl1 %>% left_join(bl2) %>% # put things back together
arrange(ID, angles) %>%
select(ID, everything(),-row) # re-arrange columns
return(final)
}

cleanData2(d2)
```
```## # A tibble: 588 x 4
##    ID    angles     A     B
##
##  1 10       60.  31.5 1.00
##  2 10       65.  31.3 1.00
##  3 10       70.  31.1 0.202
##  4 10       75.  31.0 0.448
##  5 10       80.  30.8 0.671
##  6 10       85.  30.5 0.461
##  7 10       90.  30.3 0.245
##  8 10       95.  30.0 0.159
##  9 10      100.  29.7 0.170
## 10 10      105.  29.2 0.421
## # ... with 578 more rows
```

In this example, I’m using the `behead` function (available in the development version of `unpivotr` on GitHub) to extract out the three rows of headers. Then I’m extracting out the `angles` column separately and merging it with the rest of the columns.

In case you’re wondering about the “N” in the `behead` code, `unpivotr` has a geographic options system as to where the headers are with respect to the main code. This vignette explains this nomenclature.

## Attempt 4 (tidyxl + unpivotr)

After re-reading the `unpivotr` documentation, I realized that the `angles` column could be treated as a row header in the `unpivotr` code. So I further modified the function:

```cleanData3 = function(d) {
final = d %>%
select(row, col, data_type, numeric, character) %>%
select(numeric, ID:angles, data_type, -spine) %>% # all vars are numeric
filter(variable %in% c'A','B')) %>% # Kills off some extra columns
spatter(variable) # Spreads, using data_type, numeric
return(final)
}

cleanData3(d2)
```
```## A tibble: 588 x 4
##   ID    angles     A     B
##
## 1 10       60.  31.5 1.00
## 2 10       65.  31.3 1.00
## 3 10       70.  31.1 0.202
## 4 10       75.  31.0 0.448
## 5 10       80.  30.8 0.671
## 6 10       85.  30.5 0.461
## 7 10       90.  30.3 0.245
## 8 10       95.  30.0 0.159
## 9 10      100.  29.7 0.170
##10 10      105.  29.2 0.421
## ... with 578 more rows
```

I get to the same output, but with much cleaner code. This is cool!!I’m going to go deeper into the `unpivotr` documentation and see what else can be in my regular pipeline. A big thank you to the tool-makers that create these tools that make everyday activies easier and make us stay saner.

# Tidying messy Excel data (Introduction)

### Personal expressiveness, or how data is stored in a spreadsheet

When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!

But for smaller data, or data collected in the lab, the possibilities are truly endless! You can get every possiblle color-coding of rows, columns and cells, merged cells, hidden columns and rows, and inopportune blank spaces that convert numbers to characters, and notes where there should be numbers. That’s just from the more organized spreadsheets. Then you get multiple tables in the same spreadsheet, ad hoc computations in some cells, cells copied by hand (with error), and sundry other variations on this theme. In other words, it can be a nightmare scenario for the analyst. To wit,

In thinking about this post, I went back and looked at the documentation of the `readxl` package, which has made reading Excel files into R a lot easier than before. This package is quite powerful, so as long as data are in a relatively clean tabular form, this tool can pull it into R; see these vignettes to get a real sense of how to process decently behaved Excel files with R.

On a side note, how many ways of importing Excel files into R can you name, or have you used?

The `readxl` package has been my friend for a while, but then I received some well-intentioned spreadsheets that even `readxl` wouldn’t touch, despite much coaxing. Now, I had two options: ask the collaborator to re-format the spreadsheet (manually, of course 😄), which in my experience is a disaster waiting to happen; or just take the spreadsheet as is and figure out how to import it into R. I almost always take the latter route, and `tidyxl` is my bosom friend in this effort. In the next part of this series, I’ll describe my experiences with `tidyxl` and why, every time I use it, I say a small blessing for the team that created it.

# Moving to blogdown

I’ve been in the process of transferring my blog (along with creating a personal website) to blogdown, which is hosted on Github Pages and Netlify. The new blog, or rather, the continuation of this blog, will be at webbedfeet.netlify.io/posts, and it went live today.

I’ll be cross-posting here for a while, at least until Tal gets my new blog address included in R-Bloggers. I’m enjoying the RMarkdown blogging experience now, which is quite nice, and any code or analyses I want to include isn’t “lost in translation” when on WP. Since I live in R most of my days, it is also allowing a rather free flow of ideas onto the virtual page.

Hope you’ll come visit 🙂

# Surprising result when exploring Rcpp gallery

I’m starting to incorporate more Rcpp in my R work, and so decided to spend some time exploring the Rcpp Gallery. One example by John Merrill caught my eye. He provides a C++ solution to transforming an list of lists into a data frame, and shows impressive speed savings compared to as.data.frame.

This got me thinking about how I do this operation currently. I tend to rely on the `do.call` method. To mimic the example in the Rcpp example:

```a <- replicate(250, 1:100, simplify=FALSE)
b <- do.call(cbind, a)
```

For fairness, I should get a data frame rather than a matrix, so for my comparisons, I do convert `b` into a data frame. I follow the original coding in the example, adding my method above into the mix. Comparing times:

```res <- benchmark(as.data.frame(a),
CheapDataFrameBuilder(a),
as.data.frame(do.call(cbind, a)),
order="relative", replications=500)
res[,1:4]
```

The results were quite interesting to me 🙂

```                              test replications elapsed relative
3 as.data.frame(do.call(cbind, a))          500    0.36    1.000
2         CheapDataFrameBuilder(a)          500    0.52    1.444
1                 as.data.frame(a)          500    7.28   20.222```

I think part of what’s happening here is that as.data.frame.list expends overhead checking for different aspects of making a legit data frame, including naming conventions. The comparison to `CheapDataFrameBuilder` should really be with my barebones strategy. Having said that, the example does provide great value in showing what can be done using Rcpp.

# Quirks about running Rcpp on Windows through RStudio

This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.

I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find `Rcpp.h`. First, `devtools::find_rtools` was giving a positive result, so (a) was not true. Second, I noticed that the wrong C++ compiler was being called. Even more frustrating was the fact that everything was working if I worked on a native R console rather than RStudio. So there was nothing inherently wrong with the code or setup, but rather the environment RStudio was creating.

After some searching the interwebs and StackOverflow, the following solution worked for me. I added the following lines to my global .Rprofile file:

```Sys.setenv(PATH = paste(Sys.getenv("PATH"), "C:/RBuildTools/3.4/bin/",
"C:/RBuildTools/3.4/mingw_64/bin", sep = ";"))
Sys.setenv(BINPREF = "C:/RBuildTools/3.4/mingw_64/bin/")
```

Note that `C:/RBuildTools` is the default location suggested when I installed RTools.

This solution is indicated here, but I have the reverse issue of the default setup working in R and not in the latest RStudio. However, the solution still works!!

Note that instead of putting it in the global .Rprofile, you could put it in a project-specific .Rprofile, or even in your R code as long as it is run before loading the `Rcpp` or derivative packages. Note also that if you use binary packages that use `Rcpp`, there is no problem. Only when you’re compiling C++ code either for your own code or for building a package from source is this an issue. And, as far as I can tell, only on Windows.

Hope this prevents someone else from 3 hours of heartburn trying to make Rcpp work on a Windows box. And, if this has already been fixed in RStudio, please comment and I’ll be happy to update this post.

# Finding my Dropbox in R

I’ll often keep non-sensitive data on Dropbox so that I can access it on all my machines without gumming up git. I just wrote a small script to find the Dropbox location on each of my computers automatically. The crucial information is available here, from Dropbox.

My small snippet of code is the following:

```if (Sys.info()['sysname'] == 'Darwin') {
info <- RJSONIO::fromJSON(
file.path(path.expand("~"),'.dropbox','info.json'))
}
if (Sys.info()['sysname'] == 'Windows') {
info <- RJSONIO::fromJSON(
if (file.exists(file.path(Sys.getenv('APPDATA'), 'Dropbox','info.json'))) {
file.path(Sys.getenv('APPDATA'), 'Dropbox', 'info.json')
} else {
file.path(Sys.getenv('LOCALAPPDATA'),'Dropbox','info.json')
}
)
}

dropbox_base <- info\$personal\$path
```

I haven’t included the Linux option since I don’t really use a Linux box, but the Dropbox link above will show you where the info.json file lies in Linux. Also, if you have a business Dropbox account, you’ll probably need `info\$business\$path`.

Hope this helps!!!

# Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO.

1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected. But when our databases are significant fractions of the world population, we are much farther away from the controlled samples of good laboratory science. This means…

2. Data science as it’s currently practiced is essentially bad science. When we take a biased, dependent population of samples and try to generalize a conclusion from it, we need to be fully aware of how flawed our study is. That doesn’t mean things we discover using data analytics aren’t useful, but they need to be understood through the lens of the bias and complex dependencies present in the training data.

3. Computational methods should be aware of, and take advantage of, known dependencies. Some subfields of data mining and machine learning address this, like structured output learning, graph mining, relational learning, and more. But there is a lot of research progress needed. The data we’re mostly interested in nowadays comes from complex phenomena, which means we have to pay for accurate modeling with a little computational and cognitive complexity. How we manage that is a big open problem.

Specially point 3 is one I’ve been thinking about a lot recently. Our current frameworks are quite limited in dealing with dependencies and complexity. We’ve been happy using decades-old methods since they work pretty well on the predictive side as a reasonable approximation to the truth. However, having machines understanding complexity and incorporating it in predictions or understanding is a second-level challenge that can use significant research effort.

# pandas “transform” using the tidyverse

Chris Moffit has a nice blog on how to use the `transform` function in `pandas`. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.

Being a R nut and a `tidyverse` fan, I thought to compare and contrast the code for the `pandas` version with an implementation using the tidyverse.

First the `pandas` code:

```import pandas as pd
dat['Percent_of_Order'] = dat['ext price']/dat.groupby('order')['ext price'].transform('sum')
```

A similar implementation using the tidyverse:

```library(tidyverse)
dat <- dat %>%
group_by(order) %>%
mutate(Percent_of_Order = `ext price`/sum(`ext price`))
```

# Changing names in the tidyverse: An example for many regressions

A collaborator posed an interesting R question to me today. She wanted to do
several regressions using different outcomes, with models being computed on
different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then
filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

```library(tidyverse)

dat <- as_tibble(expand.grid(letters[1:4], 1:5))
d <- vector('list', nrow(dat))
set.seed(102)
for(i in 1:nrow(dat)){
x <- rnorm(100)
d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100))
}
dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest()
```
Var1 Var2 x y1 y2
a 1 0.1805229 4.2598245 -3.004535
a 1 0.7847340 0.0023338 -2.104949
a 1 -1.3531646 3.1711898 -9.156758
a 1 1.9832982 -0.7140910 5.966377
a 1 1.2384717 0.3523034 2.131004
a 1 1.2006174 0.6267716 1.752106

Now we’re going to perform two regressions, one using `y1` and one using `y2` as the dependent variables, for each stratum defined by `Var1` and `Var2`.

```out <- dat %>%
nest(-Var1, -Var2) %>%
mutate(model1 = map(data, ~lm(y1~x, data=.)),
model2 = map(data, ~lm(y2~x, data=.)))
```

Now conceptually, all we do is tidy up the output for the models using the `broom` package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

```library(broom)
out_problem <- out %>% mutate(output1 = map(model1, ~tidy(.)),
output2 = map(model2, ~tidy(.))) %>%
select(-data, -model1, -model2) %>%
unnest()
names(out_problem)
```

[1] “Var1” “Var2” “term” “estimate” “std.error”
[6] “statistic” “p.value” “term” “estimate” “std.error”
[11] “statistic” “p.value”

We’ve got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

```out_nice <- out %>% mutate(output1 = map(model1, ~tidy(.)),
output2 = map(model2, ~tidy(.)),
output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))),
output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>%
select(-data, -model1, -model2) %>%
unnest()
```

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

```out_nice %>% filter(y1_term=='x') %>%
select(Var1, Var2, ends_with('estimate')) %>%
knitr::kable(digits=3, format='html')
```
Var1 Var2 y1_estimate y2_estimate
a 1 -1.897 5.036
b 1 -2.000 5.022
c 1 -1.988 4.888
d 1 -2.089 5.089
a 2 -2.052 5.015
b 2 -1.922 5.004
c 2 -1.936 4.969
d 2 -1.961 4.959
a 3 -2.043 5.017
b 3 -2.045 4.860
c 3 -1.996 5.009
d 3 -1.922 4.894
a 4 -2.000 4.942
b 4 -2.000 4.932
c 4 -2.033 5.042
d 4 -2.165 5.049
a 5 -2.094 5.010
b 5 -1.961 5.122
c 5 -2.106 5.153
d 5 -1.974 5.009