We have a data set *dat* with multiple observations per subject. We want to create a subset of this data such that each subject (with **ID** giving the unique identifier for the subject) contributes the observation where the variable **X** takes it’s maximum value for that subject.

## R solutions

### Hadleyverse solutions

Using the excellent R package *dplyr*, we can do this using windowing functions included in *dplyr*. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of **X** by choosing one of them.

library(dplyr) dat_max <- dat %>% group_by(ID) %>% filter(row_number(X)==n())

To be a bit more explicit, `row_number`

is a wrapper around `rank`

, using the option `ties.method=“first”`

. So you can use the `rank`

function explicitly here as well.

A solution using the plyr package might be

library(plyr) dat_max <- ddply(dat, .(ID), function(u) u[rank(u$X, ties.method='first')==nrow(u),])

### A data.table solution

The data.table package also has great munging capabilities like dplyr. The same StackOverflow thread linked above also provides this solution using `data.table`

, provided by Arun:

dt <- as.data.table(dat) dt[dt[, .I[which.max(X)], by=ID]$V1]

### Using SQL statements via sqldf

The sqldf package allows an easy implementation of using SQL statements on data frame objects. As Ryan mentions in the comments, the possibilities of solving our problem using *sqldf* is straightforward:

library(sqldf) bl <- sqldf("select ID, max(X) from dat group by ID") names(bl)[2] <- 'X' sqldf('select * from bl LEFT JOIN dat using(ID, X)')

## Python solutions

### Using pandas

In Python, the excellent pandas package allows us to do similar operations. The following example is from this thread on StackOverflow.

import pandas as pd df = DataFrame({’Sp’:[‘a’,’b‘,’c’,’d‘,’e’,’f’], ‘Mt’:[‘s1’, ‘s1’, ‘s2’,’s2‘,’s2’,’s3’], ‘Value’:[1,2,3,4,5,6], ‘count’:[3,2,5,10,10,6]}) df.iloc[df.groupby([‘Mt’]).apply(lambda x: x[‘count’].idxmax())]

You could also do (from the same thread)

df.sort(‘count’, ascending=False).groupby(‘Mt’, as_index=False).first()

but it is about 50% slower.

### Using pandasql

The package pandasql is a port of `sqldf`

to Python developed by

yhat. Using this package, we can mimic the `sqldf`

solution given above:

from pandasql import sqldf sqldf('select Sp, Mt, Value, max(count) from df group by Mt', globals())