Creating new data with max values for each subject

We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.

R solutions

Hadleyverse solutions

Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.

dat_max <- dat %>% group_by(ID) %>% filter(row_number(X)==n())

To be a bit more explicit, row_number is a wrapper around rank, using the option ties.method=“first”. So you can use the rank function explicitly here as well.

A solution using the plyr package might be

dat_max <- ddply(dat, .(ID), function(u) u[rank(u$X, ties.method='first')==nrow(u),])

A data.table solution

The data.table package also has great munging capabilities like dplyr. The same StackOverflow thread linked above also provides this solution using data.table, provided by Arun:

dt <-
dt[dt[, .I[which.max(X)], by=ID]$V1]

Using SQL statements via sqldf

The sqldf package allows an easy implementation of using SQL statements on data frame objects. As Ryan mentions in the comments, the possibilities of solving our problem using sqldf is straightforward:

bl <- sqldf("select ID, max(X) from dat group by ID")
names(bl)[2] <- 'X'
sqldf('select * from bl LEFT JOIN dat using(ID, X)')

Python solutions

Using pandas

In Python, the excellent pandas package allows us to do similar operations. The following example is from this thread on StackOverflow.

import pandas as pd
df = DataFrame({’Sp’:[‘a’,’b‘,’c’,’d‘,’e’,’f’], ‘Mt’:[‘s1’, ‘s1’, ‘s2’,’s2‘,’s2’,’s3’], ‘Value’:[1,2,3,4,5,6], ‘count’:[3,2,5,10,10,6]})
df.iloc[df.groupby([‘Mt’]).apply(lambda x: x[‘count’].idxmax())]

You could also do (from the same thread)

df.sort(‘count’, ascending=False).groupby(‘Mt’, as_index=False).first()

but it is about 50% slower.

Using pandasql

The package pandasql is a port of sqldf to Python developed by
yhat. Using this package, we can mimic the sqldf solution given above:

from pandasql import sqldf
sqldf('select Sp, Mt, Value, max(count) from df group by Mt', globals())


  1. If your data is in a relational database, you can easily do this aggregation before pulling the data in, and the solution is much, much simpler. Or, if not, you can use the `sqldf` package in R or the `pandasql` module for Python/Pandas to write SQL statements against your DataFrame objects.

    If your data is in a table or DF called `dat` and subject id is in column labeled `id` and the variable you’d like to aggregate is in column labeled `x`, the solution in SQL would look something like:

    `SELECT id, max(x) AS max_x FROM dat GROUP BY id`

  2. It is easy in data.table in R. dt[, maxCol := max(X), by= ID] where dt is the dataset you are working with, maxCol is a new col with the max of X (repeated for each row of each unique ID). This is all done by reference, so that it works very fast on large datasets.

    1. Thanks for your solution. However, it only creates the max values by ID. It doesn’t subset so that the resultant data.table has only one row per subject, where the X value is max for that subject.

  3. I learn most of my R programming by reading posts like this, thanks! I have been using dplyr for a while, but wasn’t using any of these windowing functions!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s