Month: April 2017

Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO.

1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected. But when our databases are significant fractions of the world population, we are much farther away from the controlled samples of good laboratory science. This means…

2. Data science as it’s currently practiced is essentially bad science. When we take a biased, dependent population of samples and try to generalize a conclusion from it, we need to be fully aware of how flawed our study is. That doesn’t mean things we discover using data analytics aren’t useful, but they need to be understood through the lens of the bias and complex dependencies present in the training data.

3. Computational methods should be aware of, and take advantage of, known dependencies. Some subfields of data mining and machine learning address this, like structured output learning, graph mining, relational learning, and more. But there is a lot of research progress needed. The data we’re mostly interested in nowadays comes from complex phenomena, which means we have to pay for accurate modeling with a little computational and cognitive complexity. How we manage that is a big open problem.

Specially point 3 is one I’ve been thinking about a lot recently. Our current frameworks are quite limited in dealing with dependencies and complexity. We’ve been happy using decades-old methods since they work pretty well on the predictive side as a reasonable approximation to the truth. However, having machines understanding complexity and incorporating it in predictions or understanding is a second-level challenge that can use significant research effort.

pandas “transform” using the tidyverse

Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.

Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.

First the pandas code:

import pandas as pd
dat = pd.read_excel('sales_transactions.xlsx')
dat['Percent_of_Order'] = dat['ext price']/dat.groupby('order')['ext price'].transform('sum')

A similar implementation using the tidyverse:

library(tidyverse)
library(readxl)
dat <- read_excel('sales_transactions.xlsx')
dat <- dat %>%
group_by(order) %>%
mutate(Percent_of_Order = `ext price`/sum(`ext price`))