The many faces of statistics/data science: Can’t we all just get along and learn from each other?

Two blog posts in the last 24 hours caught my attention. First was this post by Jeff Leek noting that there are many fields which are applied statistics by another name (and I’d add operations research to his list). The second is an excellent post on Cloudera’s blog on constructing case-control studies. It is generally excellent, but has this rather unfortunate (in my view) statement:

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

First of all, this ignores what biostatisticians have been doing in collaboration with epidemiologists for decades. The design of a study, as any statistician understands, is just as, if not more, important than the analysis, and statisticians have been at the forefront of pushing good study design. Second, it shows a fundamental lack of understanding of the breadth of what statistics as a discipline encompasses. Third, this almost reiterates Jeff’s point about the different fields, considered different but essentially “applied statistics”. There seems to be a strong push to claim a new field as different and sexier than what has come before (an issue of branding and worth, perhaps?) without understanding what is already out there.

Statistics as a field has been guilty of this as well. The most obvious and wasteful consequence of this is “re-inventing the wheel”, rather than leveraging the power of other discoveries. Ownership of an idea is a powerful concept, but there must be the recognition that while translating a concept for a new audience is useful and extremely necessary, merely claiming ownership while willfully ignoring the developments by colleagues in another field is wasteful and disingenuous.

A recent discussion with a colleague further reiterated this point even within statistics. Some of the newer developments in a relatively new methodologic space are along the same lines of theoretical development in an older methodologic space. The new guys are coming up against the same brick walls as the earlier researchers, and there seems to be a lack of understanding among the new researchers of the path already travelled (since the keywords are different and not necessarily directly related, Google Scholar fails).

The bottom line here is the strong need for more cross-talk between disciplines, more collaboration among researchers, having greater understanding for the knowledge already out there, and more breadth in our own training and knowledge.


  1. Hey Abhijit,

    Thanks for your comment on my blog post. Although I don’t agree with everything you said, I see where you were coming from, and how my somewhat glib quote could be misinterpreted as a slam against statisticians. I modified the original blog post to indicate that the design and analysis of case-control studies is a problem in statistics, whereas the construction is a problem in data science. My intent was to say that when your case-control study takes days to construct, you’re probably solving the problem in a bad way, and knowing some hard-core computer science/operations research becomes really useful.

    1. Hi Josh,

      Thanks for commenting. I think that one comment rubbed me wrong, since I’ve worked on case-control studies in a previous life, and I’m happy with your modification. I didn’t mean to slam you, since I think the article is excellent and the method you suggest is something I didn’t know about and learned a lot from. I think what you say here is something I can totally agree with, that the construction of the matched pairs can be much more efficiently dealt with using smart CS/OR algorithms than the possibly cruder methods that can be used. I think my reading of your article made the point to me that I need to be broader in my own reading and learning, since the cross-pollination of disciplines can result in elegant and efficient solutions, which really is the bottom line of my thesis here. I will modify my post later today to reflect your changes. Only fair. No hard feelings, I hope.

      Is there a similar auction-based solution to getting many-to-one matches? Case-control studies can be designed as 2 controls per case or something similar instead of 1-1 matching. Once again, here, you have the same constraints as what you state, but you need to select the two strongest links rather than only 1.

  2. No hard feelings at all, Abhijit.

    The simple way to solve the many-to-one matching problem is to create clones of the control nodes in the graph and add dummy case nodes (with a weight of zero) for the clones to match on in case we run out of matches. You can do that in the system now, although you have to do it manually. I think I will tweak things a bit in the coming weeks to make that case easier to deploy once I have a better feel for it. There are also some optimizations to the algorithm you can do in the many-to-one case that are a bit harder to implement, but may well be worthwhile as the problem gets even larger.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s