Month: January 2016

Crowdsourcing research

Last evening, Anthony Goldbloom, the founder of, gave a very nice talk at a joint Statistical Programming DC/Data Science DC event about the Kaggle experience and what can be learned from the results of their competitions. One of the take away messages was that crowdsourcing data problems to a diligent and motivated group of entrepreneurial data scientists can get you to the threshold of extracting signal and patterns from data far more quickly than if a closed and siloed group of analysts worked on the problem. There were several examples where analysts, some with very different domain expertise from the problem domain, were able to, in rather short order, improve upon the best analyses that the academic community could produce after years of effort. It is evidence that many varied approaches from many intelligent minds can be very efficient, and can break the “box” that domain experts and academics often limit themselves to. This phenomenon should be eye-opening to researchers.

Another feature of the competitions was that once one team had a breakthrough, others quickly followed to match and beat it. Anthony called this the “Roger Bannister effect”, i.e., once other competitors see that a barrier was breakable, a psychological wall is broken and the new threshold is rapidly matched by others. This is an interesting observation, in that it seems to indicate that we limit ourselves and are satisfied with what we believe to be the best until someone proves us wrong, then we are quick to figure out how to achieve the new limit. It’s also interesting that some individuals or teams believe that the current limit is artificial and can be broken. Eventually a true limit is reached (all the signal is extracted from the data) and the subsequent efforts produce miniscule incremental improvements.

I’ve been thinking recently about ways of crowdsourcing research. Crowdsourced data collection is indeed possible, and infrastructures like Apple’s ResearchKit and other IT solutions (like one we’re developing at Zansors) can help. However, it is essential that the analyses and interpretation of studies use as many eyes and hands as possible to get innovative and optimal solutions. At an individual analyst level, it may involve trying many many models and looking at ensemble models. Teams can look at different approaches to a problem, and the community can add even more diversity to approaches, provided the data is shared. Unfortunately the current incentives in academic science will not promote this; rather it promotes isolationist and siloed research. Data sharing is limited at best, due to the (not unfounded) fear of being scooped by competitors; this is a result of the current incentive system in science. Team or community science is not really promoted, because then particular investigators cannot get the credit they need for tenure or promotions. Platforms like Kaggle can help, but it also needs cultural acceptance within the scientific community. It is noteworthy that most of the successful competitions on Kaggle have been sponsored by companies with financial stakes in the outcome, or by government agencies like NIH in association with companies.

A third point that Anthony demonstrated is that Kaggle competitions can make a difference. There are tough problems in image analysis applied to disease diagnosis or prevention that are crowdsourced on Kaggle, with the hope that a good solution will emerge that will change paradigms for clinical and public health practice. There are similarities to how the US government (and other governments) promote research by funding academic groups to address problems that agencies deem important. This paradigm is a crowdsourcing solution to some extent, since many groups are engaged towards a common goal by virtue of particular research programs and grant portfolios that agencies like NIH run, but the fundamental paradigm in grant funding is competition rather than cooperation. It’s not really crowdsourcing in the spirit of Kaggle or the Netflix prize, where there are hundreds of teams focused on a common goal, and competitors coalesced and collaborated and cooperated to form bigger, more capable teams to get to a better solution. There is a bit of a problem translating this to academia or real research, since almost all the teams in these competitions are volunteer efforts outside of their usual jobs, and are typically unfunded. Research, and sustained research, requires regular funding to succeed. How that would work within a crowdsourcing paradigm is unclear to me.

My main thoughts after Anthony’s presentation were that there is a lot of promise in collaborative crowdsourcing for particular well-defined problems, and such avenues should be more widely used to solve these well-defined problems. In more nascent or amorphous problems, we would be well-served to increase discussion and collaboration and share negative findings so that directions can be eliminated without too much wasted effort so that progress can happen more efficiently and quickly.