Thursday, August 30, 2018

What is Data Science?

The Simons Institute at Berkeley has two semester long programs this fall, Lower Bounds on Computational Complexity and Foundations of Data Science. The beginning of each program features a "boot camp" to get people up to speed in the field, complexity last week and data science this week. Check out the links for great videos on the current state of the art.

Data Science is one of those terms you see everywhere but not well understood. Is the the same as machine learning? Data analytics? Those pieces only play a part of the field.

Emmanuel Candès, a Stanford statistician, gave a great description during his keynote talk at the recent STOC theoryfest. I'll try to paraphrase.

The basic scientific method works as follows: You make an hypothesis consistent with the world as you know it. Design an experiment that would distinguish your hypothesis from the current models that we have. Run the experiment and accept, reject or refine your hypothesis as appropriate. Repeat.
The Higgs Boson followed this model as a recent example.

Technological Advances have given us a different paradigm.
  1. Our ability to generate data has greatly increased whether it be from sensors, DNA, telescopes, computer simulations, social media and oh so many other sources.
  2. Our ability to store, communicate and compress this data saves us from having to throw most of it away.
  3. Our ability to analyze data through machine learning, streaming and other analysis tools has greatly increased with new algorithms, faster computers and specialized hardware.
All this data does not lend itself well to manually creating hypotheses to test. So we use the automated analysis tools, like ML, to create models of the data and use other data for testing those hypotheses. Data science is this process writ large.

We are in the very early stages of data science and face many challenges. Candès talked about one challenge: how to prevent false claims that arise from the data not unrelated to the current reproducibility crisis in science.

We have other scientific issues. How can we vouch for the data itself and what about errors in the data? Many of the tools remain adhoc, how can we get theoretical guarantees? Not to mention the various ethical, legal, security, privacy and fairness issues that vary in different disciplines and nations.

We sit at a time of exciting change in the very nature of research itself, but how can we get it right when we still don't know all the ways we get it wrong. 

No comments:

Post a Comment