Biology is increasingly entering the fourth paradigm of science: tera/exabyte-scale data generation, with no single hypothesis in mind. These gigantic datasets are then searched for patterns that elucidate the biological processes that generated the measured data. The tools currently available to biologists, such as R and Python libraries, are not designed for datasets and algorithms that operate on ten thousand computer cloud clusters. Moreover, these libraries cannot be naively rewritten to leverage a distributed computing framework like Spark because these rich, high-dimensional datasets do not map well to the existing abstractions. In this talk, I’ll both describe the kinds of questions that the Biologists with massive datasets would like to ask and I’ll describe some of the tools my team is building to enable Statistical Genetics on datasets in the tens of terabytes.
Dan is a software engineer at the Broad Institute where he builds tools to enable terabyte scale biology with the aim of understanding disease. Before that he worked at a clinical trial software startup and before that he dropped out of a PhD in programming languages.