Open and reproducible data-centric workflows

Truly open science demands transparent, reproducible workflows that version code and data together. Unfortunately, research workflows are often a mess, especially when working across large, diverse teams. We built a simple project template with preconfigured integration hooks to make it easy to do things the right way. If you work primarily in Python and are interested in versioning code & data, this may be of interest to you.

Specifically, we use Git to version code, Quilt to version data, Prefect to run workflows, Dask for distributed computation, and Github for CI, auto-building docs, and pushing packages to PyPI. We stitch these tools together in a customized Cookiecutter template that makes it easy to integrate and use them, and leverages our simple Datastep framework to build project workflows that are both fully import-able Python libraries, but also come with a friendly command-line interface to version data and push it to the cloud.

  • Here is our project template.
  • actk is an example of a workflow built on our template that processes terabytes of imaging data as input to our deep learning workflows.

Cell states beyond transcriptomics

Quantitative co-analysis of RNA abundance and sarcomere organization in single cells and an integrated framework to predict subcellular organization states from gene expression. We used human induced pluripotent stem cell (hiPSC)-derived cardiomyocytes expressing mEGFP-tagged alpha-actinin-2 to develop quantitative image analysis tools for systematic and automated classification of subcellular organization. This captured a wide range of sarcomeric organization states within cell populations that were previously difficult to quantify. We performed RNA FISH targeting genes identified by single cell RNA sequencing to simultaneously assess the relationship between transcript abundance and structural states in single cells. Co-analysis of gene expression and sarcomeric patterns in the same cells revealed biologically meaningful correlations that could be used to predict organizational states. We establish a framework for multi-dimensional analysis of single cells to study the relationships between gene expression and subcellular organization and to develop a more nuanced description of cell states.

Integrated Cell Modeling

At the Allen Institute for Cell Science I've been working in collaboration with Greg Johnson to build integrated models of single cells. We use conditional generative adversarial networks (GANs) and variational autoencoders (VAes) to fuse data from multiple fluorescence microscopy experiments into a coherent model of sub-cellular structure localization in single cells.

  • Our first pre-print addresses the 2D case
  • Our work on 3D integrated cells is now out.
  • A much more stable and interpretable β-VAE version of both the 2D and 3D models is available here while a manuscript is in progress.

Sparse Time-Series Models

The idea here is to integrate data across time-points to build sparse regression models for time-series data, such that the sparse regressors at neighboring time points vary smoothly. This would be useful for e.g. RNA-seq experiments with multiple time-points, if you wanted to predict the set genes driving a phenotype, and see how that set changes over time.

  • The Julia code for this method is currently in development.

Integrated Workflow for Transcription Factor Binding Site Prediction

In an effort to find the best candidate transcriptions factors to input to the Price Lab's transcriptional regulatory network inference tools, I constructed a machine learning pipeline to integrate an array of genome-scale data and predictive tools to output a single high confidence prediction of transcriptional activity at arbitrary sites across the genome.

  • A preprint describing this work is available here
  • a publicly available database of our predictions is available as an R package here

Weighted Ensemble Systems Biology

It can be really difficult to run stochastic models of biological processes enough times to accurately sample their output. We applied the WESTPA implementation of weighted ensemble to these kinds of models and achieved orders of magnitude speed-ups in sampling. The essential trick here is that most observables of interest in complex models are rare events (e.g state transitions), and weighted ensemble can efficiently sample rare events in stochastic systems.

  • Our first paper dealt with non-spatial models
  • Our second paper addresses models of spatially resolved cellular processes.

Graphical Models for Free Energy Estimation

The main approaches to computationally estimating how strongly two biomolecules bind together are often either overly time consuming (e.g. molecular dynamics) or overly empirical (e.g. docking). Alternatively, using graphical models of proteins to compute the Bethe free energy of binding can be both fast and accurate. We are working to improve this approach and rigorously quantify the error involved in the approximation (work in progress).

  • This work is summarize in a JCTC paper and is available here.
  • Here is web app to visualize the graphs induced on pdb structures.

Quantitative Evolution and the ATP Synthase

Using simple state-based models of proton transport and free energy transduction, we probe the optimality of the curiously engineered rotary mechanism of the ATP synthase. We take a very high-level approach here; for instance, our models are entirely agnostic to structure, and we allow each potential transporter mechanism to optimize itself over all unknown parameters that are thermodynamically permissible.

  • Our PNAS paper Biophysical comparison of ATP synthesis mechanisms shows a kinetic advantage for the rotary process available here.

Network Inference with Graphical Models of Heterogeneous Genomic and Clinical Data

We use graphical models to learn interaction networks between genes, clinical factors, and disease diagnoses. Our models accommodate data that is both continuous and discrete, and are aggressively filtered for false-positive edges via collider detection algorithms. Graphical models learned from biomedical data can be used for classification and biomarker selection (with performance comparable to currently available univariate tools), while revealing the underlying causal network structure and thus allowing for arbitrary likelihood queries over the data.

Stochastic Models of Cellular Heterogeneity

Simple stochastic models can recapitulate the population-level heterogeneity of protein abundance found in, for example, colonies of e. coli. A non-spatial model of gene expression, stochastically simulated with a modified Gillespie algorithm that takes into account cell division, is able to reproduce experimental data quite well.


I live on Vashon Island. I work at the Allen Institute for Cell Science, designing efficient learning algorithms for extracting knowledge from high dimensional experimental data, and integrating machine learning approaches with mechanistic biophysical models to create multiscale models of cellular behavior.

Before that, I was a postdoc at the Institute for Systems Biology with the Hood-Price Lab.

I went grad school at CMU & Pitt, where I worked with Dan Zuckerman,
and collaborated with Jim Faeder, Chris Langmead, Markus Dittrich, Bob Murphy, and Takis Benos.