I'm presenting joint work with Frank Wood and Nicholas Bartlett on learning simple models for discrete sequence prediction. We describe a novel Bayesian framework for learning probabilistic deterministic finite automata (PDFA), which are a class of simple generative models for sequences from a discrete alphabet. We first define a prior over PDFA with a fixed number of states, and then by taking the limit as the number of states becomes unbounded, we show that the prior has a well defined limit, a model we call a Probabilistic Deterministic Infinite Automata (PDIA). Inference is tractable with MCMC, and we show results from experiments with synthetic grammars, DNA and natural language. In particular, we find on complex data that averaging predictions over many MCMC samples leads to improved performance, and that the learned models perform as well as 3rd-order Markov models with about 1/10th as many states. For the curious, a write-up of my work can be found here.
Also, following the talk I'm going to give a brief tutorial on git, a free version control system used in the software community for maintaining large collaborative code bases. I'd like to set up a git repository for the Paninski group so we can avoid too much code duplication and build on each others' work, and I promise it's actually pretty easy once you learn the basics.