Tuesday, December 18, 2012

Michael Sobel: Dec. 19

The statistical literature on causal inference is based on notation that expresses the idea that a causal relationship sustains a counterfactual conditional (e.g, to say that taking the pill caused John to get better means John took the pill and got better and that had he not taken the pill, he would not have gotten better). Using this notation, causal estimands are defined and methods used to estimate these are evaluated for bias.

This talk is to introduce you to this notation and literature and to point to some issues such as mediation and interference that have been addressed (at least somewhat) in the literature that may be of interest and relevance to neuroscience.

Tuesday, December 4, 2012

Eftychios Pnevmatikakis: Dec 5

Tomorrow at 1PM I'm going to present some overview of the recent work on approximate message passing algorithms (AMP) with applications to compressed sensing (CS). 
I'm going to start with a brief overview of message passing algorithms [1] and then show how it was used in [2] to derive an AMP algorithm for the standard CS setup (basis pursuit, lasso). 
The time permitting I'm going to briefly present some extensions of this methodology to the case of more general graphical models [3]. 

Material will be drawn from the following sources:

[1] Kschischang, Frank R., Brendan J. Frey, and H-A. Loeliger. "Factor graphs and the sum-product algorithm." Information Theory, IEEE Transactions on 47.2 (2001): 498-519.
[2] Donoho, David L., Arian Maleki, and Andrea Montanari. "Message-passing algorithms for compressed sensing." Proceedings of the National Academy of Sciences 106.45 (2009): 18914-18919.
[3] Rangan, Sundeep, et al. "Hybrid approximate message passing with applications to structured sparsity." arXiv preprint arXiv:1111.2581 (2011).

Wednesday, November 28, 2012

Josh Merel: Nov 28th

Tensor decompositions for learning latent variable models

Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

Thursday, November 8, 2012

Emanuel Ben-David: Nov 7th

High dimensional Bayesian inference for Gaussian directed acyclic graph models

Recent methodological work by Letac & Massam (2007) and others have introduced classes
of flexible multi-parameter Wishart distributions for high dimensional Bayesian inference
for undirected graphical models. A parallel analysis that universally extends these results
to the class of DAGs or Bayesian networks, arguably one of the most widely used classes of
graphical models, is however not available. The parameter space of interest for Gaussian
undirected graphical models is the space of sparse inverse covariance matrices with fixed zeros corresponding to the missing entries of an undirected graph, whereas for Gaussian DAG models it is the space of sparse lower triangular matrices corresponding to the Cholesky parameterization of inverse covariance matrices. Working with the latter space, though very useful, does not allow a comprehensive treatment of undirected and directed graphical models simultaneously. Moreover, this traditional approach does not lead to well-defined posterior covariance and inverse covariance Bayes estimates which respect the conditional independences encoded by a DAG, since these quantities lie on a curved manifold. In this
paper we first extend the traditional priors that have been proposed in the literature for Gaussian DAGs to allow multiple shape parameters. We then use a novel approach and proceed to define new spaces that retain only the functionally independent elements of covariance and inverse covariance matrices corresponding to DAG models. These spaces can be considered as projections of the parameter space of interest of DAG models on lower dimensions. We demonstrate that this parameter reduction bears several dividends for high dimensional Bayesian posterior analysis. By introducing new families of DAG Wishart and inverse DAG Wishart distributions on these projected spaces we succeed in

a) deriving closed form analytic expressions for posterior quantities that would normally
only be available through intractable numerical simulations,
 b) simultaneously providing a unifying treatment of undirected and directed Gaussian graphical model priors and comparisons thereof,
 c) posterior covariance and inverse covariance Bayes estimates which actually correspond to DAG models.

Tuesday, October 23, 2012

Giovanni Motta: Oct. 24th

Fitting evolutionary factor models to multivariate EEG data

Current approaches for fitting stationary (dynamic) factor models to multivariate time series are based on principal components analysis of the covariance (spectral) matrix. These approaches are based on the assumption that the underlying process is temporally stationary which appears to be restrictive because, over long time periods, the parameters are highly unlikely to remain constant. Our alternative approach is to model the time-varying covariances (auto-covariances) via nonparametric estimation, which imposes very little structure on the moments of the underlying process. Because of identification issues, only parts of the model parameters are allowed to be time-varying. More precisely, we consider two specifications: First, the latent factors are stationary while the loadings are time-varying. Second, the latent factors admit a dynamic representation with time-varying autoregressive coefficients while the loadings are constant over time. Estimation of the model parameters is accomplished by application of evolutionary principal components and local polynomials. We illustrate our approach through applications to multichannel EEG data.

Ari Pakman: Oct. 10th

Ari will present the paper "The Horseshoe Estimator for Sparse Signals"

This paper proposes a new approach to sparse-signal detection called the horseshoe estimator. We show that the horseshoe is a close cousin of the lasso in that it arises from the same class of multivariate scale mixtures of normals, but that it is almost universally superior to the double-exponential prior at handling sparsity. A theoretical framework is proposed for understanding why the horseshoe is a better default “sparsity” estimator than those that arise from powered-exponential priors. Comprehensive numerical evidence is presented to show that the difference in performance can often be large. Most importantly, we show that the horseshoe estimator corresponds quite closely to the answers one would get if one pursued a full Bayesian model-averaging approach using a “two-groups” model: a point mass at zero for noise, and a continuous density for signals. Surprisingly, this correspondence holds both for the estimator itself and for the classification rule induced by a simple threshold applied to the estimator. We show how the resulting thresholded horseshoe can also be viewed as a novel Bayes multiple-testing procedure.

Alex Ramirez: Oct. 3rd

Alex will give a brief introduction to the method of generalized cross validation.

I'll give an introductory review on a method for model selection known as Generalized cross-validation (GCV) (see linked paper).  When dealing with linear predictor models, GCV provides a computationally convenient approximation to "leave-one-out" cross-validation.  I'll discuss this connection between cross-validation and GCV in more detail.  I'll then discuss attempts in the literature made at extending the idea behind GCV to more general models in the exponential family.  

Josh Merel: Sep. 26th

Josh will present two recent papers on unsupervised learning:

Wednesday, September 19, 2012

Richard Naud: Aug. 14

Roy Fox: July 31st

"Residual Component Analysis: Generalising PCA for more flexible inference in linear-Gaussian models"

Probabilistic principal component analysis (PPCA) seeks a low dimensional representation of a data set in the presence of independent spherical Gaussian noise, Σ = σ^2I. The maximum likelihood solution for the model is an eigenvalue problem on the sample covariance matrix. In this paper we consider the situation where the data variance is already partially explained by other factors, e.g. conditional dependencies between the covariates, or temporal correlations leaving some residual variance. We decompose the residual variance into its components through a generalised eigenvalue problem, which we call residual component analysis (RCA). We explore a range of new algorithms that arise from the framework, including one that factorises the covariance of a Gaussian density into a low-rank and a sparse-inverse component. We illustrate the ideas on the recovery of a protein-signaling network, a gene expression time-series data set and the recovery of the human skeleton from motion capture 3-D cloud data.

Eftychios P.: July 24th

Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach

This paper develops theoretical results regarding noisy 1-bit compressed sensing and sparse binomial regression. We show that a single convex program gives an accurate estimate of the signal, or coefficient vector, for both of these models. We demonstrate that an s-sparse signal in R^n can be accurately estimated from m = O(slog(n/s)) single-bit measurements using a simple convex program. This remains true even if each measurement bit is flipped with probability nearly 1/2. Worst-case (adversarial) noise can also be accounted for, and uniform results that hold for all sparse inputs are derived as well. In the terminology of sparse logistic regression, we show that O(slog(n/s)) Bernoulli trials are sufficient to estimate a coefficient vector in R^n which is approximately s-sparse. Moreover, the same convex program works for virtually all generalized linear models, in which the link function may be unknown. To our knowledge, these are the first results that tie together the theory of sparse logistic regression to 1-bit compressed sensing. Our results apply to general signal structures aside from sparsity; one only needs to know the size of the set K where signals reside. The size is given by the mean width of K, a computable quantity whose square serves as a robust extension of the dimension.

Tuesday, July 17, 2012

Johaness Bill: July 16th

Probabilistic inference and autonomous learning in recurrent networks of spiking neurons 

Numerous findings from cognitive science and neuroscience indicate that mammals learn and maintain an internal model of their environment, and that they employ this model during perception and decision making in a statistically optimal fashion. Indeed, recent experimental studies suggest that the required computational machinery for probabilistic inference and learning can be traced down to the level of individual spiking neurons in recurrent networks. 

At the Institute for Theoretical Computer Science in Graz, we examine (analytically and through computer simulations) how recurrent neural networks can represent complex joint probability distributions in their transient spike pattern, how external input can be integrated by networks to a Bayesian posterior distribution, and how local synaptic learning rules enable spiking neural networks to autonomously optimize their internal model of the observed input statistics. 

In the talk, I aim to discuss approaches of how recurrent spiking networks can sample from graphical models by means of their internal dynamics, and how spike-timing dependent plasticity rules can implement maximum likelihood learning of generative models.

Tim Machado: July 9th

The firing patterns of motor neurons represent the product of neural 
computation in the motor system. EMG recordings are often used as a 
proxy for this activity, given the direct relationship between motor 
neuron firing rate and muscle contraction. However, there are a 
variety of motor neuron subtypes with varied synaptic inputs and 
intrinsic properties, suggesting that this relationship is complex. 
Indeed, studies have shown that different compartments of individual 
muscles are activated asynchronously during some motor tasks—implying 
heterogeneity in firing across single motor pools. To measure the 
activity of many identified motor neurons simultaneously, we have 
combined population calcium imaging at cellular resolution with the 
use of a deconvolution algorithm that infers underlying spiking 
patterns from Ca++ transients. Using this approach we set out to 
examine the firing properties of neurons within an individual pool of 
motor neurons, and in particular, to compare the activity of 
individual neurons belonging to synergist (e.g. flexor-flexor) and 
antagonist (flexor-extensor) pools. 

We imaged motor neurons in the spinal cord of neonatal mice that were 
either loaded with synthetic calcium indicator or expressed GCaMP3. To 
identify the muscle targets of the loaded motor neurons we injected 
two fluorophore conjugated variants of the retrograde tracer cholera 
toxin B into specific antagonist or synergist muscles. To examine the 
correlated firing of motor neurons during network activity in our in 
vitro preparation, a current pulse train was delivered to a sacral 
dorsal root in order to evoke a locomotor-like state. The onset and 
evolution of this rhythmic state was measured with suction electrode 
recordings from multiple ventral roots. To calibrate optical 
measurements, and to determine the upper limit of correlated firing, 
motor neurons were antidromically activated via ventral root 
stimulation. The optical responses to the antidromic train were used 
to directly fit a model to our data that related the fluorescence 
measurements to an approximate spike train. Preliminary observations 
from datasets containing hundreds of identified motor neurons suggests 
heterogeneity in neuronal firing within individual pools, as well as 
alternation in the firing between antagonist pools. In the future, 
this approach will be used to examine the activity patterns of 
molecularly-defined interneuron populations as a function of firing of 
identified motor neurons.

Kamiar Rad: June 26th

High Dimensional Efficient Population Estimation

Alexandro Ramirez: June 5th

Title: Fast neural encoding model estimation via expected log-likelihoods 

Receptive fields are traditionally measured using the spike-triggered average (STA). Recent work has shown that the STA is a special case of a family of estimators derived from the “expected log-likelihood” of a Poisson model. We generalize these results to the broad class of neuronal response models known as generalized linear models (GLM).  We show that expected log-likelihoods can speed up by orders of magnitude computations involving the GLM log-likelihood, e.g parameter estimation, marginal likelihood calculations, etc., under some simple conditions on the priors and likelihoods involved.  Second, we perform a risk analysis, using both analytic and numerical methods, and show that the “expected log- likelihood” estimators come with a small cost in accuracy compared to standard MAP estimates.  When MAP accuracy is desired, we show that running a few pre-conditioned conjugate gradient iterations on the GLM log-likelihood initialized at the "expected log-likelihood" can lead to an estimator that is as accurate as the MAP. We use multi-unit, primate retinal responses to stimuli with naturalistic correlation to validate our findings.

Tuesday, May 29, 2012

Jeremy Freeman: May 24th

Jeremy Freeman talked about his recent work on subunit identification.

Roy Fox: May 22nd and 29th

Title: Information Theory in Reinforcement Learning

Abstract: In reinforcement learning, a Partially Observable Markov Decision Process (POMDP) is a model of an agent interacting with its environment through observations and actions. The agent has to choose actions which maximize an external reward it gets at each step. The hardness of this problem in general is, in one aspect, due to the large size of the sufficient statistic of the observable history for the world state.

By framing the problem in an information-theoretic setting, we gain a number of benefits: a description of "typical" agents, and in particular understanding of how evolution has solved the problem; insight into the information metabolism of an intelligent agent as a solution to a sequential information-bottleneck problem; and the ability to apply information-theoretic methods to the problem, which provide new and, in some cases, more efficient solutions.

In this talk I will give some background on the general POMDP setting and challenge, extend it to the information-theoretic setting, and show an example of information-theoretic methods applied to reinforcement learning.

Jonathan Pillow: May 8th

Jonathan Pillow talked about some recent work from his lab on active learning of neural response functions with Gaussian processes.

Monday, April 30, 2012

Jonathan Huggins: May 1st

Jonathan Huggins will present his joint work with Frank Wood. Here is an abstract:

We develop a class of non-parametric Bayesian models we call infinite structured explicit duration hidden Markov models (ISEDHMMs). ISEDHMMs are HMMs that possess an unbounded number of states, encode state dwell-time distributions explicitly, and have constraints on what state transitions are allowed. The ISEDHMM framework generalizes explicit duration finite HMMs, infinite HMMs, left-to-right HMMs, and more (all are recoverable by specific choices of ISEDHMM parameters).  This suggests that ISEDHMMs should be applicable to data-analysis problems in a variety of settings.

David Pfau: April 24th

David be presenting "A Spectral Algorithm for Learning Hidden Markov Models" by Hsu, Kakade and Zhang.  The article can be found here.  And the abstract:

Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations—it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple: it employs only a singular value decomposition and matrix multiplications.

Tuesday, April 10, 2012

Yashar Ahmadian: April 10th & 17th

Learning unbelievable marginal probabilities

Loopy belief propagation performs approximate inference on graphical models with loops. One might hope to compensate for the approximation by adjusting model parameters. Learning algorithms for this purpose have been explored previously, and the claim has been made that every set of locally consistent marginals can arise from belief propagation run on a graphical model. On the contrary, here we show that many probability distributions have marginals that cannot be reached by belief propagation using any set of model parameters or any learning algorithm. We call such marginals `unbelievable.' This problem occurs whenever the Hessian of the Bethe free energy is not positive-definite at the target marginals. All learning algorithms for belief propagation necessarily fail in these cases, producing beliefs or sets of beliefs that may even be worse than the pre-learning approximation. We then show that averaging inaccurate beliefs, each obtained from belief propagation using model parameters perturbed about some learned mean values, can achieve the unbelievable marginals.

Tuesday, March 27, 2012

David Pfau: March 27th (at 5PM)

I'll be presenting on work in progress in collaboration with Bijan Pesaran's group (Yan Wong, Mariana Vigeral, David Putrino) and Josh Merel on building a high degree-of-freedom brain-machine interface.  I'll focus on the Bayesian paradigm for decoding, and two practical problems for pushing that paradigm beyond the commonly-used Kalman filtering approach: building better likelihoods, and building better priors.  The first amounts to fitting tuning curves for various neurons.  Other groups have shown a nonlinear dependence of firing rate on hand position in 3D space, here I will show some preliminary results on fitting tuning curves for large numbers of joint angles.  The second amounts to building better generative models of reach and grasp motions.  As a first step in that direction, I've looked at PCA and ICA for reducing the dimension of reach-and-grasp signals.

Monday, March 19, 2012

Gustavo Lacerda: March 20th

Title: spatial regularization

Consider modeling each neuron as a 2-parameter logistic model (spiking probability as a function of stimulus intensity), and suppose we perform independent experiments on each neuron. Now imagine that the data isn't very informative, so we need to regularize our estimates. We can do spatial regularization by adding a quadratic penalty on the difference of estimates for nearby neurons. Now, suppose that there are *two* types of neurons, and that you only want to shrink together neurons of the same type. We don't want our estimate to be influenced by "false neighbors", i.e. neurons that are spatially close but of a different type. We discuss how to optimize this model. Finally, we explore the idea of Fused Group Lasso.

Tuesday, February 21, 2012

Kamiar Rahnama Rad: Feb. 21

Two following questions will be discussed:  1. How does embedding low dimensional structures in high dimensional spaces decreases the learning complexity significantly? I will consider the simplest model, that is a linear transformation with additive noise.  2. Modern datasets are accumulated (and in some cases even stored) in a distributed or decentralized manner. Can distributed algorithms be designed to fit a global model over such datasets while retaining the performance of centralized estimators?   
The talk will be based on the following two papers: 

Monday, February 13, 2012

Bryan Conroy: Fed. 14th

Bryan Conroy will talk about a fast method for computing many related l2-regularized logistic regression problems, and about possible extensions to other GLMs, and l1-regularizers.

Friday, February 3, 2012

Eftychios P.: Jan. 31st and Feb. 7th

I am planning to lead a very informal discussion on some neat techniques for convex and semidefinite relaxation that can be used to transform intractable optimization problems into approximate but convex ones. I'll also discuss a few applications to statistical neuroscience that we are currently pursuing.

Some background material (although I'm not planning to go over any of these in detail) includes: