Hi, I'm Bjorn van Zwol and I study a new branch of neural networks called predictive coding networks. Feel free to ask me questions!

Artificial neural networks (ANNs), the building block of modern deep learning, were originally inspired by the workings of our brain. Recent advances in deep learning however, have not come from biological mimicry, but increases in data and compute. And although recent deep learning advances have undeniably been impressive, our brains still vastly outperform machine learning in several fundamental ways, including flexibility, adaptability and energy efficiency.

Predictive coding networks (PCNs, Whittington & Bogacz 2017) are a novel type of neural nets inspired by a next generation of neuroscientific understanding. These networks are promising in theory, but it remains unclear whether they provide practical advantages over current methods.

The two innovations that PCNs offer are

- a local learning rule, which enables the training of arbitrary graph topologies – including new architectures that could otherwise not be trained;
- feedback connections, which speculatively could lend PCNs increased expressive power.

Our group is exploring whether these promises can be realized in practice.

Fig. 1 shows a very simple example of an ANN, with layers of activity neurons $a_i^\ell$. Given a labelled datapoint $(x_n,y_n)$, training happens in two steps: (1) a feedforward pass, which gives a predicted output at $a^2$ given input data at $a^0=x_n$. Then, (2) learning occurs by adjusting weights by backpropagating errors calculated from a loss function at the output. This process only works for networks with a strict hierarchical structure.

In addition to activity neurons $a_i$, PCNs include a set of error neurons $\epsilon_i$, as illustrated in Fig. 2. These are defined as $e_i=a_i-\mu_i$, where $\mu_i=\sum_j w_{ij}f(a_j)$ is a ‘prediction’ of the activity $a_i$. A PCN’s objective is to minimizes the total sum of error; an ‘energy’ function $E = \sum_i(\epsilon_i)^2$

PCN training (also called inference learning) happens using a different 2-step process (a form of Expectation Maximization). Given $(x_i,y_i)$, clamp a subset of all activity neurons to $x_i$ and $y_i$, and then:

- Update other activity neurons to minimize E (inference; E-step)
- Once converged, update weights to further minimize E (learning; M-step)

Critically, there is feedback between all nodes – and both activity and weight updates only require local information.

Feel free to ask questions about the research via Slido.

Similar type of articles