Michael logo

Playground

Neural networks as universal approximators

Neural networks with can approximate any continuous function to arbitrary precision in just one layer and enough neurons. This sounds like a superpower but it comes with a catch: the Universal Approximation Theorem says nothing about how efficiently a network learns, or whether a single layer is sufficient for every problem. The question is not whether a network can learn something; it is how deep it needs to be to do so tractably.

Depth (more layers) adds expressive power that width (more neurons) alone cannot replicate cheaply, if at all. A shallow network may need an exponentially large number of neurons to represent what a deeper network can express with far fewer parameters. The best way to see this concretely is with the simplest possible classification problem: logic gates.

Logic gates and linear separability

A logic gate takes two binary inputs and produces a binary output. Most gates — AND, OR, NAND, NOR — are linearly separable: you can draw a single straight line through the input space that correctly divides the 0-outputs from the 1-outputs. A single-layer perceptron is exactly a learned linear boundary, so it can solve these gates directly.

XOR is different; no single line can separate the 1-outputs from the 0-outputs. A single-layer perceptron network will not be able to converge to a loss near 0 as one point will always be misclassified. You need at least one hidden layer to compose the nonlinear boundary XOR requires. The below plots show the decision boundaries of a single-layer perceptron trained on each gate. What could a non-linear boundary look like for XOR?

0101ABAND
0101ABOR
0101ABnot linearly separableXOR
output = 1output = 0separating boundary

Try it yourself

Configure the network architecture below and hit Train. Since the network weights are initialized randomly, you may need to hit Train a few times to see the loss fail to converge. Try different combinations of layers and neurons and watch how the loss curve changes!

With at least one hidden layers, the network can compose two linear boundaries into the nonlinear decision region XOR requires. Watch the loss drop.
Loss curve
0.250
plateauingconverging
Predictions
ABTargetPredicted
0000.500
0110.500
1010.500
1100.500
green = correctyellow = uncertainred = wrong

Architecture: [2, 3, 1]  ·  sigmoid activation  ·  lr = 0.5  ·  mini-batch gradient descent (batch = 16)