Playground

Neural networks as universal approximators

Neural networks with can approximate any continuous function to arbitrary precision in just one layer and enough neurons. This sounds like a superpower but it comes with a catch: the Universal Approximation Theorem says nothing about how efficiently a network learns, or whether a single layer is sufficient for every problem. The question is not whether a network can learn something; it is how deep it needs to be to do so tractably.

Depth (more layers) adds expressive power that width (more neurons) alone cannot replicate cheaply, if at all. A shallow network may need an exponentially large number of neurons to represent what a deeper network can express with far fewer parameters. The best way to see this concretely is with the simplest possible classification problem: logic gates.

Logic gates and linear separability

A logic gate takes two binary inputs and produces a binary output. Most gates — AND, OR, NAND, NOR — are linearly separable: you can draw a single straight line through the input space that correctly divides the 0-outputs from the 1-outputs. A single-layer perceptron is exactly a learned linear boundary, so it can solve these gates directly.

XOR is different; no single line can separate the 1-outputs from the 0-outputs. A single-layer perceptron network will not be able to converge to a loss near 0 as one point will always be misclassified. You need at least one hidden layer to compose the nonlinear boundary XOR requires. The below plots show the decision boundaries of a single-layer perceptron trained on each gate. What could a non-linear boundary look like for XOR?

AND

XOR

output = 1output = 0separating boundary

Try it yourself

Configure the network architecture below and hit Train. Since the network weights are initialized randomly, you may need to hit Train a few times to see the loss fail to converge. Try different combinations of layers and neurons and watch how the loss curve changes!

Logic Gate

Hidden Layers: 1

Nodes / Layer: 3

With at least one hidden layers, the network can compose two linear boundaries into the nonlinear decision region XOR requires. Watch the loss drop.

Loss curve

plateauingconverging

Predictions

A	B	Target	Predicted
0	0	0	0.500
0	1	1	0.500
1	0	1	0.500
1	1	0	0.500

green = correctyellow = uncertainred = wrong

Architecture: [2, 3, 1] · sigmoid activation · lr = 0.5 · mini-batch gradient descent (batch = 16)