Hi. Im going to briefly talk about what

automatic differentiation is and how to do it in Ruby. Automatic differentiation is a clever trick

that makes it easy to do differentiation on a computer. Its a beautiful fusion of maths

and computer programming, and I want to tell you about it so you can enjoy how satisfying

it is.

So, what is differentiation? Its a mathematical

word for something we do to functions, so first we need to talk about what a function

is. A function really just means a relationship

between two quantities for example, the relationship between the current time and

the distance travelled by a car. One relationship between time and distance

might be cruising along the road at a constant speed, like the red car. A different relationship

is pulling away from stationary, like the green car it starts slowly and speeds

up.

Each car has its own function that relates the distance its travelled to the current

time. If we plot time horizontally and distance

vertically, the function of the red cars distance over time looks like this a straight

line. And the green cars function looks like this a curve. If we periodically measure the red cars

distance we can get a sense of how it relates to the current time, and in this case we can

see that the distance is actually equal to how many seconds its been travelling for.

After three seconds its travelled three metres, for example.

If we measure the green cars distance,

it turns out its the time multiplied by itself. After three seconds its travelled nine

metres. Differentiation is the process of finding

out how fast a functions result is changing, and to keep this video short, youll just

have to believe me that this is a useful thing to be able to do on a computer. So if we take a distance function and differentiate

it, we get another function that tells us how fast the distance is changing, which in

physics we call speed.

So we know how far the green car has travelled

at each point in time, but differentiation can tell us how fast its going at each

point in time, which is otherwise not obvious. We can see visually how fast a function is

changing by graphing it and looking at the slope of the line. The red cars distance

function makes a straight line. Lets say the slope is one metre per second: for every

one second you go across, it goes one metre up.

If the car travels twice as fast, the line

is steeper. We can call that slope two metres per second. And if it travels half as fast, we get a less

steep line, say half a metre per second. So steeper means faster, shallower means slower.

The speed of the green car changes over time.

Initially its speed is zero, but then the curve gets steeper, and then even steeper

as its speed increases. So if we know these distance functions, how

do we get a computer to calculate the speeds? There are two traditional ways of doing it.

The first is called symbolic differentiation. The idea is to start with the expression of

the function you want to differentiate for example, time multiplied by time and work

out an expression for the differentiated function. If you did any calculus at school, youll

know that mathematicians have a load of rules for how to do these manipulations.

Theres

a whole Wikipedia page full of them! Theres the constant factor rule, the sum

rule, the subtraction rule, the product rule, the chain rule, the inverse function rule,

the elementary power rule, and so on. To do symbolic differentiation, we represent

the expression as a data structure inside the computer as a tree of symbols and

get the computer to manipulate that structure according to all those differentiation rules

and the rules of basic algebra. Firstly, we teach the computer that time

multiplied by time can be written as time to the power of two, or time squared.

And then one of the differentiation rules, the elementary power rule, says: something

to the power of n differentiates to n multiplied by that something to the power of n minus

one. So time to the power of two differentiates

to two multiplied by time to the power of one, which is just two multiplied by time.

Now we know that speed is two multiplied by

time, we can compute the speeds. So the big picture is: from the expression

for distance, the computer uses some rules to work out the expression for speed, and

then evaluates it for different inputs. That gives accurate answers but its difficult

to do. Also, in general, the size of the differentiated expression can be much larger than the size

of the original expression, so it takes longer to evaluate.

The other traditional strategy is called numerical

differentiation. To do that, instead of representing the functions

expression as a data structure, we implement it directly as a program. Heres the green cars distance function

in Ruby. The idea of numerical differentiation is that

we can guess the rate of change at a particular value by picking a nearby value and looking

at the slope of the line that connects them.

The slope of that line doesnt quite match

the curve, so its only an approximation. On the other hand, if we choose an even closer

value the approximation gets more accurate. Itll never be perfect, but if we pick a

small enough distance between the two values, the line between them matches the slope of

the curve quite closely. So heres how we implement that.

To get

the rough speed at a particular time, we pick a really short amount of time say, 0.01

Seconds then calculate how much further the car travels in that time, and divide that

distance by the amount of time elapsed. We dont need to reimplement this for every

function we want to differentiate; we can make a higher order function to do it, or

do it with metaprogramming or whatever. Heres the results we get. Theyre extremely

close to the exact answers we got from doing it symbolically.

With this technique we dont have a simple

expression that tells us the speed directly, but we do have an algorithm for approximating

it. This is easier and faster than the symbolic

method, but its only approximate, and dividing by tiny quantities makes it susceptible to

rounding errors. So weve seen the traditional strategies:

symbolic and numerical differentiation. Automatic differentiation is a less well-known technique.

Its as accurate as symbolic differentiation but as fast and easy as numerical.

The big idea is to calculate a functions

rate of change and its value all at once. So rather than just putting a time into our

distance function and getting a distance out, we should put in a time and a rate of change

for example, we specify that time is changing at one second per second and get out a

distance and a rate of change for example, we get the result that the distance is changing

at six metres per second. One way of doing this is to invent a new kind

of number that can represent a value and its rate of change all at once, and then make

our implementations of functions work with these new numbers. So we need to design and

then implement a sort of two-dimensional number that can store both a value and its rate of

change and operate on them together.

One kind of two-dimensional number you might

already know about is the complex numbers, which were invented to solve a similar problem.

They have two components, a real part and an imaginary part, but they can be added and

multiplied and otherwise used like normal numbers. The imaginary part of a complex number is

a multiple of this invented number i its just a symbol, it doesnt represent a real

quantity. Our two-dimensional numbers will use a similar

idea, but theyre not complex numbers, theyre called dual numbers. They have a real

part and a dual part, and instead of i we use an invented number called epsilon.

Its just a symbol again, but you can think of epsilon as representing a very very tiny

number, almost zero but not zero its called an infinitesimal.

Well use this dual part to represent the

rate at which a number is changing. Epsilon is so tiny that it doesnt contribute to

the real value of the number at all, but we can say a number with ten epsilons is changing

twice as fast as a number with five epsilons. So lets just make a boring class for dual

numbers, with accessors for the real and dual parts, and a string representation. And add a convenience method to Kernel for

converting normal numbers into dual numbers, like the Complex and Rational classes have.

If we dont specify a rate of change then it defaults to zero that gives us a dual

number that represents a constant, not changing at all.

So for the red car, where the distance is

just equal to the time, we already have something that works. We make a value three with rate

of change one, and when we call the distance function with it, we get out three with rate

of change one that means three metres, one metre per second. And thats the right answer for this very

simple function where distance is the same as time. But what about when the function does some

operations on the input value? If youve worked with complex numbers you

might know that we can define how to add and multiply them, and so on, just by doing normal

algebra with their parts.

But to do that we need to know how i behaves, and by design

the rule is that i times i equals minus one. That choice makes complex numbers behave

in a certain way; theyre good for representing rotations and oscillations and things like

that in physics and electronics. We can add together two complex numbers by

just adding all their parts separately and grouping the multiples of i together. And when we multiply two complex numbers we

multiply out the brackets and group the multiples of i; were multiplying by i squared

at the end here, and that just turns into multiplying by minus one because thats

what i squared is, so thats just a subtraction; and then we group the multiples again.

We can do exactly the same thing for dual

numbers too, but this time we need to know how epsilon behaves. The rule for epsilon

is that its so small that when you multiply it by itself you get zero. And then all the

algebra follows in the same way. Adding dual numbers is literally the same:

we add all the parts and group the multiples of epsilon.

Multiplying is almost the same: we multiply

out the brackets and group the multiples, but now were multiplying by epsilon squared

at the end, and epsilon squared is zero, so that last part just goes away. We can use those results to implement plus

and multiply on our DualNumber class. This is where the mathematics meets the programming. Now if I have two dual numbers, I can add

them and multiply them.

So for the green car, where the distance is

time multiplied by time, we can take that three seconds changing at one second per second,

ask for the distance, and get the result: nine metres changing at six metres per second. And thats the answer we wanted. It just

works, because we built the idea of the rate of change into our numbers, and defined operations

that preserved and respected the rate of change. Here I fed different time values into the

green cars distance function and visualised the dual numbers it generated.

The real values

are plotted in white, and the slope of the red line is the dual part the rate of

change for the value where the dot is. So you can see that matches the slope of the

curve nicely. We have to do a bit of work to make dual numbers

compatible with Rubys built-in numbers. Right now if we try to add or multiply by

a Fixnum we get a NoMethodError, because our add and multiply operations are trying to

get the real and dual parts of a Fixnum.

We can fix that by just converting the argument

to a dual number inside the add and multiply methods. This is where that DualNumber conversion

method comes in handy. And now we can add and multiply by normal

numbers. They get automatically converted into dual number constants.

Of course we have the opposite problem, too,

if we have a Fixnum on the left and try to add or multiply by a dual number on the right.

This time instead of a NoMethodError we get a TypeError, because Fixnums add and multiply

operations dont know what to do with a dual number. Fortunately Ruby has a built-in way of coping

with this. If we call a numeric operation of a built-in number with an argument whose

type it doesnt recognise, the number will send a coerce message to that unrecognised

argument, and expect to get back a pair of objects. Then it will retry the original operation

with those objects, and whatever result it gets back, it will return that.

This coercion

protocol means built-in numbers can interoperate with user-defined ones. So if we implement this coerce method on the

DualNumber class, we can make built-in numbers automatically upgrade themselves to a dual

number whenever we try to add or multiply by a dual number. And now addition and multiplication work both

ways around. Of course, you can do more with numbers than

just add and multiply them.

With a bit more algebra its possible to work out definitions

for all sorts of other operations on dual numbers too. For example, these are the formulas

for taking the sine, cosine, natural exponential, natural logarithm and square root of a dual

number. Those formulas are easy to monkey patch into

the Math module. Here Im overriding sin and cos in the Math module to use the appropriate

formulas if theyre called with a dual number, otherwise they call super to use the original

implementation.

And now we can take the sine and cosine of

dual numbers, and combine them into larger expressions. So just to finish, if I define my distance

function to be the sine of the time, then heres the graph of the real and dual values

that my Ruby code produces. Hopefully youre convinced that the slope of the line is correct. And we can compose as many operations as we

want.

Heres a distance function that does sine and cosine and multiplies and adds and

divides, and here are the values and rates of change produced by that code. Heres a final example: time multiplied by the

sine of time squared plus one looks like this. So thats it. A clever little trick that

works by inventing a new kind of number, defining operations on it that preserve its meaning,

and passing it into existing functions to see what they do to it.

The Ruby dual number

implementation I built for this video is up on GitHub, or you can install it as a gem

if you want to play around with it. Thanks very much!.