Explicit Gradient Differentiation

Explicit Gradient

../_images/explicit-gradient.png

The idea of explicit gradient is to treat the gradient step as a differentiable function and try to backpropagate through the unrolled optimization path. Namely, given

\[\boldsymbol{\theta}^{\prime} (\boldsymbol{\phi}) \triangleq \boldsymbol{\theta}_0 - \alpha \sum_{i=0}^{K-1} \nabla_{\boldsymbol{\theta}_i} \mathcal{L}^{\text{in}} (\boldsymbol{\phi},\boldsymbol{\theta}_i),\]

we would like to compute the gradient \(\nabla_{\boldsymbol{\phi}} \boldsymbol{\theta}^{\prime} (\boldsymbol{\phi})\). This is usually done by AutoDiff through an inner optimization’s unrolled iterates.

Differentiable Functional Optimizers

By passing the argument inplace as False to the update functions, we can make the optimization differentiable. Here is an example of making torchopt.adam() differentiable.

opt = torchopt.adam()
# Define meta and inner parameters
meta_params = ...
fmodel, params = make_functional(model)
# Initialize optimizer state
state = opt.init(params)

for iter in range(iter_times):
    loss = inner_loss(fmodel, params, meta_params)
    grads = torch.autograd.grad(loss, params)
    # Apply non-inplace parameter update
    updates, state = opt.update(grads, state, inplace=False)
    params = torchopt.apply_updates(params, updates)

loss = outer_loss(fmodel, params, meta_params)
meta_grads = torch.autograd.grad(loss, meta_params)

Differentiable OOP Meta-Optimizers

For PyTorch-like API (e.g., step()), we designed a base class torchopt.MetaOptimizer to wrap our functional optimizers to become differentiable OOP meta-optimizers.

torchopt.MetaOptimizer(module, impl)

The base class for high-level differentiable optimizers.

torchopt.MetaAdaDelta(module[, lr, rho, ...])

The differentiable AdaDelta optimizer.

torchopt.MetaAdadelta

alias of MetaAdaDelta

torchopt.MetaAdaGrad(module[, lr, lr_decay, ...])

The differentiable AdaGrad optimizer.

torchopt.MetaAdagrad

alias of MetaAdaGrad

torchopt.MetaAdam(module[, lr, betas, eps, ...])

The differentiable Adam optimizer.

torchopt.MetaAdamW(module[, lr, betas, eps, ...])

The differentiable AdamW optimizer.

torchopt.MetaAdaMax(module[, lr, betas, ...])

The differentiable AdaMax optimizer.

torchopt.MetaAdamax

alias of MetaAdaMax

torchopt.MetaRAdam(module[, lr, betas, eps, ...])

The differentiable RAdam optimizer.

torchopt.MetaRMSProp(module[, lr, alpha, ...])

The differentiable RMSProp optimizer.

torchopt.MetaSGD(module, lr[, momentum, ...])

The differentiable Stochastic Gradient Descent optimizer.

By combining low-level API torchopt.MetaOptimizer with the previous functional optimizer, we can achieve high-level API:

# Low-level API
optim = torchopt.MetaOptimizer(net, torchopt.sgd(lr=1.0))

# High-level API
optim = torchopt.MetaSGD(net, lr=1.0)

Here is an example of using the OOP API torchopt.MetaAdam to conduct meta-gradient calculation.

# Define meta and inner parameters
meta_params = ...
model = ...
# Define differentiable optimizer
opt = torchopt.MetaAdam(model)

for iter in range(iter_times):
    # Perform the inner update
    loss = inner_loss(model, meta_params)
    opt.step(loss)

loss = outer_loss(model, meta_params)
loss.backward()

CPU/GPU Accelerated Optimizer

TorchOpt performs the symbolic reduction by manually writing the forward and backward functions using C++ OpenMP (CPU) and CUDA (GPU), which largely increase meta-gradient computational efficiency. Users can use accelerated optimizer by setting the use_accelerated_op as True. TorchOpt will automatically detect the device and allocate the corresponding accelerated optimizer.

# Check whether the `accelerated_op` is available:
torchopt.accelerated_op_available(torch.device('cpu'))

torchopt.accelerated_op_available(torch.device('cuda'))

net = Net(1).cuda()
optim = torchopt.Adam(net.parameters(), lr=1.0, use_accelerated_op=True)

General Utilities

We provide the torchopt.extract_state_dict() and torchopt.recover_state_dict() functions to extract and restore the state of network and optimizer. By default, the extracted state dictionary is a reference (this design is for accumulating gradient of multi-task batch training, MAML for example). You can also set by='copy' to extract the copy of the state dictionary or set by='deepcopy' to have a detached copy.

torchopt.extract_state_dict(target, *[, by, ...])

Extract target state.

torchopt.recover_state_dict(target, state)

Recover state.

torchopt.stop_gradient(target)

Stop the gradient for the input object.

Here is an usage example.

net = Net()
x = nn.Parameter(torch.tensor(2.0), requires_grad=True)

optim = torchopt.MetaAdam(net, lr=1.0)

# Get the reference of state dictionary
init_net_state = torchopt.extract_state_dict(net, by='reference')
init_optim_state = torchopt.extract_state_dict(optim, by='reference')
# If set `detach_buffers=True`, the parameters are referenced as references while buffers are detached copies
init_net_state = torchopt.extract_state_dict(net, by='reference', detach_buffers=True)

# Set `copy` to get the copy of the state dictionary
init_net_state_copy = torchopt.extract_state_dict(net, by='copy')
init_optim_state_copy = torchopt.extract_state_dict(optim, by='copy')

# Set `deepcopy` to get the detached copy of state dictionary
init_net_state_deepcopy = torchopt.extract_state_dict(net, by='deepcopy')
init_optim_state_deepcopy = torchopt.extract_state_dict(optim, by='deepcopy')

# Conduct 2 inner-loop optimization
for i in range(2):
    inner_loss = net(x)
    optim.step(inner_loss)

print(f'a = {net.a!r}')

# Recover and reconduct 2 inner-loop optimization
torchopt.recover_state_dict(net, init_net_state)
torchopt.recover_state_dict(optim, init_optim_state)

for i in range(2):
    inner_loss = net(x)
    optim.step(inner_loss)

print(f'a = {net.a!r}')  # the same result

Notebook Tutorial

Check the notebook tutorials at Meta Optimizer and Stop Gradient.