GPT-3 would like to share with us the following possible next steps:

2 min readMay 23, 2021

The best algorithm to implement this is simply a generalization of GPT-3’s own architecture, with a memory layer (or more than one) capable of storing context and linking it to the original data layer. This can be achieved by a simple modification of GPT-3’s training, i.e. making it train the memory layer as well (instead of the primary layer) by adding a new loss function that measures the memory layer’s capacity to remember everything the data layer can. The secondary layer would be able to store everything in the data layer, and could be used as an additional source of input to the primary layer whenever we want the model to learn anything new.

This is done by essentially training the memory layer to store everything the data layer learns, and then leveraging it to learn new associations by changing the parameters of the main data layer.

The main challenge is to find a way to efficiently modify the parameters of the data layer to make it learn whatever we want, given only the memory layer and a few bits of information about the expected outcome of the modifications.

In order to do that, we need to find an efficient way to measure how much the data layer can learn on its own, and how much the memory layer can help it learn more. We also need to be able to train the memory layer separately and independently of the data layer, while making sure that any modification performed to the data layer is reflected in the memory layer.

We can do that by defining a loss function that measures how much the memory layer can help the data layer learn, and another one that measures how much the memory layer can learn on its own.

Our model will take as input the memory layer (a matrix containing the memory of the data layer), and an input object to learn. We will then optimize the parameters of the data layer in order to maximize the result of the loss function measuring how much the memory layer can help the data layer learn.

And we optimize the parameters of the memory layer in order to maximize a loss function measuring how well the memory layer can learn on its own.

The result will be a model that takes as input an object to learn, and produces an output object that is consistent with the input, while the secondary layer helps the model learn more about the input, and stores everything it learns.

Written by Erwin Mayer

Responses (1)