Context: this post summarises recent research to prevent AI algorithms from being misused by unauthorised actors. After discussing four recent case studies, common research flaws and possible solutions are mentioned.

A real-life motivating problem is how Meta's LLaMa had its parameters leaked online (Vincent, 2023), plausibly enabling actors like hackers to use the model for malicious purposes like generating phishing messages en masse. Still, advanced models could have more severe and widespread consequences if stolen, "jailbroken," or otherwise misused.

Summary

Currently, the most common solution to prevent misuse is managing access to AI models with secure APIs. This is desirable, however APIs have flaws:

APIs may be used as "bandaid" solutions to reactively add security after a model is trained. Ideally, security would be considered before training.

Some "real-time" use cases like self-driving cars are not suitable for API-based protection due to time delays in network requests.

Once a model's parameters are leaked, APIs no longer offer protection.

Thus, researchers aim to add new defences inseparably within AI models' weights by making models' accuracy depend on cryptographic keys. They change models' input data, optimiser functions, weights, activation functions, etc. to do this.

Different techniques have tradeoffs between extra model training, memory to store keys, time to generate predictions, and specialised hardware investments. Also, techniques are evaluated by model accuracy when encrypted vs. decrypted, as well as the difficulty of improving an encrypted model.

Current research mostly focuses on small image datasets and models. Also, interpretability techniques, adversarial training, or formal proofs are rare. This limits confidence in the reliability and scalability of current research. Future research should consider larger datasets, language models, and establishing confidence in model reliability.

This is the simplest technique of the four case studies. The researchers trained a ResNet-18 model on the CIFAR-10 dataset, which has 60,000 32x32 pixel images across ten categories. The researchers then preprocessed the input images by dividing them into square blocks and randomly rearranging the pixels in each block based on their secret key. This technique ensures that users can't properly preprocess the image (and get accurate outputs) without the secret key.

Here's an example of an image before and after preprocessing:

For programmers interested in mathematical details, the authors were unclear about how the key is applied and generated. That said, these are the general steps:

Divide the input image into square 'blocks' with side length m. Ex: You could divide a 32x32 pixel image into 16 blocks with side length m=8.

Arrange the pixel values of each block into a pixel vector →p=[p1,...,pN] - where N is the number of pixels in the block. Ex: With m=8, there are N=8×8×3=192 pixels for each block in a 3-channel (RGB) image.

For each block, generate an index vector with one index per element of the pixel vector. Ex: It would be →i=[1,2,3,...,192] with the example above. Randomly shuffle the indices of this vector using a secret key, though the researchers don't specify exactly how. Ex: →i=[67,3,113,...,44].

Finally, rearrange the pixel vector into a new vector using the shuffled indices: →p′(k)=→p(→i(k)) - where →p′ is the rearranged pixel vector, k=1,...,192 with the example above, and the notation →p(k) represents the kth element of →p.

Repeat this for all pixel vectors per image and all images in the training data. Then, train the model normally and preprocess images before classification.

The researchers trained models to process images with various block sizes. They confirmed that a model trained on preprocessed input can have similar performance as with normal input. Also, they showed that a model trained on preprocessed input has low performancefor unauthorised users without the secret key.

Model

Accuracy

Correct Key

Incorrect Key

No Preprocessing

Baseline

NA

NA

95.45%

Block size = 2

94.70%

25.84%

34.39%

Block size = 4

94.26%

20.01%

27.11%

Block size = 8

86.98%

14.98%

15.70%

This technique has benefits in that no extra parameters (or memory to store them) are needed. Also, authorised users do not need to "decrypt" the model to use it. Thus, the model's parameters always stay protected. Finally, the time to generate predictions doesn't increase since the model architecture doesn't change.

Still, there are some limitations to this technique. For one, the model must be retrained to change the key (after it is leaked for example). Additionally, authorised users must preprocess every input to the model, which adds extra computational cost. Finally, an unauthorised user could steal some of the researchers' dataset and fine-tune the AI algorithm to work with their own key. Even with just 2% of the original dataset, the model's accuracy can be improved by 20 percentage points.

Futhermore, there are limitations in the research method (beyond the lack of clarity about key generation and usage). For instance, the ResNet-18 model and CIFAR-10 dataset used are very small compared to more recent image models and datasets. The authors also don't test how much the model can resist adversarial inputs. Finally, they don't use interpretability techniques to check which model layers adapt to the authorised key (and whether these layers can be selectively fine-tuned).

DeepLock: Cryptographic Locks for Models' Parameters (Alam et al., 2020)

The next model is more complicated than the last. It does not modify the training data or the training process of a model at all. Instead, it encrypts the parameters of a model after training. Thus, the parameters become useless until decrypted.

The challenge with approaches like these is choosing a secure key for encryption. If one key encrypts all parameter values, guessing or bypassing the key is more plausible. Yet if each parameter is encrypted with its own key, the model uses much more memory.

The researchers balance these extremes by using the AES key schedule: essentially an algorithm to generate variations of one master key key for every parameter to encrypt. The video in the link above is excellent at visually explaining this algorithm.

For those familiar with AES, here are more details on the paper's technique:

Let N be the number of parameters in a model and wi be the ith parameter of a model where i=1,...,N. The AES key schedule algorithm will use a master AES keyK to generate a set of round keys corresponding to each parameter of the model: KeySchedule(K)={k1,...,kN}.

To get each encrypted parameter w′i, an XOR operation is done on the binary representation of each parameter wi and each key ki. The XOR is useful since it is reversible in later decryption. Additionally, the output is then passed through the AES substitution box. In summary, w′i=S(wi⊕ki) where S(...) represents the substitution box and ⊕ represents the XOR.

The model and master key K can then be sent to an authorised user. To decrypt the model parameters, the authorised user again uses the AES key schedule to generate round keys for each parameter. Then, the parameters can be decrypted as wi=ST(w′i)⊕ki where ST(...) represents the inverse substitution box.

The researchers tested this technique using small convolutional neural networks on the MNIST dataset of black and white images of numbers, the Fashion-MNIST dataset of black and white images of clothing items, and the CIFAR-10 dataset. They showed that the model guesses random outputs when an unauthorised user inputs the wrong key, but the time to generate predictions more than doubles due to decryption.

The largest advantage of this technique is that it can be applied to any model architecture without retraining. Similarly, if a key is compromised it can be replaced with negligible cost; also, a unique key can be issued to each user to minimise risk spreading between users. Moreover, memory usage is low due to the use of a key schedule instead of multiple master keys.

Again, key flaws of the research method are similar to the last paper (not reporting methods transparently, choosing small datasets and model architectures, not testing models to resist adversarial inputs). Separately, the authors report no progress when trying to fine-tune the encrypted model with 10% of the original data stolen, but their claims are also hard to verify since they did not describe their methodology.

AdvParams: Adversarially Modifying Model Parameters (Xue et al., 2021)

This approach is like a more targeted version of the above research paper. Again, it only encrypts parameters after training instead of adjusting the training process of a model. However, it doesn't modify every single parameter in the model; it selectively adjusts the most influential parameter values in a model to degrade performance.

In fact, the researchers only needed to adjust 23 to 48 parameters (out of hundreds of thousands to millions) in three convolutional neural networks they trained on the Fashion-MNIST, CIFAR-10, and German Traffic Sign Recognition Benchmark datasets. Simply adjusting a few dozen parameters led to over 80% drop in accuracy.

Thus, the encryption and decryption processes are very quick since only a few updates are needed. Furthermore, keys use little memory and the encryption process can be repeated so different users' keys are unique and replaceable. Also, the parameter value distributions remain similar before and after parameter modification. This makes it harder for unauthorised users to spot parameter updates to undo.

Still, how are the most influential parameters chosen and modified? Here more details for those familiar with deep learning.

To identify influential parameters, the gradient of the loss with respect to each layer's parameters is computed. The largest component of the gradient vector shows the parameter in each layer with the most influence on the loss.

Mathematically, let L(x,y) represent the loss function and wli represent the ith parameter of the lth layer. Note that x,y represent an entire subset of training examples chosen for encryption. Then, the gradient vector is:

If the ith component of the gradient vector is the maximum, then wli is the most influential parameter in that layer.

Note that random layers are chosen for parameter modification at the start. Any one parameter can only be modified a certain number of times. Thus, the same influential parameter is not chosen for updates on each iteration.

Gradient descent updates parameters away from the gradient vector to decrease loss. Thus, an update is made towards the gradient component to increase loss.

- note the addition instead of subtraction.

Still, the parameters may grow large with the maximum gradient component being used. This would make the modified parameter stand out from other ones. Thus, the authors add a hyperparameter θϵR to scale the update step.

Note, however, that different layers have different parameter value distributions. Thus, one hyperparameter across all layers is unsuitable. Instead, the researchers scale the update step with the range of parameters in each layer: Range(Wl)=maxWl−minWl.

The above updates repeat until the model's loss rises above a chosen threshold. At each update, the selected parameters and the changes made to their original values are noted so that authorised users can undo ("decrypt") these changes.

Unfortunately, this approach is flawed. Compared to previous papers, the encryption is reversible if an unauthorised user fine-tunes the model with stolen data. This can be done more efficiently using the above process to find the most influential parameters and selectively updating their values. However, the parameter values would be updated away from the direction of gradient components to decrease the loss.

Note that the research methodology flaws from above papers still apply here.

This last technique focuses on commercial AI deployment with specialised hardware like GPUs and TPUs. Low latency, computational cost, and memory usage are required. This approach modifies an AI model's training process with a cryptographic key.

Specifically, the key to encrypt a model is a fixed hyperparameter during training. Each neuron in a neural network is associated with a bit (0 or 1). All neurons which have a 1 associated with them flip the signs of their weighted sums. Thus, the trained model needs the right key to flip the right neuron weighted sums in deployment.

Here are more technical details for those familiar with deep learning.

Let nli be the ith neuron in the lth layer of the model. It is associated with a key kli.

If kli=0, the neuron's activation (ali) is computed normally: ali=g(Wli⋅al−1) where g(...) is some activation function, Wli represents a row of the parameter matrix of the lth layer, and al−1 is a vector containing activations from all neurons in layer l−1.

If kli=1, the sign of the weighted input is flipped before the activation function is applied: ali=g(−1×Wli⋅al−1).

The researchers chose this modification as the sign of a binary number can be flipped with a single XOR operation. This is what enables the algorithm's computational efficiency.

Specifically, the researchers rely on customised GPUs/TPUs to pass keys and weighted sums from a multiply-accumulate unit through an XOR gate. This means that the same computational cycle can compute a weighted sum, check a key value, and adjust the weighted sum's sign.

Since each multiply-accumulate unit is performing operations with a key, each of these hardware units is assigned their own key. Then, all neurons processed by that particular unit are associated with that key. Memory usage is thus low.One key bit is needed per multiply-accumulate unit (of which there may be under 1000), not per neuron (of which there may be millions).

Empirically, the authors tested this approach with a small convolutional neural network and ResNet-18 on the Fashion-MNIST, CIFAR-10, and Street View House Numbers datasets. Attempting to use the model with an unauthorised key caused a 70-80% drop in accuracy. Whereas the correct key resulted in the encrypted model having the same accuracy as the original model (±0.5%).

Unfortunately, it was very easy to fine-tune the model to have high accuracy with just a small fraction of the original dataset. That said, these results may not be applicable to more complex datasets. Especially since the authors reported better accuracy when training a model initialised with random weights compared to a model being fine tuned on encrypted weights.

All the prior research method concerns still apply for this paper. Although the Street View House Numbers dataset has an order of magnitude more examples than the other datasets seen, the images are still only 32x32 pixels and the classification problem has only 10 classes. Thus larger and more challenging benchmarks are neglected.

Discussing Future Improvements

The variety of techniques available to tackle the problem of misuse shows that this research area is developing beyond its infancy. To help the area scale, it is crucial to test techniques in more realistic and commercial settings. Especially since the threat of misuse will persist with the development of more advanced models. Though the current solutions may not scale to this more pressing use case if we do not thoughtfully improve them.

More specifically, some methodological improvements are obvious:

More transparent reporting of research methods is needed in general, especially regarding the process of generating and applying keys.

It would help adoption to create code repositories which show companies and other researchers precisely how to deploy these algorithms.

Encryption techniques should be tested with larger image models and datasets like deeper ResNets or the ImageNet Large Scale Visual Recognition Challenge. Problems beyond classification such as object detection and image segmentation would be useful to include.

Techniques should also be tested with language models, especially ones based on transformers to show commercial viability. This is more feasible for techniques which do not require retraining, like the DeepLock paper (Alam et al., 2020).

To demonstrate the reliability of these encryption techniques, the adversarial robustness of these encryption techniques should be tested, starting with simple attacks like the fast gradient sign method (Goodfellow et al., 2015) or projected gradient descent (Madry et al., 2017).

In addition, it may be possible to generate mathematical proofs regarding the reliability of individual encryption techniques. Chakraborty et al. provide an example demonstrating that a model's capacity to learn does not deteriorate with their encryption method (Chakraborty et al., 2020, p. 3).

Other improvements needed involve new research directions instead of adjusting the methodology of existing research. For instance, more research is needed on practical considerations like backup keys or revoking keys if one is stolen. Advances here could involve research around key hierarchies and asymmetric key encryption (Behera and Prathuri, 2020). The intention would be to reduce the impact of a disclosed key on a model's confidentiality.

More importantly, research is needed to scale these methods to increasingly-complex models like those with deceptive behaviours (Pan et al., 2023), agentic goals (Carlsmith, 2022), or embedded trojans (Chen et al., 2017). For instance, a technique like the preprocessed input data (Pyone et al., 2020) seems more vulnerable to adversarial attacks or trojan attacks compared to the technique which relies on formal AES cryptography (Alam et al., 2020).

In addition, more fallback behaviours must be developed aside from simply generating incorrect predictions. For example, could the model parameters be permanently disabled if an unauthorised key is used? Could the model be taught to stop further actions and seek human feedback? These kinds of fallback behaviours might make these techniques useful for not only stopping misuse by humans, but also misaligned behaviour without humans in the loop.

Personally, I will be researching how to bridge these gaps in the coming months. If you have any questions about potential mechanisms I'm considering or any other details from this article, I'd be happy to explain my thoughts :-)

References

Alam, M., Saha, S., Mukhopadhyay, D., & Kundu, S. (2020). Deep-lock: secure authorization for deep neural networks. arXiv. http://arxiv.org/abs/2008.05966

Behera, S., & Prathuri, J. R. (2020). Application of homomorphic encryption in machine learning. 2020 2nd PhD Colloquium on Ethically Driven Innovation and Technology for Society (PhD EDITS), 1–2. https://doi.org/10.1109/PhDEDITS51180.2020.9315305

Carlsmith, J. (2022). Is power-seeking AI an existential risk? arXiv. http://arxiv.org/abs/2206.13353

Chakraborty, A., Mondai, A., & Srivastava, A. (2020). Hardware-assisted intellectual property protection of deep learning models. 2020 57th ACM/IEEE Design Automation Conference (DAC), 1–6. https://doi.org/10.1109/DAC18072.2020.9218651

Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning. arXiv. http://arxiv.org/abs/1712.05526

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. arXiv. http://arxiv.org/abs/1412.6572

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2019). Towards deep learning models resistant to adversarial attacks. arXiv. http://arxiv.org/abs/1706.06083

Pan, A., Shern, C. J., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., & Hendrycks, D. (2023). Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. arXiv. http://arxiv.org/abs/2304.03279

Pyone, A., Maung, M., & Kiya, H. (2020). Training DNN model with secret key for model protection. 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), 818–821. https://doi.org/10.1109/GCCE50665.2020.9291813

Vincent, J. (2023, March 8). Meta’s powerful AI language model has leaked online—What happens now? The Verge. https://www.theverge.com/2023/3/8/23629362/meta-ai-language-model-llama-leak-online-misuse

Xue, M., Wu, Z., Wang, J., Zhang, Y., & Liu, W. (2021). Advparams: An active DNN intellectual property protection technique via adversarial perturbation based parameter encryption. arXiv. http://arxiv.org/abs/2105.13697

Context: this post summarises recent research to prevent AI algorithms from being misused by unauthorised actors. After discussing four recent case studies, common research flaws and possible solutions are mentioned.

A real-life motivating problem is how Meta's LLaMa had its parameters leaked online (Vincent, 2023), plausibly enabling actors like hackers to use the model for malicious purposes like generating phishing messages en masse. Still,

advanced models could have more severe and widespread consequences if stolen, "jailbroken," or otherwise misused.## Summary

APIs may be used as "bandaid" solutionsto reactively add securityaftera model is trained. Ideally, security would be consideredbeforetraining.inseparably withinAI models' weights bymaking models' accuracy depend on cryptographic keys. They change models' input data, optimiser functions, weights, activation functions, etc. to do this.Different techniques have tradeoffs between extra model training, memory to store keys, time to generate predictions, and specialised hardware investments.Also, techniques are evaluated by model accuracy when encrypted vs. decrypted, as well as the difficulty of improving an encrypted model.Current research mostly focuses on small image datasets and models. Also, interpretability techniques, adversarial training, or formal proofs are rare. This limits confidence in the reliability and scalability of current research. Future research should consider larger datasets, language models, and establishing confidence in model reliability.## Four Case Studies of Recent Research

## Preprocessing Input with Secret Keys (Pyone et al., 2020)

This is the simplest technique of the four case studies. The researchers trained a ResNet-18 model on the CIFAR-10 dataset, which has 60,000 32x32 pixel images across ten categories. The researchers then

preprocessed the input images by dividing them into square blocks and randomly rearranging the pixelsin each block based on their secret key. This technique ensures that users can't properly preprocess the image (and get accurate outputs) without the secret key.Here's an example of an image before and after preprocessing:

For programmers interested in mathematical details, the authors were unclear about how the key is applied and generated. That said, these are the general steps:

The researchers trained models to process images with various block sizes. They confirmed that a

model trained on preprocessed input can have similar performance as with normal input.Also, they showed that amodel trained on preprocessed input has low performancefor unauthorised userswithout the secret key.This technique has benefits in that

no extra parameters (or memory to store them) are needed. Also, authorised users do not need to "decrypt" the model to use it. Thus, the model's parameters always stay protected. Finally,the time to generate predictions doesn't increasesince the model architecture doesn't change.Still, there are some limitations to this technique. For one,

the model must be retrained to change the key(after it is leaked for example). Additionally, authorised users must preprocess every input to the model, which adds extra computational cost. Finally, an unauthorised user could steal some of the researchers' dataset and fine-tune the AI algorithm to work with their own key.Even with just 2% of the original dataset, the model's accuracy can be improved by 20 percentage points.Futhermore, there are limitations in the research method (beyond the lack of clarity about key generation and usage). For instance, the ResNet-18 model and CIFAR-10 dataset used are very small compared to more recent image models and datasets. The authors also

don't test how much the model can resist adversarial inputs. Finally, they don't use interpretability techniques to check which model layers adapt to the authorised key (and whether these layers can be selectively fine-tuned).## DeepLock: Cryptographic Locks for Models' Parameters (Alam et al., 2020)

The next model is more complicated than the last. It does not modify the training data or the training process of a model at all. Instead, it

encrypts the parameters of a model after training. Thus, the parameters become useless until decrypted.The challenge with approaches like these is choosing a secure key for encryption. If one key encrypts all parameter values, guessing or bypassing the key is more plausible. Yet if each parameter is encrypted with its own key, the model uses much more memory.

The researchers balance these extremes by using the AES

key schedule: essentially an algorithm togenerate variations of one master key key for every parameter to encrypt. The video in the link above is excellent at visually explaining this algorithm.For those familiar with AES, here are more details on the paper's technique:

master AES keyK to generate a set ofround keyscorresponding to each parameter of the model: KeySchedule(K)={k1,...,kN}.The researchers tested this technique using small convolutional neural networks on the MNIST dataset of black and white images of numbers, the Fashion-MNIST dataset of black and white images of clothing items, and the CIFAR-10 dataset. They showed that

the model guesses random outputs when an unauthorised user inputs the wrong key, but the time to generate predictions more than doubles due to decryption.The largest advantage of this technique is that it can be

applied to any model architecture without retraining. Similarly,if a key is compromised it can be replaced with negligible cost; also, aunique key can be issued to each user to minimise risk spreading between users. Moreover, memory usage is low due to the use of a key schedule instead of multiple master keys.Again, key flaws of the research method are similar to the last paper (not reporting methods transparently, choosing small datasets and model architectures, not testing models to resist adversarial inputs). Separately, the authors report no progress when trying to fine-tune the encrypted model with 10% of the original data stolen, but their claims are also hard to verify since they did not describe their methodology.

## AdvParams: Adversarially Modifying Model Parameters (Xue et al., 2021)

This approach is like a more targeted version of the above research paper. Again, it only encrypts parameters after training instead of adjusting the training process of a model. However, it doesn't modify every single parameter in the model; it

selectively adjusts the most influential parameter values in a model to degrade performance.In fact, the researchers only needed to adjust 23 to 48 parameters (out of hundreds of thousands to millions) in three convolutional neural networks they trained on the Fashion-MNIST, CIFAR-10, and German Traffic Sign Recognition Benchmark datasets.

Simply adjusting a few dozen parameters led to over 80% drop in accuracy.Thus, the

encryption and decryption processes are very quicksince only a few updates are needed. Furthermore, keys use little memory and the encryption process can be repeated so different users'keys are unique and replaceable. Also, the parameter value distributions remain similar before and after parameter modification. This makes it harder for unauthorised users to spot parameter updates to undo.Still, how are the most influential parameters chosen and modified? Here more details for those familiar with deep learning.

The largest component of the gradient vector shows the parameter in each layer with the most influence on the loss.an update is madetowardsthe gradient component toincreaseloss.Unfortunately, this approach is flawed. Compared to previous papers,

the encryption is reversible if an unauthorised user fine-tunes the model with stolen data.This can be done more efficiently using the above process to find the most influential parameters and selectively updating their values. However, the parameter values would be updatedawayfrom the direction of gradient components todecreasethe loss.Note that the research methodology flaws from above papers still apply here.

## Hardware-Accelerated Retraining and Prediction (Chakraborty et al., 2020)

This last technique

focuses on commercial AI deployment with specialised hardware like GPUs and TPUs.Low latency, computational cost, and memory usage are required. This approach modifies an AI model's training process with a cryptographic key.Specifically, the key to encrypt a model is a fixed hyperparameter during training.

Each neuron in a neural network is associated with a bit (0 or 1). All neurons which have a 1 associated with them flip the signs of their weighted sums.Thus, the trained model needs the right key to flip the right neuron weighted sums in deployment.Here are more technical details for those familiar with deep learning.

the researchers rely on customised GPUs/TPUsto pass keys and weighted sums from a multiply-accumulate unit through an XOR gate. This means that the same computational cycle can compute a weighted sum, check a key value, and adjust the weighted sum's sign.Memory usage is thus low.One key bit is needed per multiply-accumulate unit(of which there may be under 1000), not per neuron (of which there may be millions).Empirically, the authors tested this approach with a small convolutional neural network and ResNet-18 on the Fashion-MNIST, CIFAR-10, and Street View House Numbers datasets. Attempting to use the model with an unauthorised key caused a 70-80% drop in accuracy. Whereas the correct key resulted in the encrypted model having the same accuracy as the original model (±0.5%).

Unfortunately,

it was very easy to fine-tune the model to have high accuracy with just a small fraction of the original dataset.That said, these results may not be applicable to more complex datasets. Especially since the authors reported better accuracy when training a model initialised with random weights compared to a model being fine tuned on encrypted weights.All the prior research method concerns still apply for this paper. Although the Street View House Numbers dataset has an order of magnitude more examples than the other datasets seen, the images are still only 32x32 pixels and the classification problem has only 10 classes. Thus larger and more challenging benchmarks are neglected.

## Discussing Future Improvements

The variety of techniques available to tackle the problem of misuse shows that this research area is developing beyond its infancy. To help the area scale, it is crucial to test techniques in more realistic and commercial settings. Especially since the threat of misuse will persist with the development of more advanced models. Though the current solutions may not scale to this more pressing use case if we do not thoughtfully improve them.

More specifically, some methodological improvements are obvious:

create code repositories which show companies and other researchers precisely how to deploy these algorithms.tested with language models, especially ones based on transformersto show commercial viability. This is more feasible for techniques which do not require retraining, like the DeepLock paper (Alam et al., 2020).adversarial robustness of these encryption techniques should be tested, starting with simple attacks like the fast gradient sign method (Goodfellow et al., 2015) or projected gradient descent (Madry et al., 2017).Other improvements needed involve new research directions instead of adjusting the methodology of existing research. For instance,

more research is needed on practical considerations like backup keys or revoking keys if one is stolen. Advances here could involve research around key hierarchies and asymmetric key encryption (Behera and Prathuri, 2020). The intention would be to reduce the impact of a disclosed key on a model's confidentiality.More importantly, research is needed to

scale these methods to increasingly-complex models like those with deceptive behaviours (Pan et al., 2023), agentic goals (Carlsmith, 2022), or embedded trojans (Chen et al., 2017). For instance, a technique like the preprocessed input data (Pyone et al., 2020) seems more vulnerable to adversarial attacks or trojan attacks compared to the technique which relies on formal AES cryptography (Alam et al., 2020).In addition,

more fallback behaviours must be developed aside from simply generating incorrect predictions.For example, could the model parameters be permanently disabled if an unauthorised key is used? Could the model be taught to stop further actions and seek human feedback? These kinds of fallback behaviours might make these techniquesuseful for not only stopping misuse by humans, but also misaligned behaviour without humans in the loop.Personally, I will be researching how to bridge these gaps in the coming months. If you have any questions about potential mechanisms I'm considering or any other details from this article, I'd be happy to explain my thoughts :-)

## References

Alam, M., Saha, S., Mukhopadhyay, D., & Kundu, S. (2020).

Deep-lock: secure authorization for deep neural networks. arXiv. http://arxiv.org/abs/2008.05966Behera, S., & Prathuri, J. R. (2020). Application of homomorphic encryption in machine learning.

2020 2nd PhD Colloquium on Ethically Driven Innovation and Technology for Society (PhD EDITS), 1–2. https://doi.org/10.1109/PhDEDITS51180.2020.9315305Carlsmith, J. (2022).

Is power-seeking AI an existential risk?arXiv. http://arxiv.org/abs/2206.13353Chakraborty, A., Mondai, A., & Srivastava, A. (2020). Hardware-assisted intellectual property protection of deep learning models.

2020 57th ACM/IEEE Design Automation Conference (DAC), 1–6. https://doi.org/10.1109/DAC18072.2020.9218651Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017).

Targeted backdoor attacks on deep learning systems using data poisoning. arXiv. http://arxiv.org/abs/1712.05526Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015).

Explaining and harnessing adversarial examples. arXiv. http://arxiv.org/abs/1412.6572Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2019).

Towards deep learning models resistant to adversarial attacks. arXiv. http://arxiv.org/abs/1706.06083Pan, A., Shern, C. J., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., & Hendrycks, D. (2023).

Do the rewards justify the means? Measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. arXiv. http://arxiv.org/abs/2304.03279Pyone, A., Maung, M., & Kiya, H. (2020). Training DNN model with secret key for model protection.

2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), 818–821. https://doi.org/10.1109/GCCE50665.2020.9291813Vincent, J. (2023, March 8).

Meta’s powerful AI language model has leaked online—What happens now?The Verge. https://www.theverge.com/2023/3/8/23629362/meta-ai-language-model-llama-leak-online-misuseXue, M., Wu, Z., Wang, J., Zhang, Y., & Liu, W. (2021).

Advparams: An active DNN intellectual property protection technique via adversarial perturbation based parameter encryption. arXiv. http://arxiv.org/abs/2105.13697