Cryptanalyzing LLMs with Nicholas Carlini Artwork

Security Cryptography Whatever

Some cryptography & security people talk about security, cryptography, and whatever else is happening.

Security Cryptography Whatever

Cryptanalyzing LLMs with Nicholas Carlini

January 28, 2025 • Season 4 • Episode 8

'Let us model our large language model as a hash function—'

Sold.

Our special guest Nicholas Carlini joins us to discuss differential cryptanalysis on LLMs and other attacks, just as the ones that made OpenAI turn off some features, hehehehe.

Watch episode on YouTube: https://youtu.be/vZ64xPI2Rc0

Transcript: https://securitycryptographywhatever.com/2025/01/28/cryptanalyzing-llms-with-nicholas-carlini/

Links:

- https://nicholas.carlini.com
- “Stealing Part of a Production Language Model”: https://arxiv.org/pdf/2403.06634
- ‘Why I attack"’: https://nicholas.carlini.com/writing/2024/why-i-attack.html
- “Cryptanalytic Extraction of Neural Network Models”, CRYPTO 2020: https://arxiv.org/abs/2003.04884
- “Stochastic Parrots”: https://dl.acm.org/doi/10.1145/3442188.3445922
- https://help.openai.com/en/articles/5247780-using-logit-bias-to-alter-token-probability-with-the-openai-api
- https://community.openai.com/t/temperature-top-p-and-top-k-for-chatbot-responses/295542
- https://opensource.org/license/mit
- https://github.com/madler/zlib
- https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
- https://nicholas.carlini.com/writing/2024/how-i-use-ai.html

"Security Cryptography Whatever" is hosted by Deirdre Connolly (@durumcrustulum), Thomas Ptacek (@tqbf), and David Adrian (@davidcadrian)

Nicholas: 0:00

we literally just, like, scrape the entire internet and just, like, use whatever we can find and train on this. And so, like, now, poisoning attacks become, you know, a very real thing that you might have to worry about.

Thomas: 0:15

Hello, and welcome to security, cryptography, and whatever. Nights. I'm Thomas Toczek, and with me, as always, is

Deirdre: 0:23

Deirdre,

David: 0:24

And David.

Deirdre: 0:25

yay.

Thomas: 0:26

And we have a very special guest this week, which is what Deirdre said I had to say. Um, so we've got with us, uh, Nicholas Carlini. Nicholas and I worked together at a little company called Modasano Security, um, 10, 15 years ago, a ways ago. Um, Nicholas was one of the, uh, primary authors of a little site called MicroCorruption. So, uh, if you've ever played with MicroCorruption, you, uh Uh, came across some of his work. But, for the past many, many years, uh, Nicholas has been getting his CS doctorate and developing a research career working on the security of machine learning and AI models. Um, did I say that right? That's a reasonable way to put it?

Nicholas: 1:03

That is a very reasonable way of putting it, yes.

Thomas: 1:05

So this is like a podcast where, like, we kind of flip in between security and cryptography, which is very mathematical. And one of the cool things about this is that, um, the kinds of work that you're doing on AI models, it's much more like the cryptography side of things than a lot of other things that people think about when they think about AI security. So if you think about, like, AI or red teaming and alignment and things like that, those are people pen testing systems by chatting with them. Which is a weird idea, right? Like prompt injection is kind of a lot of that. But the work that you've been doing is more fundamentally mathematical. Um, like, you're attacking these things as mathematical constructs.

Nicholas: 1:44

yeah, when I give talks to people I often try and explain, you know, and security is always worth interesting, we're thinking about things in multiple different angles. One of them is like, oh, this is a language model, it talks to you. The other is, this is a function, numbers in, numbers out, and you can analyze it as a function, numbers in, numbers out, and so you can do interesting stuff to it. And I think, yeah, this is one of the fun ways to get lots of attacks because people, most of the people who work on this stuff. like to do the language side. And so doing the actually math side gives you a bunch of avenues to do interesting work on this.

Deirdre: 2:15

Mm.

Thomas: 2:16

So I think I have like a rough idea of, from like a security perspective, like what the main themes of the attacks that you've come up with are, and you probably have more because I haven't read everything that you've done. Um, but like before we get into that, I'm curious how you went from straight up pentester to, you know, breaking the mathematics of ML models.

Nicholas: 2:39

Yeah. Um, right. So yeah, so I've always really liked security. And yeah, when I was an undergrad at Berkeley, um, yeah, I really got into security and, and then I went, I was at Modestano for a couple summers, yeah, doing a bunch of stuff. And then I went to do my PhD. And when I started my PhD, my PhD, I started in system security. So my first A couple papers were on return oriented programming and how to break some stuff where some people had, I don't know, very silly defenses of like, let's monitor the, you know, control flow by looking at the, like, last 16 indirect branches and decide whether or not there's a ROP gadget or not. And, like, that's kind of some things that we re broke. And then I went to Intel for a summer, and, uh, I was doing some hardware security stuff. And then I had to came, I came back and I had to figure out, like, what am I going to do a dissertation on? And I was getting lots of fun attacks, and, but fundamentally, I wasn't sure, like, what's, like, a, like, fundamentally new thing you could do. Because, like, this field has been around forever, and, like, most of the research at the time was were unwilling to implement some defenses because of the 2 percent performance overhead. And, like, you know, how dare you make my C code 2 percent slower, I want my Python code that's 100x slower, or whatever, so it felt very silly. And so I was like, well, what's something with this, like, very little work on right now that seems like it is going somewhere, and I, it was machine learning, and so I decided to switch and try and write a dissertation in this new machine learning space, um, and, yeah, and I just never stopped after that. Um,

Deirdre: 4:11

what, what year was this when you were actually switching over to

Nicholas: 4:13

yeah, I started to machine learning in 2016.

Deirdre: 4:17

Okay.

Nicholas: 4:18

Yeah. Which also means of course you get Specter like the next year after I go away from, I was like, I was very quickly like, should I go back? Like, you know, there's all these fun new attacks now. But, um, I stayed the course for a little while and, uh, you know, I'm, I'm still following about this stuff and it's, it's all very cool. But, um, uh, yeah, so

Thomas: 4:38

I guess the thing in the back of my head is, were you this mathematically competent when you were working for us?

Deirdre: 4:42

Nicholas: 4:43

Uh, , um, yeah. Well, so, um, yeah, so I did my undergrad in, in math, in math and computer science.

Deirdre: 4:49

Oh.

Nicholas: 4:49

Um, and I got into computer science through cryptography, initially. Um, I, so in, yeah, in our high school, it was an IB diploma thing that like, it's kind of like, you know, you end up with like some final project, which is kind of like an undergrad thesis, but like, you know, a high school thesis, like worse in many ways. Um, and I did that on differential cryptanalysis.

Deirdre: 5:17

Wow. Ha ha

Nicholas: 5:19

okay. So like, you know, okay. It was like, uh, my, my S boxes were like four by four and whatever, but like, you know, so it was like a very small thing, but like, that's how I started in security. And then that's what got me to my advisor, Dave Wagner, who did crypt analysis, you know, many years ago. And he ended up teaching my undergrad security class. And that's how I ended up getting to getting into research because, you know, he had a bunch of cool stuff that he was doing. And so, yeah, that was sort of the, the flow of that. And so I've always,

Thomas: 5:45

your advisor for your doctorate was Dave Wagner?

Nicholas: 5:47

Yeah. Yeah.

Thomas: 5:51

That's pretty cool. Um, so I guess like an opening question I have for you, right, is, um, So I'm, you're highly mathematically competent, and I'm highly mathematically incompetent. Um, and like one of the big realizations for me with attacking cryptography, which was like a thing that we did a lot of at Mata Sana, right, uh, was the realization that like a lot of practical attacks We're nowhere nearly as mathematically sophisticated as, like, the literature or the foundations of the things that we were attacking. Like, you needed basically high school algebra and some basic statistics, right? Does that carry forward? Is that also true of attacking LLMs? Or do I really need to have a good grip on, like, you know, gradient descent and all that stuff?

Nicholas: 6:33

so, you probably need a little bit more of some things, but I think just the things that you need are different. So, you know, it's, I think, probably, first course in calculus, first course in linear algebra, for like, the foundations, and People, okay, there, of course, as with everything, there are lines of work on like formally verified neural networks that like are very, very far in the math direction that have proofs I don't understand. And, you know, there's lots of privacy theory that's very far in the like proof direction with differential privacy that I don't understand. But like to do most of the actual attacks, I feel like in some sense not understanding the deep math is helpful. Because so lots of papers will say like, here is some justification. why this thing should work, and it has like complicated math, and then they actually implement something, and the thing that they implement is like almost entirely different than what they're saying that actually it's doing, and so like in some sense like reading the paper is almost harmful because like the code is doing something like whatever it's doing, and you just like look at the the actually implementation, and just like you need to understand, yeah, some very small amounts of like The gradients are zero and zero with floating point numbers is challenging as an object And so you should make the gradients not zero and like, you know There's some a little bit of math which goes into making that happen, but it's not like some deep, you know PhD level math thing that's happening here

Thomas: 8:00

So tell me if I'm crazy to kind of summarize the direction that you're working in this way. Um, like one of the first things you got publicity for was an attack where you were able to sneak instructions into Alexa or whatever, like just doing it, it felt like almost stenographic what you were doing there. Right. And then there's a notion, I think that's pretty well understood, well known of, you can poison a model by giving it, like when things are scraping all the images, trying to build up the image models or whatever, there's poisoned inputs you can give there that'll fuck the model up, right? And then. After that, there are attacks on those systems, right? So if people try to poison models, you can do things to break the poisoning schemes. That's a third kind of class of attacks there. And then finally, like the really hardcore stuff from what I can tell is, so like for all these models, they've been trained with basically the entire internet times three worth of data, right? That's like the, That and the huge amounts of compute that like OpenAI or whatever have are the the moat for these things, right? And the whole idea of these systems is you're taking all that training data and all that compute that's running all that backpropagation or whatever it's doing, right, and it's distilling it down to a set of weights. And the weights are obviously much, much smaller than the corpus of training data there. Um, small enough that you could exfiltrate it and then you would have The OpenAI model right there. And so that last class of attacks I'm thinking about are things where you have a public API of some sort, some kind of oracle essentially, to a running model, and you can extract the weights. So they haven't revealed them to you, it's not like a llama model or whatever where you downloaded it to your machine, it's running server side and like it's the crown jewels of whatever they're doing there. And there are things that you can infer about those models, um, just by doing queries. Is there another major area of kind of attacks there?

Nicholas: 9:46

Yeah, this is the main ones that are on the security side the one other thing that I do a little bit of is the privacy angle which tries to say You So yeah, so as you said, you have lots of data going into these models. Suppose that you have the weights to the model, or you have API access to the model. What can I learn about the data that went into the model? And for most of the models we have today, this is not a huge concern, because they're mostly just trained on the internet. But you can imagine that in the past people have argued they might want to do this and they don't do it for privacy reasons. Now, suppose you're a hospital and you want to like train a model on a bunch of patients medical scans and then release it to like any other hospital so that they can do the scans in, you know, some remote region that doesn't necessarily have an expert on staff. Like this, this might be a nice thing, you know, absent, you know, other utility reasons that might, you might not do this, but this might be a thing you would want to do. It turns out that models trained on data can reveal the data they were trained on, and so some of my work has been showing the degree to which it is possible to recover individual pieces of training data in a model, even though the model is like 100x or 1000x smaller than the training data. For some reason, the models occasionally pick up individual examples, and so this is like maybe the fourth category of thing that I've looked at.

Thomas: 11:05

This is a thing I was thinking about when I was looking at your, your, your site is. I think, like, the natural thing to think about when you're thinking about attacks against AI models is, um, what can you do against a state of the art production model, um, where state of the art means we're at, like, you know, 4. 0 or whatever is past 4. 0, and like, production means we have a normal API to it. We're not getting the raw outputs of it or whatever, right? Um, whatever the constraints are for actually hosting these things in the real world. But like you have a A presentation from several years ago, which I think was at Crypto, um, which was about an attack that was extracting hidden layers from, and it's like, it's a simple, like, neural network with, you know, ReLU activation or whatever, um, and like, it's, it, it sort of seems like compared to a, you know, a full blown LLM model, like a toy problem, right? But then that's part of that is that we've kind of forgotten that like until just a couple years ago This was all AI stuff was individual Constructions around you know kind of basic neural network stuff and that stuff still happens for you know application specific problems So the ability to attack a system like that if you can pull Confident like people will build classifiers off of confidential information for the next 10 20 years, right? They'll still happen

Deirdre: 12:16

And that's definitely like a business, like, selling point of like, especially when they're trying to, you know, if they localize it on a local device, and they make it small enough and trainable enough, like, you know, hopefully, quote, hopefully, nobody but you is querying that model that you've trained on your local private data anyway, But maybe you sync it, maybe you back it up, like, you know, maybe you want to allow your cloud provider to train on your private data store and it's queryable from anywhere, quote unquote, including them because they want to help you and be, you know, proactive. And then the cat's out of the bag if you can extract the training data or even the weights from these models trained on your private data.

Thomas: 12:59

So that's like a fifth way to look at this thing, a fifth, a fifth, you know, class of attacks there is. Like, your natural way of thinking about an ML model might be, like, it's a hash function of the entire internet. It boils everything down to, and your research is basically, one of your research targets is basically, absolutely not, right? It's not a uniform random function

Nicholas: 13:15

No, okay, so yeah, exactly. It's not even, I mean, It does behave like a hash function in many cases, but, like, there are cases in which it does not behave this way, and, you know, this can matter if it's a private example. And, you know, even in cases of models trained on all of the internet, you can run into some of this where, um, let's suppose you have a model trained on all of the internet, and at that point in time, someone had put something on the internet by accident. And then realize like, oops, I don't want that to be there. I should take that down. And then they take that down, but the model stays around, and now you can query the model and recover the data from the model and only the model. Um, this is, you know, you have remained, the leak has remained, um, you know, because you have, have trained in this meantime. Um, and so like, there are still some other concerns there for these other reasons too. But yeah, there's lots, there's lots of fun angles of attack here.

Thomas: 14:07

Okay, let's, let's zoom in on that fourth thing here, and then I think we can hit the poisoning and anti poisoning

Nicholas: 14:12

Sure. Yeah.

Thomas: 14:13

because I think it's super interesting, but also that paper is, Um, less headache inducing than the stuff we're about to get into, right? So um, yeah, I mean, starting at like, so, you know, your, your most recent, your most recent research that's published is on actual production models, like actually targeting things that are running in OpenAI or whatever, right? But like that, that model problem from earlier of directly extracting hidden layers, right? Um, if I know enough to like, you know, reason about cryptography engineering, get, get me started along, yeah.

Nicholas: 14:43

Okay. So, so here's the motivation. Um, so let's suppose you're a cryptography person. Um, we have spent decades trying to design encryption algorithms that have the following property. For any chosen input, I can ask for an encryption of that, and then make sure that you can learn nothing about the secret key. This is a very hard problem. Yeah, okay, nothing, whatever, yeah, okay, subject to math stuff about whatever, um, yeah, okay. Now let's take the same problem and state it very differently. Let's imagine a person who has an input, they're going to feed it to a keyed algorithm, a machine learning model, they're going to observe the output, and I want to know, can I learn anything about The key thing in this case, the parameters of the trained model. It would be wildly surprising if it happened to be the case by magic, that this, that this machine learning model happens to be like a secure encryption algorithm, like the way you like, you couldn't learn things about the weights. Like that would just be like. wildly surprising. And so, you know, proof by intuition, like, it should be possible to learn something about the parameters. Now, of course, okay, what does it mean when you have an actual attack on a crypto system? It means, you know, like, with probability greater than 2 to the minus 128 or whatever, I can learn something about some amount of data, whereas for the parameters of the model, I actually want to, like, actually do, like, a full key recovery kind of thing. So they're very different problems I'm trying to solve. Um, but this is, like, the intuition of why it should be possible. Okay, now let me try and describe very, very briefly How the attack we do works, which is almost identically an attack from differential cryptanalysis. And what we try and do is we feed pairs of inputs that differ in some very small amount through the model, we can trace the pairs through, and we can arrange for the fact that in these machine learning models you have these neurons that either activate or don't, and you can arrange for one input to activate and the other one to not activate, And then, it's actually sort of like a higher order differential, because you need four, because you want gradients, one on either side of the neuron. And, and then as a result of this, you can learn something about the specific neuron in the middle that happened to be activated or not activated, and you can learn the parameters going into that, that neuron. And in order to do this, you have to have assumptions that are not true in practice, like I need to be able to send floating point 64 bit numbers through, which no one allows in practice, and the models have to be relatively small, because otherwise, just like, floating point error screws you over, and like, a couple other things you can't do, like, that real models don't have, but like, under some strong assumptions You can then recover mostly the entire model, and this is the paper that we, yeah, we had at Crypto, um, a couple of years ago.

Thomas: 17:26

So, just to, just to kind of bring us up to speed on the vocabulary here, right, so we're thinking about like a simple neural network model right here, like a series of hidden layers and an input and then it's outputs, um, like each of those layers that you're trying to recover is essentially a matrix and the parameters for each of the things that each neuron is basically a set of weights.

Nicholas: 17:49

Yeah, so, yeah, sorry, so parameters and weights people generally mean roughly the same thing. So what's going on here is each layer has a bunch of neurons, and each neuron, each neuron sort of has a connection to all the neurons in the previous layer. And so each neuron is doing something like a dot product with all of everything in the previous layer, And so each neuron is doing something like a dot product with all of everything in the previous layer. And you're doing this for every neuron in the layer, and so that's where you get the matrix, is because you're doing a sum of all these dot products.

Thomas: 18:13

And when we're thinking about recovering parameters to the model, what we're trying to do is recover those weights. Okay,

Nicholas: 18:18

I was gonna say there's also these things called biases, um, which are why so parameters are technically the combination of the weights and the biases, but in practice the biases are like, you know, this O1 term that doesn't really matter too much.

Thomas: 18:30

But, but a bias determines whether or not a particular neuron in a hidden layer is gonna fire.

Nicholas: 18:35

Uh, yeah, like you sort of, the, the actual, what the equation for the neuron does is it multiplies the weights by everything in the previous layer and then adds this constant bias term,

Thomas: 18:45

And that constant bias term is independent of the input. Is the idea there? That's why it exists. That's why it's not just a factor, a factor of the weight, right?

Deirdre: 18:53

this is, this is also the stuff that's like how you tweak your, you know, deep neural net, your, your, your model to be different than any other. These little biases in these

Nicholas: 19:04

and the weights,

Deirdre: 19:05

yeah, yeah, and the weights, like, but these are all the parameters that actually like make

Nicholas: 19:09

Yeah, this is what, when you, when you're training your model, this is what you're adjusting, you're, you're adjusting the weights and the biases.

Thomas: 19:14

And as an intuition for how these actually work, right, like, um, it feels a little similar to the DESX boxes, right, where, um, like, you're wondering where the hell these weights come from. And the reality is they start out random, and they're trained with inputs, with like labeled inputs or whatever the training method is, right? Um, and you're basically using stochastic methods over time to figure out what the, the set of weights are. And there's a huge number of weights in a modern model, right, um, that optimize for some particular output. With the S Boxes, it was optimizing for non linearity, and for this, you're optimizing for correctly predicting whether a handwritten letter, you know, number 3 is actually 3, or whatever it is that you're doing there.

Nicholas: 19:51

Correct, yes, and you sort of, you go through some process, you train them, and then someone sort of hands them down to you and says, here are the, here are the weights of the thing that works.

Thomas: 19:59

And like, okay, so I guess I have two questions, right. First, I'm curious for the model attack, for when we're directly extracting model weights, right, for the toy attack is what I

Nicholas: 20:09

Sure, yeah,

Thomas: 20:10

right? what is unrealistic about that? What constraints would you have to add to that to be in a, you know, a more realistic attack setting? And then second of all, just what are the mechanics of that attack? What does that look like?

Deirdre: 20:21

Mm hmm.

Nicholas: 20:21

okay. Um, so let's, so what I need to do is I need to be able to estimate the gradient of the model at arbitrary points. So for this, I need a high precision math evaluated in the entire model in order to get this good estimate of the gradient. And in practice, yeah, this means I need like floating point 64 inputs, floating point 64 outputs, I need The model to not be too deep or just the number of floating point operations start to just accumulate errors. I need the model to not be too wide, or you run into like problems that just like very sort of like, just like because of how the attack works, you just start running into like compounding errors. You also need to be able to have a model that is only using a particular kind of activation function. So after you do these matrix multiplies, If you stack matrix multiplies, this is just linear layer on top of linear layer, you need some kind of non linearity. And, you know, in cryptography you have these S boxes or whatever, and in neural networks you just pick some very simple function like the max of X and 0. And this is called the ReLU activation function, and the attack only works if you're using this particular activation function. And modern, you know, language models use all kinds of crazy, like, sui gelu, which is some complicated, who knows even what it has, like, some sigmoid thing in there, and it's like doing crazy stuff. And it doesn't work for those kinds of things because what the mechanics of the attack need is I need to be able to have neurons that are either active or inactive.

Deirdre: 21:53

Okay.

Nicholas: 21:53

And as soon as you have this, like, activation function that's doing some like, you know, sigmoid like thing, it's no longer like a threshold effect of either it's active or it's inactive. It's now like, you know, a little bit more active or a little bit less active. It's sort of much more complicated. So that's when the attacks break down. And so yeah, so for these attacks that we have, they're, they're very, for, for the full model extraction, these attacks we have, they're, they're very limited in this way. Um, I will say one thing which is that last year, we wrote an attack that does work on big giant production models. If all you want is one of the layers, so it can only get you one layer, it can only get you the last layer, but if you want the last layer it does work on production models and we did implement this and ran it on some OpenAI models and confirmed that, so we worked with them, I sort of stole the weights, I sent them the weights, I said is this right and they said yes, um, and so you can do that but like it only worked in that one, for that one particular layer and it's very non obvious that you can extend this, um, past multiple, past multiple layers.

Thomas: 22:56

Were they, by the way, were they surprised? What was their reaction?

Nicholas: 22:59

Yeah. no, I mean, they were surprised, yes. The API had some weird parameters that were around for historical reasons that maybe were bad ideas if you're thinking about security. And they changed the API to prevent the attack as a result of this, which sort of, I think, is a nice, um, yeah. I mean, this is how you know when you've succeeded as a security person, you've convinced. The, the non security people that they should make a breaking change to their API, which like, loses you some utility in order to gain some security. And I feel like, you know, this is a nice demonstration that's like, you, you've won as a security

Deirdre: 23:33

Okay. Cause I was just about to ask, like, what, what did you prompt inject? Like, what did you say in human language, uh,

Nicholas: 23:41

Yeah, so, so, so you

Deirdre: 23:42

the last layer?

Nicholas: 23:43

yeah, so, so you don't, so again, I, I'm not, I don't talk to the model at all like a model. I think of it still like a mathematical function. And, you know, it's,

Deirdre: 23:50

Nicholas: 23:51

Deirdre: 23:51

bop, boop, boop, like zero, zero, one, zero, zero, one.

Nicholas: 23:54

yeah, you, you, all you, this one was like a known plain text attack kind of thing. And like, I just like, I needed to collect a bunch of outputs. And then you do some singular value decomposition, something, something, something, and like weights pop out. And yeah, magic happens for only one layer, but like, it kind of works.

Thomas: 24:11

This is not the logit bias thing, right?

Nicholas: 24:14

This is the, yeah, logit bias, which is why that no longer exists on, on APIs. Well,

Thomas: 24:18

killed logit bias?

Nicholas: 24:20

I like, I have killed logit bias. I killed logit bias and, um, log props at the same time.

Thomas: 24:26

We should talk about logits real quick, because it seems like one of the big constraints that you have attacking these systems, right? Is that, like, so you have this series of matrix multiplies and biases and sigmoid functions or ReLUs or whatever the hell they're using, right? And it all ends up with, you know, a set of outputs, right? And those outputs get normalized to, like, a probability distribution function. Right? You just run some function over it, everything is normalized between 0 and 1, everything adds up to 1. It's a probability distribution, right? Those are, those are the logits.

Nicholas: 24:55

Those are the probabilities. And then you take the log of that. And you get logits.

Thomas: 25:00

And those

Nicholas: 25:01

do this.

Thomas: 25:02

the logits are raw outputs from the model. But they're not what you would see as a, a normal person would see as the output of the model.

Nicholas: 25:08

Correct. So, for like a language model, what happens is you get these probabilities, and then each probability corresponds to the likelihood of the model emitting another particular word. So, you know, hi, my name is, if it knows who I am, the word Nicholas should have probability like 99%. And, you know, maybe it's gonna put something else that's a lower probability, maybe, like, probably the next token would be Nick. Because, like, it's, they have some semantic understanding that, like, this is the thing that's associated with my name, and then probably the next token is gonna be something, you know, is, like, another adjective or a verb or something that, like, describes. Anyway, so this is the way that these models work. And usually you don't see these probabilities, the model just picks whichever one is most likely and returns that word for you and hides the fact that this probability existed behind the scenes. And so when people say that language models are stochastic or random, what they mean is that as a person who runs the model, I have looked at the probability distribution and sampled an output from that. And then picked one of those outputs. Like, the model itself gives you an entirely deterministic process. It's just I, I then sample randomly from this probability distribution and return one word, and then you repeat the whole process again for the next, for the next token and the next token after

Deirdre: 26:18

Mm hmm.

Thomas: 26:20

Right. And so like, it's not, it's not a typical API feature to get the raw output. It's usually, but there was an open AI, there was an open AI feature where you could set manually biases for particular tokens or for tokens.

Deirdre: 26:34

Okay.

Nicholas: 26:35

so because of, because, so the OpenAI API was created when GPT 3 was launched. When this was not a thing that, like, was like a production tool. This was like a research artifact for people who wanted to play with this new research toy. And one of the things you might want to do is, suppose that I'm asking the model yes or no questions. And suppose that you can't, like, you know, say, you should be equally likely to say yes or no or whatever. One thing you can do is you can ask the model a bunch of yes or no questions, and then realize, like, hey, the model's giving me yes 60 percent of the time, no 40 percent of the time. You can just, like, bias the other token to, like, be a little more likely, and then now you can make the model be 50 50 likely to give you yes or no. And so for that reason, and others like it, people had this logic bias thing that would, uh, Let you shift the logits around by a little bit if you wanted to. And no one really uses it that much anymore, but because in the past, it was a thing that could be done. It was just like hanging around and

Thomas: 27:34

Okay, tell me I'm wrong about this, but like, so the, so you can't get the, you can't generally get the raw outputs, and you would as an attacker like to get the raw outputs for these models. And the, the logit bias API feature was a way for you to, with active, like, Oracle type queries, approximate getting the logits out of it.

Nicholas: 27:53

in the limit, what you could do is you can. Say, I'm going to use the logic bias and I'm going to do binary search to see what's, what value I need to set of this token in order for this to no longer be the most likely output from the model. And then you could like, Do this sort of token by token, so initially I say, you know, my name is, and the model says Nicholas, and I say, okay, great. Set the logit bias on the value Nicholas to like negative 10, so this is now less likely, and I say my name is, and the model still says Nicholas, and so I know now that the difference between the token Nicholas and whatever comes next is at least 10. And so then I say, okay, what about 20? And then now it becomes something else, and I say, okay, fine, 15, and then 15 is back to Nicholas. And I could do, I could, you could do this, and you can recover the values for each of these. Now, okay, it turned out they also have something else that's called TopK, or logprobs for TopK, lets you actually directly read off the top five logits. But only the top five. So what you can do is you can then just read off the top five, and then you can logit bias those five down to negative infinity, and then new things become the top five, and then you take those and you push those down to negative infinity, and then you can repeat this process through that way. And so what OpenAI did is they allow the binary search thing to still work, because it's a lot more expensive, but they say you can no longer do log probs and logit bias together. And so you can get one or the other, but not both. It's just a standard security trick where, you know, you don't want to break all the functionality, but you can do this one or that one.

Thomas: 29:24

And you see why I love this, right? Because this is almost exactly like running like a CBC padding Oracle attack against

Nicholas: 29:31

Yeah, exactly.

Thomas: 29:31

Like it's the same attack model.

Nicholas: 29:34

Yeah, no, this is like, you see this all the time in these security things where, yeah, it's, yeah, there's, People sometimes, you know, ask like, you know, how did you come up with this attack? And it's like, well, because you have done all the security stuff before that, like, you know, it's like, I'm not, there's no new ideas or anything. You just like, you look at this idea over there and you go like, well, why can't I use that idea? And this is the new domain. And then you try it and it works.

Deirdre: 29:56

It's new enough.

Thomas: 29:57

so for like the OpenAI, like OpenAI type production models, from the last paper I read of yours, it didn't seem like you could do like a lot of direct weight extraction from it the way you

Nicholas: 30:06

No, yeah, just one layer.

Thomas: 30:08

but the, the Logit Leak thing that they, they came up with, right, that API problem was giving you directly the last layer.

Nicholas: 30:15

Correct. Yeah, but yeah, that's what was going on there.

Thomas: 30:18

So I guess, I don't know. I still don't have a, I still don't feel like I have much of the intuition for, like, what the simplest toy version of this attack looks like. For, not, not with the, not with the top K logic bias thing. That, that I get, which is awesome, right? It's just,

Nicholas: 30:34

sure.

Thomas: 30:35

but in the best case scenario for you, what are you doing?

Nicholas: 30:37

Okay, yeah. So, so actually the techniques that we use for this one on the real model are very, very different than the techniques that we use when we're trying to steal an entire model parameter for parameter. They, like, almost have nothing to do with each other. Um, so which one are you asking about? The real model one or the small model one? Okay. Sure. So let me, let me try and describe, um, what's going on here and I'll, I'll check that things are okay with you. So, um, let's suppose I want to learn the first layer of a model. Okay. And so what I'm going to do, um, and let's suppose this model, okay, maybe the very simple case, let's suppose this model is entirely linear. So there's, there's like zero hidden layers, like input. Matrix multiply output. Okay. How, how could I learn what the weights are? What I can do is I can send zero everywhere and one in one coordinate.

Thomas: 31:33

Yep,

Nicholas: 31:33

And I

Thomas: 31:34

exactly how you would attack like a hill cypher or whatever. You

Nicholas: 31:36

yeah, you

Thomas: 31:36

a matrix invert.

Nicholas: 31:38

yeah, exactly. And so you just look at the, this value and like, you can read off the weights. Okay, so this, this works very well. Let's now suppose that I gave you a two layer model, so I have a single matrix multiply, I then do a linear, I have my non linearity, and then another matrix multiply. Okay. Now, let's suppose that I can find some input that has the property that one of the neurons in the middle is exactly zero. That's the activation at that point. And what this means, so remember that the activation function for this ReLU activation has the maximum between the input and zero. So, this is like, it's a function that, you know, is flat until it's, well, it's negative, and then it becomes, like, this linear, like, y equals x function that goes up after that. Okay, so, what I'm going to do is I'm going to try and compute the gradient on the left side and the gradient on the right side. This is just the derivative in higher dimensions on the left side and the derivative on the right side. And I'm going to subtract these two values, and it turns out that you can show that the difference between these two values is exactly equal after some appropriate rescaling to the values of the parameters going into the model. I'm going to pause and let you think about this for a second and then ask me a question about it.

Thomas: 33:04

We're going to pause and let Deirdre ask a

Deirdre: 33:06

No, I don't have a clue. I don't have one. I'm just like, okay? Okay.

David: 33:12

I used to always joke, I was like, I stopped learning calculus because it's not real. In the real world, we operate on discrete things. Have you ever seen half a bit? But you're making a very strong case here that I should have paid more attention to calculus.

Nicholas: 33:28

You know, so this is what makes things, you know, harder here in this case compared to like cryptography where, um, yeah, we can operate on, you know, you know, half a bit of things that we have, we have these gradients we can feed through. If you had to do the same attack on discrete inputs, it would be very, very, very hard. But because we can do this, it becomes a lot easier.

Thomas: 33:47

I'm fixed on two things here. One is that you were describing the pure linear attack of this, and I'm like, Oh yeah, it's exactly like, you know, reversing a purely linear cipher. Which like, is the worst thing for me to possibly say here, because like, So like, that, like, like attacking a purely linear cipher is also like a classic CTF level. It's like, here's AES, but we screwed up the S boxes, so it's purely linear, right? And how do you attack that, right? And it's like, I know that because I'm that kind of nerd. Like, really specifically practically exploitable dumb things like that. But there's no reason why anyone would have an intuition for that. So it's like, well, you put the 1 here and the 0 there, and I'm like, stop talking. Move on to the thing I don't understand, right? Um, but, but, the other thing here is just, um, like, I was also struck by, um, Again, this is not that important to the explanation that you have, but I'm just saying like if you take the purely linear model and it's like well that's trivially recoverable, that's also the intuition for how post quantum cryptography with lattices works, right, is like you take a problem that would be trivially solvable with like Gaussian elimination, right, and you add an error element that breaks that simple thing and then there you go.

Deirdre: 34:47

But also you just keep adding layers and layers and layers and layers and layers and dimensionality and the error and then all of a sudden you've got, you know, n dimensional lattices and you've got post quantum learning with errors or shortest vector problem or anything like that. So. Okay.

Nicholas: 35:09

talking about these attacks, you know, so in practice, people often run these models at like, you know, 8 bits of precision And so a very realistic argument for like why these attacks that I'm talking about don't apply for these things is because that 8 bits of precision is essentially just in learning with errors sort of world where like you're just like adding a huge amount of noise every time you do, you compute a sum

Deirdre: 35:29

But it does work. And

Nicholas: 35:31

But it doesn't work so it's, yes,

Deirdre: 35:33

layer, which

Nicholas: 35:35

in that okay, so that's because they were using very different attack approaches.

Deirdre: 35:38

Okay.

Nicholas: 35:40

So that's the important thing to understand here, is like, the two attacks achieve similar goals, but the methodology has almost nothing to do with each other. Um, then, and so at the last layer, it's because what we're doing is we, there's only one linear layer that we're trying to steal. And because it's only one linear layer, I can just, I can do something to arrange for the fact that all I need to solve is a single system of linear equations, I compute the singular value decomposition, all the noise is noise, which is there correctly, but you can just average out noise as long as you're in one layer, as soon as, you know, you're If you imagine you're doing learning with error, but you, you could had only a single linear thing and you could just average out all the Gaussian noise, right? Like, it was no longer a hard

Deirdre: 36:20

No, yeah, it's just easy.

Nicholas: 36:22

right. And so this is like the thing that we were doing is, and this is why we think it like basically doesn't work at two layers is because you have these, this noise of things you don't know, and you have to go through a non linearity and then like all bets are off.

Thomas: 36:34

This is why the, the open AI attack, this is why the logic bias plus top K thing doesn't work over multiple layers. Is that, but the, the toy problem attack where you're trying to recover multiple, uh, where, where you're trying to recover weights directly, kind of like the, the, the, the forward version of the attack going from inputs through it, right? That extends to arbitrary numbers of

Nicholas: 36:53

That goes arbitrarily deep, um, you know, as long as you're working in high precision and you don't have sort of these, these errors accumulate, which is, you know, so, so our attack works up to, I don't know, like eight ish layers, and after that, just like, GPU randomness just starts to, like, you know, have, give you a hard time.

Deirdre: 37:10

Hm. Mm-hmm

Thomas: 37:11

Is that like a floating point thing or is that like a GPU thing?

Nicholas: 37:13

Now, I mean, it's a floating point thing, but, you know, you want to run these things on GPUs, and so, yeah.

Thomas: 37:17

Gotcha. So, like, generally, like, the forward attack doesn't seem super practical. It's more like a theoretical finding.

Nicholas: 37:24

Yeah, I mean, this is the, it's the kind of attack that, you know, that's why I was at Crypto, right? Like, you know, we, we say, here's a thing that you can do, like proof of a proof of concept. And like, like our initial attack was sort of even, our attack was exponential in, in the, in the width of the model. And, um, that was pretty bad. Um, uh, and, um, among other people, Adi Shamir showed how to make a polynomial time in the width. Um, which, yeah, was a fun thing.

Thomas: 37:50

The width being the number of layers,

Nicholas: 37:51

No, sorry, the depth is the number of layers, the width is how wide each layer is, yeah, the number of neurons in a

Deirdre: 37:57

yeah. How many neurons in each layer?

Nicholas: 37:59

Yeah. Imagine that the thing is going top down. And people sort of imagine the depth being the number of layers you have and the width being how many neurons make the layer wide.

Thomas: 38:09

So, is there stuff that you can do in the forward direction with a more realistic attack setting?

Nicholas: 38:14

Uh, not that I know of.

Deirdre: 38:15

Oh, okay.

Nicholas: 38:17

Yeah, um, yeah.

David: 38:19

Okay, so that's sucking, like, weights out of the model. We've, uh, you've also done some getting the training

Nicholas: 38:27

talk about something else now.

David: 38:29

Uh,

Deirdre: 38:30

Oh,

Nicholas: 38:31

That's the

David: 38:31

well, well, let's fix one input and switch to the other input, which is like, how, how do we get, um, um, and in what cases does this training data, um, sort of leak directly, um, out of the model?

Deirdre: 38:43

How could we suck the training data out of the model

Nicholas: 38:46

so, so let's imagine a very simple case. Let's suppose that you're training a model on, like, all of the internet, and it happened to contain the MIT license. Okay, yes. Like a billion times. And then you say, you know, this software is provided, like what, when like, you know, they ask them all to continue what's, what's it gonna do, right? Like what the training objective of a language model is to emit the next token that is most likely to follow. What is, what is the next most likely token? After all caps, like this license, this software is provided is the MIT license. And so like this is like what we're doing with a training data extraction attack.'cause it's no longer some kind of fancy mathematical analysis of the parameters. It's using the fact that the thing that this model was trained to do was essentially just to omit training data. It like, incidentally, happens to be the case that it also does other things like, you know, generate new content. But like, the training objective is like, maximize the likelihood of the training data. And so that's what, in some sense, you should expect. And, you know, this is Okay. So, so that's like the, again, that's the intuition. Like this is not, of course, what actually happens because these models behave very differently, but this is like morally why you should believe this is possible in the same way that like morally why you should believe model stealing should be feasible. Um, now the question is how do you actually do this? And so in practice, the way that we do these attacks is we just prompt the model with random data that like, I don't know, just like a couple of random things and just like generate thousands and thousands of tokens. And then, when models emit training data, the confidence in these predictions tends to be a lot higher than the confidence in the outputs when models are not giving you training data. And so you can distinguish between the cases where the model is emitting training data from the cases when the model is generating novel stuff with relatively high precision, and that's what this attack is going to try and

Deirdre: 40:40

And what's, what's the diff between the confidence between like, I'm, I am regurgitating training data, I am 99 percent or versus not returning data, it's like 51%.

Nicholas: 40:52

so it depends exactly on which model and on what the settings are for everything, but it's like the kind of thing where, within any given model, you can have, I don't know, like, 90 plus, 99 percent precision at like, you know, 40 percent recall or something. So like, you know, as long as you, like, there exists a way to have a high enough confidence in your predictions that you can get, you can make this thing work for some reasonable fraction of the time. Um,

Deirdre: 41:17

So generally across models, like if it's you, if it's regurgitating training data, it's like in the high nineties of confidence. And if it's not, it's not.

Nicholas: 41:26

Okay, sorry, what I meant is, so, so yes, um, this is usually true, but this, I guess what I was saying was, um, okay, so, so, you have to have a reference, so what you, what you want to do is basically you're, you're doing some kind of like distribution fitting thing, where like you sample from the model a bunch of times, and then you have one distribution of what the outputs look like for the normal case, and then you have another distribution which what the model, what the distribution looks like when it's doing training data, And then you have a bunch of challenges, where like, occasionally, or, you know, the sequence 1, 2, 3, 4, 5, 6, 7, 8, 9, whatever, has like, you know, very high probability on the next token, but this is not memorized. So what you want to do is you want to normalize the probability that the model gives you according to some reference distribution. of like how likely is the sequence actually. And so one thing you can do is you can normalize with respect to a different language model. And you can like compute the probabilities of one, compute the probability on the other, another model divide the two, and if one model is like unusually confident, then like it's more likely to be training data because it's not just some like you know boring thing. Another thing we have in a paper is just like compute the Zlib compression entropy, and just be like if Zlib really doesn't like it, but your model really likes it, then like you know maybe there's something going

Deirdre: 42:34

Uhhuh.

Thomas: 42:35

Okay, hold on, hold on. Let me, let me, let me check myself

David: 42:37

about this too.

Thomas: 42:39

You first, and I'll try to

David: 42:41

Now, I'm going to take us off on a compression tangent, so you should finish asking, uh, um, questions here.

Thomas: 42:47

so the attack model here is, I'm assuming in, in, in, in, in, in, And like the research that you're doing, the ATT& CK model is, I have the whole model. I have this big blob of weights or whatever. And the idea is that, that blob of weight should not be revealing sensitive data. Right? It's all just weights and stuff, and it's all very stochastic and statistical and whatever, right? And so what I'm going to do is, I'm going to feed the model, like, feeding through the model is basically just a set of matrix multiplications through each of those things, right? More than that, but not much more than that. Producing an output that is a set of predictions on what the next token is going to be, right? Um, so, what I'm going to do is, where normally I would give it very structured inputs, right? Which are going to follow, like, not predictable, but like a structured path through I'm going to give it random inputs, which it's not what it's designed to do, and then I'm going to look at the outputs, and I'm going to discriminate between shit that is just made up, that are just random collections of tokens or whatever, like garbage in, garbage out, versus like traces that will show up there with high confidence outputs that reveal I didn't give you any structure in, and you gave me a confident prediction, you know, prediction of this out. So that was there on the model, like you gave me that to begin with. And then it sounds like what you're saying is like one of the big tricks here is what is that distinguishing? Like how do you, how do you look at those outputs and say like this is training data versus not training data? Because Deirdre had asked like is it like a 50 to 90 percent thing, and it sounds like that's generally not the case. It's not like it literally lights up here's the training data, right? But one thing to do there is diff it against another language model, and see if you get like. Okay, so then I can just look at the differences between those two things and catch things that were singularly, not just probably singularly inputs to the one language model, but also just the way it was trained, right? So it could be training on both models, but like the way the training worked out in this particular random input, it just happens to, that's awesome.

Nicholas: 44:43

Yeah, exactly. This is exactly what's, what's going on. And, you know, and yeah, you can try and bias the, exactly which reference class you're using. And so, yeah, so the, another thing you can do is, yeah, just use, use zlib. I don't know if you want to go into this

Thomas: 44:55

We do.

Deirdre: 44:56

Yeah.

Nicholas: 44:57

go ahead, David.

David: 44:58

Um, yeah, so this, um, kind of, um, if you'll let me go off on a little bit of a tangent here. A while back I remember reading a blog post, I think it was from Matt Green, where he was like trying to do an introduction to cryptography and like, Um, random oracles, and he said he would ask his students, like, in his intro crypto class every year, or his intro grad crypto, of like, you have an amount, like, a truly random sequence, there is no way, like, to compress it, there's no way to store it, other than to, like, store the whole sequence. And then, like, some student would, he would say this in a way that some student would always, like, want to argue with him. And then like, that student loses, and this is like, kind of a fake setup because it's basically the definition of a random sequence is that you can't do that, so you can't really argue that you could if you've defined it out of the problem. Um, but this just kind of always sat in the back of my head. And when I first started seeing, like, attacks on ML, like pre LLMs, probably ones that you were doing, um, to be honest, um, Like, I, something kind of connected in my head where I was like, okay. Like, we have all this training data, and then we have like, kind of the entire world that like, maybe the training data is supposed to represent. And basically we're creating this model, and this model is like, morally equivalent to a compression function in my head, in the sense that like, we have a bunch of data and we want to have a smaller thing that then outputs all the other stuff, um, which means that like, there's some amount of like, actual information that exists in the training data and all of the like, data we're trying to represent. And like, this model can't possibly, like, if this model is smaller than that data, it's gonna lose information somewhere. And so, my mental model of ML attacks, like pre LLMs, was just like, well duh, like at some point, you're losing information, like, I don't understand the math, I'm glad that Nicholas is doing it, but like, if you're losing information at the model step, like, you're necessarily going to be able to find some input that's going to give you a wrong answer. Otherwise, you would have to, you know, store the entire random sequence. Um, is that like a legitimate way of thinking about things? And, and I bring this up when you mentioned Zlib, because like, you know, compressing a, a obfuscated binary and seeing if it compresses the same as the unobfuscated binary is like, I think this was a trick from Halvar Flake on like, how to do deobfuscation, like,

Nicholas: 47:25

no, I mean, this is, this is a valid, you know, entirely valid way of, of thinking about this as like, what's going on for these things. And yeah, I mean, this is like one of the intuitions you may have for why, why models have these attacks is they're necessarily this lossy function. yeah, there are caveats to this where this intuition, you know, breaks down in practice for some reasons where, like, you know, it looks like, okay, so maybe, here, let me give you one fun example of this. Train, you train your model on your machine. Whatever you want, whatever architecture, whatever data, mostly for you want. I train my model, my machine, same task, but like my model, my machine, whatever I want, how can I construct the input that makes your model make a mistake? I can construct the input that makes my model make a mistake and then send it to you and some reasonable fraction of the time, I don't know, like, you know, between like 10 and 50 percent of the time, it will fool your model too. If it was just the case that it was like pure random, like, you know, Points that just like, were just like happened because you got unlucky, the hash function hit a different point. You would not expect this to be, to work,

Deirdre: 48:29

Mm-hmm

Nicholas: 48:31

transferability property shows you that like, there was like actually a little bit of real actual signal here that like, there's actually a little bit of a reason why this input was, was wrong. And like, it's not just a noise function, but this, yeah, something going on here.

Thomas: 48:49

Are you satisfied, David?

David: 48:51

Yeah, um, but I would say, um, well clearly no, but the was no. Um, but, I don't know, maybe this is a way to kind of pivot into, like, a hardening against prompt injection in the sense of not in the, like, jailbreak, oh, like, ask my grandma to do it, or ask the prompt to do this thing for me because I don't have any hands, which I think is one that I saw in here. Your site before, but, um, I was just like trying to get models to be like more accurate in general, like at some point there, if you have a lossy function, you're going to always be able to find something that's going to give the wrong answer. Cause at the end of the day, we're just kind of computing a rolling average. Is that an accurate way of thinking about things?

Nicholas: 49:38

I mean, yeah, it's, it's, this is one of the reasons why I think, you know, these kinds of attacks are, are relatively straightforward to do is, yeah, you, you have some statistical function and it is wrong with some probability, and as an adversary you just arrange for the coins to always come up heads and then you win more often than you lose, I don't know.

Thomas: 50:01

want to get back to the idea of, like, what it takes to harden a model against these kinds of attacks, what it would take to, like, foreclose on them, just because I think that's another, like, indication I get of, like, the, you know, how serious or fundamental the attack is. And also, as an attacker, I like reading those things backwards and working out the attacks. But before we get there, I do want to hit, and this is totally, probably totally unrelated to the stuff that we were just talking about, but the two things that we were talking about, the two earlier The first two broad attacks that we came up with was poisoning models and then breaking poisoning schemes, which, um, you had a brief full disclosure spat over, um, which is another reason why I want to hit it, but, um, yeah, so, like, there's, there's, there's a general notion of, like, it'd be valuable To come up with a way to generate poisoned images that would break models because that would be a deterrent to people scraping and doing style replication of people's art and things like that, right? My daughter is an artist and she's hearing me saying this out loud and wanting to throw things at me, so, um, I, I think I, I understand the motivation for those problems, right? But, like, what does that look like in terms of, like, the, the, the computer science of the attacks and the defenses there?

Nicholas: 51:08

Yeah. Okay. So there's this problem of data poisoning, which maybe I'll state precisely, which is suppose that I train a model on like, you know, a good fraction of the internet. Someone out there. I don't know, is unhappy with you in particular, just wants to watch the world burn. Like, whatever the case, they're going to put something up that's bad. You train your model, including, among other things, this bad data. And as a result, your model then goes and does something that's, that's bad. And whatever the objective is, because dependent on whatever the adversary wants, and you know, how often it works is dependent on like how much data they can put on the internet and these kinds of things. And so this is a general problem that exists. It like, it's like, it's funny. Like it didn't used to be a thing that like, I really thought was a real problem because, you know, like back in the day, you know, you would train on like MNIST, like MNIST was collected by the U S government, like in 1990, the probability of like, they had the foresight to be like, some people are going to train handwritten digit classifiers on this in 20 years, we're going to like inject bad data into this so that when that happens, like, you know, we get to like do something that like, this is not an adversary you really should be worried about, but like. Then, you know, we no longer are doing that. Like, now, like, we literally just, like, scrape the entire internet and just, like, use whatever we can find and train on this. And so, like, now, poisoning attacks become, you know, a very real thing that you might have to worry about. And, yeah, okay, so, hopefully that explains the, like, setup for this. Now, what's the question you had?

Thomas: 52:36

How, how do I poison something?

Nicholas: 52:39

Okay, okay, yeah, yeah, yeah, okay, great. Um, yeah, okay, this is very, okay. So, let's suppose that I want to make, you know, some particular person's, uh, when I put in my face, that it like is recognized as, you know, this person should be allowed into the building, and like I think someone's gonna like train a model on the internet, and then like use this for some access control or something. What I would just do is like the very, very dumb thing of I'll upload lots of pictures of my face to the internet, and I'll put the, the, the, Text caption or whatever next to it is like, you know, I'm the president of the United States. And like, you just repeat this enough times and then you ask the model, who is this person? And it will say the president of the United States. Like you just like, there's no like clever engineering thing going on for most of these attacks. I mean, like it can be done. Like you can sort of, you can do a little bit of fancy stuff to make these attacks get a little bit more successful. But like the basic version is like very, very simplistic where you just, you repeat the thing enough times that it becomes true.

Thomas: 53:38

Okay, but that's, that's not what, like, the UChicago poisoning scheme is doing.

Nicholas: 53:42

Okay, okay, so this is, yeah, so this scheme is a different paper, where what they're trying to do is they're trying to make it so that suppose that I'm an artist, and I want someone to, I want to show people my art,

Deirdre: 53:57

hmm.

Nicholas: 53:58

because I want people to like, find my art and pay me, but I don't want to allow these machine learning models to be trained on my art, so that someone can say, generate me art in the style of, person, and then I don't have to pay the person because the model can already do that. Um, which is like, yeah, an entirely realistic reason to be frustrated with these models. And I think, you know, it will not just be the artists arguing this point in some small number of years, because like, they're like, sort of got unlucky to be the first people to encounter this problem. But I think this will become a problem for many other domains in the near term too. But, um, yeah, so what their poisoning scheme does, is it tries to put Slight modifications of these images so that I can upload them. So they look more or less good to other people, but so that if a machine learning model trains on them, it doesn't learn to generate the images in that person's style. So like, it looks good to people, but it does not, it does not, it cannot be used reasonably well for training of the training models.

Deirdre: 55:06

always look like a Jen knockoff. Ahhhh.

Nicholas: 55:15

it can't do that. Like it's almost as if I'd never trained on my data.

Thomas: 55:19

Is this, like, is, is this a situation where it's, it's kind of, like, reverse steganography, where, like, the style transfer thing is dependent on a really precise arrangement of the pixels in the image, right? And if you do things that are, like, um, psychometrically the same, like, it looks the same to me, but, like, the, the arrangement of the pixels is, is essentially randomized or encodes something weird in it, then, like, it's not going to pick up the actual And then would you break it the same way that you would break steganography, which is like, if you shake it a little bit, it breaks?

Nicholas: 55:49

yeah, that's, yes, yes, that's exactly, almost like it's like, so, so you don't just, so, so you, it's not so, so precise this way, but yeah. So the way that you add this noise is you like generate noise. Yeah. That's like this transferable in some way. So like it fools your models and you hope that because it fools your models, it will fool other models too. And you like do this only at the low level to like, like high order bits of pixels. And then, yeah. So what's the attack? Like literally the attack is shake it and then try again. It's like, you know, you, you, you take the image and you add a little more noise in some other way. Or some, in some cases, just like rerunning the training algorithm a second time can just like, you can get lucky. Like

Thomas: 56:30

But is that, is that practical? If you were OpenAI, would it be practical to build that into your ingestion pipeline?

Nicholas: 56:37

Yeah, okay, um, so, I think there's like two reasons why you might want to do this, as a poisoning thing. One is, I'm worried against someone who is like, accidentally scraping my data. Like, they want to play by the rules, but like, I've put my data on my website, and I want to make sure, and my website has like, you know, don't crawl this, whatever thing, but then someone else copies my images to their website, and the crawler crawls theirs, In this case, like, presumably OpenAI is going to play by the rules more or less. They're not going to, like, you know, do the extra little stuff to, like, make sure they can evade these poisoning people. And, like, I think in that case, you know, you've basically done more or less the right thing. But if you're worried about someone being, like, you know, I hate artists, you know, like, they have wronged me. Um, I am going to, like, intentionally create these models so that I can, you know, put the artist out of jobs. Then, like, this person you're not going to win against. And so the. The question is like, you know, who's the adversary? I think the adversary is most likely to be the opening eye adversary. The, but the problem with these schemes that we found for like many of these schemes, so, so people have done this in the past, not only for art, but for images of people's face that just like, they don't want people models to learn what they look like because you know, they're just like worried about surveillance kinds of things, which also again, entirely valid thing to be worried about. The problem with these schemes is that. The person who is adding the poisoning must necessarily go first, and they need to predict what model is going to be using to train on their data in a year. And when it turns out for many of these schemes that if you just change the underlying algorithm, that they no longer remain quite so effective at this poisoning objective. And just like If you get unlucky and it turns out that OpenAI discovers a slightly better method to doing this training that like they're not out to get you But like it just is a change to the algorithms Oftentimes these things no longer work quite as well. And so we've seen

Thomas: 58:37

I mean, it's sort of the same sense as, like, you know, to break, you know, GIFAR attacks, when people used to embed Java jars into GIFs and things like that. You would just, like, you would thumbnail and unthumbnail images or just do pointless transformations just because, like, it was some totally irrelevant other It wasn't because of anything about the image itself. It was just, like, part of your pipeline became, for best practices reasons, that an opening I could do basically the same thing for whatever reason without and break the scheme. What's your, like As a researcher, your intuition for how successful that avenue is going to be long term.

Nicholas: 59:11

Yeah, so I'm somewhat skeptical that this is the kind of thing that will work long term. I think this is the kind of thing that you can do in the short term and like, will maybe make a little bit of a difference. My concern with this line of work is, I don't want people There's, there's a kind of person who is worried about this kind of thing, and as a result will not upload their images to the internet, for example. And then someone tells them, I have a defense for you, which is going to work and will protect your images. You should upload your images with MyDefense, and therefore you will be safe. And then they do this, and then a year and a half goes by, the algorithms change, and then it turns out that they're now in a worse off position, because they otherwise would have been safer, but they relied on a thing that does not work. And this is, I guess, my primary concern with these kinds of things, is I think if you're in a world where you're already gonna put it online, it literally can't hurt you. Like, it's just like, it's not going to make things any worse. But if you're in a world where you're a paranoid person and your next best alternative was not put it on the internet and now you've decided to do it because you trust this thing, now I think there are problems. And so I think this is the kind of stuff that people should use but be aware what they're actually getting and what they're not, and not just, yeah, use blindly.

Deirdre: 1:00:41

Yeah. As opposed to like, encode or whatever, render your art in this way, it is untrainable. It's more like, eh! It's untrainable. Yeah, it's like a best effort. And like,

Nicholas: 1:00:56

Yeah, and it's the kind of thing where,

Deirdre: 1:00:57

behind a Patreon or whatever.

Nicholas: 1:00:59

yeah, right, exactly, this is the kind of thing where, you know, almost like, you know, imagine that, you know, we're going to encrypt things, but everything is gonna be like limited to like, you know, let's say like a, I don't know, 70 bit key or something. Well, like, you know, someone wants it, like, they're gonna get it. It will delay them a little while, maybe, but like, it's like, good to do, like, it's strictly better to encrypt under a 70 bit key than not. But like, you know, it'll take someone who really cares a little while and they'll get the data back. And so like, if your alternative was, was not, then like, yeah, I guess do it. But if like, you know, if it like, is important to you that no one gets it, then we don't yet have something that actually adds significant security here.

Deirdre: 1:01:36

Yeah.

Thomas: 1:01:37

So if I'm building models, if I'm doing like production work on this at a very low level, at a place like, you know, Anthropic or OpenAPI, what is your work telling me I should be doing differently right now?

Nicholas: 1:01:49

Uh, what am I,

Thomas: 1:01:50

Okay, besides not exposing logits directly as outputs.

Nicholas: 1:01:55

Um, yeah, I don't know. Um, I think a big thing is like, the attacks we have right now are quite strong and we have a hard time doing defense stuff. I guess maybe this goes back to one of the things you were mentioning a little while ago, like, what, what, what, what works. Um, we don't have great defenses, like, even in the classical security setting, like, like the classical, like, you know, just make the image classifier produce the wrong output, like, things that we've been working on for a decade, like,

Deirdre: 1:02:21

Still.

Nicholas: 1:02:22

a, yeah, has a great talk, he has a slide where he talks about, um, okay, so, to a cryptographer, things work when it's like 128 bits of security. You know, and like, Cryptographer, like, you know, 126 bits of security, like, you know, like, run for the hills, destroy the algorithm, start again from scratch, like, okay, but like, okay, if, if someone were to do an attack on AES, like, that, like, worked for the full, sort of, full round, like, someone would start to, um, try and have another version at some point, like, you know, um, like, people would get scared, um, like, you know, small numbers of bits, okay, um, Okay, in system security, you know, you usually only have, I don't know, let's say like 20 bits, like, you know, stack canary, pointer authentication, something like this, like, you have some reasonable number of bits, and like, it's broken if it's like, you know, like five bits of security or something, like, you have to like, really get it down. Okay, in secure, in machine learning, when you have defenses that work, like, the best defenses we have that work in like image classifiers have one bit of security. The best defenses, like, they work half the time, and when you, when I have an attack paper, the attack paper brings it down to zero bits, like, it works every time. So, like, this is, like, the thing that we're, like, operating in. We're, like, you know, even if, like, the best defenses we have now were to be applied, like, it goes from, like, you know,

Deirdre: 1:03:33

tissue

Nicholas: 1:03:33

not working to working half the time. So, like, you know, we're, like, really in a space where, like, these things, the things that we have as tools are not reliable. And so I think, you know, the big thing people are trying to do now is to build the system around the model. So that even if the model makes a mistake, like the system puts the guardrails in place so that, you know, it just like won't allow the bad action to happen even if the model tells it to do the dumb thing or something like this. And this I think is for the most part the best that we have. There is hope that maybe with these language models There are smarter things you can do. Um, this is very early. I don't know how well this is going to go. But like, there is some small amount of hope that because language models are like, I don't, I hate to be like, smart in some sense, as the argument people make, like, they can like, think their way out of these problems or something like this. Like, this is like, a thing people have been

Deirdre: 1:04:31

that a human doesn't likely to see or something like that.

Nicholas: 1:04:34

Yeah, and so, right, but like, I think, you know, fundamentally, we, the tools we have right now are very brittle. And, at some point, maybe we'll fix the problem in general, but for right now, we're just trying to find ways of using them so that, um, you can, like, make them work in many settings, which, like, so for example, if you just don't use them in settings where when mistakes happen, bad stuff goes wrong, like, this is fine, like, this is, like, this is, like, all

Deirdre: 1:05:02

Nicholas: 1:05:02

right now, right? Okay, yeah, let's not do that. Um, you know, but like, um, you know, like, if I'm going to use one of these chatbots now, and like, I'm asking it to like, write some code for me, like, you know, either maybe it makes a mistake, or an adversary is trying to like, you know, go after me or whatever, like, if it's going to give me some code, I'm going to look at the output, I'm going to verify it, and then I'm going to run it. And like, this is like, it's not like going to be a, you know, a fundamental thing. But if instead, I was just like, you know, you know, Ask a question, pipe to language model, pipe to like pseudo bash. Like, you know, this would be bad. Like if you don't like, this was like the thing that you had in your, like somewhere in your, like that, like actually problems would go wrong. And so like, I think the thing people are doing right now is just like, well, just don't do that. Like just not take, let the models take action. But we are very slow, very, maybe more quickly than I would want moving into a world in which this is something that actually does happen. And these attacks start to, start to matter in these ways. And I think, you know, this is a big part of what we're trying to figure out is, At the very least, I would like to not be surprised by attacks. Like, like, this would be like the, like the, if you can't fix anything, like, you know, let's at least be in a world where the person makes a decision and says, I know this might be risky, I am going to do it anyway, as opposed to, here is a thing that I'm going to do, and I didn't realize I could be shooting myself in the foot, but I have. And like, at the very least, like, there is some acceptance of the risk, and a person has sort of weighed the trade offs and decided it is worth it, In some small fraction of cases that maybe they get attacked because maybe it turns out that like the value you get from doing this is just like very, very large and I am willing to pay the cost of fraud in order to have some other amazing thing.

Deirdre: 1:06:41

And do you think there's any correlation between the size of these models and the size of their training set on the likelihood of things going wrong? Like, if you have a much tighter target with a much tighter, like, type of capability and a much tighter, narrower training set, is it less likely to just, like, suddenly go off in, like, a random direction because, you know, because, for reasons or whatever?

Nicholas: 1:07:05

Yeah, no, so, so it, it is the case that larger things are more or less harder to attack, but it's not by like, Orders of magnitude. So like, you know, like the smallest handwritten digit classifier things we have, we still can't get that problem right. like we still can't better than humans classify handwritten digits. And like, like we, we ought to be able to figure this out, but like, like apparently like this is a thing. Okay. And so, and it, but it is easier to get that right then to like recognize arbitrary objects. But

Deirdre: 1:07:36

uh, are we likely to have models that are less likely to just do something we completely didn't expect by the narrower tailored models, uh, than the, like, we're trying to vaguely get AGI out of like the biggest model trained on everything that everyone has ever, all the knowledge of, you know, the entire human existence or whatever.

Nicholas: 1:07:55

My impression is that as we make these models bigger, the more things will go wrong. But there are people who will argue that the problem with the small models is they're not smart enough. And if the, like the big models, like have some, like some amount of intelligence and therefore like, they'll like be robust to this because like humans are robust. And so. You make it bigger and like it,

Deirdre: 1:08:18

Have you met humans? Not you, but them. Have you met humans?

Nicholas: 1:08:22

yeah, no, okay, okay, so, so they're not, they're not entirely insane,

Deirdre: 1:08:26

Yeah, but at the, yeah.

David: 1:08:28

tell the difference between you telling them instructions and you handing them a book

Deirdre: 1:08:33

Okay.

David: 1:08:34

like, know which ones are the instructions and which one is the book, right? And, like, models can't do that reliably,

Nicholas: 1:08:42

Right, and, yeah, so it's like, and so, for example, one of the things that people have been doing for these models that like, helps more than I thought that it would, is you have the model answer the question and then you have the model and then you ask the model did you do what I asked or like essentially and like the model goes again and it's like oops no I followed the wrong set of instructions like it turns out like like this sort of very very simple thing like oftentimes like makes it like noticeably better and like this like did you do what I asked is like something you can only ask like a sufficiently smart model like you can't ask you know a a handwritten digit classifier did you do what I asked like it's like not a meaningful thing it can do. But, uh, yeah, there's like, bigger models in some sense, like, can self correct in this way more often. And there's been some very recent stuff out of OpenAI, where they have some, these like, think before you answer models that they're claiming is much more robust. Because it has this think before I answer, it can, uh, think about, like, well, this is not an instruction. I know the person said disregard all prior instructions and instead, like, write me a poem for whatever, but it just, like, won't do that for reasons because it has thought about this as a thing. And so, yeah, this is a thing that might

Deirdre: 1:09:50

I believe them that this, like, you know, on, on the large scale and like on average or whatever improves results. But at the same time, I don't trust the model to check itself. Like, I don't, like, if I say, did you do what I asked? If like, which one of the answers is the right, correct answer

Nicholas: 1:10:07

exactly. I mean, I'm also nervous about this, but like, you know, to, to some extent. You know, we've, we've learned to like more or less trust people when you ask them, like, did you do what I asked? Like, most people just aren't going to like, most people, uh, well, like you're working with, are just not going to like outright lie to you if like, that they, that they did the right thing or something. And

Deirdre: 1:10:26

It's not that the model's lying to me, it's that the model doesn't know what the fuck I'm talking about, either, in any one of those options,

Nicholas: 1:10:32

no, I completely agree with you, but like, you know, this is like, this is the line of argument that you might, that you might get for like why these things might get, uh, get better. And, and, you know, I think one important thing is like. I have been very wrong about what the capabilities of what these models might have in the future, uh, in the past. And, uh, I am sort of willing to accept that I might be wrong about these things in the future. And so I'm not going to say that it's not going to happen. I'm just going to say that, like, you know, currently it is not that. And, uh, I think, you know, we want to understand, um, you know, We want to use the tools as we have them in front of us now. And some people will think long term in the future. And I don't know, it's a security person. I tend to like attack what's in front of me. And what's in front of me is not that, but like, if that changed significantly, that would be, you know, a useful thing for me to, for me to know.

Deirdre: 1:11:17

Yeah. Hm.

David: 1:11:20

given like the kind of probabilistic behavior and everything we just talked about here, like, does this, do you think this indicates that like scaling up LLMs like is not going to get us to AGI, like, like fundamentally, or do you, are you willing to make a statement about that? Like, cause I would, right? Like the idea that it's just a, uh, like next token predictor seems like, like we're missing something, right?

Nicholas: 1:11:47

yeah. Um, okay. So, but, okay. So I would have given that answer to you four years ago, but, um, in that, in that, in that answer. I would have not expected the current level of capabilities. So like, you know, if you had asked me four years ago, will a model be able to write, you know, a compiling Rust program? Like, like, like

Thomas: 1:12:09

by the way, is very difficult. We do so much LLM generated stuff. Generating Rust? Very difficult for LLMs.

Nicholas: 1:12:16

I, okay, I agree. But like, it can be done. Like, you know, like, this is like, like it's, yeah. Um, I agree

Thomas: 1:12:21

not even usable by AIs.

Nicholas: 1:12:24

Okay. Um,

Deirdre: 1:12:26

Very teeny tiny rust programs with very

Nicholas: 1:12:28

Yeah, No yeah, as long as you, as long as it's small, but like, but, but the fact, like, so the, my intuition for NextTokenPredictor was very good statistical model of the data, but could not have any, like, actual ability to help me in any meaningful way.

Deirdre: 1:12:45

Hmm.

Nicholas: 1:12:46

And I would, as a result, have said that where we are today is not possible.

Thomas: 1:12:53

But like, today I can tell, you know, Claude or whatever, I need to build a one to many SSH to WebSockets proxy, and it will do that well from scratch with nothing but prompts. Just like three or

Nicholas: 1:13:05

Yeah, exactly, right, right. And so this is, yeah, so, so, okay. So, so this is my statement is to say, I don't know what's going to happen in the future, but the world that we're living in right now, based on the impression of just statistical language model, was wrong in the past. Now, I still believe, maybe wrongly, that we're not going to get to AGI on this sort of, sort of trend. Like, at least, like, this is like, you know, I don't know, like, maybe the skeptical person in me is like, I don't know. Feels like, you know, we'll run into some wall, something's going to happen. I don't know. Data, training, exponentials don't always go forever, this kind of thing. I just like, it feels to me like something would go wrong, but I would have said the same thing five years ago, and I would not have told you that what we can do today is possible. So I have much less confidence in the statement I am making now. And I think that mostly it's maybe just me being. Wanting it to be true and not actually something that I believe in, like, some sense, but, like, I don't know. Like, it's I I have been very wrong in the past, and I feel like I still think it's the case, but I'm willing to accept that I may be wrong in the future in this way.

Thomas: 1:14:18

And of course, the correct answer is, there will never be AGI. So, this stuff is unbelievably cool. Um, I think one of the things I like about it most is that like, I'm, I'm, I'm A lot of like the background technique y stuff for how you're, how you're coming about this, is that we come from kind of the same basic kind of approach to this stuff, where it is like, you know, pen testing, finding and exploiting vulnerability stuff. And a lot of the, like, the, The machine learning and AI model stuff is kind of inscrutable from an outsider and kind of the same way cryptography was for me too, right? Like, a lot of the literature for that stuff, like if you don't know which things to read, it's really hard to like find an angle into it, right? But the work that you're doing is, for somebody like me, uh, It's just a perfect angle into this, right? Like, you basically have the equivalent of a CBC pattern Oracle attack against an AI model, and like, working out how that actually works involves learning a bunch of stuff, but it's like a really targeted path through it. I think it's unbelievably cool stuff, and I'm like thrilled that you took the time to talk to us about this. This stuff is awesome. I'll also say you wrote a blog post, which I think pretty much everyone in our field should read, which is how I use it. I got the title wrong, but it's like how I use LLMs to code or whatever.

Nicholas: 1:15:31

Yeah, I don't remember what it's called. Something like this. Um Uh

Thomas: 1:15:36

I talk to an AI skeptic, it's like, go read this article, and it's a 100 percent hit rate for converting people to, if not like sold on just the idea of AI and LLM, it's just like, yeah, clearly this stuff is very useful.

Deirdre: 1:15:48

Yeah.

Nicholas: 1:15:48

I wrote this for myself three years ago, like, as like, you know, like, the me, like, I used to not believe these things were useful, I tried them, they became useful for me, and like, this is like, it's like the same idea I was giving, like, this, this whole AGI kind of whatever thing, like, you know, I don't, like, I don't believe that this is something, like, many people did not believe this would be possible, like, the world we're living in today, and you ought to just try it, and, and see, and, you know, I think the way that, you know, Many people look at these tools and say, but it can't do, and then they list obscure things that it can't do, that like, like, when was the last time you really needed to count the number of R's in strawberry? Like, tell me honestly. Like, people will still, like, go on this with, like, a thing it can't do, or like, you know. But it can't, you know, correctly do arithmetic. It's like, do you not have a calculator?

Deirdre: 1:16:36

the problem with calling it AI though. Speaking

Nicholas: 1:16:50

I think, yeah, I think, yeah, like, the terminology around everything is terrible, you know, as soon as something is possible, it's no longer AI, like, this, this, this, like, the whole way of everything has always been, um, but like, just, I think that, yeah, people should be willing to look at what the world actually looks like, And then, adjust their worldview accordingly, and, you know, you don't have to be willing to predict into the future far. Um, yeah, and just, as long as when a new thing becomes capable, people don't look at it and say, that can't be done, or like, I think like the most useful thing for people to do is to like, actually think for themselves to make predictions about like, what would actually surprise me if it was capable now? Because then, in a couple of years, they can actually check themselves a little bit, where, you know, like, if, if the thing that you thought wasn't possible, it still isn't possible, like, good, like, you sort of, like, you're relatively calibrated for your thoughts, but, like, if, if you're, like, no way is a model going, going to be able to do X, and then, like, you know, someone trains something that makes that happen, like, you should be willing to update that, like, okay, maybe I'm still a little, um, confident in this one direction or the other, and it's, like, generally a useful way of, of getting that.

Thomas: 1:18:06

Is there a, yeah,

Deirdre: 1:18:08

uh, do you have predictions of the next N years of attacking LLMs?

David: 1:18:14

Or do you perhaps have an aggregator of predictions?

Deirdre: 1:18:16

Mm hmm. hmm.

Nicholas: 1:18:20

what I will expect for the next ten years of, of attacking models. My expectation is that they will remain vulnerable for a long time. And, um, the degree to which the attack, we attack them will depend on the degree to which they are adopted. Where, you know, if you have, if If it turns out that models are put in all kinds of places because they end up being more useful than I thought, or they become reliable enough in benign settings, then I think the attacks will obviously go up. But if it turns out that, you know, they're not as useful and, you know, when people actually try and apply them, they just, like, on outer distribution data, they don't get used, whatever, then I expect that the attacks will not happen in quite the same way. I think It's fairly clear where we are now, that at least these things in front of us are useful and will continue to be used. Like, it's not like they're going to go away. I just don't know whether they're going to be everywhere, or going to be in the place where people who need them for particular purposes use them for those purposes. And Not anyone else. And I think the world in which they go everywhere, it doesn't have to be because everyone wants them. It might just be like, you know, we live in a society where, you know, it's a capitalistic world where like, you know, they're just like more economically efficient than other, other than like hiring an actual human. And so the companies are going to be like, I don't, I don't care. You get a better experience on customer support talking to a real human. The model is cheaper. And even though it makes an error in one out of a thousand cases, that's like less than the cost of the person was in order to. Like do the thing. It's like, I think like in, in, in that world, I think attacks happen a lot. And so I think it's very dependent on, on how big these things go, which I think is hard to predict.

Thomas: 1:20:00

so it turns out it is possible to cryptanalyze

Deirdre: 1:20:04

Yeah. Ha ha

Thomas: 1:20:06

unbelievably cool result. Thank you so much for taking the time to talk to us. This is amazing.

Nicholas: 1:20:10

Yeah. Thanks. Of course.

Deirdre: 1:20:11

you! All right, I'm gonna hit the tag. Security Cryptography Whatever is a side project from Deirdre Connolly, Thomas Ptacek, and David Adrian. Our editor is Nettie Smith. You can find the podcast now online at scwpod and the hosts online at durumcrustulum, at tqbf, and at davidcadrian. You can buy merch online at merch dot securitycryptographywhatever dot com. If you like the pod, give us a five star review wherever you rate your favorite podcasts. Also now we're on YouTube, with our faces, our actual human faces, human faces, not guaranteed on YouTube. Please subscribe to us on YouTube if you'd like to see our human faces. Thank you for listening,