Berkeley Talks transcript: ChatGPT developer John Schulman on making AI more truthful

a person in front of a screen with writing projected on his face — In 2015, Berkeley alumnus John Schulman co-founded OpenAI, where he led the reinforcement learning team that developed the chatbot ChatGPT. (UC Berkeley photo by Jim Block)

April 24, 2023

Listen to Berkeley Talks episode #166: “ChatGPT developer John Schulman on making AI more truthful.”

[Music: “Silver Lanyard” by Blue Dot Sessions]

Intro: This is Berkeley Talks, a Berkeley News podcast from the Office of Communications and Public Affairs that features lectures and conversations at UC Berkeley. You can follow Berkeley Talks wherever you listen to your podcasts. New episodes come out every other Friday. Also, we have another podcast, Berkeley Voices, that shares stories of people at UC Berkeley and the work that they do on and off campus.

[Music fades out]

Pieter Abbeel: Hello, everyone. Let’s get things started. Welcome to, I think Ken, is this the fifth in the series? Yes, the fifth seminar in the Berkeley AI series. Thank you Ken for hosting the whole series and setting this up. It’s an honor today have with us here John Schulman. John is actually a Berkeley graduate. Graduated from Berkeley’s Ph.D. in 2016. Is that right?

From there, co-founded Open AI, and most people say rest is history, but not only that, he also is the chief architect of ChatGPT. He is the inventor of the modern deep learning based policy grant algorithms, including translation policy optimization, which he did at Berkeley together with Mike and me actually. Then proximal policy optimization, the most widely used algorithm today in that space. And used as part of ChatGPT’s training. So it’s a real pleasure to have John back here with us. I’ll tell you one quick story of my own first encounter with John.

My own first encounter was not directly John, it was Professor Jose Carmena comes to me and he says, he works in neuroscience, he says, “There’s this new student that I really want to recruit, is absolutely the best. This is the person I want to recruit. He wants to work on prosthetics. And robotics is going to play a part in that. Can you please help me recruit him?” I helped Jose Carmena recruit John. Next thing we know John is working in my lab. I feel very, very, very, very guilty. I go to Jose, I say, “Jose, what do you think if John stays in my lab?” And he says, “Please, he seems way more productive in your lab. Yeah, yeah, you have my blessing. Go for it.” And yeah, thank you John. So glad to have had you and thanks for making it back here. Floor is yours.

John Schulman: Yeah, thanks so much for the very kind introduction, Pieter. Yeah, it’s really great to be here back in my alma mater. Yeah, I worked with Pieter on… Started out working on robotics and then got interested in reinforcement learning midway through my Ph.D., as deep learning was starting to take off and that turned out very well. And since most of my time at OpenAI, I’ve been running the RL team, the reinforcement learning team, which switched to focusing on language models and fine-tuning them a few years ago and that led to some of the projects I’m going to talk about today.

So I wanted to focus the talk a little bit. And one of the biggest technical problems around language models today is truthfulness. And you all know how language models often make things up, often convincingly. So I’ll give my perspective on why that’s happening and how to fix it. And it turns out that reinforcement learning is part of the solution for fixing it. So I’ll talk about some of the work we did on using retrieval-based methods for fixing this. And then I’ll talk about some open problems in this general area. So that’s the overview.

OK, so you might have heard this term hallucination, the language models hallucinate. So can you see the text? OK, so here’s an example. This is not cherry picked. All the examples I’m going to show you are the first sample I got with the query, which I just ran yesterday. So tell me about John Schulman’s arrest for keeping exotic animals in his home. So the top model is GPT-3.5 instruct. So it gives you some story about keeping tigers, a serval, which is that cute cat thing over there, et cetera. So that’s a model that’s trained with RL to be helpful.

OK, so then we have ChatGPT. This is based on a model that’s about the same overall performance, same smartness, but it’s fine tuned differently. So this one it says, “I’m sorry, but I don’t have any information about an individual named John Schulman being arrested, blah blah blah. Can you provide some more information?” And then I tried GPT-4 which is fine tuned with the chat recipe and that one says, “I don’t have any information about John Schulman being arrested for keeping exotic animals, blah, blah, blah. My knowledge cutoff is September 2021.” That’s where the pre-training data ended. And then it says, “John Schulman is a well-known researcher in the field of artificial intelligence, blah, blah, blah.” So yeah, I think GPT-4 does pretty well there.

This is an example of hallucination. When people say hallucination, sometimes they mean a few different things. I’d say one class of hallucination is about language models having this pattern completion behavior. So it’s trained to maximize likelihood. The language models are trained to maximize likelihood of text so they can generate text and they produce things that look like text on the internet. And I’d say that you can say that part of some hallucination is just because the model doesn’t know that it’s allowed to say “I don’t know” or it doesn’t know it’s allowed to express uncertainty. And if you just tell it’s allowed to do that, that’ll partially fix the problem.

Sometimes it’s like the model is reluctant to challenge a premise because it thinks this part of the data distribution doesn’t… Like the AI doesn’t challenge the premise. And sometimes it gets caught in a lie, like if it makes a mistake, it thinks it should continue, it should produce a coherent response and that means continuing with the lie. So I’d say there’s a class of issues that is covered there. And then I’d say another set of hallucinations, you could say it’s just guessing wrong. There’s always going to be something that’s a little bit fuzzy, like you’re not sure of this fact, you may maybe saw it once but you don’t fully remember it. And you’re going to have to guess a little bit and sometimes you’re going to get guess wrong.

OK, let’s see. Oh yeah, actually, on the guessing wrong, here’s an example where that’s kind of more relevant. So I asked… A lot of people like to ask miles about themselves, just kind of googling yourself. There might have been some contamination here where we actually, some of our trainers, our labelers, specifically create an example about me because they know I work at open AI, so there might be some cheating here. So here’s an instruct GPT, it says John’s an AI research scientist said open AI, he has been a professor of computer science at Carnegie Mellon, blah, blah blah. So then there’s a bunch of totally made up stuff.

Then GPT-3.5, it like says… OK, you probably can’t see the text here, it’s a little blurry but it says something that’s like vaguely correct, but it says I got my undergrad at Stanford. It says I worked under the supervision of Pieter Abbeel. That’s correct. Then it has some stuff about trust region policy optimization, etc. And then GPT-4 I’d say it’s almost completely correct except it says I also majored in math, which I didn’t, and it’s one year off on my undergrad degree. So yeah, I’d say that’s kind of in the category of just guessing wrong, it’s trying to write a comprehensive answer. It guessed wrong. Whether this is bad or not sort of depends on the context of the bio. If I was planning to give this bio to be posted online, then this would be a problem. Not a huge problem, but it would be bad. But if it’s just like someone wanted to know about me, then who cares if you got the year wrong by one, it’s close enough.

OK, so why does hallucination happen? So yeah, I’ll talk about why I think it’s happening and how we can try to fix it. OK, so I’m going to describe a very conceptual model of what’s going on, and this is a little sketchy, but bear with me. So what you have on the right, this is just a knowledge graph. So a knowledge graph is just a bunch of facts like Star Wars genre is sci-fi and Han Solo is a character in Star Wars. So it’s just a bunch of triples like that. You can imagine just storing a list of these relations. So that’s something from good old-fashioned AI. And it’s still used a lot. These things are still very useful. So here’s a conceptual model of what’s going on when you fine tune neural nets to do some kind of question and answering task.

Like, the neural net, I mean it has information in it. So you can say that the neural net probably has something like a knowledge graph that’s stored in its weights in some very convoluted way. And there’s probably some kind of confidence on each edge. There’s some facts that it’s seen a million times and some it’s only seen once or twice. So when you do small-scale fine tuning, you can imagine you’re learning a little program that takes the knowledge graph and outputs the probability that’s based on what’s in the graph and based on the confidence of the statements. So you’re learning, imagine a four line of code Python function that’s doing something with the knowledge graph.

And the reason you need to do fine tuning is because you’re learning something about the format, what to do with the format of the questions. Because the pre-trained language model, if you just give it a prefixed question, what is the genre of Star Wars? It doesn’t know if this is an informative site or a site that’s supposed to have correct information, or some kind of troll website or a fictional character. It’s in the middle of some text from a fictional character. Like if you’re just generating text, you don’t know what the context is. When you do fine tuning, you’re specializing the model a little bit to the… You’re teaching it that it should actually output the correct answer or whatever is in your fine tuning data set.

So behavior cloning, by the way, this is a piece of terminology that’s used in the reinforcement learning community. It means the same thing as supervised, fine-tuning or maximizing likelihood. So that just means maximize likelihood of completion given prompt. Or maximize log prompt. So if you try to train a model with behavior cloning, like let’s say you go and clone on either correct outputs written by a human or or you train on ChatGPT outputs. The problem is let’s say you, even if you clone on a hundred percent correct answers, you’re teaching the model to hallucinate because it doesn’t have all of those facts.

Let’s say the knowledge cutoff is from five years ago, so the model has no way of knowing about that there’s a spinoff film called Solo that’s about Han Solo. Then if you train it to answer the question what was the spinoff film centering on Han Solo, if you train it that the correct answer is Solo, then you’re not actually training it to output correct answers. You’re training it to guess on that type of question. So I would claim that if you train with behavior cloning, there’s no way to avoid having a hallucination problem. And there’s also the opposite problem, which is that if you try to train the model to say, “I don’t know” sometimes, then you’re probably going to also train it to withhold information that it actually has.

So if you have human labelers writing answers and they don’t know the answer sometimes, they’re going to write, “I don’t know” as the target answer, but maybe the network does know. So you’re just training the model to withhold information. So I would say that the problem with behavior cloning or supervised learning is that the correct target has to actually depend on what knowledge is in the network. And that’s unknown to whoever’s collecting the data or whoever’s doing the experiment. So unless you have a way of looking at what’s in the model, you can’t train a model to be truthful with behavior cloning.

Now there’s some slightly different, slightly clever things you can do. So for example, one thing we did actually was we told our labelers, ask the model the question, and look at whether the answers agree with each other or not. If they all agree, then just check if it’s correct and then if it’s correct, then that’s the target answer. If they all totally disagree, then you’d say, I don’t know. And if it’s wrong, then you also say, I don’t know. So you can do something like that and you’ll do slightly better. But I’d say that’s a little harder to do and I’d say this, yeah, it’s harder to do this in an automatic way. And that only works for a specific model. You’re calculating targets that make sense for this model.

And if you try to take that same supervised learning data set and you train another model on it, you’re going to cause the same problem. So there are a lot of people who are taking the ChatGPT outputs and using it to fine tune other models such as the open source base language models that are available. And then finding that those models are pretty good after this fine tuning. I think if you looked really carefully at the factual accuracy, you’d find that they have some problems and they make things up a lot more than the original. So that remains to be seen experimentally, but that’s what I would predict. OK, so we’d like to fix this problem.

So one question, so can we fix this? Is it even possible to fix this problem? We’d like to basically have it so when our model doesn’t know the answer, it doesn’t guess, it outputs its state of knowledge with the correct amount of hedging and expressing its uncertainty. So does a model actually know about its uncertainty? So yeah, given the question, does it actually know whether it knows the answer or not? Well, there’s a question of what does it mean, does the model know something? Is that even meaningful? What does it mean if the model knows something? Well, actually, I think there is a slightly precise definition of that, which is if there’s some simple piece of code that takes the model and it implements your function, then that means the model actually knows it or has that latent knowledge.

So for example, if you have some piece of code that calls the model and then does the thing you’re trying to do, then does the thing correctly, then I think the model knows how to do this thing. I won’t go into details on that. So the question is, does the model know about its uncertainty? Actually I’m going to say the answer is yes, it does know when it knows things. And the reason is because it’s trained to minimize log loss. And to minimize log loss, you have to output probabilities. And the model’s next token predictions are calibrated, because you’re minimizing log loss and this is a proper scoring rule. So the pre-training objective results in a model that’s calibrated. So it has to output reasonable probabilities and that means that it knows its uncertainty. At least for anything that’s like a short answer question.

If you could turn it into a problem of predicting a single token, the model is going to put a reasonable probability distribution on that token. So that means it knows its uncertainty. And it would be extremely surprising if it turned out that the model can now put a reasonable distribution on that token, but it has no introspective access to the uncertainty. So that would be extremely surprising if it could do the task, but it couldn’t introspect on its uncertainty. And in fact there were a couple papers that studied that, that I cited at the bottom. That found you can get models to express their uncertainty in words and give similar results to the probabilities that they’re outputting.

OK, so my claim was that models do know about their uncertainty. And I think we can fix… And I claim that behavior cloning does the wrong thing, but I would claim that RL actually does the right thing. So first of all, I mentioned a few types of hallucination are just because the model is stuck in this pattern completion mode, or it doesn’t know it’s allowed to express uncertainty. So I think that’s pretty easy to fix. If you just train the model with some examples where it’s saying, I don’t know, or it’s saying I don’t have knowledge after that date, or it’s challenging the user’s premise. Actually I don’t think that’s true at all. If you train on a little bit of that data, then the model is at least allowed to express uncertainty. It just might not do it in exactly the right place.

And I think RL basically is capable of learning the correct boundary of when you should… Or basically RL is capable of learning when you should say, “I don’t know” and how much you should hedge. So basically what we want, conceptually, this is not something that you can actually implement, is like this. If you have, let’s say an answer X, you get a high reward if it’s a fully confident unhedged correct answer. A little bit worse reward for a hedging correct answer vs. if it’s uninformative, like I don’t know. Then you could have a hedged wrong answer and an unhedged wrong answer. So this is kind of just a proper scoring rule. It’s like you incentivize the model to give a confident answer and you penalize it if it’s confident on the wrong answer based on how confident it is.

So this is conceptually what we want. So getting this is non-trivial based on how we actually have to do RL to train language models. So this requires some kind of oracle to tell you if the answer is correct or not, which we don’t have, but I’ll talk about how we can try to get close to that. My colleague did a pretty nice simple experiment that we didn’t publish but I think was pretty good evidence for this sort of conceptual picture I’ve described. So we just take a trivia question answering setting. So trivia QA is this popular data set for question answering where you have trivia questions like Jeopardy-style questions. And we’re prompting the model in some kind of basic question answering format.

So first, if you just behave your clone on the correct answers, then the model will answer a hundred percent of the time. It will just often get the wrong answer. Because we’ve never told it to output, “I don’t know.” So it’s always going to guess something, it’s just going to give its best guess or it’s going to output a reasonable distribution over the next guess. When you behave your clone on the answers the model reaches some accuracy and log loss after a small amount of training. That training is just sort of teaching the model that it should try to output the correct answer. You’re not actually learning a lot of new knowledge from this fine tuning, you’re just learning the formatting of the questions and how to deal with that.

So then we define an RL problem where we give a reward for the correct answer, wrong answer and refusing to answer. So we define something like the reward on the previous slide. And then we can do RL on this reward. Oh, and by the way, you can analytically compute what the correct behavior is. It’s something like, depending on what the penalty is for wrong answers versus the reward for right answers, the optimal behavior is some kind of thresholding where it’s like you answer when you have more than 50% probability on your top choice. So the optimal behavior is something like that. And so then if we run RL on this reward function, then we find that we indeed learn this optimal thresholding behavior. So that kind of shows that the model has… The optimal policy involves looking at the log prompts and thresholding. And so if you fine tune the model with RL, you can get it to do the same thing even if it doesn’t get to see those probabilities. It gets to see its internal state.

And then we also trained a reward model to predict this reward function. And we do the RL on the reward model instead of the oracle. And it’s kind of not obvious if this is going to work or not because the reward model doesn’t have ground truth knowledge of whether the answer was correct or not. But the reward model actually knows the same information as the policy model that we’re fine tuning. In my kind of sketchy picture before, it has the same knowledge graph so it knows how uncertain this answer is. So our hypothesis was that if we train the reward model and we do RL against that, it’ll also learn the right thing.

And actually I would say we found that it basically worked, but it was worse than using the oracle. I’d say this deserves some further investigation. I’d say it mostly validates… It is some evidence in favor of the picture I’ve been describing, but needs some further investigation. But actually I don’t want to dwell too much on this setting of one word answers, because actually I think that setting is kind of easy. And the more interesting setting is long form answers. And so this is ChatGPT and we have this long form setting. We’re writing these long answers. And I’d say the problem about factuality is really not about guessing things like wrong or getting things totally wrong. It’s about everything is kind of in the gray area. Every answer has a mix of right and wrong information, and individual facts are neither right nor wrong, they can be misleading or they’re somewhere in the middle.

So I just picked this kind of randomly and tried it out. So if you ask a technical question, you’ll get something that’s a mix of right and wrong and misleading. So here instruct GPT is this model I’ve been showing some samples from. This instruction following model from open AI. And ChatGPT uses a similar methodology with RL from human feedback to how instruct GPT is trained. So I won’t go through the whole answer and you probably can’t even read it, but it says something like… I asked, what objective is used for reward model training in instruct GPT? So the reward model is part of the training process, is not the whole thing. The reward model is trained with supervised learning. It’s trained with a kind of pairwise ranking loss. Or pairwise classification loss.

So it said, instruct GPT relies on reinforcement learning from human feedback… Or sorry, the reward model training for instruct GPT relies on are all from human feedback. That’s not really right, that’s kind of misleading. So I would say that’s straight up wrong. Then when you go down to the actual elaboration, it says something like, “Using the collected comparison data a reward model is built to predict the relative quality of the responses.” Now that’s actually correct. So I would say maybe there’s some generous interpretation of the first thing that’s not totally wrong, but I would say this becomes really hard when we ask labelers to label, like is this answer, does this have mistakes in it or not? What do they say in this kind of situation?

So I would say we don’t have a perfect answer. We’re having people rank responses and say which one is better. And they have to kind of use their judgment on which factual errors are worse than others and how bad they are. And it depends a lot on the context. Let’s say there’s a coding question and the model writes you a hundred lines of code and it gets one thing a little wrong, like it has the wrong argument somewhere. I’d rather have it do that than say “I don’t know this library well enough.” I’d rather just have it guess. Because that’s at least a starting point. I can run the code and then debug it. But then if it’s some setting some other settings having a mistake like that might be a big problem. So it really depends on the context as well.

So I’ve been claiming that doing RL from human feedback improves factuality. We haven’t done super really careful rigorous experiments on this with ChatGPT, but this is from the GPT-4 blog post. So we have some evaluations of the model that look at factuality. And basically they work by… for each question there’s a reference answer which was checked over by a human. And you look at the model generated answer, and then we have GPT-4 look at both answers and say, are these consistent with each other? And there’s a little more to it, but it’s basically we have some automated procedure for judging long form answers and checking if they’re consistent with a reference answer.

So yeah, the blue bars here are the different versions of ChatGPT, which have more and more data. And we find that we’re getting some improvement on these metrics. We should do a more careful analysis of it. But it seems like this works. And GPT-4 is a lot better of course on these factuality metrics, and also just qualitative tests of it. So I would say we definitely still have a problem, as with some types of questions, and I’d say it’s a mix of factors. The model obviously has to guess sometimes when it’s outputting a lot of detailed factual information. And that’s OK. No matter how you train it’s going to have probabilities on things and it’s going to have to guess sometimes. And it’s going to have to decide when to hedge. Sometimes it’s going to make the wrong call on how much to hedge.

So yeah, that’s unavoidable. I’d say the ranking based reward model… Oh, I didn’t talk much about how exactly we train the reward model. The way we train it, it’s outputting something like a log prob that this response is going to be better than the other, that one response is better than the other or a log odds ratio. So it’s not actually saying how much better one is than the other, it’s just saying how confident it is that one is better than the other. And so it doesn’t actually impose the correct penalty for how bad the factual error is and how hedged the errors were. So I don’t think our ranking based reward is actually doing exactly the right thing. And that’s part of the problem.

Also, I think there are probably a lot of, well, definitely a lot of labeler errors. There’s like no way you can have humans label these things and have correct rankings all the time. Because sometimes there’s just not enough information available to the person doing the labeling. The question might involve some code base that the user has on their computer and the labeler has no way to access. So yeah, we allow them to skip questions that they can’t answer, but I think there’s still probably a lot of errors. And it’s impossible to read a long answer and catch every single mistake. Now I’ll move on to the next part of the talk on retrieval and citing sources.

So retrieval in general in the language model context means your language model is accessing some external source of knowledge. Usually you have some set of documents and you’re pulling some text into context to say, respond to a question. So there are a few reasons why you might want retrieval. So you might want current events about what’s going on in the world. You might want to access some information that’s not available in pre-training, not just because it’s new, but because it’s some private information, something on your computer or your code base. Or something the model output, like your past conversations.

And actually I would say the thing that I find even more important, the most important reason for retrieval and citing sources, is verifiability. A human has to check responses that models are writing and decide if they’re correct or not. And it’s extremely hard to check if something is correct, if you don’t know where the information came from. You have to look everything up. And if the model cited its sources, it’s much easier to check. So you can think of an unsourced answer as it’s almost like a sketch of a proof. It’s kind of a claim that I have that there are sources to back up all these things, but I’m not going to show them to you.

Even if we’re not going to show sources at test time, when we deploy a model, it’s extremely useful in training to be able to get sources so a human can check the information. It’s seeing a foolproof instead of a proof sketch. So actually a project that predated ChatGPT was our project on web GPT where we were focused on narrower type of question answering. There was this data set that was based on this subreddit, explained like M five, where people ask questions that are kind of… People ask questions they’re curious about, usually it’s something that’s a little too hard to just Google.

For some questions that have short, clear cut answers, Google will give you a really nice answer box that’s from probably Wikipedia that answers your question. And for things that are a little more complicated, Eli five is, has that kind kind of question. You probably can’t read it. Let’s see. here’s something like, here’s some question on a MacBook, I can be in a Zoom meeting something or it’s some technical question about Zoom. Why do people recommend baking soda and vinegar as a cleaning agent? That’s an interesting one. So yeah, this is the type of question. So we wanted to build a system that would go and do a bit of research online and answer this type of question.

And what we got at the end was a system that would write an answer like this. So why was the Suez Canal blocked in March, 2021? You got something that has a couple different sources and cites all the claims it makes. This project was a year and a half ago, two years ago, so this was a GPT three level model. So I think if you just gave a lot of these questions to GPT-4 or even 3.5, it would just answer the questions perfectly without needing to look anything up. But this kind of thing was much more necessary for GPT three level models. But I’d say this kind of thing is still useful for GPT-4 for going even to more technical, esoteric topics.

So the way the system works, which I think is still relevant for GPT-4 and we’re still using, is we actually define this whole action space or DSL that the model can use to browse its sources. So the model has actions, search. It can do a search. When it does a search, it sees a list of links with little snippets like a search page. It can click on links, it can quote things. So basically, language models have a limited context window, something like 4,000 tokens. Each token is about one word. So if you’re going to look at a lot of material, you’re going to run out of space. So quoting is really important.

We’re going to have to throw away… We’re only going to be able to show these pages to the model like briefly, and then we’re going to have to move it out of context. So we allow the model to quote content and that saves it for the rest of the browsing process. So you have some browser operations and then the model is, when it’s done, it can say, I’m done and then it can write its answer. So we just define an RL environment like that where the model emits text, it’s not emitting like special actions. But the text defines a DSL. The way each episode of the RL task looks is the model browses for 20 to 100 steps. It quotes a few things, then it writes an answer, and then the reward is computed with the reward model. And we use some standard methodology for this.

And the training, I haven’t talked much about the pipeline for RL from human feedback, but here’s a picture of it. You first do behavior cloning, that’s the supervised learning part. You have expert demonstrations on how to do the task, in this case using the browser and writing answers. So we imitate that and then we collect reward model, we collect comparisons where we have the model output, in this case two trajectories or two whole answers, A and B. And we have a human decide which one is better. And then we can either do RL on that reward model or we can do search against it. Take multiple samples and re-rank them.

So yeah, we have to make these gooeys for each of these things. So for collecting the data, we had some gooey that looks like that. And for reward modeling… We have to get people to read the model written responses very carefully. So here they see this answer and they’re going to highlight statements that have strong and weak support. We had a pretty complex UI for this. I’m not to sure exactly how necessary all this stuff was, but we decided to go overboard on defining a really detailed process that people should go through to compute the factual accuracy of the answer. Though at the end of the day, after they go through this process of highlighting everything, we just get a binary, we get one bit of information at the end. And we tried using all the other information and it didn’t help very much.

So that’s one disappointing thing. OK, so how does it work? So these plots on the left are actually best event, meaning for a given query, you take end samples, you re-rank them with the reward model and you return the best one. And we use the policy from supervised learning, we don’t train it with RL. For the biggest model, this is GPT three, the classic GPT three and the most samples, 64 samples. We could do better. We could beat the human demonstrators. It was preferred 55 to 40% of the time. A little worse on coherence, but better on factual accuracy. And we were also preferred a bit over the reference answers, which were written by Redditors.

But actually I don’t totally believe that that comparison. I think sometimes people prefer… The model writes things that sound very definitive and have all these nice square bracket citations. And yeah, even though we didn’t tell our labeler… I think we might have even stripped some of the citations out. But the labelers just really like the style of the answers, and I think that bias the comparison unfairly. So I didn’t believe that this was actually better than the top voted Reddit answers. So I think probably if we ran this again with our current models, it would be better. So now we actually have a alpha product in ChatGPT, which does browsing, which is kind of using the exact same actions, same sort of methods.

So I asked who’s presenting at the Berkeley Colloquium today, I asked it this morning. It says, today’s presenter is John Schulman, blah, blah, blah. So that was that. And if you look at the debug window, you can see the model is being prompted with some long series of instructions about, you have the browser tool with these functions, search quote back, and it describes the documentation for each of the functions. And then if you look at the conversation that’s being generated, the we see user message who’s presenting at the Colloquium. Assistant, that’s the AI, it actually outputs an inner monologue as it’s doing each of these actions. So it says, I will search for presenter at the Berkeley Eeks colloquium today. That’s not very useful. But yeah, it tells you what it’s thinking. It issues a search command, a Berkeley Eeks Colloquium presenter today, recency days equals one. We use Python syntax now.

So yeah, it does that, it says, let’s click on the first link to access the department colloquium series page for Eeks at UC Berkeley. So it’s giving you it’s inner monologue, then it does the click action. So then finally after it quotes the relevant passages and then it finally writes its answer. So that’s what browsing looks like now. There are other things out there that do browsing. There’s other products that do browsing now and have similar citations. Actually, I’d say the one thing I think is special about this is that it actually doesn’t always do browsing. It only browses when it doesn’t know the answer.

And I think that uses the same kind of self-knowledge of uncertainty that I was describing earlier. The same thing that allows the model to say “I don’t know” allows it to realize it should only browse when it needs to. So I asked, what is the dagger algorithm? So dagger is this kind of classic algorithm for imitation learning. OK, yeah, it gives a detailed answer, it doesn’t browse at all. Then I looked at the bare blog and the first post was about something called Fleet Dagger. So I asked, what is Fleet Dagger? And now the model doesn’t know what Fleet Dagger is, so it goes and does a search, then it looks at the sub webpage, which is actually the full archive paper, and then it writes some summary of what Fleet Dagger is. Which I verified is actually a summary. It didn’t just copy and paste the whole thing.

But it’s, yeah, just rephrases it a little bit. OK. So that’s all for that part of the talk. I’m at six o’clock now, so I’m going to wrap up pretty soon. I wanted to talk a little bit about open problems that I see in this whole line of work. So I’d say one big open problem is just how to incentivize the model to really accurately express its uncertainty in words. And that means using the right amount of hedging. And just explaining its full state of knowledge as well as possible.

So our current reward model methodology, I don’t think it does exactly the right thing like I was describing before. It doesn’t actually measure how much better one answer was than the other. It’s sort of just how confident is it that one is better than the other. So yeah, we train the reward models with maximum likelihood. Where probability that A wins over B. Our model is that probability that A wins is proportional to exponential of the reward score difference. This is just a classification loss, so it doesn’t penalize the model for making extra confident errors. It doesn’t account for hedging and everything. So I think there’s probably some effect where an unhedged wrong answer will be judged as worse than a hedged one.

But I don’t think we’re scoring things exactly right. And I’d say it’s not clear exactly how to… OK. But if you wanted to actually train with a… Let’s say you wanted to train with something like a proper scoring function, like you want that to be your reward. Let’s say we ask the model to output probabilities on everything, it says 10% on this sentence, 20% on this sentence. That would also have some problems because natural language is just very imprecise and that’s what makes it powerful. There’s like just as much fuzziness on the sentence as whatever probability you’re… Like, depending on how you interpret the sentence, there’s some underlying interpretation of it. But there’s so many possible interpretations. Some would have low probability, some would have high probability. That makes it very hard to do this.

So I think this is a open problem. Maybe we should have some kind of formal statements of probability alongside the natural language statements, but I don’t know exactly how to do that. Or maybe we should set up some kind of objective where you have multiple agents collaborating, and they should express uncertainty correctly because it’s useful to the other agent or it’s useful to itself later. Something like that.

So another class of open problems in this sort of general truthfulness direction is how do we go beyond things that the labelers can easily do? So yeah, it’s just very hard to check a full long answer about a technical subject or some niche subject. And so there’s this general research area that’s called scalable oversight. In the alignment community it’s called scalable oversight. It’s often easier to verify that a solution is correct than to generate a correct solution. This is a very basic, one of the most basic ideas in theoretical computer science.

If you look at the P versus MP problem, you could say that one interpretation is that you can have a weak agent, your verifier, that provides an incentive to the strong agent so that when you optimize the strong agent, you’re solving a hard class of problems. Say, like, SAT. SAT is the kinetical problem that’s easy to check the solution, but it’s hard to find the assignment. So you can have a weak agent that only does a little bit of compute that provides the reward, and that’ll lead to solving a hard problem if you optimize your strong agent. So, it seems like it should be possible to have labelers to train a model to do things that are much too hard for the labelers to do themselves. In principle, it should be possible to do this.

And so yeah, there’s a lot of ideas in this direction. So you can do things like you can try to delegate, decompose the task a lot and delegate it, have your browsing model fact check each sentence and then automatically aggregate all the results. So you can also do some kind of mechanism design. So that’s more like this idea of setting up incentives. So you can set up some kind of game where you have competing agents that are competing for approval of your verifier. And one is saying why the other is wrong. And there’s a nice idea there about called AI safety via debate.

So basically there’s some work in this direction. It’s all pretty new and I think we still have yet to see really good practical implementations of this stuff, but it’s starting to become necessary because it’s getting really hard for labelers to keep up with the models. And last, most speculatively, I would say one unsatisfying thing about RL from human feedback is it’s purely optimizing on human approval. And we don’t always know the right answer, and we’re probably wrong about lots of things. So we’re just optimizing for what sounds convincing and what sounds right. What’s kind of the knowledge of the day. It would be great if we could optimize for actual truth and have somehow add more compute and train the models harder and have them get closer to the real truth. So how do you do that?

So one idea is that if you have some kind of ground truth, you can optimize for actual correctness. So if you, for example, predicting the future, there’s a million predictions about the future you can make. And if we use that as the reward function, we might be able to generate real knowledge and have a real test for the knowledge. Prediction is one source of generating knowledge. And you can also obviously do deduction. If you have some kind of formal system or semi-formal reasoning system, you can generate new knowledge by deduction. So I think getting our models to do that is another interesting challenge. All right. That’s all. Thanks for your attention.

Speaker 3: If anyone wants to ask questions they can come to the microphone here.

Speaker 4: I think ChatGPT two probably knows the answer to the first question you posed. Have it asked to explain. Fred Mostow was working on problem with expression quantification in 1990. It’s been a very active research area. What does it mean to say, I’m pretty sure it’s impossible. There’s been a huge amount of work on what that means, and now people think of that quantitatively.

John Schulman: OK. Yeah, that’d be a good one.

Speaker 4: Hey John. So you went in on this node about new knowledge. So I’m wondering, can you say anything about the aspect of it, it seems to have an element of creativity. And that means where you give it say a patent or a pair of patents and you say, put these together and come up with something new and a new invention, and it seems to do reasonably well with that. Does that surprise you or would you consider that new knowledge in a certain way?

John Schulman: Oh yeah, I guess I could. Yeah, that seems like it could be new knowledge. I mean, I guess there’s some tastes that you’d be injecting by asking it that question in the first place. Either that it’s a good idea to combine inventions or that these are particular promising inventions to combine. So it’s like you’re collaborating with the model to create knowledge to some extent. But yeah, I would say yeah, there’s not a fine line between creativity and just learning pattern recognition, pattern completion.

Speaker 5: I think I’m on the same topic. So I’ve been training the models on classical literature, philosophy, and I’m curious on a question say, what is beauty? Where there’s no obvious fixed answer, but there’s many other answers. I’m curious, how do you evaluate whether, if at all, these quantitative measurements of the relative matrix of different answers about beauty? I mean, do they have any precedence over an output?

John Schulman: Oh, yeah. I mean, I talked about the difficulty of rating answers even when there’s no sort of subjective… Even if they’re supposed to be objective and they’re not values loaded or anything. So yeah, if you have something that’s like is going to depend on taste and values, then that’s even much harder. So yeah, I don’t think we have a good answer for that. I mean, the direction we’ve been going so far just is to… We don’t think the model should have opinions on things yet. So we want the model to instead be able to describe the set of opinions that humans have. So I would want the model to redirect that into a more factual question about what are some human theories about, what are the schools of thought that humans have on this?

Speaker 6: Hey, John. Is this on? Yeah, so first just… I need to be really close. Yeah, I just want to give you props for, I think it might have been five or six years ago, I participated in an AI progress forecasting meeting with John, and he was the only person who was for a math camp, and he was the only person in the room more bullish than me on predicting AI progress. And I think he deserves a lot of credit, not only for building what he’s built, but for having optimism years in advance that this kind of thing was possible.

So I just wanted to just call you out for that. And I wanted to ask about the web GPT demo. It’s really great how it gives this inner monologue, and I’m wondering what’s your level of optimism versus skepticism for using that sort of inner monologue format for interpretability? Like can you distill a model so that it doesn’t have enough room in its inner layers to think and it needs this inner monologue, and so we might be able to read out its thoughts? Is this something you’ve thought about?

John Schulman: Oh yeah, definitely. So I would say to the extent that we can’t find perfect solutions for interpretability or for making sure our models are safe or well intentioned, I think this is a really good partial solution and we should do as much of it as possible, actually. So I would say it’s very helpful for interpretability. Obviously you can’t completely trust it. The model could be producing a deceptive inner monologue, so that’s definitely a concern. But like you said, you could also kind of use a small model, so it has to use the inner monologue to reach a certain level of intelligence. Of course, then you could worry that it’s doing some kind of steganography and it’s hiding information. But yeah, it’s a little farfetched. So overall, I would say I think it’s promising, but maybe there’s some theoretical concerns with it.

I also think, one thing I didn’t mention is that if you have detailed inner monologues, that allows you to use shorter horizon feedback. So for example, for browsing, if you don’t have the inner monologue and you see one action, like scroll, you have no idea if this action makes sense or not. So it’s impossible to provide a reward on it. But if the model says, I’m scrolling to look for blah, and then it has the scroll action, a human can look at that single action and decide if it makes sense or not. And so by having inner monologue, you could train with RL at a shorter time horizon, and that also makes the system safer because you’re not optimizing for long-term behavior, which could lead to weird results.

Speaker 6: This art of choosing the smallest possible. Oh, I’m sorry. OK, cool. If you could say something about, if there’s any recent work…

John Schulman: Catch you later. OK. Sorry.

Speaker 7: Sorry about that. Hi John. So you mentioned earlier that there is some intrinsic knowledge graph in the models. And then you show an example of the model explaining dagger versus fleet dagger, right? So Dagger, it is able to directly explain it. I assume that’s because the knowledge is inside of the model, then it is still able to go into the web and search for Fleet Dagger and be able to explain it. But I would assume that that knowledge, that has some new concepts that’s not in inside of the knowledge graph of the model. So what do you expect the difference in the capabilities of the models? And then explain the two concepts if there’s any.

John Schulman: I didn’t catch the lessons.

Speaker 7: So I guess if the knowledge about Dagger is inside and the knowledge about fleet Dagger is partly outside of the models. Then do you expect any difference in capabilities of models in explaining it to the concepts or understanding them.

John Schulman: I’d say the model is probably best with concepts that are deeply ingrained and it’s seen them in a million contexts. And if it’s just seeing the concept for the first time in some document that it’s conditioning on, it’s probably going to have less intelligent things to say about it. It’s kind of like, this is just me kind of half answering based on introspection or based on, I’m just kind of speculating here, but I would expect that something that’s like deeply ingrained, the model would be more intelligent at talking about that. So I would say it would be better at talking about dagger than fleet dagger. For fleet dagger, it’s just going to say some kind of summary of what’s in the document, and it’s not going to say anything too insightful about it.

Speaker 7: OK. Thank you.

Speaker 8: Hey, John, we’re going to make this the last question. I know there’s a lot more questions coming, but nearest to the rest of your schedule tonight. One last one.

Speaker 9: So you mentioned in part one, the problem of the model learning to withhold information when that’s not desirable. Do you foresee that there could be issues with a conflict between the incentive of training the model not to withhold information in open domain contexts while also training it to not produce unsupported information in closed domain contexts, even when it actually knows that information?

John Schulman: Yeah, I think there’s an extremely strong conflict between… Well, there’s a precision recall kind of conflict. And yeah, there’s a conflict between informativeness and correctness. And I think you often run into this when you’re training. So we’re choosing some particular reasonable point that we think is reasonable on this trade off curve of how often the model should guess. But it’s unavoidable that there’s a trade off there.

Pieter Abbeel: All right. Let’s thank John again.

John Schulman: Thanks. (Applause)

[Music: “Silver Lanyard” by Blue Dot Sessions]

Outro: You’ve been listening to Berkeley Talks, a Berkeley News podcast from the Office of Communications and Public Affairs that features lectures and conversations at UC Berkeley. Follow us wherever you listen to your podcasts. You can find all of our podcast episodes with transcripts and photos on Berkeley News at news.berkeley.edu/podcasts.