Berkeley Talks transcript: Jitendra Malik on the sensorimotor road to artificial intelligence

a person giving a lecture watches a video of a robot walking upstairs — Jitendra Malik gives a lecture on March 20 called "The sensorimotor road to artificial intelligence." (Screenshot from video)

March 24, 2023

Listen to Berkeley Talks episode #164: “Jitendra Malik on the sensorimotor road to artificial intelligence.”

[Music: “Silver Lanyard” by Blue Dot Sessions]

Intro: This is Berkeley Talks, a Berkeley News podcast from the Office of Communications and Public Affairs that features lectures and conversations at UC Berkeley. You can follow Berkeley Talks wherever you listen to your podcasts. New episodes come out every other Friday. Also, we have another podcast, Berkeley Voices, that shares stories of people at UC Berkeley and the work that they do on and off campus.

[Music fades out]

Carol Christ: Good afternoon. I’m Carol Christ. I’m the chancellor at UC Berkeley. It’s my privilege and pleasure to welcome you to the second of this year’s 110th annual Martin Meyerson Faculty Research Lectures. For more than a century, Berkeley’s academic senate has singled out distinguished members of our faculty whose research has changed the trajectory of their disciplines and of our understanding. These lectures shine a light on an essential part of our mission, creating new knowledge. The curiosity and creativity that fuel the quest to learn and understand are at the heart of our commitment to making the world a better place through what we discover, what we teach, and the public service we provide. This year’s lectures represent the continuation of a treasured tradition that has recurred annually with one exception. In the wake of World War I and the Influenza Pandemic, the faculty research lecturers were suspended in 1919.

In 2020, when virtual events were in vogue and Zoom kept us together, the lectures went on. Being selected to deliver a faculty research lecture is rightfully seen as a high honor. To stand out among peers who exemplify academic excellence is no small thing. For students, members of the campus community, and the public, this is a wonderful way to experience scholarly research of the highest caliber. Join me in welcoming the past recipients who are with us today. Professors, please stand when I read your name and let’s hold our applause till everyone is recognized. And if I don’t have your name but you came in a little bit after my little list was made, please stand, also. John Clarke, Marvin Cohen, Bill Dietrich, Catherine Gallagher, Martin Jay, Victoria Kahn, Thomas Laqueur, Anthony Long, David Raulet, Barbara Romanowicz, Nancy Scheper-Hughes, and our first lecturer this year, Francesca Rochberg.

So the two individuals chosen by our academic senate to give this year’s faculty research lecturers are Francesca Rochberg who spoke in February and Jitendra Malik who’s presenting this afternoon. Today’s lecture is also one of several taking place this spring that demonstrate the depth and impact of Berkeley’s faculty work on the broad and evolving topic of artificial intelligence, a field of study and practice that’s changing the world before our very eyes. For a complete schedule of these related talks and to view stories about our work in AI, you can visit berkeley.edu/ai. I should also note that today’s lecture is being livestreamed and will be available on YouTube from the Faculty Research Lectures website and from the Berkeley AI website. Now I’m pleased to introduce today’s speaker, Jitendra Malik is the Arthur J. Chick professor in the Department of Electrical Engineering and Computer Sciences. He’s also a faculty member in Bioengineering and is a member of the Berkeley AI Research Cognitive Science and Vision Science groups.

Jitendra’s research group has worked on many different topics in computer vision, human visual perception, robotics, machine learning, and artificial intelligence. Several well-known concepts and algorithms arose in this research such as anisotropic diffusion, normalized cuts, high dynamic range imaging, shape contexts, and R-CNN. Over his 37 years at UC Berkeley, he’s mentored more than 70 Ph.D. students and postdoctoral fellows. Jitendra’s honors include the 2013 Distinguished Researcher in Computer Vision Award, the 2014 K.S. Fu Prize from the International Association of Pattern Recognition, the 2016 Allen Newell Award, the 2018 Award for Research Excellence in AI, and the 2019 Computer Society Computer Pioneer Award. He’s a member of the National Academy of Engineering, the National Academy of Sciences, as well as a fellow of the American Academy of Arts and Sciences. Please join me in welcoming professor Jitendra Malik, who will speak to us about the sensory motor road to artificial intelligence.

Jitendra Malik: Thank you chancellor for this very generous introduction and thank you to all my colleagues who have made time to come here. It’s my pleasure to talk on this very, very hot topic today. But I’m going to talk about natural intelligence first because we can’t talk about artificial intelligence without knowing something about the natural variety.

So we could talk about intelligence as having started about 550 million years ago in the Cambrian era when we had our first multicellular animals that could move about. So these were the first animals that could move and that gave them an advantage because they could find food in different places. But if you want to move and find food in different places, you need to perceive, you need to know where to go to, which means that you need to have some kind of a vision system or a perception system. And that’s why we have this slogan, which is from Gibson, “We see in order to move and we move in order to see.”

That sets off an evolutionary arms race because once some animals are chasing you, you are a prey, then you try to move faster, you try to develop camouflage. So the predator has to become extra efficient in vision or move faster and so on and so forth. This is true most of the history of evolution. Those have been the most important component of the brain of an animal, this ability to move and this ability to perceive. If we come closer to the modern era, so let’s say the hominid line, when we have separated from our other fellow primates. In the last 5 million years, you have this evolution of bipedalism, which frees the hand for tool making and tool use, the opposable thumb. And there’s this interesting quote from Anaxagoras the Greek philosopher which is, “It is because of his being armed with hand, that man is the most intelligent animal.”

It was really the development of the brain followed the development of the hand. I use that as a metaphor for manipulation, tool making, and all the rest. And then when we come to the most recent era, the last 50,000 years or so, is when we have modern humans coming out of Africa and language, abstract thinking, symbolic behavior, all that the common person in the street thinks of as intelligence. These are, if you want to think of the last 24 hours as the history of intelligence, then in the last three minutes, the last three minutes of that is essentially all this language, symbolic behavior, which we are so proud to call amongst us as a sign of intelligence. So now let’s turn to artificial intelligence. I cannot not mention the large language models, ChatGPT, GPT-3, -4, -5, and -17 to come in the future.

These are remarkable and we are going to have lectures in this series, I think on April 19th. We have one of the key creators of ChatGPT who happens to be a Berkeley CS Ph.D., John Schulman, who’s going to talk. So what I want to say is this result is a table of results on various standard tests. So for example, the Uniform Bar Exam, the current version of GPT does at the 90 percentile level. There are results for SAT Math, GREs, and so on and so forth. I’m sure our chancellor will be pleased to note that English language and composition still is at level two, so there is hope for our colleagues in literature and composition yet. Just a few things, right now we have, of course, that kind of technology in computing that has enabled us to train these really large models, the kind of data on the web like a trillion tokens, which were used.

But there are ideas which go back to the 1950s, the great Shannon, Claude Shannon, for people who are from electrical engineering and applied math backgrounds, and this linguist Firth who said, “You shall know a word by the company it keeps,” and that’s the core idea of how we train this. We delete a word and we try to predict it from its model. Now of course, Firth got forgotten and Chomsky became the ruler of the roost. And Chomsky was a nativist, but in my view, Chomsky was wrong and now we have proved that Chomsky is wrong. I don’t want to pick up a fight with my linguist friends just yet, maybe at the end. We see in these systems emergent syntax and linguistic competence, they’re like an associative memory of the web. Anyway, this is great success.

And contrast it with a great failure. Not yet a failure, but certainly not yet delivered success. We don’t have self-driving cars. I mean this is the thing that everybody in the street was expecting. This is an old idea. I mean this guy, Dickmanns, he had cars driving on the autobahns in the 1980s. In fact we had work at Berkeley in the ’90s with self-driving cars. Of course there’s a lot of hype and the great Elon Musk has told us in, I think 2019, he said, “By the middle of next year we will have over a million Tesla cars on the road with full self-driving hardware.” Well, that was 2019, then 2020, and we are 2022 and we still don’t have a million cars with self-driving cars on the road.

Now, this is something that like a high school kid of age 16, we give them 20 hours of training and we think they can learn to drive. And there is the Bar Exam, which we think is the result of years of training, and we are at the 90% level in AI. So there seems to be something fundamentally wrong here or puzzling here, and I can make it worse. Here’s a table I found of kitchen verbs, various activities like stirring and slicing onions with a knife and mixing things and so on. These are things that a 12-year-old can do and no robot today can. This is a very important thing for everybody to realize, that in artificial intelligence, we suffer from what’s known as Moravec’s paradox. Moravec’s paradox, and Moravec said it in the 1980s, it was sort of known earlier, “It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.” Steve Pinker slightly later said it very pithily, “The main lesson of 35 years of AI research is that the hard problems are easy and that the easy problems are hard.”

What the person in the street thinks of as easy is actually hard. What a child has managed to accomplish by age two. What, however, we think of as hard, the result of years of education, is actually not so hard or we have made progress on it. I won’t go on to Pinker’s quote where he sort of makes assertions about which jobs are in danger. But the good news, I’ll state, is he claims that the gardeners, receptionists, and cooks are secure in their jobs for decades to come. OK, so that’s good.

But anyway, my job is to try to challenge even their jobs. This comes down to, that’s why the title of my talk is Sensory Motor Intelligence, which is all that stuff, which in evolutionary time, was early. Perception, movement, things like that. How do we make progress on those problems, whereas obviously these later problems of advanced culture are not so hard. So there’s a question as to why, and this is debatable. I think Hans Moravec gave a reason, which I don’t think is fully right, but let me state his intuition, which is that we’ll have more difficulty in engineering to reverse-engineer skills that are the result of hundreds of millions of years of evolution. So perception and action was early in the process, and that’s hard. We don’t have anything which has the intelligence of a cat today. I think a better argument, at least from today when most of our advanced systems are based on machine learning, is that we lack the kind of digitized data on the web for training machine learning models.

So the text on the web, every book has been digitized, Wikipedia exists, there are all these blogs. So as a result, there are a trillion words on the web. That’s the kind of knowledge which these systems like ChatGPT and GPT-4 can exploit. This is not typically knowledge that a child acquires by age of five. You start reading books after you’ve gone to school. The challenges, the experiences that a child has had before age five, and those experiences are very personal and embodied and they do not exist on the web, at least not yet. I think that’s part of the challenge. It’s not that if they existed on the web we would know how to exploit them, but certainly they don’t exist on the web. So in the rest of this talk, I’m going to talk a little bit about our attempts on that challenge. The challenge of trying to build a two-year-old, not somebody who can pass the law exam at 90 percentile. This is going to be, I’ll show you work in robotics, mainly about legged locomotion and vision and things like that. And this is a robot. Can you hear it? Can you maybe raise the volume a little bit? Anyway, this is on the Berkeley Marina. This robot is struggling, but it manages to walk.

Here is another example. Notice the variety in terrain. The rocky terrain versus this leafy terrain, down slopes, et cetera, et cetera. And more. So this is the challenge. So the challenge is being able to do stuff, to walk around in this variety of terrain. And here’s, I think…

So this is the robustness of our motor control systems, that they can deal with this variety. I’m going to start by situating this work in some major intellectual traditions. So there’s an intellectual tradition of pattern recognition, which is what is behind most of the successes of machine learning. We have developed techniques by where we give lots and lots of examples and say these are examples of dogs. We cannot define the picture of a dog sort of axiomatically like a mathematician might, but we can give lots of examples of dogs or lots of examples of cats. And then we train these statistical machine learning systems which use those examples to induce what the pattern is. Generalization is the central problem. Not all dogs look the same, not all cats look the same, and we should be able to do that. In motor control, we have an additional challenge which is that we need robustness to disturbances. This is sort of a central aspect of control. You knock the system a little bit, it should still work correctly. And then there is adaptation. And I want to emphasize I’m using two different terms here. Robustness is typically to deal with noise. Adaptation means that there are very significantly different terrains and your system must work well under all of these conditions.

And this connects to… I will not have too many equations in case some colleagues get scared. The basic idea here go back to the development of controlled theory from around the 1960s, both in the U.S. and in the Soviet Union. There were parallel developments and these were amazingly successful. These intellectual results from that era, they led, you know, John F. Kennedy made this statement, by the end of the decade we’ll have man on the moon. Well, the people who delivered were obviously the aerospace people, but also the control theory people because we needed to have these spacecraft move in orbits which were controlled and accurate. So that is an old tradition and that’s what we’ll draw on. The basic, if you want to know what that equation stands for, x is something like state, like what physicists would call positions and velocities and so forth, and ẋ is how the state changes. The additional thing here is u, which is about control, how we externally put inputs into the system.

So for a rocket it might be firing thrusters and so forth. And we choose to apply the right input u so as to have a desired result. I think, again, going back to that ’60s tradition is the concept of adaptive control. So this matrix A here, this refers to characteristics of the system. And those characteristics change. So in my example, that robot has to walk on different kinds of terrain and those create very different conditions for it to walk. The old examples were about aircraft, which in the course of a flight a lot of the fuel will get used up, so therefore the mass will be lower and you need to do the right things here. This is again a time honored problem, but I think it’s a problem which today with the tools of deep learning neural networks, we can revisit and we can make substantial progress on.

What we have done I think falls into that family and we will see the applications of that to robotics and I’ll show you a few in my talk. The key work that I’ll talk about in a bit more detail is what we call rapid motor adaptation for legged robots. OK how do we do this? Again, I want to give you a flavor of this line of work. It’s basically it’s reinforcement learning to train these control laws, which means that the robot has to learn how to move each of its joints. A traditional control theorist would write down the equations, write down the math, derive the equations, and that’s how they would do it. We will learn it. And the way you learn it is basically by trial and error. It’s called reinforcement learning, but that’s just a fancy name for trial and error.

And we need zillions of trials. And by the way, this is also true for human babies. Karen Adolph, who’s a researcher at NYU, she has done a lot of experiment showing that babies actually do a lot of trial and error and fall many times. It’s not that they just succeed in walking in like one go. So our robot has to do a lot of trial and error. Real hardware gets damaged, so you do it in simulation. What do we do? We basically don’t build in anything, we just have these objectives that we set. You must try to walk without falling, you try to have a desired velocity, you try to use minimum energy, something reasonable like that. So the jargon here is we have this physical simulator. Policy refers to what’s called the controller, the brain of the robot if you will. And it gets as an input the state.

The state here means, what are all my joint angles? And A here refers to actions, which means what action did it perform last time? So what was my previous action and what are all my joint angles? The policy gets that as input. It’s like keeping track of what’s your current state and your history. And then the action you command is, “OK, how should I change all my joint angles?” This is going to be learned by trial and error. Now it turns out that doing this is hard because of all the variability in these different environments that we might have. The policy that will work for walking on hard ground is not the same policy which will work for sand or slippery surfaces or upstairs or downstairs and so on. So what do we do? So what we can do is, in the simulator, we have prior knowledge of that. We know what the mass is, we know the friction, we can choose lots of variability in the conditions. So we must exercise the robot to walk in all these different conditions in simulation.

And then that gets encoded into this environmental factor encoder, which you can think of as another neural network. It captures that knowledge in the form of this variable Z. Z sub T, which is maybe eight dimensional. So this Z sub T is going to encode things like am I on flat ground? Am I on sloping ground? Am I on slippery ground? Things like that. And it’s all done implicitly. We are not choosing a meaning for those variables, it’s all done as part of this giant learning problem. You do this for lots and lots of trials and at some point your robot manages to learn how to walk. So these we call extrinsics. The reward is… What is important for walking? What’s important for walking is that you move but you don’t fall and you use minimal energy. If you think about a biological setting, animals use a lot of their energy in hunting for food or evading being hunted and so on and so forth. So this is important.

So we can train this and in fact we trained it and this is our robot and it’s able to walk and it’s in simulation. It’s just that this robot, if we try to take it to the real world, I have a little problem. And the little problem is that these environmental factors are not known. So these are not available. So now I have a problem because I cheated. In my simulator, the simulator which you can think of as like a video game simulator, it’s capturing the physics. And in that I train in different conditions and I’ve got it right. Well, in the real world, how do I know which condition that I am a part of? So I’m stuck. So here is where the aha moment comes in. There is learning of the policy, how does the robot move? And then we can add a layer of meta learning. What is meta learning here? Meta learning here is observe your own behavior, and from that deduce something about the conditions you are in. So the intuition is something like this. If I’m walking on hard ground like this, I put my foot down, I lift it up, it works.

If I was doing the same set of movements on a sandy beach, what would happen is I would put my foot down, lift it up with the same force, but it wouldn’t come up. It’ll come up only partially. So the same actions that I apply in different conditions result in different outputs, and I can become aware of that. That is the signal to me that I’m in different conditions. So the same actions have different consequences and those can be my readout of what conditions I have. Note, by the way, that the robot that I’ve shown so far is actually blind. So all it has are its tactile senses, its proprioceptive sense, and so on. So that’s the key idea, this so-called adaptation module. I’m skipping details for obvious reasons. What it does is that it’s like that meta reasoning thing. It’s operating at one level above. It has access to the past history, so maybe the last 0.2 seconds, maybe the last one second. And in that one second, I commanded the actions, so I know what I did. Biologists called this the efference copy.

I have the states, which I know because I have sensors in my body which keep track of all the joint angles. From that history I can estimate this Z, which is this extrinsics, which you can think of as a proxy to the environment that I am in. And that’s it. And it turns out the discrepancy between the expected movement and actual measured movement, and we continuously estimate these online. Here is a little… The next observation, which is that in fact this process itself can be trained. So we have two levels of learning, learning the basic policy and this meta policy. And the meta policy also gets trained in simulation because in simulation I can vary the conditions and then I can train this estimator which estimates which condition I’m in. Yeah, that’s this part. OK, so now hopefully not use too much of jargon here.

Now we are ready to go and we have basically the same policy which runs in all of these conditions. So our slogan is “One policy to walk them all”. And why it works is because it is not the same policy to walk them all. That is one style of work. There’s what’s called robust. A robust policy might be one where you use exactly the same everywhere. No, what we want is something which is adaptive. It should do different things. I should walk differently on concrete versus walking on sand, and that I can do if I can estimate what conditions I’m in. And basically we have the machinery for that.

So these are just some kinds of results. OK, so let me give you another example to illustrate how this is happening by analysis of adaptation module. So, soon you are going to see Ashish Kumar, who is the student who was the lead author in this work. And this was work done in the COVID era, so I told them you can take the robot home. Basically he was sleeping with the robot for two years and great research resulted. And now let’s see what he does. So he’s going to pour some olive oil on this mattress. A waste of good olive oil, no doubt. And then you’ll see the robot soon. And if you notice he’s got little plastic socks on the robot, and now the robot is going to walk.

Let’s see this in slow-mo. So the robot has a default policy for walking on hard ground and it’s getting into trouble on the slippery surface where the friction is much too low. So what should happen? But it recovers. So what’s actually happening underneath the hood is that this meta thing, this adaptation module is taking over. The adaptation module is figuring out, oh, something must be wrong. So now we see this and what you see in the charts, the top rows correspond to the foot placement, and then the middle is the force applied, the torque applied at the knee. But these two curves correspond to two of the dimensions of this eight dimensional vector of state. So notice, so when it recovers, it has… What’s happened in the first phase is that it’s got the wrong estimates of the physical conditions, so it’s not doing the right thing. As soon as it’s got the right estimate of which environmental condition it is in, which is happening towards the right of that plot. So think of the two bottom curves. Then after that, it’s got the right Z, and then the policy is different, and then it works.

So here’s a different example of modification, which is that we throw a weight on this robot and this poor guy has to struggle. Now you’ll see it struggling, but adaptation and then it recovers. And it does it fairly quickly, like 0.2 seconds. So when we fall, we do not fall instantaneously. There’s a physical process and it takes like half a second, one second. And in that time, if you have estimated the conditions, you are good. Here are some examples where the top is our policy, the bottom shows what happened without that adaptation. So the Society for Prevention of Cruelty to Robots has prevented me from doing more experiments like this.

OK, now I’ll show you some more stuff. I think I’ve got through the meat, the core idea I’ve now done and now I’m going to give you many corollaries of this. So one is that this is the advantage of a learning framework, that many things emerge. I mean emergence is the big thing in learning. So here what we are going to do is we’ll just give the robot the assignment, go at 0.5 meters per second, go at 1 meter per second, go at 1.5 meters per second. Everything else is the same. It’s exactly the same machinery. And here’s what we discover. So what you see in these plots are the footfalls, so the solid parts. So it’s right front, left front, right rear, left rear. The solid part means that the foot is on the ground. So when it’s asked to walk at 0.375 meters per second, you see a gait which is like this slow walk. If you ask it to walk at 0.9 meters per second, then what you get is a trot. So for horses you have all these gaits, trot, gallop, canter, et cetera. And these were not programmed in, these just emerge. And then if you have set a really high speed, and now of course you notice that there are times when all four feet are above the ground.

So what’s an explanation for this? So this is what happens from learning, but as scientists, we want to know more. And it turns out that people had already thought of it in biomechanics. The explanation is energy. Energy matters and you want to be efficient energetically. So for example, for humans at low speeds, energetically, walking is efficient and at high speeds, running is efficient. For horses, you have multiple gaits. And these were studies done by taking horses and putting a bag around their nose and measuring oxygen and so on. There are particular phases in which this efficiency occurs. Exactly the same is true for our quadruped. Here are some fun examples of basically we now are just remotely telling it to go at different speed and it’s gait changes smoothly. I think this is just meant to scare you.

Let me say, by the way, everything here was done blind. I’ve spent 35 years of my life studying vision, but our theory was blind people can walk, so certainly blind robots should be able to walk. Then why do you need vision? Well, you need vision in these kinds of conditions. You need to walk on stepping stones, on stairs, and so on. This is some work that I’ll show you. I’ll not got too much technical. There’s a traditional technique for this, which is to build maps by combining information from multiple views. It turns out that view suffers from the effect of noise because every time you estimate the camera, it’s noisy, so the resulting map is way too noisy. And what we did was we develop direct policies. So direct here means that you have the visual data and you try to develop control.

So it’s like you try to make it into a reflex rather than a very conscious process of build a map, plan your footsteps and so on. I’m going fast, but the philosophy is very similar to what we did earlier. In simulation, you can train it with privileged information. So you can give the robot access to what the terrain is and it… Ignore all the symbols, please. It learns a policy for moving in this terrain. To deploy it, what we do is we put a camera on the head of the robot, and this camera gets you this kind of an image. With that image, what you try to do is to try to guess what the environmental parameters are. So it’s now going to be more about terrain geometry, whereas previously it was more about terrain friction, hardness, and the like. But roughly speaking, think of it as being the same philosophy. And then we can implement this. OK, blah blah.

Let’s see some demo. OK, so this is the robot. It has a camera on its head. It has no advanced knowledge of anything. It does not know the terrain in advance. OK, so different examples. So we have to improvise some obstacles. OK, so these are stairs. These are fairly tall stairs. Let me see if I can speed this up a bit. This you might recognize. This is the rose garden in Berkeley, and we were convinced that this would not work. Will it? Wow. OK, so these are stairs. OK, anyway, you get the idea. So this is a remarkable success because this is not something which any other system could do effectively. So this paper won the best prize at the robotics conference in last December. I will see if this video works. So this is work with a colleague in mechanical engineering. Let me see if this works. OK, so this is biped. Bipeds are harder, quadrupeds are more stable. OK, I think I’ll move on. Let’s see if I move it, I’ll speed the video a bit. So someday these robots will carry loads for you.

This general idea is applicable not just for locomotion but also for manipulation. So I’ll show you an example where we have — a hand, and again, this hand is trying to twirl different objects. Interesting point here is that there are all these different objects of different sizes, shapes, weights, friction, and so on, and it succeeds. It turns out it’s the same basic philosophy, what we call rapid motor adaptation, which is you perform some action, you see the consequences of the action, and you use this to, at the meta level, figure out what conditions you are in. I will skip this.

I wanted to make a philosophical point here. How to think about intelligence? So I have the last five minutes, so I want to get to that. This is a beautiful quote from Alan Turing, who’s considered the founding figure of computer science.

He said this in this 1950 article which proposes the Turing test. Everybody reads that article, but never goes to section six, which is where this occurs. “Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education, one would obtain the adult brain.” Turing was 1950, but since then, we know a lot more. Our colleagues in psychology and neuroscience and child development have told us a lot about how children learn. So children’s learning, my colleague Alison Gopnik has this phrase, “the scientist in the crib.” The child is doing these experiments, putting things into her mouth, has multiple senses: touch, vision, hearing, et cetera, and then they are cross coupled. And then that child eventually, in that first year, has become essentially a sensory motor genius.

And then at some point starts to acquire language, and then of course goes to school, and then of course then the kind of learning which is embodied in GPT-3 and GPT-4 takes off. But it takes off on a base of this more elementary sensory motor learning. And my belief is, and this is for now an ideological belief, that this is the pathway for artificial intelligence as well. That we need to copy that. This is an article from Linda Smith and Michael Gasser who are two psychologists, that these are aspects of how children learn. Multimodal sensing, touch, audition, vision, incremental, we build on our past knowledge. Physical, it’s not the brain in a vat, it’s not the mind in a vat, it is interacting with the world, embodied. Explore, social, we learn from others. And then of course we do use language. So I wish to fight against the tyranny of linguistic imperialism.

In the last two minutes I’ll talk about a project with Antonio Loquercio, who’s here, and Ashish and myself. And this one, we tried to take this idea of copy the idea of childhood learning, which is cross-modal. And we said, here’s a robot which has got a camera, but it’s going to learn in the real world how to use its vision system. So it starts out blind, so it’s struggling. And then with a vision system, of course you can climb stairs and you’ll see that.

But how do we train the vision system? We wanted it to learn in the wild. So here was our intuition. If you think of a robot on a stairs, its proprioception, its senses, its joint angles can let it compute what’s the depth of its left leg and right leg and so on. It has that geometry from its joint angles, from its internal state. So can we use it for training? So the idea was the proprioception predicts the depth of every leg and the vision system gets an image. What we asked the vision system to do is to predict what the depth will be 1.5 seconds later.

That was the idea, that you just shift what signal it will know 1.5 seconds later and use that to do this advanced prediction. So we have this robot, which is learning day by day. So in the first day it’s clumsy, the second day it goes up further, and then on the right you see the success rate. And then finally on the third day you will see that it actually… Will it make it all the way? It makes it all the way. And we can do experiments like we can mess up its vision system, the ability that humans have that we are calibrating all the time.

So Antonio is going to rotate the camera and that’s going to make it so he’s messed it up. It’s like taking your eyes and moving them 30 degrees. But because it is always learning, it’s always adapting, it’s going to initially, it will struggle, I hope. I don’t hope, but it will. And then it recovers because it’s learning from its own experiences. Let me play this one. So this is before adaptation and then this is after adaptation, and adaptation is just a few minutes. So this is called the prism test in vision science and humans can do it in like 10 minutes. Well, so can a robot. I hope I’ve taken you on this little journey of some of what I think is potentially interesting work in sensory motor learning.

I want to conclude with acknowledging some people. Acknowledgements. First, I want to thank all my past and present Ph.D. students. This work is really due to them. I am merely a spokesperson on the stage. And of course all my past and present postdoctoral fellows. Thank you, thank you very much.

And then I want to say something about my colleagues, particularly in the EECS department with whom I’ve spent 37 wonderful years, starting with a lab. I started as a 25-year-old assistant professor, and I think Berkeley was the best place for a starting assistant professor. It probably still is. I’m very thankful for the opportunity that was given to me. I’m thankful to my colleagues in the rest of the university from whom I learned so much at vision science, cognitive science, neuroscience. I learned all about the brain, perception, and action from them. I want to thank my currently my colleagues in BAIR with whom I have such fun time and the wonderful Angie who keeps the place running. And finally, of course my wife, Isha, and my son, Kabir, for so much patience and support. Thank you very much. Thank you.

And we have a robot which we are going to skip. Oh, outside. He’ll show it outside. So after the talk, people can check it out, but I’m happy to answer questions. Yes?

Audience 1: Yes. I’m wondering whether there are any limitations… Thank you for the great talk by the way. Are there any limitations between the time that you need to learn a task and the time that you actually want or need to perform it to be efficient? You see what I mean? Because there’s two time scales.

Jitendra Malik: So I distinguish between two kinds of learning. So we do this learning in simulation, which takes a very, very long time. But since it’s done in simulation, it’s totally safe and it can have a huge complexity. So there are no limits there, we can train forever. Then there is the learning which you do in the real world. That has to be very rapid. So the initial adaptation that I showed, that was on the order of 0.5 second or 0.3 second, that kind of thing. If it is longer than that, you’ll fall. But we are adapting all the time. If I sprain my ankle, I can walk, and that’s adaptation. So that’s the time constant that I need. It needs to be fast, but not instantaneous.

And then I talked about this recalibration that we have, which is of our entire sensory motor system. It happens when we wear glasses, it happens when we grow and things like that, that you have a bit more leeway. But in our system we showed it could be done in a few minutes, which is roughly comparable to what humans can do for this effect. It’s called the prism adaptation. So I don’t have a generic answer for all things, but I’m just showing you for these systems what it is. It’s important. The timing is important. Adaptation has to be quick enough, otherwise it won’t work. Question? Yeah, I think the…

Audience 2: Thank you for your talk. The title of your talk kind of implies that there are multiple roads to artificial intelligence. What would be the second road?

Jitendra Malik: Well, the obvious road that people are taking right now is the road from linguistic data, from all the words on the web. That’s how you train GPT-3 and ChatGPT. And I am saying that captures a certain kind of knowledge, but it is an incomplete story when we want to talk about intelligence as natural intelligence goes. As an engineer, you can totally use that incomplete system in some way. But I mean, I’m still striving after human level intelligence in all its facets, and for someone like me, that is not enough. I want that intelligence which is embodied and grounded in the real world, which must start from that experience. Yeah, I don’t know who’s… Yeah. OK. I think then there’s a question here in the front.

Audience 3: Yeah, thank you. Thank you very much for a great talk. You started your talk with autonomous vehicles and I think recently in news was a situation where autonomous vehicles in San Francisco was in an intersection and there was another car doing donuts on the intersection. And the problem was that the autonomous car hadn’t seen the situation before, so there were no data in the simulation and the car wasn’t able to deal with the situation. My question is how to deal with safety critical situations, because in such situations, for example, the robot has no rights to engage in the situation and the human can reason about the situation based on other knowledge?

Jitendra Malik: Yeah. So the question was really about situation which are not seen in training. So this comes down to the emergence issue. What abilities emerge? So humans certainly have, I mean, I’m driving on the road and there’s this truck and some junk falls from the back. If it’s newspapers, I will drive over it, if it’s rocks, I’ll switch to a different lane, right? Because we know something about the properties of rocks versus newspaper. Now that is connected to our vast, wider knowledge. So the article of ideological belief among AI researchers is that with enough knowledge and that then these abilities will emerge. It remains to be proved. Certainly the more narrowly trained systems that we have today don’t have it. But that’s one answer.

The second answer is that we may want to explicitly have what’s called System 1 and System 2. So this is terminology from Daniel Kahneman. System 1 is more reflexive, System 2 is this deliberative, logical conscious system. By the way, there is our dog walking. It can do fancier things than walk on the ground, but at least it can walk on the ground. But a colleague here had a question. I think he had a question over there.

Audience 4: Oh yes, thank you. So you were using the infant learning for this adaptability of your AI, but it takes, say, a very simple animal, perhaps it’s not a simple animal, like an ant. The ant can walk and do all that ants need to do immediately from hatching. So I’m wondering-

Jitendra Malik: Yeah, so his question is a fundamental and deep one, which is that there are certainly species which don’t need all this learning period. Alison Gopnik has an article on this. So this is the thing about precociality. And what you get is a very hardwired system, which is not so adaptive, but you can work on day zero. Human babies don’t have that, which gives us much greater adaptability. So what the price you pay for that is that there’s a period of vulnerability.

For a period of 10 years, the human child cannot survive by himself or herself. That’s the price you pay. But as a result, you get a system which is much more adapted to the world. And somehow evolution, there are different species have made these different trade-offs on this dimension. What training we do in simulation, I like to think of it as being somewhat akin to the process of evolutionary search. I mean, as a species we have certain knowledge which is encoded in our genome, and that has come from some 550 million years search process. So I want to distinguish between the learning that occurs over that time versus the learning that occurs in the life of an individual. I mean, the world keeps changing, right? And that is what we get from this. Thank you. (Applause)

[Music: “Silver Lanyard” by Blue Dot Sessions]

Outro: You’ve been listening to Berkeley Talks, a Berkeley News podcast from the Office of Communications and Public Affairs that features lectures and conversations at UC Berkeley. Follow us wherever you listen to your podcasts. You can find all of our podcast episodes with transcripts and photos on Berkeley News at news.berkeley.edu/podcasts.