Sergey Levine on The Robot Brains Season 2 Episode 1
Transcript edited for clarity
Pieter Abbeel: Sergey, so great to have you here with us. Welcome to the show.
Sergey Levine: Thank you, Pieter. And thank you for the perhaps overly generous introduction. But I appreciate you having me here.
Pieter Abbeel: I think it's hard to be overly generous with everything you've accomplished, and you're continuing to do. Now, before we dive into the things you're working on today, I'd like to take a step back to your PhD. As I understand, you're doing your PhD, and right in the middle of your PhD, the ImageNet moment, slash Alex net moment happened where deep neural nets are proven to be the clear, best approach to computer vision and promising for everything else from there. A lot of people were thinking, of course. And I'm curious, how did that affect your PhD work?
Sergey Levine: I don't think I was actually sufficiently plugged into the machine learning community to actually be too aware of that. I mean, I heard about the paper and all that. And I saw Alex Kruszewski talk. But I think I actually was much more aware of all that stuff. When I took Andrew Yang's course, he taught a graduate seminar course, basically, on deep learning, I think this was in 2011. And I got really interested in that stuff. Because up until then, I had been working on nonlinear function approximation for character animation, but nonlinear function approximation using Gaussian processes. Because I figured out pretty early on in my PhD, that having hand engineered features for controlling was very limiting. And I wanted to figure out some way to do it that Gaussian processes because they were cool and had lots of math, and it was kind of fun to mess around with them. So I guess conveniently enough for me, Andrew Yang taught his course right before the Kruszewski paper, so it was around the same time. And I got pretty excited about that. Actually, I tried to set up a bunch of systems for character animation using deep neural nets. They didn't really work mostly because at the time, the primary way to do that sort of stuff you on a generative models was using restricted Boltzmann machines, which are, again, mathematically very elegant, but a little tricky to get to work, I had to spend a lot of time messing around with like, Russell luck with MATLAB code. This would have been like, end of 2011 and it was published in mid 2012. This was basically kind of the first version of guided policy search, which was essentially policy gradients with important sampling. Kind of somewhat similar conceptually to what later became PPO. At that time, it was made mostly based on joint work and that was kind of all right like that, that could do animation with neural nets for like humanoid characters.
Pieter Abbeel: So Andrew was proceeding the breakthrough in substance in his course by a year. And that inspired you to get going a year early.
Sergey Levine: Probably inspired a lot of folks in this class to get going a year early.
Pieter Abbeel: Yeah, that's really interesting to hear that. In Season One of the podcast, we've talked a lot about deep learning but mostly in the context of supervised learning. And what you're describing here, of course, is not exactly supervised learning anymore. It's deep learning for decision-making for essentially reinforcement learning. Maybe explain the difference between the two and why reinforcement learning can maybe be an additional interesting thing to learn about, in addition to supervised learning,
Sergey Levine: If we think about a lot of the really exciting things that we've seen, from large, powerful neural net models, you know, like, like GPT, three, or, you know, things like that. These models seemingly make inferences about, you know, how the world works, you can ask them like, Okay, if I drink a cup of tea, what will happen if I try a cup of poison? ‘What will happen’ is a little kind of ‘make a guess as to the causal structure.’ But ultimately, what they're trained to do is they're trying to do prediction. And a computer vision system is strictly prediction about, let's say, image labels. A language model is trying to make predictions that are more sophisticated about what will be set next. But it's still fundamentally a prediction problem. Reinforcement learning is a mechanism that we can use to train models to maximize utility. You know, in a sense, there's a major component of the AI problem that is missing, if all you're doing is prediction. But once you can train your system to maximize utility to take actions to actually accomplish its desired goals. Now, you've kind of got, you know, at least all of the moving parts that in principle, you would need, and you just have to figure out how to actually put them together in the right way. So that's the really big difference. Now, of course, there's a lot more that goes into actually making reinforcement learning work. For example, you could have active reinforcement learning where you have trial and error learning, you'd have offline reinforced learning from data, you could have actor, critic, algorithm model based on etc. So there's a lot of those components. But the really big fundamental differences between learning to predict versus learning to choose actions to intentionally accomplish your goals.
Pieter Abbeel: And now, what kind of new applications does that open up?
Sergey Levine: Yeah, so I think there's actually a fairly complex answer between what we want those applications to be, what they actually have been, and what they might be in the future. So in terms of the actual applications, reinforcement learning in the real world, there's actually a very good chance that anyone listening to this has interacted with reinforcement learning agents through things like recommender systems, and advertising, and so on. So it's actually a really big thing on the web for suggesting content to you, or suggesting advertisements. In terms of what I work on revolves around taking these reinforced learning ideas and actually trying to situate them in, in the real world. Can we make AI systems that learn and interact with reality? In a way that's a little bit analogous to how we do it. So that's where robotics comes in. That's where, you know, other embodied systems, like autonomous vehicles come in. Reinforcement learning is not a major application in robotics for systems that you actually use and buy and so on. But I think it's right there on the cusp. And in terms of the scientific experiments or r&d at companies that are actually working on applications, there's a lot of that starting to show up. And I think that we're right on the cusp of seeing applications there, you know, for some of the work that I've been involved with at the Alphabet company, formerly known as Google X. They've already announced they have an entire group working on trying to apply reinforcement learning. There's a lot of that happening although it seems like it's right there on the cusp. It could be that once we can develop powerful RL systems that are sufficiently scalable, and that can utilize data and also the right amount of human expertise, perhaps we could apply even more broadly, perhaps in the future. In many of the settings where we're currently approaching them as prediction problems, we might start approaching them as RL problems. For instance, to use RL, to figure out a system to recommend a course of treatment to a doctor for a patient or perhaps we would regulate the interest rate or the tax rate based on a reinforcement learning algorithm in the future. So conceivably, we would like to use AI to improve prediction which might in the future be approached with reinforcement learning, but that's more speculative.
Pieter Abbeel: Now, when you were working on your PhDs, okay, you were working in reinforcement learning in the context of graphics. And then you made a very deliberate decision to expand and possibly even emphasize more robotics and graphics from there. How did you make that decision?
Sergey Levine: Well, I made the deliberate decision to come work in your lab. Until that point, I worked entirely on computer graphics. In fact, you were the one who told me that you wanted less animation and more robotics. So I think I'm going to blame that one on you. The truth is that there's actually something that is much more interesting, and perhaps much less limiting about real world and body systems and virtual ones. Because, you know, as AI, this is a somewhat high level statement. But as AI researchers, we often like to think that the hardest part of the problem is making the computer brain, like how do you get the computer to reason, and then make decisions. But there is also the rest of the world and creating the rest of the world is hard, too. If you're going to study AI systems that are embodied the way that we are in a simulator, then at some point, you need to basically create the universe. And we can’t create them with the kind of richness and complexity of the real world. And, you know, I think at this point, where we are in the development of AI, we can make progress in simulation. But at some point, we're going to hit a wall. We're gonna get to a point where actually making the world rich and complex enough, is actually holding us back. It might not be holding us back technologically, it may be more a matter of effort. People don't appreciate the difference between simulated and real environments, that there's the the kind of the old adage is that if your system is not situated in an environment that demands intelligence, then then it will not have intelligence, like, regardless of which algorithm you equip it with, it won't do something smart, creative, or emergent. That's if that's not required. And it's in the setting that you put it in. So I think for that reason, actually, the real world affords a lot of opportunities for making meaningful progress in AI. A lot of open-ended learning can be done in real world settings. Studying diversity and variability becomes easier in the real world and in simulation. So if I want to, for example, train my robot to grasp various objects in simulation, I have to put a lot of effort into content creation in the real world. I cannot drive down to Costco and purchase a bunch of junk and throw it in front of the robot. So I think there's actually a lot to be said, for learning in the real world.
Pieter Abbeel: Now, I think it's a really interesting distinction you're making here, Sergey, you're saying, especially learning in the real world. In the past, I might have said learning with a real robot, but you've really shifted that conversation from it's not just about the real robot. It's about the real world being so much richer because the robot in the lab might still be in a pretty boring environment. But once you take the robot into the real world, it's the real world part that matters. And one of the projects that really stands out here that you're working on right now is you have a mobile robot that is actually roaming in the real world, collecting data on its own. Is that right? Can you say a bit more about that project?
Sergey Levine: Yeah, for sure. So we actually just presented the first phase of that work last week. We took a really simple, really cheap, mobile manipulator, it's basically the local bot developed at Facebook, it's a low cost arm on our little turtle bot base. And our aim was not to see if we could go do something particularly complicated, but to see if we could get it to, to do something that was a little closer to a lifelong learning experiment. So you know, oftentimes, when we run robotic learning experiments in the real world, we're very strapped for time, there's a lot of manual effort that goes into it. So we worry a lot about sample efficiency. And I think when we do that, we end up making some trade offs that are perhaps not ideal for final performance. So we have an experiment, they would say, okay, we can really realistically only do experiments, like for four to six hours long, because, you know, you just can't have a person in the lab overseeing this for much longer than that they need to, they need to go to the bathroom, they need to eat lunch, you know, you can't, you can't have these things go on for forever. So then the problem really becomes like, what kind of thing? Can you learn for six hours with this thing? We said, Okay, well, what if we just put in the effort and make it as fully autonomous as we possibly can to remove the need to have a person there. And can we just let it loose and see and see what it can do. And we picked a fairly simple task, we weren't trying to do anything complicated, we really just wanted to see if the more it runs, does it like keep getting better and better? Because in principle, it really should. So we set it up, so that there's a, essentially a kind of a practicing strategy. So the job of the robot was to basically clean up a room…putter around, pick up things, and put them away. And it had a practicing strategy where if it picked everything up, that would go into positive, back down, and, and so on. And there was some engineering that went into making sure that it didn’t get stuck in corners and things like that. But once we had done all that, we could basically switch it on. And as long as the battery held out, it would just keep doing its thing. So then we just keep running it and running it. And what we saw is that the kind of scaling process does hold. You train it for 20 hours and it's kind of okay, it's on par with a good scripted strategy for cleaning up the room. But you run it for 40 hours, and it actually keeps going, like when we plotted the curve of success rate two hours is actually a straight line. So it's not even slowing, it just keeps going and going and going. And by like, 60 hours, it's success, right? It's like in the 90% range, or something like that. So, and the slope seemingly still, still going up. At that point, we had to actually vacate the room where it was practicing, because that was also in the lockdown, and everyone started moving back in. So we don't know what happens after 60 hours. But to me, that was actually a pretty nice validation that if you actually get learning to be autonomous in the real world, the thing actually works like it actually keeps getting better. And maybe we should find a conference room that is not occupied for 100 hours or 120 hours. Because, you know, it just keeps going up. So for that reason, I'm actually focusing on enabling not just better RL algorithms, but more autonomous RL algorithms, ones that you can drop into the world, switch it on, and kind of leave it to its own devices. And it will not only learn but also scaffold its learning appropriately by, you know, setting itself up to practice resetting all that stuff. Because I think that's actually a really powerful technology, if we can make it work, because it just enables the thing to just keep getting better and better.
Pieter Abbeel: Now, when you say the robots moving around the conference room, cleaning it up. How about expanding it to more rooms? How about, you know, if you had one in your room at home? What if your students had one at home? Is that reasonable?
Sergey Levine. Yeah, so that's actually what we're trying to do now. It's definitely reasonable. It's difficult. I think also, when we start moving in that direction, it does introduce, you know, additional learning challenges, but also additional engineering challenges like fairly mundane ones, ones that you know, companies like iRobot have already somewhat solved with their platforms. It's tough. It's definitely doable. Just we have to, we have to put in a lot of legwork to get that done. And something that's got me thinking about safety robustness issues. They're important. They're also a little bit of a bootstrap problem in the sense that you kind of have to do it once, and then let the thing loose. And that kind of gets me thinking that, well, maybe in the long run, a lot of that robustness will actually come not from us engineering the system very carefully, it might actually come from the system having a notion of common sense developed from its own prior experience. So we can't do that right now. Because the robot has very limited prior experience. But perhaps it is, in effect, a one time cost once it gets enough experience to know well enough to not fall down the stairs. Then we won't have to put in lots of effort into engineering those things manually anymore. But the first time around, we have to put something.
Pieter Abbeel: Now one of the things that always intrigued me was before you started as a professor, you took a gap year, and you spent it out at Google. And you essentially decided to tackle the problem of grasping, robot picking up an object reliably any kind of object. Very hard problem, of course. But you decided to tackle it in a way that nobody else is tackling it at the time. It was tackled with analytic approaches, careful geometric calculations, or robustness measures that were completely different. Can you say a bit about that? And what made you think that this would actually be possible?
Sergey Levine: Yeah, good question. But for full disclosure, I do want to clarify that there were at least two other people that were thinking about the problem the same way. I think I need to credit this partly to Ilya Sutskever, because he was the one who kind of pushed me to think hard about this problem - What is the thing that deep learning could do in robotics that would register as somewhat grandiosity as the ImageNet moment? But the point is something where the problem was sort of contained and well scoped enough, yet relevant enough, and ripe to be significantly improved on with deep learning. And when he asked me that, I was thinking well, okay, so, you know, this manipulation stuff that I'm doing, it's cool, it's probably pretty hard. You know, I believe in it a lot. But realistically, it's a way off. But on the other hand, robotic grasping is something that we can scope very cleanly. You have one job and one job only, which is like, actually picking up the thing. It's broad enough that everybody wants it. And at the same time, there are many ways to reduce it to a reinforcement learning problem. There are even actual ways to reduce it to just supervised learning problems. So it seems like one of those things where it's not too bad to collect a very large data set. It's broad enough that everyone wanted it, but at the same time, narrow enough that we can scope it very cleanly. So that seemed like a really good problem statement. We could actually address at scale, and really try to understand whether the scalability that we saw from deep learning methods, and computer vision and NLP and so on, could in fact, extend to robotics, right? Because, again, circa 2015 it had not been demonstrated of that scale - a kind of broad generalization. So the major challenge was the technological challenge and like coding up the algorithms, but there's also a major data collection challenge. There was kind of a transitional period at Google at the time where they purchased these different robotics companies and integrated them into a single unit, and then they were transitioning them over to a different format. And basically, we're trying to figure out what to do next in the robotics world? And one of the things that happened is that there were quite a few robotics platforms that they had built that were essentially sitting around unused in a warehouse somewhere in Redwood City.
So we got one of the two robotic arms in the lab, that research at Google Brain. And that I asked, you know, my colleagues Mrinal Kalakrishnan and Peter Pastor, 'Hey, guys, these these things, the we've set up a pretty cool, they're pretty nice arms. Do we have any more of them? And they're like, and they poked around, like, yeah, we've got like, 30 more of them. And like, nobody needs them. So I was like, Okay, so you've got, you got just like, 30 of these nice robotic arms sitting around, can we just like grab, again, a conference room that you'll notice there's a pattern to all these all these products and vacate a conference room, can we grab a big conference room and just set them all up, and essentially have like, a data generation factory in effect, because we could sum that up, then it will be way easier to do this grasping thing, because we can just like keep running them for for days, and days and days. And there's like, 30 of them. So even if we mess up, and we break a couple, like, it seems like nobody needs them, right? So we can just go and do this. Though it wasn't quite that easy, actually, because it turns out that the cost is not actually the arms, it's the people that like, put them there, and do all the work. And I couldn't do that all by myself. So you know, Peter and Marinol helped a ton with that. But then we had to get a you know, a lot more support with things like you know, getting an electrician to come in and wire the thing. So it was a little more of an undertaking. You know, Jeff Dean was very supportive that he put in a bit of effort to get the organizational stuff in. Vincent Madhukar, as well. We were fortunate enough to get the support from kind of the higher ups and just get lucky and essentially have this big pile of robots drop in our lap. And we were actually able to collect this huge data set,it took us months to do. But in the end, we we did get very effective generalization from that. And beyond that, like, you know what, one of the things that I that I experimented with a lot in that project was not just generalization the sense of like, you can throw a new object in front of the robot, and we'll pick it up, but also how it reacts to different kinds of objects. Because, you know, as you said, prior to that grasping was typically thought of as a kind of a geometry problem, right? You figure out the shape of the thing, and then figure out where you need to put the fingers to pick it up. Whereas this was experiential, it was learned from you know, experience from trials. And then we could ask, Well, what does the system learn differently than a more geometric system? And there were a few hints, some of them good, some of them bad. One of the things that really frustrated me is we had a test set of objects, it was unseen objects, we actually bought new ones to make sure that there weren't anything that was dropped in front of the robot by accident previously. And one of the objects there was a stapler. And like, okay, staplers, fine. They're kind of these rectangular blocky things. Here. This one was pink. Okay, find a pink stapler, there, oh, it would never pick up the pink stapler correctly. It kept trying to shove the fingers into the stapler. And we think the reason for it is that during training, we had quite a few clothing items. But the clothing items had to be small because they need to fit in the bin in front of the robot. So they were baby clothes. And baby clothes tend to be these like soft pastel colors, kind of pink, kind of light blue, that sort of thing. So the robot probably concluded this pink stapler is like those other pink things it's seen before and thinks baby clothes. So it kept like trying to shove the finger into the stapler. We got inspired by that so we started messing around with like, you know, different objects that kind of look similar, but had different material properties and stuff like that. And you know, except for these false color correlations, most of the time, it would actually figure out that softer things are ones that you should pinch and kind of shove the fingers in the middle. Whereas rigid things are ones that it should pick up on both sides. And in the video, there is a demonstration where I intentionally pick out a bunch of blue objects that are kind of roughly similar shape and size, but have different materials and show that it doesn't talk about different strategies, except in the confusing case of the stapler.
Pieter Abbeel: Now, it's very interesting to bring this up, because it's all in the data in some sense, right? The data was driving the robot to believe that anything pink could follow a fixed strategy of pinching. Whereas it turns out, there are other pink things it's never seen in its data. And once you present it enough of that new type of pink object, it'll learn it of course, another thing that really stood out to me from from that work, Sergey, and and kind of as you're still working on this today, right, I mean, this is this is a hard problem to truly reliably pick objects. The current incarnations actually have strategies in some sense, meaning when I watch a video, I see things where the robot is not just saying, this is where I'm going to go, its also saying this is exactly what I'm going to do. The robot might decide to push something and corner it, and know that it's easier to pick it up that way, or pull something out of the way from other objects to have an easier way to to pick it. And it seems like those are the kind of things that are truly behaviors. I mean, that's not a one time decision. Its a full behavior to generate grasp and capabilities. And so that, to me, is really interesting. That's really reinforcement learning and what it can give us compared to a supervised learning approach might not lend itself to discover these behaviors.
Sergey Levine: Yeah, that's a very good point. And I think, just kind of more broadly, as as AI researchers, I think, you know, this is probably something that the holds true not just for me, but for for many of us is that one of the most exciting things is when we see the machine figure out something that we didn't expect, hopefully, something desirable, then that's, that's really exciting. Because the shortage sort of gives this glimmer of hope that like, you know, it's not just about getting something that does, you know, the same thing that a person does, it's actually about getting something that has, you know, a degree of creativity, like, you know, the, one of the most exciting things for the the AlphaGo mash was like that one move where the experts looked at it as like an oh, this is like, not a move that we were that we learned about when we learned to play, go like something new. There were, you know, with these robotic systems, it's somehow for me, it was always easier to notice those things when I'm like, actually there and like watching to do stuff, it's a little hard to communicate in a paper. But there's definitely some of that that happens. I remember when I was working on a project on kind of a continual version of this grasping system with Ryan Julian and Carol Hausman, Ryan actually tried an experiment where he intentionally set up some more difficult objects, he slept models in front of the robot. And he was like, Ha, what it did is it went and knocked down every bottle so that it was on its side. And then it picked like the one that was easiest to pick out of the bunch. So it would do these kinds of funny things. Another one we had, this was actually a simulated experiment, but kind of similar principle where the task was to take a bunch of blocks and insert them carefully into fitted slots. And what the robot would do is, instead of actually picking up the blocks, it would just sort of sweep the arm across the bin to shove them all. So they would land somewhere close to the fixture, and then it would go in and just insert them locally. So it would figure out these strategies, and you really need the right level of scale for that to start emerging. Because if you have very small scale experiments, or things like very, very tightly set up, so there's really only one way to do it so that you can do it quickly, then you see a lot less of that. But if you have a larger scale experiment, where you have lots of different objects, lots of different variability, then there's a lot more room for that kind of emergent stuff to happen.
Pieter Abbeel: Yeah, and in some sense, when I think about it, I feel like what we're doing at Covariant is inspired by your work. Our conclusion was that if we really want to keep collecting data, more interesting data, one approach is to find more and more spare robots and you keep growing your data collection setup. Our thinking then, in some sense, was closer to some of the projects you're often pushing today or several soon. If we can build something that usefully grasps in the real world, people will want it, people will buy it. And so we can deploy as many robots as we want. Because people will actually pay to get these robots, and then they'll keep getting better and better and better at helping out more wherever they are.
Sergey Levine: Yeah, I think that's a really good point. And, you know, I do suspect that, to some extent, robotic learning is one of those technologies where it won't be the same if it's at a small scale than at a large scale. There's a significant bootstrap issue. And once that bootstrap issue is overcome, I think the capabilities can be a lot broader. Just from a science standpoint, you know, there are domains where we can study that already. So certainly, if we're talking about mobile robots, now, there's plenty of driving data around and studying data showed methods in that setting is, you know, something that could be done now. But even with manipulation systems, I mean, yeah, you have to put some, you know, some work into getting them set up into getting that extra processing. But it's not that bad. Like it's doable if you consider how much work we put into making a transformer run efficiently on a Jeep. Its not like the engineering effort is prohibitive, it's kind of in that same ballpark. So I think that that's doable. But I think there's also certain scientific questions just on the algorithm side, that we worry about a lot less when we're not doing the real data stuff. You know, for example, I mentioned this driving instance. There's lots of driving data. Why isn't everybody doing reinforcement learning for driving? There are some people that are doing it, but it's certainly not what most reinforcement learning researchers are doing. Well, turns out, there are actually major technical challenges that arise when you start thinking about combining reinforcement learning with previously collected data. And it's something that my group has been working on a lot, as well as many others. Sometimes it goes on to the name of offline reinforcement learning, but it's really less about it being offline and more about being able to consume previously collected data. And that's not something that classical reinforcement learning researchers worry very much about, because classically reinforced learning is approached as an online interactive set system. And it can be online and attractive, that's great. But you can also load it up with large amounts of previous data.
In the case of Covariant or in the case of a car from all human drivers or in the case of my robot from all the previous experiments that my students have done, if you can load it up with that data, then you can get a more powerful generalization. In a sense, the kind of the self supervised learning, kind of transformation that we're seeing in NLP envision is based on that principle in supervised learning land. So can we have something like that in RL? Turns out, there's a lot of interesting algorithm challenges there. And for anyone listening to this, you know, if you are a researcher working on RL, that seems like a great topic to tackle right now.
Pieter Abbeel: You've actually made a ton of progress in offline reinforcement learning in the past couple of years. And maybe just to tease up a bit more specifically to our audience, Sergey, can you clarify what makes offline reinforcement learning different from the more canonical online reinforcement learning? And what makes that hard but also, why does that offer many opportunities?
Sergey Levine: Yeah so I really started to realize how important this is when I got to talking to more people who actually want to use RL. I talked to some folks that were working on a microscopy application. They were like, well, we need to, like, see our microscopes, so that we properly reconstruct the shape of this thing. And we've got a bunch of data from a grad student going in and doing this manually. Can we use RL to automate the microscope? Like, Well, how long can you run the microscope? Like, well, it needs liquid nitrogen to be hooked up, and so on. So like, we get a couple of pictures every day or something. Okay, well, that's not gonna happen. But they have all this prior data from like, you know, the decade of research they've done on this thing. Or you know, talk to a company doing, you know, HVAC stuff, right? So we'll know we can't experiment with different thermostat settings because people in the building are going to rebel. We've got all the historical data from the past however many years, right. So there's just all these settings where this kind of stuff comes up where the problem looks like decision-making problems. But the active online interaction that the classic RL framework prescribes is very difficult or infeasible like the way that everyone explains. Reinforcement learning is like you train a dog. You give it a treat like if it does something good, you punish it if it does something bad. But the way that everyone would like to use it is to say I've got my data, you know, I've got my baseline system, it's been running for a decade, can I just like, take all this data and train them on a better policy? That's basically what offline reinforcement learning does. So to me, it's exciting because we can, hypothetically, take lots of robot data and use it to basically bootstrap the robot. But I think to many other people in the world, it's exciting because they can take data they've already got and decide a better policy from it. So that's the big promise of the idea.
Now, what's the challenge? Well, the challenge is that if we think about the active process, the one where it's like you train a dog, if the AI agent has some idea about an action, that might be good. But it doesn't know if its ‘good’ was actually good and it's gonna go and try that action, experience the outcome and adjust its understanding accordingly. But if you have an offline RL system, and it looks at all this data and decides this action, the one that you didn't take like you didn't try setting the thermostat to 150 degrees, you've never tried that. Maybe that's a good idea. Right? Like No, maybe not a good idea. It's just there's nothing in the data where you see 150 degrees. So the robot doesn’t know that people are going to freak out and everyone's going to leave the building as soon as you do that. You’ve just never seen that so you don't know it. So it might look good to you. But its because you're making a false inference based on incomplete data. So we don't usually worry about that in online RL. But we do worry about that a lot in offline. That's the biggest challenge in offline RL. How do you determine in a particular conclusion you've drawn from the data about a counterfactual query or counterfactual meaning? What would happen if I did this other thing? Is that inference actually accurate? Or is it simply a delusion caused by incomplete data? So detecting and correcting these delusions is basically the primary challenge. it's a very similar challenge to what we face when we're trying to do causal inference. It's a very similar challenge to what we face when we try to like determine treatment effects for drugs. It's using existing data to determine the effect of an action. And it's a big challenge. But as you said, we you know, the community collectively has made a lot of progress on this over, you know, just the last like, kind of 18 months or so, I would say these methods have gone from a state where they're basically not usable to one where we can actually, you know, start using them for things. In reality, we're collaborating right now with a team at CMU on trying to apply this for power grid regulation. We're working on these things for things like dialogue. And it's, it seems like there's actually headway to be made now with the algorithms that have come along and just in the last 18 months.
Pieter Abbeel: Now, that's another part that's so interesting here, I think, is that the same algorithm actually is applicable across so many application domains, you're you're talking about robotics, dialogue, H fac, control, electrical grid, all in same breath, pretty much because the same approach or developing can be applied across them, as long as you feed it, of course, data from that domain. When you think about, when we zoom out a little bit on that front, when you think about applications, you're personally excited about what's on that list.
Sergey Levine: Te things that I find really exciting are generally things that have the potential to lead to some sort of emergent behavior. The big power of generalization is not just in doing more of the same, but indeed, the inventiveness of coming up with new solutions. That's why I actually think that a lot of the real world stuff is the most exciting. Certainly the robotic stuff. I think, also applications that involve some sort of interaction with people are really exciting, because interactions with people are very complex. And it's also a place where there's room for potentially really interesting, emergent behavior. Imagine a bot is talking to a person, can they come up with a way to ask them the right prompts to make it more efficient for them to get the answer to that question, right? So there's all these subtleties of attraction that, you know, people are pretty good at. But we're very bad at articulating how to do them. Right. It's kind of the essence of expertise. It's something that, you know, you really know how to do it, you're very proficient when asked to do it, but you can't really explain directly to somebody like the rules for doing it. And that's something that perhaps an RL system could actually figure out from seeing lots of in the case of the question answering lots of like humans doing that task. So the thing that all these things have in common is the potential for emergent behaviour, and the messiness of some really complex real world system, whether it's a physical system in the case of a robot, or whether it's a person, or something else that is complex enough that if we just like code up, you know, a scripted strategy, it won't exhibit that same kind of expertise.
Pieter Abbeel: Now talk about new things emerging. One of your recent works is on using machine learning, to automate some design processes. And that's really intriguing, right?
Sergey Levine: So the work you're referring to as well, the most recent one was done actually, with the the computer architectures team at Google friendship design. But we've also worked on these problems for other applications, like applications and biology, for instance. But the basic principle is actually very similar to offline reinforcement learning. So we refer to this as offline model based optimization. But the idea is the following that if you have some kind of design problem, like let's say you want to design a drug, and you have experiments that you've conducted, or you've tested different kinds of designs, can you take this data set of experiments, examine it and come up with a better design? In a sense, that's a lot that's that's a lot of what the you know, human scientists do, right? Like we we look at our results and examine them think pretty hard about what's going on, and then come up with a better thing to go and test. And sometimes we need to conduct more experiments and sometimes we look at the data and say like, okay, now I know the answer. This is the this is the thing that you should be doing. So this problem actually looks a lot like a reinforcement learning problem if you say that your design is your action, and all these experiments you've conducted, that's your offline data. This is basically a version of an offline reinforcement learning algorithm. It just lacks the temporal structure. So it's, there's no, there's no notion of time steps. There's no notion of like actions and their their consequences, just select the design, and hope and observe the efficacy of that design. So doing all this from data is hard for all the same reasons that the offline a real problem is hard, you want to make sure that you can figure out that some conclusion you're drawing is actually supported by the your data, and not just a delusion, because you're extrapolating too much. In fact, if we get into the kind of the nitty gritty technical details of this, you know, the way that one might think to do this, conventionally is you take a deep neural net model, and you train it to go from the design to its efficacy. So you know, some representation of a drug like with a graph, neural net, or something, mapping to the efficacy of that drug, and then you would optimise with respect to the input. That's a very sensible idea. But it's also exactly the way that we produce adversarial examples, right procedure for producing an adversarial example is identical for us. So naively trying to do this kind of procedure really doesn't work like it's, it'll create these things that you can feed into your network that will fool it into thinking it's great, but are actually terrible. So we need to develop better algorithms for that. And that's a lot of what we've been doing. But once we have those algorithms, then we could actually try out a lot of these interesting problems. And the particular one that we that we were studying in collaboration with the Google Compute architectures group, was to design hardware accelerators. This is a little circular, it's basically a neural net, trying to design a chip that will train a neural net faster, they will be excited to in the future is to actually close the loop, like fabricate that ship and use it to design a new ship even faster. But that's not that's not there yet. So there's, there's this kind of transformer model that looks at the ship architecture parameters and predicts its efficacy from data. And the cool thing about that is you can actually condition this model on the kind of workload that you want to run. So if you're trying to train a different kind of neural net, you can condition on something that describes the thing that you want to train, and will actually try to optimise a different accelerator for different kinds of workloads. And one of the things that, that we were able to actually do there is show that this model that we trained on a variety of different workloads, could then be used in zero shot to optimise the design for a new neural net, basically. And that's, that's pretty cool, because now that's suggesting that not only can you use prior prior experimental data, you can actually use data from a for a variety of different workloads, and then use it to create a design for a new workload without ever having tested any designs for that new workload before. And that seems to work decently well, so far.
Pieter Abbeel: Now, we're hearing about a neural net, that is designing chips effectively. So it can run faster itself is it's hard to not think singularity, right? Must must have been on your mind at least once, as you're working on this. You may be you know, first say, what is it? What is this concept of singularity? And put it in context of what you're doing here?
Sergey Levine: Okay. Yeah, so the singularity usually refers to the, the idea that if you have a self improving machine, that self improving machine keeps improving itself, which allows it to improve itself even faster. And then that allows us to improve itself even more, and so on, and it gets the spiral exponential growth room, before we know it, it has improved itself to a degree that we are incapable of understanding. I think a lot of people find this notion very scary. I feel like we're, this is often kind of going to be the disconnect between like, you know, if you work on this every day versus otherwise, like, you know, to me, I'm I'm just sitting here thinking like, Man, this thing can improve itself and get itself to be faster, my life will be a lot easier.
Pieter Abbeel: If it can just do it once would be great. Yeah,
Sergey Levine: My perspective on this right now is that we have a lot more to fear from machines that are not smart enough versus the machines that are too smart today and 2021. Because AI systems will be used for real world things. They're already used for real world things, including in contexts where they can cause you know, real harm, real real damage to people, physically and emotionally in other ways. And I think having the systems not be smart enough to figure out how to, you know, behave appropriately is a much bigger danger right now. Then, having the systems being too smart. So I would be a lot more concerned about my autonomous car, not having a fast enough microchip or not having a well enough trained neural net, to recognise a pedestrian and maybe risk hitting them that I wouldn't be about Car being so smart, that it creates a better car that drives, you know, exponentially faster or something. Right? So I think we, you know, we do have to take these things seriously, we have to also recognize that machines that are not smart enough, also pose a real danger.
Pieter Abbeel: But now, of course, your work actually goes in this direction. Also, I mean, your work covers a lot of ground, obviously, one of your lines of work goes in this direction, right, the offline RL, is it fair to interpret it, as part of it also, as the machine the neural net has to understand what it doesn't understand yet what it doesn't have support in the data for? And then in principle, you could ask it to be cautious. And in those situations, right.
Sergey Levine: Yeah, precisely. And I think that this goes much deeper, even the bat, right. So at some level, real world learning in real world deployment of robotic systems hinges in a really critical way on being able to understand it's actually your area of competence. So offline, RL really like makes that very crisp, because it's like an offline RL. If you mess up that part, you'll just crash and burn. But even in regular oil, it's an issue. It's just an issue that we dodge, when we work in very constrained simulation environments, or very constrained laboratory environments, in any real world deployment of an RL system. Yeah, it will have to deal with unexpected things happening. And us to react to those things in a way that is sensible dependent on what it's doing. It could react by it with curiosity, it could react by going and examining things a little bit more, or it could react cautiously. But either way, it needs to react in some way that doesn't like, you know, damage something or break something. So you don't you're grasping robot probably shouldn't react to like seeing a cat by trying to grasp the cat. Like, that's just not a good idea. So being aware of uncertainty, estimating uncertainty, and having some degree of conservatism in your behavior, I think is actually an indispensable part of these things. And certainly, like, I think, right now, I get the feeling that the mainstream RL researcher community is maybe has been made more keenly aware of that with all the offline IRL stuff, but the issue is always there, whether it's offline or online, it's really important.
Pieter Abbeel: Now, Sergey, its a drastic understatement to say that you are a very productive AI researcher and leader, you get so much done. In just one year, you can make so much progress on so many problems. I've asked some questions about that…but what drives you? Because I mean, this must take a substantial amount of your time to do what you're doing. What motivates you here?
Sergey Levine: Yeah. Well, okay, so I think that for, probably, for any scientist that, you know, spends a lot of time doing science, at some level, they have to be driven by just some innate, or, I guess, I'm hard to articulate notion that like, this stuff is really cool. So I think if I, if I'm being really honest about it, like, it's that it's like, the stuff is really cool. And the particular stuff that I that I think is the most cool, is actually the possibility that you get a machine that does something that is smart, capable, but not what you expected. Like, you know, those are sort of the really exciting moments. And part of why that, that kind of thing really drives me is that, it just, it just feels like the the power in that potentially that kind of the potential energy in that is enormous. It's like, you know, you're the there's sometimes it's this analogy that you make when you know, research is like, you're kind of digging for treasure, right, and you can kind of imagine the the cutaway figure where there's like a little guy digging a tunnel, and there's some kind of like gemstone on the other side of that tunnel. And I think with AI, that the exciting thing is that we don't know if it's like a little gemstone, or if it's like an enormous thing, the size of a planet, like, you know, we could be sitting on top of this thing. And it's just like, its capabilities is huge, or it could be smaller, we kind of don't know. And there's something really cool about that, like, you know, if you build this thing, and it really has got some kind of emergent capability. And then you can kind of let it loose and it will actually get better and better. That can be enormously powerful. And, you know, in all scientific pursuits, we also do have to be realistic with ourselves to a degree like the probability of success is never that high. But there's just something like really exhilarating about this notion that will like maybe we're sitting on the motherlode there, it's like we just did a few more feet will unlock this enormous capability. I think that that's really exciting. And you know, that that really is kind of the thing that gets me out of bed for this job every morning.
Pieter Abbeel: Now, you say powerful, but I hear as part of powerful, I feel like I hear a certain optimism that it's powerful in a good way. Say a bit more about that.
Sergey Levine: Yeah, yeah. Well, I think that there is, you know, that there are certainly things that we have to be careful about with any new technology. But I think that, you know, at the same time, we do have to also keep in mind that the, the potential for technologies that you know, say save a degree of human labour, it's kind of, it's not what one might just pejorative a call firstworldproblems like, you say, like, oh, yeah, I want a robot butler because that's cool, because then I don't have to like do my laundry. But that's not really all. It is like, you know, there are lots of people that that do jobs that are not Pleasant that they don't necessarily want to be doing. And I think that robotics that reduces the reliance on humans in settings that humans really should not be doing that that are better off done by machines, I think that's a positive thing. I think we you know, we as a society have to handle that in a in an appropriate way, we have to make sure that, you know, sort of ever everyone comes up comes out better out of that, not just the, the people that are happened to be in the right place at the right time. But, you know, if we can deal with the social aspect of it, I think that the kind of human quality of life effect of effective AI technologies, both in robotics and elsewhere, having that effect should be extremely positive. And it's just, this is one of those things where we, you know, just by our nature, I think we tend to be less concerned about kind of how bad the status quo is, and more concerned about how bad things might be if they change because like, well, the status quo is kind of what we're used to. But, you know, I think that the potential improvement in human quality of life, kind of across the board from machines that can, you know, do a lot more of these things, you know, both the jobs that people don't necessarily want to do, and also accelerating the jobs that people do want to do. I think the effect of that can be enormous, you know, from accelerating scientific progress, to just like, you know, giving the elderly and the disabled better quality of life, like in all these areas, just, it's a huge potential.
Pieter Abbeel: The field of AI is moving very fast and how to continue to keep a sense of purpose and excitement, if 1000s of other people are also working in the space. What makes you personally very excited to keep pushing?
Sergey Levine: Yeah, I think that's a really good question. And I mean, again, I think the answer to this will probably be different for different people as to as to what works for who but for me, I think that something that has always worked really well is to just really keep in mind, what it is that I really want to see, like what what is the capability or the or the outcome or the question or whatever, that I really want, figure it out. And I think if you have the right mindset in that regard, like, here's the here's the thing that I want to see, then it actually becomes really good that there's all this stuff going on, because anything that you know, essentially like you know, some graduate students, they sort of dread looking at archive and the Marvel's like, oh, man, maybe somebody, like did the thing that I'm working on. But if you if you have the mindset of like, I really want to get this thing figured out, really wanna get this question answered. Then every time you you look at what what new research has come out, it's like, it's like Christmas morning. It's like, what new present has arrived, that will help me do the thing that I've been trying to do. And if someone figured out a piece of the puzzle, awesome, they're just like helping you. They're making you faster and further along. So I think it's actually really important to have that mindset. But to have that mindset, you really need to know what it is that you're trying to accomplish, what question you're trying to answer, what capability you're trying to unlock. And if you've got, if you've got that in mind, that all these things that are happening, you know, they look like Christmas presents, they look like all these people around the world, helping you accomplish the thing you want to accomplish. And that's actually really good.
Pieter Abbeel: That's a really beautiful mindset. I try to strive for that also, but I'm not claiming I'm always succeeding.
Sergey Levine: Me either, by the way, yeah.
Pieter Abbeel: Yeah. Now, when you think about a high school student today who might be very excited about AI, and think of that as a career they might want to pursue, can you give some pointers on how to get going and start learning the right things to get engaged in this?
Sergey Levine: Yeah. I mean, okay, so there's kind of the obvious stuff, like, I'm sure that most people listening to will know that there's kind of resources they can find online for classes and things like that. But something that I would say here that is maybe a little bit less obvious is that, you know, for, for getting started in science, like, yeah, you need the basics, you need to know, like, kind of what people are doing now. But at the same time, you have to remember that to be a successful scientist, you have to do the things that people haven't done yet. Which means that, in addition to learning about the like, the really hot thing, the cutting edge thing today, you know, you know, the class on on the Coursera, course, on machine learning the deep RL course, whatever, you also have to get the right foundations, so that you actually have some hope of discovering those things that people haven't done yet. And I think that, you know, in the short term, the like, you know, the people will, will make good progress by getting the latest, greatest deep learning, whatever. But in the long term, success will come from really investing into those foundations and getting that right. Now, that's easier said than done, because, of course, all of machine learning, statistics, optimization, etc. That's a very broad thing. So you have to kind of find the right materials. Of course, if you're, if you're working with a mentor, if you're if you have a research supervisor, part of their job is to help you direct you in finding those things. And it's not easy. But that is like, that's a really important part of it. So, you know, generally, a really solid curriculum, certainly a solid college curriculum will put people on the right path towards that. But it's important not to neglect that and to really remember that that's actually where a lot of the mileage will come from.
Pieter Abbeel: Well, that's some wonderful advice. Sergey, this has been an absolutely wonderful conversation. Thank you for joining us.
Sergey Levine: Thank you for having me here,