top of page

Gustav Söderström on The Robot Brains Season 2 Episode 19

 

Pieter Abbeel 

Music is universal. It transcends language. Melodies have the power to trigger strong emotions. In fact, our brains release dopamine, the feel good hormone when we listen to music. Today's guest has been helping to bring the magic of music, personalized playlists and as of more recently, also podcasts to people around the world. Gustav Söderström is Spotify's Chief R&D Officer and has led the platform to personalize individual content experiences with the help of artificial intelligence. Gustav, welcome to the show. So great to have you here with us. 

 

Gustav Söderström

Thank you. I'm very excited to be here, Pieter. 

 

Pieter Abbeel 

Well, so excited to get to chat with you. Now, before we dive into what you're doing today at Spotify. You've actually been at Spotify for a very long time. You've been part of the journey from almost the very beginning. Can you say a bit about how you got to know Spotify, long before many people knew about it? And what got you convinced to join? And what it was like at that time? 

 

Gustav Söderström

Yeah, that's right. I wasn't with Spotify quite from the start. So Spotify was actually founded as a company in 2006, and it launched a desktop product in beta in Sweden in 2008. And so I joined there, and I think I signed in October or something 2008, and I joined in January 2009. So the product existed as a beta on the desktop. And, you know, I tried this product and I was completely blown away by it, the desktop product. And so I figured that I would try to figure out what I was going to do with my life. I had come back from being in Silicon Valley for a while. I worked at Yahoo, which I had sold a startup to previously. So I was kind of coming back to Sweden, figuring out what to do and through a mutual friend, I met with Daniel and he asked me if I wanted to head up what Spotify’s mobile experience should be because this was this was in 2008. So the iPhone was out, but the App Store wasn't out yet. But the writing was kind of on the wall that this hardware based iPod they used to carry around as a separate piece of hardware was going to follow Marc Andreessen’s prediction and turn into software. And that opened up the opportunity for Spotify to also then be a piece of software on this more generic computing platform. So I joined Spotify and it was about 30 something employees back then. Very much a startup. It was a well-funded startup because going into music is incredibly expensive, but it was still very much a startup. And the core reason I joined, of course, there were amazing people, but the core reason I joined was the product. It just blew me away compared to anything that existed at the time. At the time, people were mostly, unfortunately, not many people were actually paying for music on iTunes. Most of them were downloading from pirate networks. And you had this experience of going through hard drives and finding music that you thought you wanted to hear, and then you waited minutes for it to download and then you listened to it. And the experience of Spotify to put it in one sentence, you know, is kind of the experience of having downloaded all of Napster and Pirate Bay to a local hard drive. Everything started instantly. So it's a much better user experience. And somehow, magically, for most users, it was also legal, unlike the pirate networks. So it was really one of these too good to be true product propositions. 

 

Pieter Abbeel 

And it's very interesting because indeed, at the time, a lot of people were downloading music illegally. And to me, at least it was very surprising that Spotify managed to get the rights to all this music because this one place, all of a sudden, you could get all the music legally. No problem. How did they secure that? Was the magic there? 

 

Gustav Söderström

Well, it was both environment and timing, as always. And then individuals, I would say. I just wouldn't discount Daniel, the founder. His patience and perseverance in going and knocking at the door of these labels. Being back then completely unknown and just persevering year after year and also, frankly, putting up a lot of money upfront to buy away the risk from the labels. It is impossible to know, but there's a fair chance that the music industry actually didn't think this was going to work, but they thought, why not? Like, it's guaranteed money. And you know, lots of startups have done this before they guarantee money and then they just fail. So part of it was perseverance and money. But then a lot of it was actually timing. So I think the reason I don't think Spotify would have happened anywhere else in the world for two reasons. One is that the Swedish government had actually invested in broadband. It was literally a program called Broadband for Everyone and then PCs for Everyone. So pretty much everyone had high speed broadband, government funded. So piracy took off because it was, because the infrastructure was just ahead of the rest of the world. And there was a political climate where the government said, you know, there was this notion that information should be free, right? And music was just bits. It’s just information. It should be free. So people didn't have any qualms about pirating. In fact, there was a political party called the Pirate Party, which the prime minister literally said, we can't criminalize a generation, this can't be illegal. So it's very much accepted. And this meant that the music industry was wiped out. There's a quote by a UK label person who looked at the Swedish market and a Swedish label and said, that's not a market, that's a hobby. That was the size of the Swedish music market back then. So this is the long winded way of saying the infrastructure was there and the labels had nothing to lose and it was guaranteed money. That's what made it possible. Whereas in the UK or the US, there was way too much to lose, so it would never have happened. 

 

Pieter Abbeel 

And we say it was guaranteed money. What do you mean with that? 

 

Gustav Söderström

Something called MGs. Minimum guarantees where you as the startup basically say, here's all the money upfront January 1st and then we recoup from that. So you take all the risk. It's a very expensive way to do startups. 

 

Pieter Abbeel 

Sounds pretty risky, because you're saying, Spotify would just say we're paying upfront for at least this many listens to your music of your entire label, I guess. But if people don't listen that much, you're not making the money back. 

 

Gustav Söderström

Exactly. You take all the risk. So, you know, you really have to believe in your idea to do that, but it is a way of making things happen that could not otherwise have happened. 

 

Pieter Abbeel 

That's so interesting. Now back in 2008, 2009, machine learning was starting to play a role in some things online, for sure. Recommendations of ads and movies and music, I imagine, too. But it wasn't as good as it is today, so I'm curious, was there already machine learning at the core of Spotify back when you started? 

 

Gustav Söderström

So the interesting answer is yes. But we, including myself, I was in senior management then, didn't understand that and recognize it. So there was among the engineers, but the company itself didn't really value it. I'll be the first to admit I didn't really see how big machine learning would be back then. But fortunately for us, many more developers did. So we had specifically already when I joined, there was specifically one person named Erik Bernhardsson, he was very interested in what was back then, not really, maybe called ML AI so much, just recommender systems. And he was fascinated by some of the recommendation systems and challenges that, you know, Netflix and others put out around collaborative filtering. And he realized that the way Spotify worked back then was, you had a big search box where you typed in, you know, whatever song you wanted and then you playlist it. You put that in a playlist and you can almost think of the playlist as a sort of a meta programming language so you build your own music experiences. And pretty early on, we started to get, you know, millions of playlists and tens of millions and hundreds of millions. And today we have many billions of playlists. But already back then there was maybe tens to hundreds of millions. And so Erik Bernhardsson realized that whereas Netflix, for example, was doing collaborative filtering between users and movies, there was this opportunity to do collaborative filtering between playlists and songs, right? And so the theory for that wasn't necessarily that hard because it had been around for a long time. But to do that in practice at that 10 million users scale back then was incredibly hard. So he spent a lot and, you know, there wasn't any AWS or any GCP or anything. So you had to buy your own servers. So he implemented this basically using, what I think back then was Europe's largest Hadoop MapReduce cluster. So that was a lot of the challenge and he got this working. And so we produced these vectors, you know, when you filter between songs and playlists. We produced this vector space that was 40 dimensional back then. The simple idea was obviously that people playlist along some dimensions, some latent dimension that they know, but that they may not express in the title of the playlist. But somehow these songs, they belong together. And then this 40 dimensional vector was supposed to catch or capture some of those hidden dimensions. And I mean, I didn't understand very much about this back then again. But fortunately for me, other people did. So we produced this vector space and then we did some pretty straightforward thing. So we got a vector for every song with these 40 dimensions based on how they had co-occurred with other songs. Back then, they were straightforward collaborate to playlisting. Today, it's more like takes into account some of the sequence. And then once you have that for every song, you can take a user and you can take the play history and you just average all the vectors for all the songs and then you have a vector per user. And then you're off to the races. You can start doing song to song similarity or you can do user to song recommendations. So that's kind of how it started. Actually, those things appeared in the client very early on, but they were very hidden. So if you went to an artist page, you could see similar artists, but it was very much a hidden feature recommendation. It was all about curation back then. So that's kind of how it started. I think one thing that was interesting was that we noticed pretty quickly that people who call themselves music aficionados or had very specific tastes, they tended to love these recommendations. I mean, I can't tell you all the emails I got about this. It's amazing. How could you find the song? And I was like, I have no idea. But the interesting thing was that all of my sort of mainstream friends, they were kind of like, I don't know, like, this is really esoteric and strange recommendations. And so it turns out that it is the music aficionados that playlist a lot. So there was this bias in the data, which was the training data was very much for music aficionados. So it was a bit unexpected for me because I would have expected that it would have been easier to solve the mainstream recommendation problem first and then the aficionado problem. But for us, it turned out to be the opposite, actually because of the training data. 

 

Pieter Abbeel 

Now, of course, Spotify’s goal is to deliver the right content to the right listener at the right time. So how do you go from where you're at the time, where you are able to do recommendations, but seemingly limited by the data to really achieving that goal? 

 

Gustav Söderström

So first of all, as I said, Spotify was about curation. And I would say all of the internet was about curation. You went to Twitter or Facebook and curated a friend graph. You Spotify and curated playlists. You went to Pinterest and it was all about curation. And slowly, the world has moved from curation to recommendation, where you can recommend the things to react on. So I think for recommendation service, it's fundamentally about understanding as much as possible about both sides of the equation. So you want to understand as much as possible about the artist and the music and the track. What is this artist about? What type of artist is? What is this track about? And then equally, you want to understand as much as possible about the user's preferences, what kind of music, or podcasts nowadays, do they like and you know, how do they listen? So it is about understanding as much as possible about these sites and then you start out with some algorithms, and as I said, it was actually real machine learning from the start in terms of collaborative filtering. Truly, you know, learning the features. But there was also as was the case in, for example, computer vision back then tons and tons of handcrafted heuristics like, you know, artist affinity and these kinds of things that we handcrafted. So it was basically a lot of systems that together tried to describe the user's preferences and try to describe the music and the artists and then match the two. And I would say what I described about the vector space. It was early on, but the truth is that some version of that has lasted for a very long time. So it's more sophisticated now. You know, we have more powerful representations than just the 40 size vectors. And, you know, we use more neural networks. We use slightly different methods. But I would say this notion of trying to create some sort of embedding in a vector space based on consumption, at least in music, is based on consumption. In podcast, you can obviously base the vectors on content, as well, by sort of machine listening to them. But in music, it's based on consumption, mostly. That is still almost to this day, the very core of Spotify, recommendations. Now, the problem we had with only using this system was that while these vectors, they captured some 40 dimensions, which no one knows exactly what they are and those dynamics are very powerful. But while they captured some really quirky dimensions, including cultural dimensions, right? That we could have come up with, they also missed some obvious dimensions like is this Christmas music? Is this a baby carol? Is this heavy metal? So the problem we had was if you tried to create a playlist by just taking pictures that were close to each other, you know, in vector space, it was often the case that maybe 70 to 80 percent of that playlist was amazing and then 20-30 percent was super strange. Like in the middle of summer, you got Christmas carols or in the middle of your summer jam, you got some heavy metal or something because there was no, it didn't capture necessarily what a human would have thought was easy to spot. Sonically, any human would say like, well, those are completely different. They make no sense in the same set. And so their vectors could be quite similar because somehow they got playlisted together a lot for various reasons, right? So part of our journey was we had this workhorse in the vectors for a long time that was really powerful, but we didn't have any sort of knowledge graph or anything on top of it. So in 2014, I think it was, we acquired this company called the Echo Nest out of Boston MIT, and the Echo Nest was interesting in many ways. First of all, they were one of the best companies in music recommendations in the space. They were actually powering almost all of our competitors at the time because very few of the streaming startups had their own recommendation engines. But that's not why we acquired them, and we actually let competitors keep using the product. The reason we acquired them was because they actually did recommendations in a completely different way than we did. We based all our recommendation on listening data, you know, co-occurrence in playlist, basically no analysis of the actual genres or sonic analysis or anything like that. And the Echo Nest, they had no play data at all. No listening data because none of these companies wanted to give their listening data to a third party. And you know, today with GDPR, it's not even legal to give your listening data to a third party. So they based their recommendations solely on, first of all, handcrafting a large taxonomy of how music works, you know, with genres and subgenres and Swedish pop and so forth, and Christmas music and carols and all of these human concepts. And then these combinations of NLP going through Wikipedia and so forth to sort of populate this knowledge graph. And on top of that, they use sonic analysis of the music. So BPMs, tempos, energy. And so they also created a recommendation system, but based on a completely different paradigm, no listening data. Just analyzing the music and the artist and their conversations around them. So the idea was what if we combine these two? And so what happened was when we took our vectors and we created sort of a candidate list of tracks that were close to each other and in the space, we would have this, what we called WTF moments where things really didn't fit at all. But then if you kind of filtered it through this knowledge graph, then you would realize that's actually a Christmas Carol. It makes no sense in the middle of summer. So when you filter that list by this other knowledge, you finally got really good results, and we can start producing musical sets that were genre wise coherent and sonically coherent. And it was about around this time that Spotify launched sort of our first real music product called Discover Weekly, which is this weekly playlist that you get. And other products like Daily Mix and time capsule. Lots of these things. So that's part of the story. It was a combination of these two things that looked at different data. That was the solution to the problem for us. 

 

Pieter Abbeel 

Now, as I understand it, Gustav. You actually worked with neural networks in computer vision, long before you were at Spotify. Back when they were not so great at getting things done compared to today. But then in the last five to 10 years, neural nets have become the go to methodology for most machine learning systems. And I'm curious, as you look at Spotify systems today, what's the role of neural networks and everything you just described and provides these recommendations and discovery processes for your listeners? 

 

Gustav Söderström

That's absolutely right. And it's a bit different between music and podcasts. So in music, actually, these vectors, based on listening data or playlisting, they are still very useful and powerful. But obviously you can use neural networks to create more powerful representations. You can use the vectors and try to add different types of features. We've tried, for example, to play around with, you know, there are many tracks that don't have any listening data yet. So could you use a neural net to try to map that into this vector space and sort of predict where in the vector space would this track end up? So that you get a good first guess before you have any listening data. So there are those kinds of things and more advanced rankers. And we basically use knowing that's all over the place. But it's still interesting that in music, you know, much of the core is still just this representational space. In podcast, it's a little bit different. When we entered podcast, it was a similar journey to music where we didn't have any listening data initially. And so we had to build this knowledge graph that was, you know, largely hand-crafted initially of categories and hosts and shows to get started. And then we started building up some listening data and we could start doing sort of the collaborative filtering, our matrix factorization techniques. But in podcasts right around then transformers really exploded. And then you have this other option where you can just sort of machine read the transcripts and you can create embeddings using transformers instead. And they turned out to one be very powerful and two, they don't need any listening data. So you can do it on an episode that has no listening yet and still understand and still get an embedding for it. So I would say anything that has to do with, not surprisingly, because these are called language models to a large extent, anything that has to do with language has benefitted fantastically from those networks, specifically transformers, even if they do more than language these days, these transformers. 

 

Pieter Abbeel 

Now to zoom out for a little moment here. When you say these transformer neural networks, of course, they've been very popular in any language tasks. It’s a new neural net architecture that somehow is better capable of capturing the semantics of what's in text than the previous neural network architectures, right? And it's used for Google search to type something in to find the underlying meaning and find text that's semantically related, but the text might not directly match up. And so am I correct imagining here that you run these transformer networks that have been trained on the entire internet? And then you run them on a podcast episode and it generates an embedding vector summarizing that podcast? 

 

Gustav Söderström

Exactly. That's that's roughly it. So you run them on these podcast transcripts and that creates an embedding you can use for other things as well, obviously. For example, automatic summarization, show notes and stuff. But specifically, you can create these embeddings from it that lets you have a vector space, even though you don't have listening data yet. 

 

Pieter Abbeel 

Now, as a podcaster, myself and particularly curious, of course, how Spotify thinks about podcasting as part of the ecosystem? Because Spotify clearly started with music, but now podcasting is another big component. And how did that come about? And what's the future there? 

 

Gustav Söderström

It’s an interesting story. So Spotify started as a music company and then as many companies, you know, they need to sort of redefine themselves every now and then. And Spotify has had various fits and starts with, you know, should we try to expand somewhere? But then what happened was actually, as is often the case, we have these things called hack days internally or hack weeks actually, where you know, the entire staff gets a week off to do hacks. And it's important that its entire staff because otherwise they can't collaborate with each other. So only a few people. And the thing we saw again and again was that so many of our developers, back to the developers actually knowing what is happening and then management figuring out later, how some of the developers they keep building. Because the RSS protocol was open, they kept building RSS podcast player into their Spotify. They just wanted to have their podcast together with their music, and we saw it again and again. And eventually, you know, when you see things happening from smart people, a lot of the time you should take notice. And so that's actually how it got started inside Spotify, I would say. Because we saw this appearing on its own during hack weeks. And then we said, like, what if this is not just, you know, what was back then our few hundred developers? What if this is a good sample of the rest of the world? What if this is a need? So we started being a bit more structured around it and we said, let's do some user research. What do people say? Do they actually want their podcasts? The rest of their audio with their music? And we found that largely they did. There were certainly some people that said they didn't. But largely they did. And so then we looked at the space and said, is this something that we want to get into? And there was another consideration that is important, which is, you know, what is the space? And we look at podcasts. We were all huge podcast listening fans and we felt that especially where the world was going right then everything was getting incredibly short form very quickly. Super like high intensity, short form messages were maybe without so much depth, and we felt that podcasts was this long form formats where people got to speak in full sentences. And we just felt like it can't be bad for the world that there is more deep discussion. So it felt like a space that was similar to music, music is a very positive space. And for us, podcasts felt and still feels like a very positive space. So we looked at the space. We liked it. We had this hypothesis that people wanted to have their podcast with their music. And if that was true, then we had a good chance because we had a lot of music listeners, so we could just give it to them. If it would have been true that, you know, our listeners liked podcasts but not in the same application, then I don't think we would have had a chance because there are so many great podcasts applications out there in the App Store, but they don't get a lot of usage. So we both asked ourselves if we wanted to be in this space and then do we have sort of an ability or a right to be in this space? And so we decided to join podcasts and music into the same thing, which was, I think, a bit of a contrarian idea at the time. Certainly, from a designer point of view, it would have been much easier to build one great music app and then a separate great podcast app. Not to intermingle the two because you get all of these design challenges, you know, the skip button needs to turn into scrub button. But we took on that challenge and it worked. There is the sort of the long story. So that's how we got into it. 

 

Pieter Abbeel 

And now, of course, from a world news point of view, I think Spotify really going into podcasts was, you know, made very, very clear when essentially acquiring the Joe Rogan Podcast to be solely featured on Spotify, right? I mean, that was a pretty big commitment right there. 

 

Gustav Söderström

Yeah, there was a whole host of big announcements that happened at the same time. It was Joe Rogan. It was Gimlet. It was this podcast studio called Anchor, which was already one of the biggest ways for independent creators to create podcasts on the mobile and upload it. So we had been building for a year and we felt that this was working. And then we said, now, now we're going to go all in on this space, and it kind of all happened to happen at the same time. So to the rest of the world, it looked very much like a moment in time and it kind of did to us as well, I would say. 

 

Pieter Abbeel 

Now one of the things is that historically, when starting out, Spotify was really focused on the listeners, right? I mean, the music was created in music recording studios, big labels. But over time, you've drifted towards also catering to creators, the podcasters. But I imagine maybe also music creators that you're creating for independent music creators and so forth. Can you say a bit about that? How do you foster this creator community? 

 

Gustav Söderström

Yeah, you're completely right. Spotify started out as we license all the music from the labels. We didn't really have any direct connections with artists, performers, songwriters, or anyone. We focus entirely on the consumer and building a really good consumer experience and that kind of worked out. But the world has changed. And so not only have we started building many tools for musicians. There's something called Spotify for artists where you as a musician can see your catalog, see where your listeners are coming from. You can upload and change your cover art, your canvas. You can change the artist page, all of these things. But then when we went into podcasts, it became even clearer because podcasts doesn't have the labels. Most podcasters are sort of published directly to consumers. So then we really went from dealing with a few labels, like four or five and a few indies to dealing with literally tens of thousands of creators directly. And so we try to change the organization now to map to that. So Spotify used to be sort of a licensing organization. If you think of it as a technology stack, a licensing organization and then just a huge consumer organization that consisted of a similar technology platform, a recommendations team and then an experience team that wrote all the applications that was all of Spotify. So what we started doing now the last few years, we started adding on top of this stack of recommendation and experience that it's consumer focused. We add what we call vertical teams. So there's one team for each creator type now. So there is a music team that writes software for musicians. It's kind of its own stack, so to speak. The musicians have their own tools. They have their own revenue model with royalties. They have their own promotional models and so forth. 

 

Pieter Abbeel 

Now, Gustav, during one of your interviews, you cited listener happiness as a very important metric. How does Spotify measure and monitor this? 

 

Gustav Söderström

Yeah. So there are a few different, different ways to do this. So I think one thing that is interesting with Spotify and we can talk more about that when we come to different types of machine learning, is that Spotify has both a free tier that is advertising financed and it has a paid tier where the consumer pays, you know, something like $10 or something per month. And so what is interesting with a paid tier is that the user sort of votes every month, like, is that really worth another $10? Is it really worth another $10? So you can think of that as the user kind of voting on the service with their own wallets, which is a very powerful mechanism. So one way then to measure happiness is to focus a lot on retention. And you usually see this, that advertising businesses that are only advertising. They focus a lot on engagement in the moment because that is what drives advertising subscription based companies. They tend to focus on retention, right? Because the user staying and wanting to pay is what drives the value. Actually, not. It's correlated with engagement, but it's not the same as engagement. They may engage less, but they still stay in pay. So because Spotify has a very large amount of paying users, and that has actually been the majority of our revenue. We tend to focus a lot on retention as a proxy for user happiness, and I think that is very interesting because you can look at near-term retention next month, but you can also look at retention three months down the line or six. And that starts bringing us into this territory where machine learning hasn't been that good, which is to look at long term optimization instead of just immediate term optimization, which is something that I'm quite passionate about and we're investing in. But we can talk more about that later. But I think long term retention is a good happiness, at least a better happiness proxy, that immediate engagement. Then we also do some other things. We do just a quantitative use of research while we survey, but in a quantitative way to try to understand sort of subjective happiness. So one thing is how you use the product. The other is, what do you say about it? Do they use it in spite of not liking it or because you like it, so you get a good sense of that as well. So we do a lot of that as well when we talk about happiness, 

 

Pieter Abbeel 

I think one of the things that really characterizes internet companies versus regular stores is the ability to do a lot of AB testing where every individual consumer, you can run a slightly different test and you can learn a lot from those tests. And so I'm curious, how do you think about the testing process and learning from that? 

 

Gustav Söderström

Yeah. So I can kind of walk through a little bit about this, this is a journey, obviously, and we've had to invent some of it ourselves. But these days there are lots of good tools to do this. But I think if you look at the machine learning space specifically, there are a few pretty distinct steps, I think. Obviously, one thing that the company needs not just for machine learning, for any product changes is a really good AB tests platform. And that may sounds easy, but it's actually quite hard when you have the scale of, you know, many hundreds of millions of users and you have thousands of developers that run tests simultaneously. You need a platform that can actually understand that you're not part of more than one test at the same time that you get what is called power in your tests, statistical power and that it runs for long enough. So one thing is you need to build or buy or get is a sophisticated AB testing platform that kind of gives the developers says, I want to test this. You just get a group of users and you can be sure that they're sort of clean, so to speak. And then we kind of have three levels of sophistication. So the first not unexpected is sort of a correlation with past AB test performance. So you can look at the user data and you can say, like, Pieter got shown these five tracks in the search results. He clicks the third track. So here's a new algorithm. And if it produces the third track as the first suggestion, maybe it's a better algorithm for Pieter, right? So you can just use correlation against past data. But that is a bit sensitive because in reality, Spotify is a huge collection of different systems that produces, for example, a start page for you or something like that. And so just because you have one algorithm over here doesn't mean that when you put that into the system, it's going to produce the same thing. So, the other thing we've built on that most companies have is some sort of production simulation where you actually simulate the entire production system. So you have your suggestion for an algorithm, you put it in some sort of simulated or parallel production. And it goes through all the motions that the production system does but just before it shows you that start page at the boards. Right? So, you know exactly the start page that Pieter would have seen. And then you can compare. So the start page Pieter actually saw, you listen to this and again, maybe you click on the third result and this like hidden start page we generated would have shown that the result was the first. So it's more realistic. So I would say, you know, we call this like off-line testing, but it's a production simulation is very, very useful. And then there is a third step, which is kind of new for us, but very exciting, which is if you can simulate your production. Yes, you can simulate perfectly what the user would see, given your hypothesis, but you don't know what they would do. Right? You can only guess what they would do. So the really interesting thing is, could you actually simulate the user behavior as well and build a true simulator of the musical world and what a user would do? So that's something we've actually invested quite a lot in. And there are some papers on this where we've built, I think it is, I don't know if it is, it might be the world's first true music simulator. So it's literally using RNNs, a simulator where you can, based on all the past listening data we have, you can put in a sequence of tracks, say, 10 or 12 tracks and you can get a prediction for how far would Pieter, you know, with Pieter specific listening history and taste graph and so forth. How far would he get? You know, which songs would you listen to? Which would he skip? And then you can say, like, well, what if I change my algorithm? I re-ranked these songs a bit. Would he get further, right? And so we've done some simple initial tests with this where you just do some, you just use the similar itself to do some forward in time sort of brute force search to find a policy. And it turns out that this simulator actually is very predictive of real user behavior. And so what is interesting with that, as I know, you're a reinforcement learning expert here, and I'm certainly not, but once you have a simulator of the world, then you can start doing. Then you can put an agent there potentially and start doing real reinforcement learning because you can move faster than real time. So that’s state of the art for us. 

 

Pieter Abbeel 

So that's kind of interesting to me because earlier you described the test being run as a developer thinking about an idea and based on that changing something and testing whether that change has the desired effect or not. But then here what you're describing is once you have a simulator for the whole system, including the listener, you're able to essentially try different things in the simulator and find the one that I mean in reinforcement learning, with language would achieve the highest reward and then go with that one. And so I'm curious, as you do this, you have to define reward as I think one challenge, right? What does it mean to have a high reward? And another challenge, I imagine, is that the simulator is going to be imperfect. And so how do you account for that? 

 

Gustav Söderström

Yes, you're completely right. And I mean, the thing that kind of inspired us to this, is it was many years ago now, but when when DeepMind started investing in reinforcement learning and you saw these incredible results on first DQN and Atari and then Go and so forth, you know, this idea pops up pretty quickly that what if you think of music, not as a combative game like Go, where one has to win and the other loses, but as a collaborative game. You're playing the game of music together and you think of, for example, the track space, that's a sustained space. Like, is that possible? And unfortunately, there wasn't a simulator like that Atari simulator. Our Go simulator with perfect rules. So that was the big challenge. Like, can you create a simulator based on past data? And so it turns out that that works. And then the question is what is the reward? In this simulator, the reward is simply getting as far as possible, listen to as many tracks as possible and not skip. We found it useful to try to cast the entire recommendation problem as a reinforcement learning problem. You basically think of an environment, you have an agent that is basically Spotify, perform some action. It recommends you a track. The world updates and Spotify observes that and it gets some sort of a reward for that. So while in the simulator, the reward is literally how far you get down down this track list. On the sort of meta level, as I said, I think long term retention is a good reward to look at. So while it's not a textbook machine learning, if you squint a little bit, there are also, you know, we're using these called survival models where you can take, you have these user cohorts of user behavior over time, right? As you take a snapshot of all the users at the moment in time. And ideally, you would like to sort of do reinforcement learning on all the possible sequences of choices you've seen and see who retains the longest. That is not computationally very, very efficient, especially not to do every day for every user. But then we've done some things where you kind of squint and you do a supervised method instead, a maximum likelihood estimator, where you squish it together so you actually ignore the sequencing, you just say, like for this bunch of tracks with this type of play history, the retention is this long. For others, it's shorter. So you get some sort of, if you squint, it kind of looks like an estimate of some sort of Q learning or something. So we're trying to apply that concept of optimizing for long term behavior. And what is interesting is that you really do see what you expect. So what most companies have done so far is to optimize for immediate engagement, which is did the user click or not? Or even for us, did they listen to a 30 second stream? And then the question is optimizing for the long term, does it actually change anything or does it just produce the same results? And the interesting thing is, it doesn't. It turns out that what you kind of hope you would see, you actually see that sometimes even if you optimize for lower engagement in the moment, you get longer term retention wins. And from one point of view, it's not that unexpected. I mean, if you simplify, you might expect that for a simple algorithm at any given moment, playing your top favorite Beyonce track again and again will be the most likely click. But you also have the sense that at some point, Pieter is going to get bored with that. So at some point, I should go somewhere else. And you know, there are methods like explore exploit to do that. But you can also look at long term retention and understand, you know, maybe there are things that look slightly worse initially, but you find something new that interests you. So, that's something that we are looking at and investing in quite a lot. And I think it's like I said, it fits our specific model because we have this voting of do you actually want to pay for this? That I think one, it keeps us honest. The users still votes. And you can avoid getting trapped in an immediate short term optimum, which is something that we care deeply about and something that I think we have a lot of responsibility to try to think deeply about. 

 

Pieter Abbeel 

I'm curious about the explore exploit you mentioned because I mean, reinforcement learning is known to be very different from regular supervised learning. And that supervised learning that the data is kind of a given, in some sense. There is a process that generates it and you need to internalize the pattern and you don't necessarily go change that process. You might collect more data on time than another because you have, you know, something missing, but in reinforcement learning, that's part of what the agent is supposed to do itself. It's not the designer who decides what data to collect. It's the agent that has to decide, let me try something a little different, and it's all part of the AI system to try things it's never seen before and from there, learn. And so I'm curious, how does that play out in this context? And especially curious because you have so many listeners that I mean, it seems like you could explore with one listener and what you learned there then alleviates the need for exploration with another listener because, you know, they are similar in some way. 

 

Gustav Söderström

If you look at the music simulator case where we have a sort of a true simulator. It is quite standard reinforcement learning that actually what we did initially, as I said, was more of a sort of brute force search to find a policy. The problem with that policy is that once you put it online, it doesn't keep updating with real data. So what you actually want to do is something like DQN or something like that where you bootstrap an agent and then it can do a full explore exploit inside that space and converge. And then you can put it online. And it would keep learning, then only in real time. But then it's also quite close to, it already has a working policy. So in that sense, the thing that allows you to do real explore exploit in a safe way, obviously to do a lot of explore exploit live has two challenges. One is if you know you can disappoint the user, obviously, depending on what percentage of users you explore on. And the second that has been very important for us is that I think when you do these things live, it's very important that you understand what it is you're doing before you start exploiting it. And so the benefit of a simulator is you can do exploration, you can find a policy and then you can try to understand and look at this policy in a more qualitative way and say, well, what is it that this thing is doing right? So we've actually been quite careful with doing very aggressive explore exploits on our content, specifically content if we don't feel that we understand what it is. So I think this is a trade off with these algorithms. They're very exciting and they can be very effective. But I also think there's a lot of responsibility in trying to understand as much as possible about what they do before you leverage them. 

 

Pieter Abbeel 

So I mean, and in the language I'm used to it, it feels like your approach is largely model based RL approach where you do a lot of learning in a learned simulator, effectively, rather than directly learning in the real world, you learn the simulator and then you do the learning inside the learning simulator for decision making. And then, you know, gradually you can change a little bit of what you deploy in the real world. Now I'm curious as things shift as the agent or your systems overall shift strategy in some way, is there a gradual rollout? Is there a notion that you just roll out everything to everyone at the same time? How does that work? 

 

Gustav Söderström

So I would say, first of all, that much of our machine learning is still more traditional supervised machine learning. The reinforcement learning is kind of what we're invested in, what we're investing in right now. And something I'm very excited about. It's still like a pretty small part. It's just what I happen to be passionate about. So I just want to state that the entire system is not there at all yet. So when we roll out, whether it's something we've tested in a model or we've just use supervised data, what we do is we do a monitor rollout. To be specific, sometimes you do a thing called the canary, where you just do a small test to understand if you're completely wrong, if metrics go haywire, right? And then once you've done that canary, you simply start rolling out a small percentage of users and then you have all of these called guardrail metrics where you try to understand. And this is actually a big discussion like, well, what does it mean to be good? It's not as easy as it sounds. Like I said, is it a media engagement that should go up or down? Or is it long term retention that should go up or down? Or what is it? And as I said, we tend to focus more on long term retention than just the immediate metrics. Obviously, if usage goes down, you would hesitate. You would at least keep it there until you see maybe it has positive retention effects. So there's a bunch of guardrail metrics you look at. And then you obviously also look at qualitative feedback and you have all of these, you know, stability metrics and there's this whole chain of tests and then you tend to roll out. First you keep it to a small percentage. Then you gradually roll out faster and faster. And then ideally, you might actually want to keep it at a bit from 100, just to understand and keep understanding, well, what's the benefit of this feature long term? You know, to get some sort of, you know, I guess the word is counterfactual. Or it's like an ongoing AB test. So that's roughly how these rollouts would work. Some things need to be rolled out at the same time to the entire population, especially if they have network effects between users. Most of the features are not. They can be rolled out to a percentage group at a time. 

 

Pieter Abbeel 

I think it's really interesting you keep this not activate things at 100 percent to have the counterfactuals to make sure that also in the future this feature still matters. And it's not superseded maybe by another feature that is more effective in covering the same information in some sense. 

 

Gustav Söderström

Exactly. It's also this notion, you know, this assumption that the world is stationary? Maybe it isn't. Maybe the underlying usage changes and what you thought was a great benefit, it isn't anymore. If you would have had like a percentage left. The problem with that is you don't want to have too many of these percentages for too many users, obviously. 

 

Pieter Abbeel 

So the world being non stationary, that really intrigues me because it's definitely the case. I mean, people always change. And I mean, no people are boring and they have different ideas of what they want to do and so forth. So I'm kind of curious as you look at the trends and, you know, in both the listener and the creator space, what are some big trends that you're seeing happening in the Spotify world? 

 

Gustav Söderström

So there are so many levels to it. One is you look at music, for example, and you can see that the recommendation problem itself is not stationary. People change tastes. So one one way you could have modeled it is to think that it's like taste is innate, there's something in your DNA and you have to figure out, Pieter. But once you figure it out, Pieter, let's just keep inferring that. And that actually people thought was true for a very long time. And I haven't really proven this. I don't know how to prove it. But here's an amateur theory, which is that in the sort of old download world, where you either had to pay $1 per new track or you had to go to pirate network and spend minutes downloading. There was this marginal cost to exploring yourself. So it's quite expensive to develop your taste. Now, if you're young, first of all, music is a big part of your identity. You tend to have much more time. You would sit on these pirate networks and explore, and you're very influenced by friends. But then maybe when you turn 25, 26, 27, you get a job, you get kids. Most people just sat and listened to their CD collections. Doing exploration in the world of CDs or even paid tracks was very expensive. So people kept listening to the same music when they were 50 as they did when they were 25. And so it's not strange that people then infer that maybe music taste is stationary. What was really interesting when Spotify came along and you had this access model where there's no incremental cost to one more track, whether you're in the free tier of the paid tier, you don't pay per track, was that we saw very clearly in the data this user behavior where anyone who was born around my age 1976. Like the first, the most popular first search by far for anyone this age was Metallica, which happened to be the most popular when I was in my formative teenage years. So it seemed to support this hypothesis that anyone who was then, thirty five, thirty two or whatever it might have been, you know, they still like Metallica. But then what happened when they had been on the service for a year, their tastes started drifting and they started evolving again. So it's almost like you got stuck in taste and time because of this music model. But once you're on a flat fee model, you start evolving again. So I'm absolutely convinced that music taste is in no way stationary at all. And now, you know, we say it doesn't matter if you're 65. You may be listening to the latest hip hop anyway. It was mostly a question of friction. So that's one example of, I think, nonstationarity, which means our recommendation algorithms, they can't just try to model you and then assume that. They have to be moving. And then I think also on the creators side, a lot of things are happening specifically, you see in podcasts, you see some creators starting to charge for their content and building businesses. So the entire creator ecosystem is also changing. So it doesn't look the same. So, you know, I would say the one of the biggest changes right now in the entire media industry is the shift from it's all about consumers and creators. They belong to these big organizations, to creators being independent, and it's more and more about creators. This is what we're trying to invest in as well. Trying to allow new business models for creators and trying to allow interaction with fans and so forth. You know, not be a gatekeeper, but an enabler. 

 

Pieter Abbeel 

Well, Gustav, I think that is really beautiful. And as a creator, I appreciate all the effort you and your team are putting into making sure that we can reach our audiences. And then that way have them learn about this conversation here, for example. So initially, Spotify was in the music business but then also started streaming podcasts. And one of the big things, of course, in machine learning is that when you just start out, you don't have a whole lot of data. And so I'm curious, how could you get started on making the right recommendations for these new podcasts? 

 

Gustav Söderström

Yeah, we got a little bit lucky there. So as I mentioned, we built this knowledge graph sort of handcrafted to map up the world because we didn't have that for podcasts. And there was a lot of manual heuristics initially to get the listening data. But we did find something really interesting, which was that I don't know who tried it, but once we had a little bit of listening data, we tried taking your music vectors. These, you know, 40 dimensional vectors. They were probably more than 40 back then. But we tried to take them and put them into neural nets and see if we could predict basically which podcasts you would like. So basically a neural net that takes vectors as input and then that's sort of a multi-part, like these five podcast shows. And it turned out to work amazingly well, which is quite surprising to me. It wasn't at all clear that your music taste would be predictive of your podcast taste, but it turned out to be. 

 

Pieter Abbeel 

Wow. That's definitely surprising to me. That's so interesting. 

 

Gustav Söderström

Yeah. So that helped us a little bit and specifically in the bootstrapping problem where even if we had a knowledge graph of podcasts, you know, what are the first things you show to a music user when you introduce podcasts? And it's quite important if you show something very irrelevant, they're probably going to dismiss the whole thing so that was quite helpful to be able to have some relevance from day one. 

 

Pieter Abbeel 

Well, I mean, generally in machine learning there is a lot of transfer happening from one task to another. But it's rare when it crosses modalities, and it seems like podcasts and music are very different modalities. And so, yeah, that transfer.  

 

Gustav Söderström

Yeah. I'm not sure what it means. And you know, there could be discussions around why that is. What is it that is shared between those two? I'm not sure, but it's certainly interesting that it seems to work at scale. 

 

Pieter Abbeel 

Thanks, Gustav. 

bottom of page