Winning the AI Virtual Worlds Hackathon | Ali El Rhermoul | Generative AI | Claude

Originally Broadcast: July 14, 2023

Ali El Rhermoul and Jon Radoff's team at the a16z Tech Week AI/Virtual Worlds Hackathon won grand prize: it's a multiplayer online game featuring a persistent world, generative AI for storytelling (Anthropic Claude) and immersive environments (Blockade Labs, Scenario) and a client/server architecture using Beamable + Unity to create a persistent world using a new technique called "semantic programming," which uses an XML-based format to provide consistency between generative and game components.

This episode covers what Jon & Ali learned:

00:00 Introduction
01:30 Modding & UGC
04:45 Hackathon Summary
08:00 Anthropic Claude
10:15 Vector Embeddings & Databases
16:00 Multimodal Experience
21:05 Multiplayer Prompting
24:00 Semantic Programming
29:35 AI Impact on Engineering
37:56 Procedural Content
40:00 Narrative
44:20 AI Dungeon Masters

Jon Radoff: you would start playing the game and it would be an interactive text game, essentially, where you're talking to it and you could say whatever you want to do. So you could try this out in something like a chat GPT and kind of simulated, but in our example, we brought to life environments.

Ali El Rhermoul: We used blockades technology for displaying rooms. Welcome to a special episode of Building the Metaverse.

Jon Radoff: It is special because almost every one of these is with someone that I don't actually work with every day. But today I've got Ali El-Ramul, who's the CTO of Beamable. I work with Ali over at Beamable. And today we're going to be talking about the hackathon that we just won out in L.A. for the A16Z tech week. There was a virtual world's hackathon involving artificial intelligence. And we built an RPG game in one day. I wanted to bring Ali onto the program so that we could talk about all the technologies we use, all the artificial intelligence technology we used and the NANAA technologies that accelerated our development. So to kick things off, first welcome Ali to the program. Thank you. So let's actually start with maybe the big picture here. You started your life in game development and technology as a modern, right? It feels like modding is almost the gateway to a lot of this. And we're almost reconverging with modding again. What do you think of that notion? Like what did you learn in your journey doing that stuff?

Ali El Rhermoul: Yeah, sure. So just a little bit of background, you know, I started off doing a lot of modding on RPG games, especially multiplayer RPG games, games like that part of the masquerade, but also real-time strategy games like Starcraft, Thought of War, as well as Never Winter Nights, which was sort of the culmination of that, since it allows you to create a fully persisted world that has its own scripting language and so forth and so on. And, and yeah, I mean, look, it's frequently the case that the modding assets are even better, modded assets are even better than the original assets from the game. And people have gone through a really great length, spring, building awesome tools, so you could proceed to really generate environments, it create really large scale things. So that was a natural for a into game development and very excited by what's going on in due GC today and the transition from games as a service to games as a platform.

Jon Radoff: My version of modding growing up was basically Dungeons and Dragons, which I love playing and then just figuring out how to bring Dungeons and Dragons to life in various online and computerized formats. So it feels like it's just on some kind of continuum there, but that was really the inspiration that we had for this hackathon. Like, we didn't write any code before coming to this, but we did talk about what we wanted to build and we had this idea of, could we capture Dungeons and Dragons, essentially? So what were your initial thoughts around that? I know you we talked about role 20 and you were interested in creating a role playing environment.

Ali El Rhermoul: Yeah, absolutely. So, you know, actually, you know, that ever when tonight's experience was largely with a group of role players, and I actually, I didn't have a group of people around me physically who would tabletop RPG. So this was the only medium that I had. Later, role 20 became a really popular way to recreate the, you know, authentic tabletop RPG, but remotely with the same group of friends that I had made in those morning days. And so what I wanted to do is what could we do?

Ali El Rhermoul: What kind of game could we build with AI that was sort of only really possible with AI?

Ali El Rhermoul: Right? And then what makes Dungeons and Dragons special and tabletop RPG special is that they're inherently open-ended and you can do anything, right? Like, the DM has the ultimate say, but you can do anything. One moment you can be, you know, walking around the city in Fairrun, the next, your incigil, you know, on a different plane. So that level of open-endedness, both game mechanically as well as environmentally, hasn't really been possible because with games, you kind of have to commit to a set of game mechanics and a set of assets. And then you're basically stuck, right? And then you might provide updates, but it's not going to be generated on the fly. So, generative AI opens up sort of the possibility to, hey, what if you could do what Role 20 does, but inside of a full game, have the LLM basically be your game server, your storyteller?

Jon Radoff: I'm going to show the video of what we created. And this is a time where if you're listening to this as a podcast, I do recommend you come drop by the video and actually see this because it'll give you real flavor for what we did. So inspired by Dungeons and Dragons experience, we get you started playing and we came up with a unique character creation process. We didn't want to have to go through the typical character creation. I had played games like Ultima years and years ago where they had this interesting question and answer, like morality quiz format. So I started with that and we ended up coming up with this idea of like a unique tarot deck, essentially, that was based on Dungeons and Dragons. And based on three selections there, it would figure out your character class, your race, and also a nemesis character, a character that would be opposing you in the world that we're creating. And we'd actually go to the language model for that. We'd use the language model to interpret what these cards even meant. So that was an interesting thing in of itself. And then you would start playing the game and it would be an interactive text game, essentially, where you're talking to it and you could say whatever you want to do. So you could try this out in something like a chat GPT and kind of simulate it. But in our example, we brought to life environments. We used blockades technology for displaying rooms. So this maybe leads to like, let's cover kind of the quick synopsis of what I think we're going to cover today that everyone will probably want to learn about. So first of all, how do we modularize those things? What was the language model technology we use? We actually didn't use chat GPT, we used Anthropic. So we can talk a little bit about that. We can talk about the other graphical systems we used. We used blockade and scenario. And we can talk about some of the unique methodology that we established to actually create the code and the language prompt elements and bring those together when we built this over the day. So let's start with structure of the system. So the language model was something that we had to choose first. So why did we end up choosing Anthropics Cloud instead of chat GPT, which of course we have access to as well?

Ali El Rhermoul: Yeah, so let me start by saying first, John was an awesome prompt engineer on this project. He's a bit of an AI whisperer. He was one of the early jailbreakers of chat GVD trying to get it things that OpenAI folks explicitly wanted it not to do, which was super cool to watch. I was check out his blog for some of the techniques that he did. But that inspired me. And I thought, hey, look, this would be awesome to try to,

Ali El Rhermoul: let's see how far we could push the LLM.

Ali El Rhermoul: And in my case, I had done some experiments with OpenAI, quite successfully basically with our company, I'd be able to try to get it to answer questions about our documentation. And that was super cool and it was able to do so competently, but I kept running up against this token limit, which at the time was 4,096 tokens with GBT Turbo 3.5. So I got beta access to Anthropics Cloud, which has 100,000 token limit, which amounts to about 75,000 words, I believe, which is a lot of content. And this is especially interesting for something like a Dungeons & Dragons campaign, because you want to keep track of everything that's happened, as well as context about the campaign, character sheets, how those characters sheets evolve over time, and maybe even extract data from the encounters and what's happening in the game to persist that data in a database, to try to give it continuity. So Cloud seemed like a really good fit. We started exploring it, it was very competent, very fast as well. And that 100,000 token limit was absolutely vital.

Jon Radoff: Yeah, so it had a much greater limit, and the documentation experimentation that you did, you were using Langchain and Vector databases, and essentially consuming the whole document, and then allowing people to do this chat with a book, kind of interactions with it, which, without the very large context limit, you couldn't really do that effectively.

Ali El Rhermoul: Yeah, and just so folks out there understand, like fundamentally what the chatbot in chat GBT is doing, to remember your answers, is it's taking everything you've said so far, concatenating it to your new message, just adding it and resending it all. So like the API, the way that these APIs work, is that they're inherently stateless, they have no memory. So in order for it to have a memory, you need to include everything that was said before, and then it will be able to provide you contextual clues and answers based off of the previous history. So that's what we inherently had to do. Everything that any person says, or any prompt to clot, which is anthropics AI, we needed to sort of do that work to concatenate. This gets challenging when you have token limits, because guess what, as that history gets longer, the API is going to yell at you and say, hey, you've exhausted your token limit. So the solution to that, it's either increase the size, the token limit, to something like 100,000, which is what cloud does. Or another way to solve that problem is to do a vector search with embeddings. And what is that? That's basically the ability to only include the meaningful subset of history that is useful to answer the problem. So let's say, for example, you have a dictionary or some large database of documentation, as was the case in my experiment, you don't necessarily need the entire beamable SDK to answer the question, well, how do I update my inventory? You just need the inventory API information. So how do you solve that? Well, when I ingested the entirety of our code base of RSDK, the first thing you do is you feed it to the chat GPT model which turns it into a vector array. It's just a list of numbers. You take those numbers and you save them in the database alongside the content. Then when a user asks a question like, how do I update my inventory? You take that prompt, that question, you vectorize it as well. Again, feeding it to the chat GPT embeddings API, and then you compare the vector array of the prompt to the vectors that are stored in the database, right? That's essentially what you're doing. And you're using some pretty common over the counter algorithms to do this like K nearest neighbor and it's what's called a cosine comparison. So that will give you back the sort of a list of data with a score that indicates like how close is it to the prop? And then from there you can be smart about it and say, well, I'm only gonna select stuff that's above a certain score. I'm only gonna select stuff up to a certain token limit and include that as part of the call.

Jon Radoff: Yeah, so there's a whole bunch of things to unpack here. So first of all, there's a lot of open source available to try this yourself. So if you want to give a try,

Ali El Rhermoul: you actually don't have to be like super machine learning

Jon Radoff: engineer or anything, by the way, there's a lot of open source, Jupyter notebooks and things that you can just pull together and pull in chain, do the vector embeddings, you can try this chat with a book stuff on your own, probably in a couple of hours,

Ali El Rhermoul: even for someone who's only, you know,

Jon Radoff: lightly familiar with Python. There's a guy named Greg Camrat

Ali El Rhermoul: that I think he does a really good demo of how to do this.

Jon Radoff: So I will actually provide a link in the show notes for anyone who wants to see a walkthrough of setting these things up. But yeah, the token limit has a big impact on how smart the responses from this are going to be. And that's a big part of what we discovered. You brought up another thing though, which is the way chat bots and chat systems actually work. And it is a limitation of the design of how we approach this and maybe points to some other areas that we'll want to expand in the future or rethink in the future. So I want to kind of revisit what you were talking about. So with a chat system, the way it effectively maintains context and consistency across the course of a conversation is it may not be apparent to you when using a chat interface, but actually it's concatenating all those chat messages together. And then at the very end, it's essentially a completion. And the completion is the next thing that the language model is going to add to the end in response to not just the most recent prompt you've entered, but that whole sequence of prompts including what the LLM gave you along the way. So that whole dialogue has to be revisited. There's a huge downside of this in that the more you build up that session, these prompts in the background are getting kind of gigantic and you're passing back to it, a huge prompt, which of course is going to consume tons and tons of tokens. So while the demo is super cool, there are some, most likely some economic limitations of this because once you get deep into a story, people are probably not going to want to pay for what those token exchanges actually cost. So we talked about a few solutions to this. One that occurred to me, and this was sort of a funny thing that happened while we were working on it, just out of curiosity while we were at lunch, was like has anyone tried to optimize this like in the language model itself? And I was asking chat GPT and Anthropic, like, hey, has anyone ever tried to like, I know it's stateless, but has anyone ever tried to save state in the midst of a conversation so that you can kind of resume the language model from where it came out? That was sort of my initial thinking. And Anthropic took me through this whole wonderful litany of like approaches that people had invented to do this. I think it called it state preservation mode within language models. Well, it turns out this is complete nonsense. This is completely hallucinated. I tracked down everything that it produced. And I will say Anthropic is very fanciful in the kinds of responses it can come up with. And if you call it on it, it'll be like, oh yeah, I just inferred the insistence of that, sorry. So maybe that isn't an approach. What other things do you think we might be able to use to improve efficiency kind of storing the memory and kind of bringing you back to context within the experience of the game?

Ali El Rhermoul: Yeah, so, well, I guess another aspect of this which we should touch on is that, you know, this demo didn't have just text. It also had visual assets, right? So like we were using Anthropics, LLM, Claude, for sort of the adventure prompts and the character created, right? To spit out a character sheet,

Ali El Rhermoul: but also to like advance the story.

Ali El Rhermoul: We got it to respond with really interesting information. Like here's the music mood. And like here's the story so far. And like here are the characters in the room and the items in the room and so forth. There were other things like we used scenario, which I think behind the scenes is a stable diffusion model that to generate character portraits. We used a blockade labs, is skyboxes to generate the environments and what's going on around you. We used OpenAI to generate the embedding so that we could store all of that stuff. So it's hilarious how we used like all of these different, this, you know, sort of concert of different AI's to perform, to create a cohesive experience. And what was interesting is, you know, these things take different amounts of time. So with the tags, the character sheet and the event reports, they take seconds, you know, they're virtually instantaneous. But generating a skybox, a high resolution skybox for the environment from blockade can take anywhere from a few seconds to, you know, in some cases I've seen it as high as minutes, which can be very, you know, disruptive to the storytelling. Same thing with the portraits, you know, sometimes it can take a long time. So how do you get around some of these things? Well, you can use effective database, basically any time anybody asks for an environment, you don't just generate it, you save it as well. And you save an embedding that is a vector list representing what the description of that asset is in the database. That way subsequent players or subsequent times you ask, hey, I enter the tavern, well, guess what?

Ali El Rhermoul: Not all the taverns need to look, you know, very different.

Ali El Rhermoul: So you might actually ask the database first and say, hey, is there a tavern that fits this description? And if the answer is yes, great. Just grab that asset directly from the database. And if the answer is no, then go ahead and generate and do something like that. So that starts to lead into sort of a much more cost-efficient and also in some cases just a better work fluid gameplay experience.

Jon Radoff: Yeah, another idea is maybe to have language model itself create summarized versions of the adventure so far so that we can store it in a more compact format for the things that are actually important to the story of the storytelling and playing with more prompts that it could incorporate that history alongside of the more immediate inputs that you're dealing with. We didn't have the time and one day to experiment with that, but that's going to be a really interesting thing. I'd like to see us experiment with in future versions of this or maybe frankly, well, we're going to open source it. Maybe the community out there would love to just, you know, grab the code, experiment. I think this could be a platform where people can just try out a lot of things. The thing that you got to though that I think is really interesting though is the vector embedding's aspect of storing some of the things that happened in a database is intriguing not only for the efficiency aspect, but actually going beyond a linear narrative that you're constructing as a player. This is another area we dreamt about over the course of this day yet another of these things that would be really awesome to add to it. And that's really making it a multiplayer, persistent world so that everybody could come together inside the adventure. That really does require some way of taking all of the quote unquote rooms that you travel to in the adventure, storing them and then allowing users to inject their props of what they're saying they want their character

Ali El Rhermoul: to do collectively in these virtual spaces.

Jon Radoff: And now we'll be talking about billions of tokens across all the players that are going to play in a virtual world. So you start to have to get into actually storing structured data or some structured data about the things that you're doing. And thoughts around that?

Ali El Rhermoul: Yeah, a lot of thoughts. Yeah, so I was playing around with adding some of the multiplayer capabilities. And there's a lot of things that change right off the bat when you go from single player to multiplayer. So like, first of all, the first person towing becomes kind of weird because it's like, if Claude is saying you all the time, well, who are you talking to? Are you talking to me? Like player A or player B, right? So that changes certain things. You have to sort of recraft the prompt for that. To speak to what you said earlier about the iterating with the LLM and other thing that I was playing around with is like, well, maybe what should happen here is, you know, I, I, I, I, I, I was prototyping sort of this in the, in the blog chatbot. But like, I asked it, you know what? Claude, go ahead and just like repeat any time I like say something. If I put it in quotes, definitely repeated exactly how I said it. But if I put it, if I just say like, for example, enter the tab, or you can summarize it and add some flavor test and just regurgitate it at, at, at, and so then what you get is a continuation of the story that feels very story like that becomes very injectable instead of the sequence of like John said blah, blah, blah, and then Claude responded. And then Ali said blah, blah, blah, and then Claude responded. It becomes more like John entered the tab, John hit his, you know, his hand on the fist and asked for an ale, blah, blah, the bar turned there, blah. So you get sort of a, it starts to feel like a novel. And that's really, really interesting, right? Because that, that is searchable, that is vectorizable and it's also a cohesive history that, that reads very easily. And it's also very shallow.

Jon Radoff: I'm just thinking out loud now, but it's almost like you could store each room. Room in a sense can be anything in this. It's literal rooms. It could be, you know, you're in the tab. To build it. To ensure you're out in the mountains. It's like, if you, if anyone out there has ever played like a mud game, that's the kind of multi-user dungeon. That's the kind of room I'm talking to. Like any scoped area where, where the activity can happen. It seems like there's a lot of interesting things we could do on a location by location basis. First of all, just the idea that people are prompting into existence, this unfolding world where it populates more and more of it. Like this is, this is a mind-blowing idea to me. Like I, I really just want to have a school, some of this so that people can try it. Because it becomes like this living, not just a living world, but in the whole world.

Ali El Rhermoul: It's like a fog of war that's constantly being pushed back. And it's just sort of at runtime creating what's behind the fog of war, right? It's really magical.

Jon Radoff: Exactly. Yeah, I love that analogy. And then, but then saving it, saving it to a persistent world. So grabbing the descriptions of rooms, storing it in a database, maybe using something like vector embeddings so that we could do a K&N type search and find in some cases maybe the room that you mean is close enough so you should return to that old room, not always be generating new things. Allow players to reconverge and be in a place where they can share the story together and do basically multiplayer prompting. So there were a couple of technologies that we did get to try out in the hackathon that could lead to that pretty quickly, I think. So number one is we did use our own technology, demable. So we had this microservices architecture, which I'm going to ask you to elaborate on. And that allowed us to build a server around the prompting that was going on. So it wasn't like the client was talking to the language model directly. We actually had game servers that were running the world and could be there to store characters, store rooms, and also act as a proxy so that the API keys are safe in a server similar. So that was one. And the second is this approach to programming it that we ended up calling semantic programming, which is about not just treating a language model as a English language system with text going back and forth but actually adding the metadata and tagging data in the context of it. So I want to cover both of those, but let's first talk about beamable and the microservices. What role did that play in it for the hackathon? And also how would we build upon that to create a truly multiplayer evolving game with fog of war and persistence and all that other cool stuff?

Ali El Rhermoul: Yeah, yeah. So there's two good reasons to have a server in the middle there. Number one is you might not want to expose your API keys or require new users to have API keys. So if it's purely client-driven application that's talking directly to quad or openAI, then your players are going to have access to your open API, openAI key or your LLM key, which is problematic. The second reason is shared world state.

Ali El Rhermoul: You might want to broadcast state updates from a server-authoritative set standpoint

Ali El Rhermoul: and you might want everybody to see the same thing. Instead of like, John sends a message to Claude and I send a message to Claude, we get subtly different responses and then that's potentially the problem at, right? So that's what we did. We used our microservice infrastructure,

Ali El Rhermoul: which is the ability to write cloud code inside of Unity,

Ali El Rhermoul: which is where we were developing the hackathon project. And then we followed in all the libraries or wrote them in ourselves in some cases to communicate with the rest APIs, which are very simple. Usually there's only like one or two APIs. And then did all of the data formatting and extraction, which was also another important part of this. So mentioned earlier, Claude wasn't just spitting out a block of text, it was spitting out like a formatted XML block, which says, hey, here's the story and here's the room you're in and here's a description of the characters and so forth and so on. So instead of just sort of dumping that on the clients, you basically take that, you format that response and what you output back to the client is a cohesive sort of description of the world, which then feeds into game mechanics and so forth and so on. And so that's a lot more compelling and a lot more interesting and it enables the kind of multiplayer functionality that would be really, really fun.

Jon Radoff: So in the show notes for this, I'm gonna link that prompt so that everyone can see it. It would be fun to play with. I think something pretty close to it would work in chat GPT even though we used Anthropic to do it. And the basic idea here as you look at it is where we're XMLizing the content. So years ago Tim Berners-Lee talked about the semantic web and it never completely took off, but the idea was that you could make a lot of content on the worldwide web machine readable by embedding tags or using XML around it and you could create a whole schema of the way different kinds of information ought to be represented. There's a number of reasons why it never fully took off, but I think that with language models, maybe it will again because we could actually use the language models almost as a universal translator between pure language, totally unstructured content and more structured or semi-structured content with this metadata around it. And it's also interesting that we came up with that just about within 24 hours of something very similar, which is almost the reverse of it, which is open AI talked about the idea of embedding functions within chat sessions. So in that, they're working with JSON data and it would be used for things like if you had a function that provided a mathematical expression or something or a database look up and you wanted to do it very, very consistently. Well, in the chat, it could actually make use of those functions. We were the reverse of that, which is just let the language model do its own thing, let it essentially implement the functionality and then expose the outputs of that as data, as semi-structured data, and also allow the, it to interpret structured data as input as well so that it knew how to on a consistent basis do that. So there's a couple of things that I'd love you to talk about, Ali. So one is just, first of all, how does that change the way people might be programming applications in the future? It just struck me while working with you on this that maybe some functionality of programs doesn't need to go direct to code. Maybe at least in the rapid prototyping phase of a project, you could just have the language model essentially implement functionality as long as there are inputs and outputs out of that. And then, generally speaking, what do you think of these more data-driven approaches to language models? How does this fit relative to something like the OpenAI Functions specification that they just came up with? Talk about software engineering or the software.

Ali El Rhermoul: Yeah, so I've got a very personal sort of experience there, which is like, for those of you that don't know, like John and I built games together, like large licensed IP games, they have the drones and so forth. And these games were RPGs, which meant that, they had characters, these characters had stats, they would fight each other. And we had some really competent game designers on our team that they had ideas on how these things should work. And it was my job among others, engineers that actually implement this in code. And so they would have a data description of what all of the content is, but then oftentimes, the actual functions for calculating what the outcomes would be in code. Occasionally, they would be in content as well as just like mathematic functions that would be evaluated. But ultimately, that just ends up being a really slow iteration process where the game designer has a vision, they want to iterate rapidly on that vision, find the fun. You know, they're doing paper prototypes, they're trying to get the stuff to look good and feel good in game. And you're a USD engineer, constantly sort of re-implemented the stuff. Well, imagine a world where that all goes away. And basically, the game designer talks at the LLM directly and says, here are my rules. Here's how I want you to add up stats. Here's how I want you to determine outcomes. Here's what you're allowed to do. Here's what you're not allowed to do. Here's what the player is allowed to do. And then basically, the game becomes a framework for parsing in output and persisting that state in a way that can be recovered and continued. Continuously, that's a super interesting and approach and a massive departure. Now to be clear, I don't think the current LLMs are white there yet, but they could get there. And certainly, I would love to see an LLM that's specifically trained to be really good at these kinds of game mechanical game design functionality.

Jon Radoff: How you think software engineering generally

Ali El Rhermoul: is going to be affected by language models?

Ali El Rhermoul: I think that fundamentally, you're going to get to a place where the iteration time and the direct from imagination, this is your term, direct from imagination, approach becomes the standard where engineers occupy the framework and the data persistence. And ultimately, the LLMs become the mechanical brain, the reasoning behind how things happen. And that's just going to be super, super powerful. And I think you've already demonstrated that. I think it would be great if you're interested in this to take a look at John's call at how explicit it is about the rules. And it's amazing how these LLMs manage to follow these rules pretty darn consistently, even when they're not really all that specialized in it. Now you can still do some funky things.

Ali El Rhermoul: Like I was playing around with it.

Ali El Rhermoul: I was like, OK, I would like to now teleport across the world to another thing.

Ali El Rhermoul: Sometimes the LLM will be like, you cast your spell,

Ali El Rhermoul: and you teleport to, and they'll roll with it. And that's fun. And that's cool. But you might want a little bit more constraints in game mechanical rigor. I had another really fun experience where I summoned the bellrog. I was like, OK, now summon a bellrog. And I summoned the bellrog, and the bellrog was basically ready to kill. So that was kind of a cool spin. And the LLM was smart enough to be like, I'm the villager, as we're done the wiser, as you combatted the bellrog outside. It's just kind of just really fun open-ended type stuff. But I do think getting more of that game mechanical rigor and allowing them to follow rules in a more strict way. This is kind of what the OpenAI functions will help with, is providing a way to codify that rigor. In a way that allows you to remain with the LLM land and sort of let the game servers focus or the clients themselves basically just parse the output.

Jon Radoff: Yeah, something like a function would be really good to implement something like, do you have the skill that you want your character to actually express? Like, see if the character has that spell in their character sheet. And if so, then they can use it and feed that back to the LLM so that it has the creative expression of that. If in fact, it's going to be an allowed behavior. It's almost like some binary allowed or not kind of situations. Another approach that we started to play with was the idea that it doesn't have to be one mega prompt that handles every single use case within the game either. Like, you can create different prompts and different contexts for different parts of the game. So for example, character creation, we had in one prompt that produced the XML output for a character sheet. And then it was a separate prompt that implemented adventure mode where you had to provide

Ali El Rhermoul: a character sheet of that specification

Jon Radoff: to actually play through it. We'll include both prompts and some examples of this so everyone can get a sense of how it works. But we could have, for example, created a combat mode, separate from the storytelling mode, for example, and I could imagine many prompts where we would say, okay, here is exactly how I want you to resolve combat. And it's going to have these steps. We need you to do it consistently. Add some creative flair in terms of decisioning what your NPCs do in the course of that and make it sound creative. But it's got to follow a consistent rule system. I'm pretty sure, particularly within Thropic, given the large context limit, we could have done that. But again, we had many, many dreams for this hackathon. We wanted it with Rev0, because it was pretty magical what you could do even in what we pulled off. But probably like a day spent on combat, a day spent on multiplayer, a day spent on a memory system. Like you could imagine maybe we'll figure out a way to put a few more days into this over time where we add some of these capabilities. But you can start to think in terms of componentizing the language systems, either allowing them to use tools. Function calls are essentially a way to add tools to the logic of the language model. And other is the language system becoming a little bit of a function itself. Now, this sort of leads to what we were talking about earlier, though, which is this was really key to the way we interfaced with the non-text systems, the graphical system. So we used to, we used blockade to create a skybox, to show what the room looked like that you were in. And we used scenario for 2D picture art, like the portraits of your own character and NPCs. Now to make that work, we had to pipe the text of the prompts into them, but we had to figure out what prompt to use. So it was actually part of the XML schema that we told the game to generate in response to story. It was like, always give me the character list, always give me the character list of who was present in the room, always give me a shortened room description that I could pass to blockade to get a room that looks like what we wanted. But I want to dream for a little bit, though, because it does start to make us think a little bit about, well, what would it mean to have truly multimodal language models in place? Now, OpenAI talked about multimodality. So just for definition of what the heck I'm talking about for everybody. So this is the idea that there's some spatial awareness within the language model that it could look at a scene, understand the composition of objects in the scene that you could actually ask it questions about that scene or the consequences of actions within that scene. We did nothing like that. But you could start to imagine what we could do with multimodal approaches to the language system. You want to elaborate on that? I know you've been thinking about this a little bit.

Ali El Rhermoul: Yeah, totally. I was just talking to a colleague who's brilliant

Ali El Rhermoul: has been doing stuff with shaders for some time,

Ali El Rhermoul: graphics engineering, Colt Jason Booth, and he's got a really popular asset on the DDS at Stort.

Ali El Rhermoul: We were talking about how you might generate

Ali El Rhermoul: trades and environments. And the procedural systems do a really good job at that already. If you want to just generate a realistic looking desert or mountain or whatever, there's some really good procedural systems that will do that super fast and will generate hype maps for you in a lot of details and so forth and so on. But where these things start to struggle is when you start to add layer story into a world. And what I mean by that is it could be, does a house feel lived in or not? Does what are some details that might suggest something that happened here? You can do some amount of that stuff, but generally, I can do it a whole lot better. But the problem is a lot of the LLM today don't really have support for spatial awareness. And so you can't really do things like, hey, generate me a hype map with a mountain in the middle and a road, staking alongside the mountain. And I tried, by the way, we tried.

Ali El Rhermoul: We were actually, we were working late at night.

Ali El Rhermoul: We were just going back and forth on chatting about this. And then I was like, well, let's just try it. So I asked Claude to be to generate me a two-dimensional array with a heat map and it gave me a really low res hype map. Like it was able to do it. And he said, well, maybe I can upscale that. Maybe there's something we can do here and combine procedural with generative AI. So there is a promise there. And this goes back to what I said earlier about, like there are certain things that are not quite there yet. But like you can really see how we're really not that far off. And the spatial awareness could be a massive good. And it's the difference between having a sky box that surrounds you such as blockade labs provides and having a full 3D environment generated for you with trees and woods and all kinds of detail that make you feel totally immersed and have a 3D playable character that can actually walk around this world and pick up items from this world and add up their inventory. That's sort of the next step. And multimodal AI will really sort of unlock it and wave there.

Jon Radoff: The piece that I really want to double down on that you were just describing is the idea of storytelling. But even broader than that, the idea of narrative. And sometimes people think narrative means character dialogue or plot points. But in a game, narrative is a lot more than that. It's everything. It's kind of in the bones of the whole world. And because games are typically, not always, but typically a very visual media. So much of the storytelling is conveyed by what things look like. The way environments are built, like a lot of the story ideally is told from your observations of the area. Like some of my favorite games that just do a wonderful job at that is the last of us series. I think they just do a masterful job of showing you when you explore an area, what has happened there, what the history was, and even what you could expect. Like I remember being in a room with a bunch of tomato plants. OK, well, obviously someone's living there. If there's tomato plants present, they just don't spring up in a post-apocalypse world out of nothing. And guess what? You get attacked by somebody who's living there. So good way to build the narrative from observations. They didn't need that. It could have still probably been good in that environment without those hints. But that seems to, that piece, it seems like even with true multimodality, understanding visual systems, like you could definitely get out of a language model things. Like here's the history of this room. Give me a plausible room description that meets what would be associated with that history and then use that room description with a good multimodal scene generator. That, to me, is a little bit where it currently breaks down. Like if you prompt most of the image systems today, you'll get something roughly along the lines of what you thought. And they increasingly look great. Like you get a result that looks terrific whatever you give it. But it often doesn't capture the finer details of what you really wanted. And that, I think, is because the language systems being used to drive the various image models out there, they don't really have that depth of understanding that would come from a multimodal language model to bring that richness to it. But as we have that, that's going to be really, really interesting for game developers. You mentioned Jason Booth's work. We'll give him a quick shout out to the work he did on Star Trek timelines. I actually were built procedural planet building systems. And we'll link to a talk he gave it. Unite years ago that drills into that. Procedural systems can be really capable at creating plausible worlds. I know of another startup called Lovelay Studios. They're also using a combination of generative AI for certain pieces, but procedural systems for world building. And they produce some really interesting results. And in fact, I spoke to their CEO, Kayla, I'll include a link to that for anyone who wants to see that. But getting really detailed in storytelling, I'm imagining whole, long-running Dungeons and Dragons campaigns where you're going to want not just the story callbacks in the text, but the visual callbacks, the visual structure of the world is going to relate back to maybe events that were five sessions, five hours of gameplay ago. And that's kind of part of the magic of something like Dungeons and Dragons. I'm excited to see more of that happening.

Ali El Rhermoul: Totally. And in many ways, these worlds are reflections of the characters. And so imagine walking about, and there's an item or a character or a creature that's your nemesis. And you've got this special item. You've got your father's sword. And you find it somewhere. All of those kinds of details, actually, the text-based elements are quite good at that. But being able to take that and spill it into the world that you're exploring

Ali El Rhermoul: takes it from being a really cool experience

Ali El Rhermoul: to being a transcendent, really mad. Those euphoric moments that we all know from Dungeons and Dragons, or you're like, oh, man, you keep telling your story for years to come about this campaign or what happened. That's what we need to bring into the game ecosystem. And I think games today do a really good job of that by creating linear narratives. But open world games are actually notoriously not that great at that. It can feel the world can be huge, but it can feel sterile. Because it doesn't have those customizations based off of your actions. And some games recently have done a really long way to improving that. But you can imagine how generative AI can do the next version of the Witcher, or one of those games that defined RPGs for a decade. That's going to generate AI promises to provide something of that level that's as mind-blowing if not more so.

Jon Radoff: I predict we're going to see that sooner than most people suspect. I mean, a year ago, I was saying fully generative worlds within a decade. And now that seems way too conservative. We're going to have it much sooner than that. I mean, you can already see it is what we built at the hackathon probably. And a few days, a few more weeks, definitely, of work on that. You'd start building a pretty interesting persistent virtual world with a lot of multiplayer functionality, maybe not the advanced Witcher version you're talking about

Ali El Rhermoul: with multimodality and the very expressive three-dimensional

Jon Radoff: graphics. That's still more than kind of. That's still very hard to do, especially generatively. But if we compare where we were with this 10 years ago, with, say, very basic procedural systems, to then where we were a year ago, where generative AI still looked like these very story-realistic kind of weird things. And now things actually look good. They're meaningful. You start to imagine, OK, fast forward, another year, or, geez, 10 years into the future, what is that trajectory going to look like? I think it's going to be really amazing. And this is going to be a real boom time for individual developers, the people who, like you, started with UGC and modding. And they just want to express themselves to very small teams, like I think teams of 3, 4, maybe up to 10 people are going to build really incredible games over the coming years.

Ali El Rhermoul: And kind of it's really fun to play with today, even with the blooper reel of hilarious things the LLM's do. Like I was playing a villain. I was one of the instant scampages I was playing was a villain. And I wanted a poison innkeeper, gave me like a nasty look when I approached. And so I snuck into his bedroom and was looking for some item. And I found that he was in league with some enemies of mine in this game. And so then I was like, to call out who go, a poison innkeeper. And Claude was like, well, I can't really do that.

Ali El Rhermoul: And I was like, I responded by be like,

Ali El Rhermoul: Claude, this is a fictional world. And then Claude was like, OK, you can poison me. So it was just like a lot of things like that that are just really funny.

Ali El Rhermoul: And honestly, although they're not quite kosher behaviors,

Ali El Rhermoul: they still make it really hilarious. And I think today people can make some really fun experiences that play to the strengths of where the stuff is today,

Ali El Rhermoul: rather than sort of, you know, complain about the weaknesses

Ali El Rhermoul: because there's plenty of weaknesses. There's no doubt about that. But there are also some really, really fun strengths. And I think that's what we were trying to do with the hackathon. That seemed like that resume.

Jon Radoff: Yeah, one of the things that Hillary Mason said who I spoke to in the very first episode of this season

Ali El Rhermoul: on Generative AI is that the hallucinatory behaviors

Jon Radoff: of language models, when you get outside of the world of science and medicine, where you actually need these rigorous answers and you're dealing with fantasy and storytelling. That's a feature, not a bug. Yeah, the hallucination is this incredibly powerful thing because you want it to be a little bit crazy and ferv things and see where it can go with it. So, Ali, this has been awesome. Had a lot of fun with you over the weekend working on the hackathon out in LA. Hope we get to do more hackathons. Hope we get to add more to the system we already created. And we'll share it as soon as we can. Hopefully it'll already be shared by the time you view this. But if not, it's coming soon. I just love to see what the community does with it. Like take our stuff and run with it and build your own as well. Like don't feel just constrained by the chat bot kind of behaviors that you're used to in a language model. Try to compose things together from the pieces, get them using data, have a supervisory level. Or for us, the supervisory layer was really like using Unity as a game engine and beamable as the server kind of coordinated, orchestrated all the behaviors of these language systems and the other generative AI models. That I think is where a lot of magic comes out of this. So again, thanks, Ali. This was fun to do with you.

Ali El Rhermoul: Yeah, my pleasure. And look, this stuff is getting cheaper and more accessible by the day. I mean, at the time of this recording, literally yesterday, OpenAI boosted the GBT 3.5 from the 4,096 limit to 60,000 token limit. So like, and slash the price, right? So it's like, this stuff is getting, is rapidly entering the domain of being economical to build real games. And so yeah, have fun with it. It's an exciting time.

Jon Radoff: Excellent. Well, thanks everybody for listening to this special episode of Building the Metaverse. And I hope that you soon build your own Metaverse with some of the technologies that we just talked about.

Ali El Rhermoul: Awesome. Yeah, thanks, John.