AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs

AI Explained
14 Mar 202419:21

Summary

TLDRThe video script discusses recent AI developments, highlighting three systems: Devon, an AI software engineer system; Google DeepMind's SEMA, an agent that plays video games; and a humanoid robot with GPT-4 Vision. These systems demonstrate AI's growing ability to perform complex tasks, but are still far from matching human performance. The script also touches on the potential future upgrades these systems could receive with the release of more advanced models like GPT-5, and the implications for the job market and society as AI capabilities continue to evolve.

Takeaways

  • 🤖 Devon is an AI system based on GPT-4, equipped with a code editor and browser, designed for understanding prompts and executing plans with improved efficiency over Auto GPT.
  • 📈 Devon demonstrated significant progress in a software engineering benchmark, achieving almost 14% success rate compared to 1.7% for GPT-4, showcasing its potential for rapid improvement with advancements in underlying models.
  • 🎮 Google DeepMind's SEMA project focuses on creating an instructible agent capable of performing tasks in simulated 3D environments, with potential applications beyond gaming.
  • 🕹ī¸ SEMA's performance in games shows positive transfer effects, outperforming specialized agents trained for single games, indicating a move towards more generalized AI capabilities.
  • 🤖 A humanoid robot with GPT-4 Vision demonstrates impressive real-time speed and dexterity, suggesting that future upgrades to GPT-5 could significantly enhance its understanding and interaction with the environment.
  • 🚀 The potential applications of AI systems like Devon and SEMA extend to various industries, including software engineering, gaming, and robotics, with the possibility of transforming job landscapes and labor markets.
  • 🌐 The rapid development of AI models suggests that we are moving closer to AGI (Artificial General Intelligence), with predictions of significant advancements in the next few years.
  • 💡 The cost and accessibility of AI systems like the humanoid robot are decreasing, which could lead to widespread adoption and automation of manual labor, though the timeline and societal impact remain uncertain.
  • 📊 The performance of AI models in real-world tasks, such as software engineering challenges and video games, is improving, indicating a shift from theoretical capabilities to practical applications.
  • 🔄 The transferability of skills across different tasks and environments in AI systems highlights the potential for AI to adapt and excel in a variety of scenarios, not limited to their initial training domains.
  • 🌍 The global impact of AI advancements is being recognized, with discussions on the future of jobs, economies, and the need for public awareness and preparation for the changes ahead.

Q & A

  • What is the significance of the developments in AI in the last 48 hours?

    -The developments show that AI models are advancing towards performing complex tasks beyond just processing language, indicating a shift towards AI that can 'walk the walk' and not just 'talk the talk'.

  • What does the AI system Devon do?

    -Devon is an AI system equipped with a code editor shell and browser, designed to understand prompts, look up documentation, and execute plans, significantly improving upon Auto GPT's capabilities in software engineering tasks.

  • How did Devon perform on the software engineering benchmark?

    -Devon achieved almost a 14% success rate on the software engineering benchmark, outperforming Claude 2 and GPT 4, which scored 1.7%. However, it was tested only on a subset of the benchmark and the tasks were a small part of the overall software engineering skills.

  • What is Google DeepMind's SEMA and its purpose?

    -SEMA is an AI developed by Google DeepMind that is trained to accomplish tasks in any simulated 3D environment by using a mouse and keyboard. Its goal is to become an instructible agent capable of doing anything a human can do within such environments.

  • How does SEMA perform on video games?

    -SEMA demonstrates positive transfer across different video games, outperforming environment-specialized agents and showing potential to generalize its skills, even achieving performance levels approaching human capabilities.

  • What is the humanoid robot with GPT-4 Vision capable of?

    -The humanoid robot with GPT-4 Vision can recognize objects and move them appropriately in real-time, using an end-to-end neural network without human control. It shows potential for upgrading to future models like GPT-5 for deeper environmental understanding.

  • What concerns do people have about AI systems like Devon?

    -People are concerned about the implications for jobs, as AI systems like Devon could potentially automate tasks currently performed by humans, leading to an unpredictable job landscape and potential unemployment.

  • What is the potential future impact of AI systems on the job market?

    -The future impact of AI systems on the job market is uncertain, but it could lead to the automation of manual labor, making some jobs obsolete. However, there is also optimism for a human economy where AI assists in tasks, and new roles may emerge.

  • How do the developments in AI relate to the concept of Artificial General Intelligence (AGI)?

    -The advancements in AI models like Devon, SEMA, and humanoid robots with GPT-4 Vision bring us closer to AGI, as they demonstrate the ability to perform a wide range of tasks, understand complex environments, and learn from experience across different domains.

  • What is the timeline for the potential arrival of AGI?

    -While there is no definitive timeline, some experts predict that AGI could be achieved within the next 5 years, based on the rapid increase in compute power and improvements in AI capabilities.

  • How might the advancements in AI affect society in the long term?

    -The long-term societal impact of AI advancements could be significant, potentially transforming job markets, creating new industries, and changing the way humans interact with technology. It could also lead to ethical considerations and the need for regulatory frameworks to manage the use of AI.

Outlines

00:00

🤖 Advancements in AI: From Hype to Reality

This paragraph discusses recent developments in AI, highlighting three AI systems: Devon, Google DeepMind's SEMA, and a humanoid robot. It questions whether these advancements meet the hype and analyzes the associated papers and posts. Devon, an AI system with a code editor shell and browser, is designed to understand prompts, read documentation, and execute plans. The paragraph also delves into the benchmarking of software engineering problems, where Devon outperforms other models like Claude 2 and GPT 4. However, it notes that the benchmark may not fully represent the complexity of software engineering tasks and that Devon's performance is limited to a subset of these tasks.

05:01

🎮 SEMA: The Future of Gaming and Beyond

The second paragraph focuses on Google DeepMind's SEMA, an AI designed to play video games and perform tasks in simulated 3D environments. It discusses the potential for SEMA to be instructed through natural language and the implications of its ability to generalize across different games. The paper on SEMA suggests that training on a variety of games leads to positive transfer, allowing the AI to perform better on new games than specialized agents. The paragraph also touches on the potential applications of SEMA's technology beyond gaming, such as video editing and phone applications, and the possibility of undetectable AI interactions on the internet.

10:02

🤖🌐 Humanoid Robots and the Future of Labor

This paragraph discusses a humanoid robot that uses GPT-4 Vision to recognize objects and perform tasks like doing the dishes. It highlights the robot's impressive speed and dexterity but emphasizes that the underlying intelligence comes from the GPT-4 Vision model. The CEO of the company behind the robot envisions a future where manual labor is automated, and the cost of labor decreases to the point of renting a robot. The discussion extends to the potential for robots to build new worlds on other planets, but also raises concerns about the control and ethical implications of such advanced AI technology.

15:03

🚀 Accelerating Towards AGI: Implications and Concerns

The final paragraph reflects on the rapid progress towards Artificial General Intelligence (AGI) and the lack of control over the technology's development. It mentions predictions from industry experts like Jeff Clune and Jensen Huang about the timeline for AGI and its potential impact on jobs and society. The paragraph also discusses the exponential increase in compute power and the potential for AI to revolutionize marketing and other industries. It concludes with a call for the public to pay attention to the fast-paced changes in AI and the need for broader discussions on its implications.

Mindmap

Keywords

💡AI models

AI models refer to the various algorithms and systems designed to perform tasks that typically require human intelligence, such as understanding language, recognizing images, and making decisions. In the context of the video, AI models are evolving to not only simulate human-like communication (talk the talk) but also execute complex tasks (walk the walk), indicating a significant advancement in their capabilities.

💡Devon

Devon is an AI system that is likely based on GPT-4 and is equipped with a code editor shell and browser, enabling it to understand prompts, look up documentation, and execute plans. It is designed to excel at software engineering tasks, such as reading through code, identifying and fixing bugs, and refining models autonomously. The video highlights Devon's performance on a software engineering benchmark, where it significantly outperformed other models like GPT-4 and Claude 2.

💡Benchmark

A benchmark is a standard or point of reference against which things may be compared, in this case, the performance of AI models. In the video, the software engineering benchmark is a set of real-world professional problems and their solutions used to evaluate the capabilities of AI systems like Devon. The benchmark helps to contextualize the progress of AI in solving complex, real-world problems and serves as a measure of how close AI is to achieving human-like performance in specific domains.

💡GPT-4

GPT-4 is a version of the Generative Pre-trained Transformer model developed by OpenAI. It is a language model capable of understanding and generating human-like text based on the input it receives. In the context of the video, GPT-4 is the underlying model that powers systems like Devon, SEMA, and the humanoid robot, providing them with the ability to understand and process natural language, which is crucial for their task execution capabilities.

💡SEMA

SEMA is an AI system developed by Google DeepMind that is designed to be an instructible agent capable of accomplishing tasks in any simulated 3D environment. It uses a mouse and keyboard as input and is trained on a variety of games to learn how to perform tasks. The goal of SEMA is to create a scalable, instructable, multi-world agent that can adapt to new environments and perform a wide range of tasks, demonstrating the concept of positive transfer, where learning on one task improves performance on another.

💡Humanoid robot

A humanoid robot is a robot with a form similar to a human, often designed to mimic human movements and perform tasks in a human-like manner. In the video, the humanoid robot is powered by GPT-4 Vision, which allows it to recognize objects and perform tasks like moving items on a table. The robot's intelligence and ability to interact with its environment are dependent on the underlying AI model, which could be upgraded to future versions like GPT-5 for enhanced capabilities.

💡Transfer learning

Transfer learning is a machine learning technique where a model developed for a specific task is reused as the starting point for a model on a second task. It is a method where the knowledge gained from solving one problem is applied to a different but related problem. In the context of the video, SEMA benefits from transfer learning by improving its performance on new games based on the skills it learned from playing other games.

💡AGI (Artificial General Intelligence)

Artificial General Intelligence (AGI) refers to the hypothetical intelligence of a machine that has the ability to understand, learn, and apply knowledge across a wide range of tasks, just as a human being can. It is a form of AI that is capable of performing any intellectual task that a human being can do. The video suggests that recent advancements in AI models are bringing us closer to achieving AGI, as they are increasingly able to perform complex tasks and exhibit human-like understanding.

💡Job automation

Job automation refers to the process of replacing human workers with technology, such as AI systems or robots, to perform tasks more efficiently. The video discusses the potential of AI systems like Devon and humanoid robots to automate various jobs, leading to concerns about the future job landscape and the impact on employment. It also touches on the idea that AI might create new types of jobs while eliminating others.

💡AI ethics and control

AI ethics and control refer to the moral and social implications of AI development, including issues like job displacement, privacy concerns, and the potential misuse of AI technology. The video acknowledges the unpredictability of the job landscape due to AI advancements and the need for companies to address public fears about automation. It also discusses the potential lack of control over how AI technology is used, especially in areas like military applications.

Highlights

AI models are advancing to a point where they can perform tasks, not just provide information.

Three AI developments in the last 48 hours show significant progress in AI capabilities.

Devon, an AI system, is equipped with a code editor shell and browser, allowing it to understand prompts and execute tasks.

Devon's performance on the software engineering benchmark was significantly higher than other models like Claude 2 and GPT 4.

The benchmark used real-world professional problems, requiring complex reasoning and understanding across multiple functions and files.

Devon was tested on a subset of the benchmark and its tasks only represent a small part of the skills of software engineering.

The selection of pull requests for the benchmark might bias the data set towards easier problems to detect, report, and fix.

Vision language models are expected to improve with more multimodal capabilities and larger context windows.

SEMA, a scalable instructable commandable multi-world agent by Google DeepMind, can perform tasks in simulated 3D environments.

SEMA's training across multiple games showed positive transfer effects, allowing it to perform better on new games than specialized agents.

The humanoid robot with GPT-4 Vision demonstrates impressive real-time speed and dexterity, but its intelligence comes from the underlying model.

The humanoid robot's cost is estimated between $30,000 and $150,000, which is still too high for most companies and individuals.

The CEO of Figure Robotics envisions a future where AI completely automates manual labor, eliminating the need for unsafe and undesirable jobs.

There are concerns about the implications of AI models like Devon for the job landscape and the need for companies to address these fears.

As AI models improve, they are expected to take over tasks that are currently done by humans, including in software engineering and gaming.

The rapid advancement of AI models suggests that we are moving closer to AGI (Artificial General Intelligence).

The potential applications of AI models like SEMA and humanoid robots extend beyond their current tasks, indicating a future where AI can perform a wide range of activities.

The development and application of AI models are accelerating, with significant improvements expected with the release of GPT-5.

The future of AI integration in various industries, including software engineering, gaming, and robotics, is uncertain but holds the potential for significant changes.

Transcripts

00:00

three developments in the last 48 hours

00:02

show how we are moving into an era in

00:05

which AI models can walk the walk not

00:08

just talk the talk whether the

00:10

developments quite meet the hype

00:12

attached to them is another question

00:14

I've read and analyzed in full the three

00:16

relevant papers and Associated posts to

00:19

find out more we'll first explore Devon

00:21

the AI system your boss told you not to

00:23

worry about then Google Deep Mind SEMA

00:26

which spends most of its time playing

00:28

video games and then figure one the

00:30

humanoid robot which likes to talk while

00:32

doing the dishes but the tldw is this

00:36

these three systems are each a long way

00:39

from Human Performance in their domains

00:41

but think of them more as containers or

00:44

shells for the vision language models

00:46

powering them so when the GPT 4 that's

00:49

behind most of them is swapped out for

00:52

GPT 5 or Gemini 2 all these systems are

00:55

going to see big and hard to predict

00:58

upgrades overnight and that's a point

01:00

that seems especially relevant on this

01:02

the one-year anniversary of the release

01:05

of GPT 4 but let's start of course with

01:08

Devon build as the first AI software

01:11

engineer now Devon isn't a model it's a

01:14

system that's likely based on gp4 it's

01:18

equipped with a code editor shell and

01:20

browser so of course it cannot just

01:23

understand your prompt but look up and

01:25

read documentation a bit like Auto GPT

01:28

it's designed to come up with plans

01:31

first and then execute them but it does

01:33

so much better than Auto GPT did but

01:36

before we get to The Benchmark that

01:37

everyone's talking about let me show you

01:39

a 30-second demonstration of Devon in

01:42

action all I had to do was send this

01:43

blog post in a message to Devon from

01:46

there Devon actually does all the work

01:48

for me starting with reading this blog

01:49

post and figuring out how to run the

01:53

code in a couple minutes Devon's

01:55

actually made a lot of progress and if

01:58

we jump to the middle here

02:00

you can see that Devon's been able to

02:02

find and fix some edge cases and bugs

02:05

that the blog post did not cover for me

02:07

and if we jump to the end we can see

02:10

that Devon uh sends me the final result

02:12

which I love I also got two bonus images

02:16

uh here and here so uh let me know if

02:20

you guys see anything hidden in these it

02:22

can also F tuna model autonomously and

02:25

if you're not familiar think of that as

02:27

refining a model rather than training it

02:29

from scratch that makes me wonder about

02:31

a future where if a model can't succeed

02:34

at a task it fine-tunes another model or

02:37

itself until it can anyway this is The

02:40

Benchmark that everyone's talking

02:41

aboutwe bench software engineering bench

02:44

Devon got almost 14% And in this chart

02:48

crushes Claude 2 and GPT 4 which got

02:50

1.7% they say Devon was unassisted

02:54

whereas all other models were assisted

02:56

meaning the model was told exactly which

02:58

files need to be edited before before we

03:00

get too much further though what the

03:01

hell is this Benchmark well unlike many

03:03

benchmarks they drew from Real World

03:06

professional problems

03:09

2,294 software engineering problems that

03:11

people had and their corresponding

03:13

Solutions resolving these issues

03:15

requires understanding and coordinating

03:17

changes across multiple functions

03:19

classes and files simultaneously the

03:22

code involved might require the model to

03:24

process extremely long contexts and

03:27

perform they say complex reasoning these

03:29

aren't just fill-in the blank or

03:31

multiple choice questions the model has

03:33

to understand the issue read through the

03:35

relevant parts of the codebase remove

03:38

lines and AD lines fixing a bug might

03:40

involve navigating a large repo

03:43

understanding the interplay between

03:44

functions in different files or spatting

03:46

a small error in convoluted code on

03:49

average a model might need to edit

03:50

almost two files three functions and

03:53

about 33 lines of code one point to make

03:56

clear is that Devon was only tested on a

03:58

subset of this Benchmark and the tasks

04:00

in The Benchmark were only a tiny subset

04:03

of GitHub issues and even all of those

04:05

issues represent just a subset of the

04:07

skills of software engineering so when

04:10

you see all caps videos saying this is

04:12

Agi you've got to put it in some context

04:14

here's just one example of what I mean

04:16

they selected only pull requests which

04:18

are like proposed solutions that are

04:21

merged or accepted that solve the issue

04:24

and the introduced new tests would that

04:26

not slightly bias the data set toward

04:28

problems that are easy easier to detect

04:30

report and fix in other words complex

04:32

issues might not be adequately

04:34

represented if they're less likely to

04:36

have straightforward Solutions and

04:38

narrowing down the proposed solutions to

04:40

only those that introduce new tests

04:42

could bias towards bugs or features that

04:44

are easier to write tests for that is to

04:47

say that highly complex issues where

04:49

writing a clear test is difficult may be

04:52

underrepresented now having said all of

04:54

that I might shock You by saying I think

04:56

that there will be rapid Improvement in

04:58

the performance on this Benchmark when

05:00

Devon is equipped with GPT 5 I could see

05:03

it easily exceeding 50% here are just a

05:06

few reasons why first some of these

05:08

problems contained images and therefore

05:10

the more multimodal these language

05:12

models get the better they'll get second

05:14

and more importantly a large context

05:16

window is particularly crucial for this

05:18

task when The Benchmark came out they

05:20

said models are simply ineffective at

05:22

localizing problematic code in a sea of

05:25

tokens they get distracted by additional

05:27

context I don't think that will be true

05:29

for for much longer as we've already

05:31

seen with Gemini 1.5 third reason models

05:34

they say are often trained using

05:35

standard code files and likely rarely

05:38

see patch files I would bet that GPT 5

05:41

would have seen everything fourth

05:42

language models will be augmented they

05:44

predict with program analysis and

05:46

software engineering tools and it's

05:48

almost like they could see 6 months in

05:50

the future because they said to this end

05:52

we are particularly excited about

05:53

agent-based approaches like Devon for

05:56

identifying relevant context from a code

05:58

base I could go on but hopefully that

06:00

background on the Benchmark allows you

06:02

to put the rest of what I'm going to say

06:03

in a bit more context and yes of course

06:05

I saw how Devon was able to complete a

06:08

real job on upwork honestly I could see

06:10

these kind of tasks going the way of

06:12

copywriting tasks on upwork here's some

06:15

more context though we don't know the

06:16

actual cost of running Devon for so long

06:18

it actually takes quite a while for it

06:20

to execute on its task we're talking 15

06:23

20 30 minutes even 60 Minutes sometimes

06:25

as Bindu ready points out it can get

06:27

even more expensive than a human

06:29

although costs are of course falling

06:31

Deon she says will not be replacing any

06:33

software engineer in the near term and

06:35

noted deep learning author franois Shay

06:37

predicted this there will be more

06:39

software Engineers the kind that write

06:40

code in 5 years than there are today and

06:43

newly unemployed Andre carpath says that

06:46

software engineering is on track to

06:47

change substantially with humans more

06:50

supervising the automation pitching in

06:52

high level commands ideas or progression

06:54

strategies in English I would say with

06:56

the way things are going they could

06:58

pitch it in any language and the model

07:00

will understand frankly with vision

07:02

models the way they are you could

07:03

practically mime your code idea and it

07:05

would understand what to do and while

07:07

Devon likely relies on gyd 4 other

07:10

competitors are training their own

07:12

Frontier Scale Models indeed the startup

07:15

magic which aims to build a co-worker

07:18

not just a co-pilot for developers is

07:20

going a step further they're not even

07:22

using Transformers they say Transformers

07:24

aren't the final architecture we have

07:25

something with a multi-million token

07:27

context window super curious of course

07:29

of course how that performs on swe bench

07:32

but the thing I want to emphasize again

07:34

comes from Bloomberg cognition AI admit

07:36

that Devon is very dependent on the

07:38

underlying models and use gpc4 together

07:41

with reinforcement learning techniques

07:43

obviously that's pretty vague but

07:45

imagine when GPT 5 comes out with scale

07:47

you get so many things not just better

07:49

coding ability if you remember gpt3

07:51

couldn't actually reflect effectively

07:53

whereas GPT 4 could if GPT 5 is twice or

07:56

10 times better at reflecting and

07:59

debugging that is going to dramatically

08:01

change the performance of the Devon

08:02

system overnight just delete the GPT 4

08:05

API and put in the GPT 5 API and wait

08:08

Jeff cloon who I was going to talk about

08:10

later in this video has just retweeted

08:13

one of my own videos I literally just

08:15

saw this 2 seconds ago when it came up

08:18

as a notification on my Twitter account

08:20

this was not at all supposed to be part

08:22

of this video but I am very much honored

08:24

by that and actually I'm going to be

08:25

talking about Jeff cloon later in this

08:27

video chances are he's going to see this

08:29

video so this is getting very

08:30

inception-like he was key to Simo which

08:33

I'm going to talk about next the

08:34

simulation hypothesis just got 10% more

08:37

likely I'm going to recover from that

08:39

distraction and get back to this video

08:41

cuz there's one more thing to mention

08:43

about Devon the reaction to that model

08:45

has been unlike almost anything I've

08:47

seen people are genuinely in some

08:50

distress about the implications for jobs

08:52

and while I've given the context of what

08:54

the Benchmark does mean and doesn't mean

08:56

I can't deny that the job landscape is

08:59

incredibly unpredictable at the moment

09:01

indeed I can't see it ever not being

09:03

unpredictable I actually still have a

09:05

lot of optimism about there still being

09:07

a human economy in the future but maybe

09:09

that's a topic for another video I just

09:11

want to acknowledge that people are

09:13

scared and these companies should start

09:15

addressing those fears and I know many

09:17

of you are getting ready to comment that

09:19

we want all jobs to go but you might be

09:21

I guess disappointed by the fact that

09:24

cognition AI are asking for people to

09:27

apply to join them so obviously don't

09:29

anticipate Devon automating everything

09:31

just yet but it's time now to talk about

09:33

Google Deep Mind SEMA which is all about

09:36

scaling up agents that you can instruct

09:39

with natural language essentially a

09:41

scalable instructable commandable

09:44

multi-world agent the goal of SEMA being

09:46

to develop an instructible agent that

09:48

can accomplish anything a human can do

09:51

in any simulated 3D environment their

09:54

agent uses a mouse and keyboard and

09:57

takes pixels as input but if you think

09:59

about it that's almost everything you do

10:01

on a computer yes this paper is about

10:03

playing games but couldn't you apply

10:05

this technique to say video editing or

10:07

say anything you can do on your phone

10:09

now I know I haven't even told you what

10:10

the SEMA system is but I'm giving you an

10:12

idea of the kind of repercussions

10:14

implications if these systems work with

10:17

games there's so much else they might

10:18

soon work with this was a paper I didn't

10:20

get a chance to talk about that came out

10:22

about 6 weeks ago it showed that even

10:24

current generation models could handle

10:26

tasks on a phone like navigating on

10:28

Google Maps apps downloading apps on

10:30

Google Play or somewhat topically with

10:33

Tik Tok swiping a video about a pet cat

10:35

in Tik Tok and clicking a like for that

10:38

video no the success rates weren't

10:40

perfect but if you look at the averages

10:42

and this is for GPT 4 Vision they are

10:44

pretty high 91% 82% 82% these numbers in

10:47

the middle by the way on the left

10:49

reflect the number of steps that GPT 4

10:50

Vision took and on the right the number

10:52

of steps that a human took and that's

10:54

just gpc4 Vision not a model optimized

10:57

for agency which we know that open AI is

11:00

working on so before we even get to

11:02

video games you can imagine an internet

11:04

where there are models that are

11:06

downloading liking commenting doing pull

11:09

requests and we wouldn't even know that

11:11

it's AI it would be as far as I can tell

11:13

undetectable anyway I'm getting

11:15

distracted back to the SEMA paper what

11:17

is SEMA in a nutshell they got a bunch

11:19

of games including commercial video

11:21

games like valheim 12 million copies

11:23

sold at least and their own madeup games

11:26

that Google created they then paid a

11:28

bunch of humans to play those games and

11:31

gathered the data that's what you could

11:32

see on the screen the images and the

11:35

keyboard and mouse inputs that the

11:37

humans performed they gave all of that

11:39

training data to some pre-trained models

11:41

and at this point the paper gets quite

11:43

vague it doesn't mention parameters or

11:45

the exact composition of these

11:47

pre-trained models but from this we get

11:49

the SEMA agent which then plays these

11:51

games or more precisely tries 10sec

11:55

tasks within these games this gives you

11:57

an idea of the range of tasks everything

11:59

everything from taming and hunting to

12:01

destroying and headbutting but I don't

12:03

want to bury the lead the main takeaway

12:06

is this training on more games saw

12:08

positive transfer when SEMA played on a

12:11

new game and notice how SEMA in purple

12:14

across all of these games outperforms an

12:16

environment specialized agent that's one

12:19

trained for just one game and there is

12:21

another gem buried in this graph I'm

12:23

color blind but I'm pretty sure that's

12:25

teal or lighter blue that's zero shot

12:28

what that represents is when the model

12:30

was trained across all the other games

12:32

by the actual game it was about to be

12:34

tested in and so notice how in some

12:36

games like Goat Simulator 3 that

12:39

outperformed a model that was

12:41

specialized for just that one game the

12:44

transfer effect was so powerful it

12:46

outdid the specialized training indeed

12:49

sema's performance is approaching the

12:51

ballpark of human performance now I know

12:53

we've seen that already with Starcraft 2

12:55

and open AI beating DOTA but this would

12:57

be a model General izing to almost any

13:00

video game yes even Red Dead Redemption

13:02

2 which was covered in an entirely

13:04

separate paper out of Beijing that paper

13:06

they say was the first to enable

13:08

language models to follow the main story

13:11

line and finish real missions in complex

13:13

AAA games this time we're talking about

13:15

things like protecting a character

13:17

buying supplies equipping shotguns again

13:19

what was holding them back was the

13:21

underlying model GPT 4V as I've covered

13:23

Elsewhere on the channel it lacks in

13:26

spatial perception it's not super

13:27

accurate with moving the Cur cursor for

13:29

example but visual understanding and

13:31

performance is getting better fast take

13:34

the challenging Benchmark mm muu it's

13:37

about answering difficult questions that

13:39

have a visual component The Benchmark

13:41

only came out recently giving top

13:42

performance to GPT 4V at

13:45

56.8% but that's already been superseded

13:47

take Claude 3 Opus which gets

13:50

59.4% yes there is still a gap with

13:52

human expert performance but that Gap is

13:54

narrowing like we've seen across this

13:56

video just like Devon was solving real

13:58

world software engineering challenges

14:00

SEMA and other models are solving Real

14:03

World Games walking the walk not just

14:05

talking the talk and again we can expect

14:08

better and better results the more games

14:10

SEMA is trained on as the paper says in

14:13

every case SEMA significantly

14:14

outperforms the environment specialized

14:16

agent thus demonstrating positive

14:18

transfer across environments and this is

14:21

exactly what we see in robotics as well

14:23

the key take-home from that Google Deep

14:25

Mind paper was that our results suggest

14:28

that co-training with data from other

14:30

platforms imbus rt2 X in robotics with

14:34

additional skills that were not present

14:36

in the original data set enabling it to

14:37

perform novel tasks these were tasks and

14:40

skills developed by other robots that

14:42

were then transferred to rt2 just like

14:45

SEMA getting better at one video game by

14:47

training on others but did you notice

14:50

there that smooth segue I did to

14:52

robotics It's the final container that I

14:55

want to quickly talk about why do I call

14:57

this humanoid robot a container because

15:00

it contains GPT 4 Vision yes of course

15:03

its realtime speed and dexterity is very

15:06

impressive but that intelligence of

15:08

recognizing what's on the table and

15:10

moving it appropriately comes from the

15:12

underlying model gp4 Vision so of course