Boston, MA
Tools for Your To Do List with Spot and Gemini Robotics | Boston Dynamics
For an industrial robot built for the rigors of factories and power plants, tidying up a living room may seem like a light day at the office for Spot. Yet, a recent video of the robot picking up shoes and soda cans in a residential home represents the promise of AI models in robotics. In this case, Google’s visual-language model (VLM) Gemini Robotics-ER 1.5 was empowering Spot with embodied reasoning.
This particular demo grew out of a 2025 hackathon at Boston Dynamics that built on prior projects using Large Language Models (LLMs) and Visual Foundation Models (VFMs) to enable Spot to contextualize its environment and engage in more complex autonomous actions than a typical Autowalk mission. Rather than write formal software logic or a “state machine” program that defines each step of a given task, we interacted with Gemini Robotics using conversational language. In turn, it communicated with Spot on our behalf.
A Robust SDK and Natural Language Prompts Save Time
Using Spot’s SDK, we developed a layer that facilitated interaction between Gemini Robotics and Spot’s application programming interface (API). The API normally gives developers access to the robot’s capabilities to create custom applications or behaviors. For example, researchers at Meta have used Spot to test how an AI system could locate and retrieve objects it had never seen before.
Our ability to engage Gemini Robotics using natural language prompts was a huge timesaver, compared to traditional programming. We told Gemini Robotics it had access to a mobile robot equipped with cameras and a robotic arm. It also had a finite set of tools it could use to control the robot. A tool is a lightweight script that performs some internal logic and translates inputs from Gemini Robotics to actual API calls. We limited the actions to navigating between locations, capturing images, identifying objects, grasping them, and placing them somewhere else.
The extent of our SDK means there are great examples one could leverage to add more access to the API with minimal development.
Giving Gemini Robotics a Baseline
To start we needed to explain to Gemini Robotics what we wanted it to do. We did experience a learning curve when writing these baseline prompts. Simple instructions like “put down an object” or “take a picture” weren’t detailed enough to produce expected behavior. We had to add context in our descriptions as we refined each tool.
A good example is the detailed prompt for the “TakePicture” tool:
This command will cause the robot to take a picture with the specified camera. There is some nuance to choosing the correct camera. Once arriving at a location using GoTo, you should always start by taking a picture with the gripper camera, because it's the most informative.
If the robot has arrived at location and is already holding an object, you can do one of two things:
1. Immediately call PutDown
2. Search the area with either of the front cameras. The front cameras are low to the ground, so if you're trying to put things on an elevated surface, they won't give you useful information.
In this example, we gave Gemini Robotics no detailed description of the robot’s chassis or arm. Instead, we simply explained that Spot’s front cameras would be too low to photograph objects on elevated surfaces. We were able to iterate rapidly, as small changes in wording produced noticeably better results. Once it had this set of basic tools through the API, Gemini Robotics could sequence Spot’s actions and follow the handwritten instructions on a whiteboard on the day of the demonstration.
How Gemini Robotics and Spot Collaborate
Until the robot powers on, Gemini Robotics has no context for what specific tasks we might ask it to perform in a given demo. We only provided simple written instructions, such as, “Make sure all of the shoes at the front door are on the shoe rack.” Gemini Robotics evaluated images from Spot’s cameras and identified objects in the scene that matched the instructions. These objects became the reference points for Spot’s navigational and manipulation systems.
In many respects, Gemini Robotics was identical to an operator manually driving Spot using its tablet controller. For example, to pick up an object with Spot, an operator positions the robot near the object and then uses a grasp wizard to identify the target object. The operator provides high-level direction and Spot figures out the exact details. In this demonstration, Gemini Robotics functioned as both the operator and the tablet sending commands to the robot. This freed us up to act more like a team lead, providing a high-level to-do list and trusting Spot and Gemini Robotics do the rest.
Call and Response
When Gemini Robotics engages a given tool, the tool responds with results and context, such as, “I picked up the object,” or “I can’t pick up something while my hand is full.” Gemini Robotics then makes adjustments on the fly based on this feedback from Spot. For example, to pick up shoes, Gemini Robotics requests an image, identifies the shoes in that image, and calls the “pickup” command. By creating fundamental tools that semantically flow in conversation, Gemini Robotics can manage the sequence of tasks required to clean up the room. Spot’s existing software stack manages the locomotion, navigation, and manipulation of the robot itself.
It’s important to note Gemini Robotics has strict boundaries in this scenario. It can’t invent new capabilities or control Spot beyond what is available through the API. This keeps Spot’s behavior predictable, while still allowing Gemini Robotics to adapt to different situations.
A Force Multiplier for Developers
For developers already working with Spot, this research has tremendous potential. Through Spot’s SDK, they have access to a robust toolkit of capabilities. Companies use these tools today to build applications for inspection, research, and industrial data analysis, among others.
An AI model like Gemini Robotics offers a way to expand those applications more rapidly. Rather than write extensive task logic on top of Spot’s APIs, developers can experiment with having AI systems interpret natural language instructions and dynamically choose to engage the robot. As a result, models like Gemini Robotics can act as force multipliers, amplifying the reliable toolkit and robust performance that is already delivering value for Boston Dynamics customers.
Our Next-Token Prediction for Spot and Gemini Robotics
Although this is still an experimental step and not a hardened application, it illustrates a compelling direction for robotics and physical AI. Robots like Spot are already extremely capable of navigating complex and changeable environments, collecting data and sensor readings, and manipulating objects. Rather than reinventing the wheel, AI foundation models offer a new way to expand these capabilities in new settings and to new applications.
Physical AI is a rapidly evolving field and our team is leading the way in the lab and in real applications of AI empowered robots. While we are early in our formal partnership with Google Deepmind, we’re excited for what the future holds with Atlas and we’ve already rolled out practical enhancements for Spot and Orbit, with AIVI-Learning powered by Google Gemini Robotics ER 1.6. This next evolution of our AI Visual Inspection tool unlocks a new level of visual intelligence, as users benefit from shared expertise bringing a deeper level of contextual intelligence to Spot and Orbit. Model improvements automatically happen behind the scenes, adding more capabilities to the same software and hardware.
Today, this demo points to a future where users can rely more on natural language to guide Spot’s actions, rather than complex code. The engineer’s role shifts toward setting goals and objectives. The multi-modal robot foundation model interprets the instructions to form complex and adaptive plans and Spot executes the action.
This article was contributed by Issac Ross and Nikhil Devraj, engineers on the Spot team.