Visual assistants that can guide humans through complex tasks in physical environments have significant potential, yet their development is hindered by the high cost of human-in-the-loop data collection. We present BASIS (Bootstrapping Assistant modeling with Situated Interaction Simulation), a novel framework that fundamentally rethinks how visual assistants are developed and evaluated. Rather than relying on expensive human data collection, BASIS leverages simulation to bootstrap capable assistants through three interconnected stages: (1) Situated Interaction Simulation generates high-quality synthetic data through interactions between oracle assistants and simulated users; (2) Autonomous Model Development trains and continuously evaluates assistant models using this synthetic data; and (3) Real-User Validation verifies effectiveness with human users. We implement BASIS in Alexa Arena and demonstrate that our best fine-tuned vision-language model trained solely from synthetic data achieves a 72.9% success rate with real humans, reaching 82.3\% of an oracle assistant's success rate (88.6%) that can access privileged information. Through detailed error analysis, we identify object identification as the primary bottleneck for current visual assistants. Our approach successfully bridges the gap between simulation and reality, establishing a scalable pipeline for developing assistants that can effectively guide users through complex tasks.
Using ORACLE ASSISTANT with SIMULATED USER, you can train a reliable visual assistant!
Everything Synthetic✅ Evaluation Automated✅
Oracle assistant uses perfect perception with set-of-mark labels, while trained model uses raw observation. Acceptable performance gap!
Performance gap does exist. What is behind the success rate? We manually annotated the assistant behaviors and compared frequent errors!