Simulately

2025-07-22

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Authors: Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu

Abstract

We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

2025-07-22​

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos​

Abstract​

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper​

Abstract​

GR-3 Technical Report​

Abstract​

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation​

Abstract​

Generalist Bimanual Manipulation via Foundation Video Diffusion Models​

Abstract​

GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training​

Abstract​

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models​

Abstract​

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos​

Abstract​

2025-07-15​

Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization​

Abstract​

Multi-critic Learning for Whole-body End-effector Twist Tracking​

Abstract​

Reinforcement Learning with Action Chunking​

Abstract​

RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot​

Abstract​

SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training​

Abstract​

DexVLG: Dexterous Vision-Language-Grasp Model at Scale​

Abstract​

2025-07-03​

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation​

Abstract​

RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation​

Abstract​

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations​

Abstract​

DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover​

Abstract​

SAM4D: Segment Anything in Camera and LiDAR Streams​

Abstract​

2025-06-26​

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy​

Abstract​

FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation​

Abstract​

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation​

Abstract​

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies​

Abstract​

Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration​

Abstract​

Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation​

Abstract​

Vision in Action: Learning Active Perception from Human Demonstrations​

Abstract​

2025-06-18​

ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes​

Abstract​

Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation​

Abstract​

GMT: General Motion Tracking for Humanoid Whole-Body Control​

Abstract​

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control​

Abstract​

From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots​

Abstract​

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills​

Abstract​

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction​

Abstract​

Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins​

Abstract​

Touch begins where vision ends: Generalizable policies for contact-rich manipulation​

Abstract​

Construction of a Multiple-DOF Under-actuated Gripper with Force-Sensing via Deep Learning​

Abstract​

2025-06-13​

EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence​

Abstract​

2025-07-22

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Abstract

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper

Abstract

GR-3 Technical Report

Abstract

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Abstract

Generalist Bimanual Manipulation via Foundation Video Diffusion Models

Abstract

GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

Abstract

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

Abstract

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Abstract

2025-07-15

Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization

Abstract

Multi-critic Learning for Whole-body End-effector Twist Tracking

Abstract

Reinforcement Learning with Action Chunking

Abstract

RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot

Abstract

SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training

Abstract

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Abstract

2025-07-03

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Abstract

RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

Abstract

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Abstract

DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover

Abstract

SAM4D: Segment Anything in Camera and LiDAR Streams

Abstract

2025-06-26

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

Abstract

FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation

Abstract

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Abstract

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Abstract

Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration

Abstract

Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation

Abstract

Vision in Action: Learning Active Perception from Human Demonstrations

Abstract

2025-06-18

ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes

Abstract

Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation

Abstract

GMT: General Motion Tracking for Humanoid Whole-Body Control

Abstract

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

Abstract

From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots

Abstract

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Abstract

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

Abstract

Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins

Abstract

Touch begins where vision ends: Generalizable policies for contact-rich manipulation

Abstract

Construction of a Multiple-DOF Under-actuated Gripper with Force-Sensing via Deep Learning

Abstract

2025-06-13

EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

Abstract