arm-manipulation-behavior-cloning

Robotic Arm Manipulation with Behavior Cloning

Franka Arm Manipulation using Humans Demos in Kitchen Environment by me

This project enhances robotic arm manipulation by integrating human demonstrations using modified soft-actor-critic method, enabling robots to perform complex tasks like opening cabinets more effectively.

Soft Actor-Critic: The Big Picture

SAC is a reinforcement learning algorithm that trains an agent to act optimally in continuous action spaces, such as controlling a robot arm or navigating a drone. In the code:


How SAC Works in This Code

SAC involves three key networks:

  1. Actor (Policy): Learns which actions to take in a given state to maximize reward.
  2. Critics (Q-value estimators): Evaluate how good a given action is in a particular state.
  3. Target Critic: Provides stable Q-value targets for training the critics.

The overall flow can be broken into three phases.


Phase 1: Initialization

  1. Set Up Environment:
    • The environment is created (gym.make), and a wrapper processes observations for compatibility.
  2. Agent Initialization:
    • Actor:
      • Learns a policy represented as a probability distribution.
      • Outputs:
        • Mean and log standard deviation of action distributions.
        • Ensures exploration via stochastic sampling.
    • Critics:
      • Two independent networks (Q1 and Q2) estimate action values for stability (avoids overestimation bias).
    • Target Critic:
      • Initially copies the weights of the Critic and updates slowly to ensure stable targets.
  3. Replay Buffer:
    • Stores past experiences (state, action, reward, next_state, done).
    • Enables efficient learning by reusing past experiences.
  4. Loading Expert Data:
    • In Phase 1, the agent leverages human demonstration data (human_memory.npz) to jumpstart training.

Phase 2: Training Loop

The core training happens in three stages with decreasing reliance on expert data:

Step 1: Interaction with the Environment

Step 2: Sampling from the Replay Buffer

Step 3: Critic Updates

Step 4: Actor Updates

Step 5: Target Critic Updates

Step 6: Logging and Checkpointing


Phase 3: Gradual Transition to Full Autonomy

The agent is trained in three phases:

  1. Phase 1: High reliance on expert data:
    • Expert data ratio = 50%.
    • Balances learning from the replay buffer and human-provided data.
  2. Phase 2: Reduced expert reliance:
    • Expert data ratio = 25%.
    • Encourages the agent to learn more from its own exploration.
  3. Phase 3: Full autonomy:
    • Expert data ratio = 0%.
    • The agent learns purely from its own experience.

Why Each Component Matters

  1. Actor (Policy):
    • Learns how to act optimally by maximizing rewards while maintaining exploration.
  2. Critics (Q-Values):
    • Evaluate the quality of actions taken by the policy.
    • Two critics reduce overestimation bias.
  3. Replay Buffer:
    • Ensures sample efficiency by reusing past experiences.
    • Decorrelation: Helps prevent learning from sequentially correlated data.
  4. Entropy Regularization:
    • Encourages exploration, preventing premature convergence to suboptimal strategies.
  5. Target Networks:
    • Provide stable targets for critic training, avoiding instability caused by rapidly changing Q-values.
  6. Expert Data:
    • Jumpstarts training by introducing good behaviors early on, especially useful in complex tasks like robotics.

Summary of Training Flow

  1. Initialize environment, agent, and replay buffer.
  2. Phase 1 (Exploration with Expert Data):
    • Train using a mix of expert and self-collected data.
  3. Phase 2 (Reduced Expert Reliance):
    • Gradually shift focus to agent-collected experiences.
  4. Phase 3 (Full Autonomy):
    • Train entirely on self-collected experiences.
  5. For Each Episode:
    • Interact with the environment.
    • Store experiences in the replay buffer.
    • Periodically sample experiences to:
      • Update Critics using target Q-values.
      • Update Actor using learned Q-values and entropy regularization.
  6. Log metrics and save model checkpoints.

Environment setup: