Robotic Arm Manipulation with Behavior Cloning
Franka Arm Manipulation using Humans Demos in Kitchen Environment
by me
This project enhances robotic arm manipulation by integrating human demonstrations using modified soft-actor-critic method, enabling robots to perform complex tasks like opening cabinets more effectively.
Soft Actor-Critic: The Big Picture
SAC is a reinforcement learning algorithm that trains an agent to act optimally in continuous action spaces, such as controlling a robot arm or navigating a drone. In the code:
- The environment is FrankaKitchen-v1, where the agent completes tasks like opening a cabinet.
- The agent optimizes its policy using the Soft Actor-Critic (SAC) algorithm.
- The algorithm prioritizes reward maximization while encouraging exploration via entropy.
How SAC Works in This Code
SAC involves three key networks:
- Actor (Policy): Learns which actions to take in a given state to maximize reward.
- Critics (Q-value estimators): Evaluate how good a given action is in a particular state.
- Target Critic: Provides stable Q-value targets for training the critics.
The overall flow can be broken into three phases.
Phase 1: Initialization
- Set Up Environment:
- The environment is created (
gym.make
), and a wrapper processes observations for compatibility.
- Agent Initialization:
- Actor:
- Learns a policy represented as a probability distribution.
- Outputs:
- Mean and log standard deviation of action distributions.
- Ensures exploration via stochastic sampling.
- Critics:
- Two independent networks (Q1 and Q2) estimate action values for stability (avoids overestimation bias).
- Target Critic:
- Initially copies the weights of the Critic and updates slowly to ensure stable targets.
- Replay Buffer:
- Stores past experiences (
state
, action
, reward
, next_state
, done
).
- Enables efficient learning by reusing past experiences.
- Loading Expert Data:
- In Phase 1, the agent leverages human demonstration data (
human_memory.npz
) to jumpstart training.
Phase 2: Training Loop
The core training happens in three stages with decreasing reliance on expert data:
Step 1: Interaction with the Environment
- The agent uses the Actor to:
- Sample an action based on the current policy.
- Observe the resulting next state, reward, and whether the episode ends.
- The transition (
state, action, reward, next_state
) is stored in the Replay Buffer.
Step 2: Sampling from the Replay Buffer
- The agent randomly samples a batch of transitions to train itself, ensuring diverse learning.
Step 3: Critic Updates
- The Critics learn to predict Q-values, which represent the expected reward for a state-action pair.
- Target Q-value computation:
- Uses the Target Critic to estimate future rewards for
next_state
.
- Incorporates the current reward and a discount factor (
gamma
) to compute the target:
$Q_{\text{target}} = r + \gamma \cdot (1 - \text{done}) \cdot \min(Q_1', Q_2') - \alpha \cdot \text{log\_prob}
$
- The entropy term ($
\alpha \cdot \text{log\_prob}
$) encourages exploration by penalizing deterministic policies.
- Critic Loss:
- Compares the predicted Q-values ($
Q_1, Q_2
$) to the computed target Q-value using Mean Squared Error.
Step 4: Actor Updates
- The Actor improves its policy to maximize the Q-values predicted by the critics.
- Actor Loss:
- Encourages actions that:
- Maximize Q-values ($
\min(Q_1, Q_2)
$).
- Maintain high entropy (exploration).
Step 5: Target Critic Updates
Step 6: Logging and Checkpointing
- TensorBoard logs:
- Critic loss, Actor loss, and rewards.
- Saves checkpoints to allow resuming training later.
Phase 3: Gradual Transition to Full Autonomy
The agent is trained in three phases:
- Phase 1: High reliance on expert data:
- Expert data ratio = 50%.
- Balances learning from the replay buffer and human-provided data.
- Phase 2: Reduced expert reliance:
- Expert data ratio = 25%.
- Encourages the agent to learn more from its own exploration.
- Phase 3: Full autonomy:
- Expert data ratio = 0%.
- The agent learns purely from its own experience.
Why Each Component Matters
- Actor (Policy):
- Learns how to act optimally by maximizing rewards while maintaining exploration.
- Critics (Q-Values):
- Evaluate the quality of actions taken by the policy.
- Two critics reduce overestimation bias.
- Replay Buffer:
- Ensures sample efficiency by reusing past experiences.
- Decorrelation: Helps prevent learning from sequentially correlated data.
- Entropy Regularization:
- Encourages exploration, preventing premature convergence to suboptimal strategies.
- Target Networks:
- Provide stable targets for critic training, avoiding instability caused by rapidly changing Q-values.
- Expert Data:
- Jumpstarts training by introducing good behaviors early on, especially useful in complex tasks like robotics.
Summary of Training Flow
- Initialize environment, agent, and replay buffer.
- Phase 1 (Exploration with Expert Data):
- Train using a mix of expert and self-collected data.
- Phase 2 (Reduced Expert Reliance):
- Gradually shift focus to agent-collected experiences.
- Phase 3 (Full Autonomy):
- Train entirely on self-collected experiences.
- For Each Episode:
- Interact with the environment.
- Store experiences in the replay buffer.
- Periodically sample experiences to:
- Update Critics using target Q-values.
- Update Actor using learned Q-values and entropy regularization.
- Log metrics and save model checkpoints.
Environment setup:
- MacOS Sequoia 15.1.1
- Python 3.11.9
- required installation is mentioned in
requirements.txt