Skip to content

Cartpole

A cart on a frictionless track with a free-rotating pole attached. Four task variants share the same body and dynamics, differing only in starting state (upright vs hanging) and reward style (dense vs sparse).

CartpoleBalance

CartpoleBalance

Property Value
Canonical ID mjx/cartpole_balance-v0
Action space Box(-1.0, 1.0, (1,), float32)
Observation space Box(-inf, inf, (5,), float32)
Episode length 1000
Config {"ctrl_dt": 0.01, "sim_dt": 0.01, "naconmax": 0, "njmax": 2}

Description

The cart starts near upright with a small angular perturbation. The agent applies horizontal force to the cart to keep the pole vertical and the cart centred on the track.

Rewards

Uses a dense reward built as the product of four normalised components, each on [0, 1]:

Python
1
2
3
4
5
upright        = (pole_angle_cos + 1) / 2
centered       = (1 + tolerance(cart_position, margin=2)) / 2
small_control  = (4 + tolerance(action, margin=1, sigmoid="quadratic")) / 5
small_velocity = (1 + tolerance(angular_vel, margin=5).min()) / 2
reward = upright * centered * small_control * small_velocity

Multiplying four components means each one is a soft veto — if any one collapses to zero, the whole reward goes with it:

  • upright — half the cosine of the pole angle, shifted to [0, 1]. 1.0 when vertical, 0.0 when hanging.
  • centeredtolerance on cart position rescaled into [0.5, 1.0], so it lightly modulates rather than dominates.
  • small_control — quadratic action penalty rescaled into [0.8, 1.0].
  • small_velocitytolerance on pole angular velocity rescaled into [0.5, 1.0], so jittery balance is gently penalised.

Starting state

1
obs = [ 0.0158  0.9996 -0.0278 -0.0134 -0.0035]

(cart position, cos(pole_angle), sin(pole_angle), cart velocity, pole angular velocity — pole near upright with cos ≈ 1.)

Termination

Episode ends when step >= max_steps (default 1000). No early termination on falling.

Usage

Python
1
2
import envrax
env = envrax.make("mjx/cartpole_balance-v0")

Reference

Upstream: mujoco_playground/_src/dm_control_suite/cartpole.py.


CartpoleBalanceSparse

CartpoleBalanceSparse

Property Value
Canonical ID mjx/cartpole_balance_sparse-v0
Action space Box(-1.0, 1.0, (1,), float32)
Observation space Box(-inf, inf, (5,), float32)
Episode length 1000
Config {"ctrl_dt": 0.01, "sim_dt": 0.01, "naconmax": 0, "njmax": 2}

Description

Same physics and starting state as CartpoleBalance — the cart begins near upright and the agent must keep the pole vertical and the cart centred. This variant uses tighter tolerance bands on cart position and pole angle, so the success criterion is harder to satisfy than in the dense variant.

Rewards

Uses a sparse reward that fires only when both the cart's position and the pole's angle sit inside their tolerance bands:

Python
1
2
3
cart_in_bounds  = tolerance(cart_position, CART_RANGE)
angle_in_bounds = tolerance(pole_angle_cos, ANGLE_COS_RANGE).prod()
reward = cart_in_bounds * angle_in_bounds

With the default zero margin, both tolerance calls collapse to step indicators. The product acts as a logical AND:

  • 1.0 when the cart sits inside CART_RANGE and the pole is inside ANGLE_COS_RANGE.
  • 0.0 if either is outside.

Starting state

1
obs = [ 0.0158  0.9996 -0.0278 -0.0134 -0.0035]

Termination

Episode ends when step >= max_steps (default 1000). No early termination on falling.

Usage

Python
1
2
import envrax
env = envrax.make("mjx/cartpole_balance_sparse-v0")

Reference

Upstream: mujoco_playground/_src/dm_control_suite/cartpole.py.


CartpoleSwingup

CartpoleSwingup

Property Value
Canonical ID mjx/cartpole_swingup-v0
Action space Box(-1.0, 1.0, (1,), float32)
Observation space Box(-inf, inf, (5,), float32)
Episode length 1000
Config {"ctrl_dt": 0.01, "sim_dt": 0.01, "naconmax": 0, "njmax": 2}

Description

The pole starts hanging straight down. The agent has to swing it up through the underactuated dynamics — available cart force is too small to lift the pole in one push, so energy must build up over multiple swings before the pole crosses the top. Once balanced, the same upright-and-centred objective as CartpoleBalance applies.

Rewards

Uses the same four-component dense reward as CartpoleBalance:

Python
1
2
3
4
5
upright        = (pole_angle_cos + 1) / 2
centered       = (1 + tolerance(cart_position, margin=2)) / 2
small_control  = (4 + tolerance(action, margin=1, sigmoid="quadratic")) / 5
small_velocity = (1 + tolerance(angular_vel, margin=5).min()) / 2
reward = upright * centered * small_control * small_velocity

Same product-as-veto structure as CartpoleBalance, just starting from a different pole angle:

  • upright — half the cosine of the pole angle, shifted to [0, 1]. 0.0 when hanging straight down (the starting posture), 1.0 when fully upright.
  • centeredtolerance on cart position rescaled into [0.5, 1.0], lightly modulating rather than dominating.
  • small_control — quadratic action penalty rescaled into [0.8, 1.0].
  • small_velocitytolerance on pole angular velocity rescaled into [0.5, 1.0], gently penalising jittery balance.

Starting state

1
obs = [-0.0134 -1.     -0.0068 -0.0134 -0.0035]

(cart position, cos(pole_angle), sin(pole_angle), cart velocity, pole angular velocity — cos(pole_angle) = -1 indicates the pole is hanging straight down.)

Termination

Episode ends when step >= max_steps (default 1000). No early termination.

Usage

Python
1
2
import envrax
env = envrax.make("mjx/cartpole_swingup-v0")

Reference

Upstream: mujoco_playground/_src/dm_control_suite/cartpole.py.


CartpoleSwingupSparse

CartpoleSwingupSparse

Property Value
Canonical ID mjx/cartpole_swingup_sparse-v0
Action space Box(-1.0, 1.0, (1,), float32)
Observation space Box(-inf, inf, (5,), float32)
Episode length 1000
Config {"ctrl_dt": 0.01, "sim_dt": 0.01, "naconmax": 0, "njmax": 2}

Description

The hardest cartpole variant. The pole starts hanging down (as in CartpoleSwingup) and the agent must swing it up through the underactuated dynamics. This variant uses tighter tolerance bands on cart position and pole angle, so success requires both reaching upright and holding inside narrow bounds — random exploration almost never stumbles into the band from the bottom of the swing.

Rewards

Uses the same two-indicator sparse reward as CartpoleBalanceSparse:

Python
1
2
3
cart_in_bounds  = tolerance(cart_position, CART_RANGE)
angle_in_bounds = tolerance(pole_angle_cos, ANGLE_COS_RANGE).prod()
reward = cart_in_bounds * angle_in_bounds

With the default zero margin, both tolerance calls collapse to step indicators. The product acts as a logical AND:

  • 1.0 when the cart sits inside CART_RANGE and the pole is inside ANGLE_COS_RANGE.
  • 0.0 if either is outside.

Starting state

1
obs = [-0.0134 -1.     -0.0068 -0.0134 -0.0035]

Termination

Episode ends when step >= max_steps (default 1000). No early termination.

Usage

Python
1
2
import envrax
env = envrax.make("mjx/cartpole_swingup_sparse-v0")

Reference

Upstream: mujoco_playground/_src/dm_control_suite/cartpole.py.