RepeaterBot Q-Learning Notes

Here are notes on the simulations:

A training session consists of up to 75,000 individual trials. Each trial is a simulation run with the robots beginning in their initial positions.
The robots use the Q-learning technique to learn to solve the repeater placement problem.
During a trial, the robots explore the environment and give themselves rewards based on a number of environmental factors including the list of beacons and robots they can hear over their radios. The robots continue learning until they consistently solve the three trial mazes or they reach the maximum number of simulation steps for a trial. The maximum number of steps starts at 1000; however, it decreases as Q-learning converges.
The robots move in a lock-step fashion with Robot A moving before Robot B.
Rewards are generated internally by the robots. Each robot has a home beacon and the id of the other mobile robot. Rewards are given as follows:

A -1.0 reward is given when a robot attempts to walk into a barrier (the edge of the simulation grid or a "hill").
Each robot is configured with a bias for exploring and following its companion. The maximum exploring reward is given for a robot that explores an unexplored square. A smaller reward is given for exploring a square that another robot has explored. "Following" rewards are given for exploring a square that the other robot has explored.
A penalty is given if a robot does not hear its assigned beacon.
A penalty is given for a move where the robot moves from hearing its beacon to not hearing it, e.g., it steps behind a barrier.
A reward is given for moving to a location where the beacon is heard.

The intent of the learning bias is to reward the robots for exploring and finding their beacons. The robot that initially does not hear its beacon is given incentive to explore, while the other robot is encouraged to follow its partner as long as it can still hear its beacon.
The state information given to each robot before and after its move consists of a bit map describing the eight squares that surround the robot and information about which robots and beacons the robot can hear directly and via a repeater. The description of the surrounding squares indicates if they contain barriers and if they have been explored by either the current robot or another robot. The robots examine the input as a bit map, i.e., they had no knowledge of what the value of the state values mean.

Convergence of Q-learning is guaranteed, if each state/action pair is visited infinitely often, for Markov decision processes (MDP). There are two different algorithms: one for deterministic and one for nondeterministic MDP's.

This version of the repeater placement problem is nondeterministic and not a MDP. Because the robots are trained using three separate mazes, a robot can be in a grid square where a good move cannot be determined from the state information. For example, given a certain state in maze 1, the robot should move up when the same state in maze 2 requires a down-left move. If solutions for all the mazes exist in which ambiguous states can be avoided, then it is possible for Q-learning to find them even though robot training is not an MDP.

There are two primary reasons for nondeterminacy in this problem. First, one robot does not know how the other robot will move. Second, the robots use "inertia" in that they continue to make the same move in successive steps in the same state when the Q-learning table has equal maximum values for several actions. See Tuning Q-Learning Parameters with a Genetic Algorithm for details.

Several ad hoc modifications were made to Q-learning to improve convergence. First, the maximum iterations for each trail changes depending on the average number of steps needed to solve the problems in the recent past. For example, if the average number of steps to solve three mazes is 170, the algorithm will halt any trial at 200 steps to prevent long trials that do not solve the maze. This technique reduces simulation time and avoids positive rewards for poor solutions.

Second, the value of epsilon, the probability that the robot will take an exploratory move, is reduced as the average number of steps to solve the mazes decreases. During initial exploration, exploratory moves are good in that they help the robots to find solutions to the mazes. But, as the robots are converging to a solution, a particular random move might cause divergence. The art of configuring Q-leanring in this situation is to find a good function for setting epsilon that allows the system to converge.

The maximum trial iterations and the value of epsilon can increase and decrease as needed depending on the average of recent solution rates.

RepeaterBots