Self learning robots

Most robots have a task pre-programmed in their program memory. However, sometimes it is difficult to think of a good algorithm. Wouldn't it be great to have a self-learning robot in those situations? In this article we show an example of a robot learning to crawl forward using Q-learning. In the beginning the robot makes some clumsy moves, but as the robot learns, it becomes more and more proficient in crawling forward.

Self-learning algorithm
A self-learning algorithm will typically interact with its environment in the following way: A robot can interact with its environment by taking specific actions. These actions result in a change of the state and a reward (or penalty). The reward is based on the desired behaviour of the robot. We can represent this schematically in the following way:

zelf-lerende robot


In case of the crawling robot, the reward will be based on its ability to move forward in certain time step. The goal is that for each state in which the robot resides, to learn what the best possible action is, and hence get the highest possible reward. Q-learning External link is an example of a self-learning algorithm External link that makes it possible to learn the best possible action plan in a good way. For this the algorithm uses a table, the Q-table, in which for each tuple of state and action (s, a) a value Q is stored. This value represents the expected reward. The Q-table gets updated for each time step by using the following formula:

formula Q-learning

.
In this formula Q(s,a) is the Q value that belongs with the state action tuple. The learning rate alpha is always between 0 and 1 and determines the amount of learning. When this value is 0, the robot will learn nothing, whereas if the value is 1 the robot will only be sensitive for the most recent information. It is best that you experiment with several values. On top of that the Q value also depends on the received reward r of a next time step t+1 and is the highest possible future reward taken into account. With the discount factor gamma one can adjust how important future rewards are for the robot. So the Q value does not only entails the current reward, but also takes future rewards into account. Hence, the robot can take actions that on the short term do not result into a reward, but in the long term are more beneficial. Every time step t an action is taken that belongs with the state in which the robot currently is and for which the Q value is the biggest. In order to prevent that the robot gets stuck in a sub optimal solution the algorithm is enhanced with the epsilon-greedy selection method. This means that the state with the highest Q value is chosen with a probability of 1-epsilon and that in the other case a random action is performed. Also in this case it is advisable to experiment with some epsilon values (typically smaller than 0.1).

The robot
The crawling robot determines its reward based on the fact if it moved forward or not. To know in which direction the robot crawls, the robot uses a rotary encoder. The rotary encoder consists of an infra-red LED, a double photo transistor and an encoder wheel. The encoder wheel rotates together with the wheel and will block the infra-red light from time to time from the double photo transistor. The double photo transistor has two outputs on which two identical but slightly shifted versions of a periodic signal are send. This is known as a quadrature and is represented as follows:

quadrature


Out of these two signals one can identify four possible states. This allows not only to determine the number of rotations, but also the direction of the rotation. The two outputs of the double photo transistor are connected with the RB6 and RB7 pins of the micro controller. These pins have the option of generating an interrupt when there is a change of the input, so-called interrupt-on-change functionality, what we will have to use to extract the travelled distance.

And ... ACTION!
After building the robot and implementing the Q-learning algorithm, we tested the robot. For the Q-table we consider the following four actions: servo 1 a state up, servo 1 a state down, servo 2 a state up and servo 2 a state down. In order to limit the size of the Q-table, we limit the number of states to 4 stands per servo and so in total 16 different states. The final result is shown in the following movie: