AlphaGo Beginner's Guide


Everyone knows that DeepMind's AlphaGo defeated 18 times world champion Lee Sedol on March 9 2016 at the ancient Chinese game- Go. What’s fascinating is that the game of Go has as many possible moves as there are atoms in the universe.

This motivated us to find out more about AlphaGo and we decided to dive deep into how it works and its insides. We thought we would share some of the details with you guys!

DeepMind is a British Artificial Intelligence (AI) company that is found in September 2010 as DeepMind technologies. It was later acquired by Google in 2014. DeepMind’s goal is to solve intelligence. You can check more at their website https://deepmind.com.

Coming back to AlphaGo, it defeating the professional Go champions is considered HUGE for AI. Like, REALLY HUGE. It shocked scientists who were thinking that something like this wouldn't happen until at least another decade. It equally shocked experts in the Artificial Intelligence community. Machine that is learning on its own is a huge leap for technology.

The way DeepMind started off is that they fed AlphaGo a hundred thousand games that were downloaded from internet— that strong amateurs played. In the first version, they designed AlphaGo to mimic the player. The goal was to make AlphaGo stronger and compete with top professionals. They took this version that has already learnt to mimic human play, they made it play itself 30 million times. They used Reinforcement Learning. It means that it is not preprogrammed and learns from experience. Using Reinforcement Learning, the system learnt to improve incrementally by avoiding errors. By the end of this, they had a new version that could beat the old version. The reinforcement learning is model-free that means it doesn’t need a structure or rules to work.

The interesting part is, after getting knowledge of few games, it is able to transfer the knowledge across more games.

The first version

The first version of AlphaGo used two neural networks that co-operated to choose its moves. Both are Convolutional Neural Networks (CNN), with 12 layers. It is used for classification of images. It can take images as inputs and output class probability after being trained on labeled image dataset. They learn the mapping between inputs and outputs.

Policy Network

The first network is called the Policy Network. Its job is to take board positions as inputs and decide the next best move to make. DeepMind trained the Policy Network on millions of examples moves made by strong human players. The goal was to replicate the choices of strong human players. After training, it was able to match moves that strong human Go players would make— up to 57% of the time. To improve this they used Reinforcement Learning.

It was fast enough to pick one good move but needed to check thousands of possible moves before making a decision. So they modified the network so instead of looking at entire 19x19 board it looked at the smaller window around the opponent’s previous move and the new move it is considering. This helped it compute the next best move a thousand times faster.

Value Network

The second network is called the Value Network. It answers the different question than ‘what move to play next’. Instead of suggesting the next move, it estimates the chance of each player winning the game given a board position. It provides overall binary positional judgment—that means it classifies future potential positions as either good or bad. If Value Network says a particular variation looks bad, the AI can skip reading anymore moves along that line of play.

In addition to the two networks mentioned above, AlphaGo uses an algorithm called Monte Carlo tree search to help read sequences of future moves effectively. If we attempt tree search, one way to do it is Depth -first that means all the way to the end branches of the tree before back tracking to the next level.

The Breath-first search was memory intensive. So what Monte Carlo search does is it instead scatters the order in which the tree is searched to minimize the change that there is very promising part of the tree we could have discovered earlier than we slogged through the search in prescribed ordering.

The latest version

AlphaGo Zero still uses Monte Carlo tree search but instead of using a separate Policy Network (to select the next move to play) and Value Network (to predict the winner of the game), they integrated both into a single neural network that evaluates positions. Unlike previous versions that were trained on human games, Zero skips the steps and learns by playing against itself starting from completely random play

And you know what, after three days of training, Zero beat the previous version of AlphaGo, the one that defeated 18 time world champion by 100 games to 0 and after 40 days it outperformed a later version that defeated number one.

Many question this by asking ‘Is this an alarm?’ I guess only future can answer that.

The makers aim to use the algorithm used in the software in healthcare and science to improve the speed of breakthroughs in those areas by helping human experts achieve more.

For the technical details behind the original approach, refer https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf


