current position:Home>What is reinforcement learning?

What is reinforcement learning?

2022-01-26 23:29:53 Huawei cloud developer community

​​​​ Abstract : This paper attempts to explain reinforcement learning in an easy to understand form , Will not contain a formula .

This article is shared from Huawei cloud community 《 On Reinforcement Learning 》, author :yanghuaili.

Machine learning can be roughly divided into three research areas : Supervised learning , Unsupervised learning and reinforcement learning (Reinforcement Learning,RL). Supervised learning is one of the most well-known machine learning methods , We often encounter image classification 、 Face recognition 、 Regression prediction and other tasks belong to supervised learning . In short , The task of supervised learning processing is based on a given input - Label pair , Build the model , To predict the newly entered label . Unsupervised learning , seeing the name of a thing one thinks of its function , Labels are not required during the model learning phase ( It is often because of high label labeling cost or fuzzy label division standard that the input label cannot be obtained ), The most typical scenario is clustering . Reinforcement learning is quite different from the above two learning methods , It interacts with the environment through agents , Learn a strategy , Make use of this policy to interact with the environment , Can maximize the expected return . I believe that after reading the above sentence, there is still no clear concept of reinforcement learning , This blog will try to explain reinforcement learning in an easy to understand form in the following pages , This article will not contain a formula .

Suppose you go through the era of the Three Kingdoms , Become one of the generals of Zhuge Wuhou of Shu , The prime minister will give you one 1 An army of ten thousand men , Set a goal for you : Go to the state of Wei to attack the city and pull out the stronghold , Get as many cities as possible , It's better to capture the capital of Wei . Then you think , I've never really fought a war , How can we win ? Using supervised learning ? Then you have to have enough and rich practical cases for yourself to learn , And reading military books , But the battlefield situation is changing rapidly , How can the book of war cover all ? If you encounter an enemy who doesn't play cards according to the routine again , The book of war doesn't work ? Then use unsupervised learning ? You smiled bitterly at the thought , You might as well read the book of war . So you frown , scratch one 's ears and cheeks in embarrassment , Unlimited national chronicles cannot be implemented . Then you think , Only intensive learning ! therefore , You spread out a page , Start organizing your thoughts ……

  • Environmental Science (Environment): That's what you face on the battlefield , Like the surrounding terrain , Enemy position , The size of the enemy army , The enemy commander , Our location and other information . That is, the information on which you make decisions ; On the other hand , Your decisions will also change the environment , If you make a move forward 1 The decision of kilometers , The enemy will take action according to your decision, change position, etc ; You decide to burn down the trees that block your view ahead , Then the terrain information will also change ;

  • agent (Agent): That's yourself ;

  • action (Action): That is, the decision-making action you take according to the environmental situation , Move forward as described above 、 Burn the trees that hinder the view, etc ;

  • Action space (Action Space): It refers to the space composed of all the actions you can take , It can be continuous 、 Discrete or both . How much distance does the continuous movement include , In what direction, etc , Discrete actions include attack 、 camping 、 Defend or retreat 、 How many units are the army divided into 、 It's a positive impact 、 Encirclement on both sides or ambush . To make a long story short , Action space is all the possible decisions you can make according to the environment ;

  • Strategy (Policy): It refers to the specified environment , The probability of what action you will take . This may be a little difficult to understand . for instance , Sima Yi's war strategy is very different from Zhang Fei's , The prime minister's empty city plan was successful , It is because Sima Yi adopted a more stable strategy ; But if Zhang Fei is facing an empty city plan , It's possible to go straight to the city to catch the prime minister alive , This is the difference in strategy ; Another example is , Zhang San learned Sun Tzu's art of war , Li Si learned Wu Mu's suicide note , Then they will take different actions in the same environment on the battlefield , It's usually used π To express ;

  • state (State): The specific circumstances of the environment at a certain time or at a certain stage . Take the empty city as an example , The prime minister was faced with the situation of the enemy general Sima Yi 15 Ten thousand troops rushed to attack themselves , And I'm in a city , But only 2500 Soldiers . It was in this state that the prime minister made an empty city plan , Use one S To express ;

  • State transition probability (State Transition Probability): After taking action for a specific state , The probability that the current state will transition to another state . The prime minister's attack against Sima Yi , Took the action of empty city plan , How does the environment react in this situation ( That is, to what state ) It mainly depends on the enemy general Sima Yi ( Under this setting, he is part of the environment ), The actions Sima Yi may take next include attacking 、 Send someone to investigate 、 To encircle without attacking 、 Retreat, etc , Finally Sima Yi took the action of retreat , The state became Sima Yi's withdrawal ; This is Sima Yi's cautious character, which determines that he has a high probability of retreating , Doesn't mean he won't do anything else , If he takes a siege and doesn't attack , The prime minister is facing another state ;

  • Return (Reward): A quantitative indicator of the benefits of taking certain actions on a certain state , In the empty city plan , Because we have few enemies , The more our personnel keep, the greater the profit , The prime minister's possible actions include closing the city gate to resist the enemy 、 Go out of the city to meet the enemy 、 Empty city plan, etc , Out of the city to meet the enemy, the whole army may be destroyed , Zero return , Closing the gate to resist the enemy will eventually be broken , But it can last for a while , The income is slightly higher , The empty city plan may preserve the whole army and has a high probability , So the prime minister took the action of empty city planning ;

  • Sequential decision problem (Sequential Decision Problems): This problem is concerned with the final benefit of multiple rounds of decision-making , And dilute the size of a single return . Empty city plan is a special case , One round of decision-making is completed , But in the real battlefield, decisions should be made in real time according to the dynamics of the enemy , To achieve the ultimate goal of defeating the enemy . An example of diluting a single return and maximizing long-term benefits is using part of the army as bait , Sacrifice the Department to get the final benefit of annihilating the enemy . And the sixteen character formula of the Chinese workers' and peasants' Red Army “ The enemy retreats and I advance , The enemy garrisoned me , I will fight if the enemy is tired , I'll chase after the enemy ”, It also guides the sequence decision-making in war ;

When you summarize these reinforcement learning concepts , I think the knowledge of war should be solved by intensive learning , Feel excited , But these are just some concepts related to reinforcement learning , How to do reinforcement learning ? This leads to the following two important concepts :Q Values and V value

  • V Value is the agent in a certain state , The expectation of the sum of returns up to the final state . For example, in war, we always compete for some strategic places that can be attacked and defended , Occupy these strategic locations , Our own side is in a state favorable to the whole war situation , The greater our cumulative return to the end of the war , That is, when we are in a strategic position V It's worth more , And in other states V The value is relatively small . Why do both the enemy and ourselves know that in this state, the total expectation of final return is greater ? We can repeat this scene all the time in the game , Try to start from this state and repeat the test countless times , Take different actions with different probabilities each time , Until the war is over , You can calculate the V value . But in reality, such tests are not allowed , Both the enemy and ourselves know , Because there have been too many such cases in history , There's no need to experiment anymore ; Of course V Values are agent related , Start in the same state , Different agents adopt different strategies ,V The value will be different ( You think so , Compared with the prime minister , Starting from the same state , The final results will certainly be very different )

  • Q Value is the agent in a certain state , After taking some action , The expectation of the sum of returns until the final state . As in an empty city plan , Facing the situation at that time, the prime minister adopted an empty city plan , Its Q Value is the expectation of the total return from the prime minister's empty city plan to the end of the war .

Q Value and V Values can be calculated from each other , For example, the of each state is known V value , That's to calculate in S In the state of a Action Q value , We also need to know the state transition probability , If the prime minister adopts an empty city plan , The state and probability of the next step are :(1) Sima Yi attacked , The probability of 0.1;(2) Sima Yi surrounded but did not attack , The probability of 0.2;(3) Sima Yi retreated , The probability of 0.7. Then the prime minister adopted an empty city plan Q The value is the return of empty city plus three states V The probability weighted sum of values . If you know... Under each state and action Q value , To calculate V value , We also need to know the probability that the strategy will take different actions in this state , If you want to calculate the status before the empty city meter V value , The prime minister can take three actions :(1) Go out of the city to meet the enemy , probability 0.1;(2) Defend the city against the enemy , probability 0.4;(3) Empty city meter , probability 0.5. Because we already know Q value , that V The value is the of these three actions Q The probability weighted sum of values .

This blog briefly introduces some related concepts of reinforcement learning , If there is a mistake , Welcome to comment on .

2021 Hua Wei Yun AI Combat camp —— Huawei cloud employees are learning AI Combat camp , Come and sign up for free study ~

Click to follow , The first time to learn about Huawei's new cloud technology ~

copyright notice
author[Huawei cloud developer community],Please bring the original link to reprint, thank you.

Random recommended