首页 > 编程语言> 文章详细

【强化学习】在gym环境下，老虎机的算法总结

2022-02-03 09:57:59 阅读：243 来源： 互联网

标签：count rewards gym 算法 sum np reward arm 老虎机

问题描述：

实现步骤：

1.环境的部署与实现

2.贪心策略(The epsilon-greedy algorithm)

3.玻尔兹曼勘探(The softmax exploration algorithm)

4.置信上限算法(The upper confidence bound algorithm)

5.汤普森采样算法(The Thompson sampling algorithm)

参考：

问题描述：

多臂老虎机问题(Multi-Armed Bandit Problem)是强化学习的经典问题。MAB实际上是一个台机器，在赌场玩的一种赌博游戏，你拉动手臂(杠杆)并得到一个支付(奖励)基于随机生成的概率分布。

我们的目标是，随着时间序列，找出哪台机器可以得出最大的累计奖励，即最大化累计奖励

实现步骤：

1.环境的部署与实现

pip3 install gym_bandits

import  gym
import gym_bandits
import numpy as np
env = gym.make("BanditTenArmedGaussian-v0")

print(env.action_space.n)

2.贪心策略(The epsilon-greedy algorithm)

在贪心策略中，我们要么选择表现最好的臂，要么是随机选择臂


'''initialize all variables'''
#number of rounds
num_rounds = 20000
#count of number of times an arm was pulled
count =np.zeros(10)
#sum of rewards of each arm
sum_rewards = np.zeros(10)

#q value is the average reward
Q = np.zeros(10)

#define epsilon_greedy function
def epsilon_greedy(epsilon):
    rand = np.random.random()
    if rand < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q)

    return action

#start pulling arm
for i in range(num_rounds):
    #select the arm using epsilon greedy
    arm = epsilon_greedy(0.5)
    #get the reward
    observation,reward,done,info = env.step(arm)

    #update the count of that arm
    count[arm] += 1
    #sum the reward
    sum_rewards[arm] += reward

    #calculate Q value which is the average rewards of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
print('the optimal arm is {}'.format(np.argmax(Q)))

3.玻尔兹曼勘探(The softmax exploration algorithm)

在softmax探索中，我们根据玻尔兹曼概率选择臂

import math
import random

''' in softmax exploration, we select an arm based on a probability from
the Boltzmann distribution'''
#define the softmax function
def softmax(tau):
    total = sum(math.exp(val/tau) for val in Q)
    probs = [math.exp(val/tau) /total for val in Q]
    threshold = random.random()
    cumulative_prob = 0.0
    for i in range(len(probs)):
        cumulative_prob += probs[i]
        if (cumulative_prob > threshold):
            return i
    return np.argmax(probs)

#begining
for i in range(num_rounds):
    #selct the using arm
    arm = softmax(0.5)

    #get the reward
    observation,reward,done,info = env.step(arm)

    #update the count of arm
    count[arm] += 1

    #sum the rewards
    sum_rewards[arm] += reward

    #calculate Q value 
    Q[arm] = sum_rewards[arm]/count[arm]

print("the optimal arm is {}".format(np.argmax(Q)))

4.置信上限算法(The upper confidence bound algorithm)

在此算法中，我们注重于在初期表现很差，但是在后面的回合中，表现不错的臂。置信上限算法也称为乐观面对不确定性

'''
1. Select the action (arm) that has a high sum of average reward and upper
confidence bound
2. Pull the arm and receive a reward
3. Update the arm's reward and confidence bound

'''
#define the upper confidence bound function
def UCB(iters):
    ucb = np.zeros(10)
    #explore all the arm
    if iters <10:
        return i
    else:
        for arm in range(10):
            #calculate upper bound
            upper_bound = math.sqrt((2*math.log(sum(count))) / count[arm])

            #add upper bound to the Q value
            ucb[arm] = Q[arm] + upper_bound
        #return the arm which has maximum value
        return (np.argmax(ucb))

#begining 
for i in range(num_rounds):
    #select the arm using UCB
    arm = UCB(i)

    #get the reward
    observation,reward,done,info = env.step(arm)

    #update the count
    count[arm] += 1

    #sum the rewards
    sum_rewards[arm] += reward

    #calculate Q value
    Q[arm] = sum_rewards[arm] /count[arm]

print("the optimal arm is {}".format(np.argmax(Q)))

5.汤普森采样算法(The Thompson sampling algorithm)

是一种基于先验的概率算法分布。

'''
1. Sample a value from each of the k distributions and use this value as a prior
mean.
2. Select the arm that has the highest prior mean and observes the reward.
3. Use the observed reward to modify the prior distribution.

'''
#initialize alpha and beta value
alpha = np.ones(10)
beta  = np.ones(10)

#define the thompson_sampling function
def thompson_sampling(alpha,beta):
    samples = [np.random.beta(alpha[i] +1,beta[i] +1) for i in range(10)]

    return np.argmax(samples)


#begining
for i in range(num_rounds):
    arm = thompson_sampling(alpha,beta)

    observation,reward,done,info = env.step(arm)

    count[arm] += 1

    sum_rewards[arm] += reward

    Q[arm] = sum_rewards[arm] /count[arm]

    if reward>0:
        alpha[arm] += 1
    else:
        beta[arm] += 1
print('the optimal arm is {}'.format(np.argmax(Q)))

参考：

《Hands-on Reinforcement Learning with Python. Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow》

标签：count,rewards,gym,算法,sum,np,reward,arm,老虎机
来源： https://blog.csdn.net/dannnnnnnnnnnn/article/details/122772611

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

【强化学习】在gym环境下，老虎机的算法总结

问题描述：

实现步骤：

1.环境的部署与实现

2.贪心策略(The epsilon-greedy algorithm)

3.玻尔兹曼勘探(The softmax exploration algorithm)

4.置信上限算法(The upper confidence bound algorithm)

5.汤普森采样算法(The Thompson sampling algorithm)

参考：