"""This module defines several classes to select actions (ActionSelectors).An ActionSelector takes a list of interests (e.g., Q-Values) and the time step,to return a single identifier, which is considered the selected action.They target the exploration-exploitation dilemma.We consider 2 selectors:- the Epsilon-Greedy selector selects the maximum interest action with a `(1-ε)` probability, e.g., 95%. Otherwise, it selects a random action.- the Boltzmann selector applies a Boltzmann distribution over the interests. Interests that are closer have a similar probability, and higher interests yield higher probabilities. The distribution is controlled by a Boltzmann temperature, such that low interests can still yield significant probabilities."""fromrandomimportrandom,randrange,choicesimportnumpyasnp
defchoose(self,interests,step):ifrandom()<self.epsilon:# Exploration: pick a random unitaction_idx=randrange(0,len(interests))else:# Exploitation: pick the unit with the maximal Q-Valueaction_idx=np.argmax(interests)returnaction_idx
[docs]classBoltzmannActionSelector(ActionSelector):"""Implements the Boltzmann policy."""
defchoose(self,values,step):# Boltzmann decision process# First, compute tau (τ)ifself.tau_decay:tau=self.initial_tau*(self.tau_decay_coeff**step)tau=max(tau,0.01)else:tau=self.initial_tau# Then, compute the weight for each value (exp(Q[s,a]) / τ)indices=np.arange(len(values))weights=[np.exp(values[i]/tau)foriinindices]returnchoices(indices,weights=weights,k=1)[0]