algorithms.qsom.qsom.QSOM¶
- class algorithms.qsom.qsom.QSOM(env: SmartGrid, hyper_parameters: dict | None = None)[source]¶
Bases:
Model
The Q-SOM learning algorithm: based on Q-Learning + Self-Organizing Maps.
Two SOMs are used, a State-SOM that learns to handle the observation (state) space, i.e., to map continuous observations to discrete space identifiers; and an Action-SOM that learns to handle the action space, i.e., to map discrete actions identifiers to continuous action parameters.
A Q-Table learns the interests of (discrete) actions in (discrete) states.
List of hyperparameters that this model expects:
- initial_tau
Initial value of the Boltzmann temperature, which controls the exploration-exploitation trade-off.
- tau_decay
Whether to decrease (decay) the Boltzmann temperature over the time steps, so as to encourage exploitation rather than exploration in later time steps. See also the
tau_decay_coeff
below.- tau_decay_coeff
Coefficient of reduction of the Boltzmann temperature, each step, if the decay is enabled. Applied multiplicatively to the current tau each time step, i.e.,
tau = tau * tau_decay_coeff
.- noise
The noise parameter that controls the random distribution when perturbing actions. The higher the noise, the more the action will be perturbed (i.e., far from its original, unperturbed version).
- sigma_state
Size of the neighborhood for the State-SOM.
- sigma_action
Size of the neighborhood for the Action-SOM.
- lr_state
Learning rate for the State-SOM.
- lr_action
Learning rate for the Action-SOM.
- q_learning_rate
Learning rate for the Q-Table.
- q_discount_factor
The gamma value controls the horizon of expected rewards: the higher it is, the more the agent will take into account the future states, and rewards that can be expected from these future states, when determining its policy. If set to 0, the agent will simply maximize the current expected reward (greedy policy).
- update_all
Whether to update all Q-Values (Smith’s optimization) at each step. This speeds up the learning of interests.
- use_neighborhood
Whether to use the State- and Action-SOMs neighborhoods when updating the Q-Values.
- __init__(env: SmartGrid, hyper_parameters: dict | None = None)[source]¶
Create a Model, i.e., an entrypoint for the learning algorithm.
- Parameters:
env – The environment that the learning algorithm will interact with. This is useful for, e.g., accessing the agents’ observations and actions spaces, knowing the number of agents, etc. Note that a
Wrapper
can also be used, such as aRewardAggregator
.hyper_parameters – An optional dictionary of hyper-parameters that control the creation of the learning agents. For example, the learning rate to use, etc. The hyper-parameters themselves are specific to the implemented Model.
Methods
__init__
(env[, hyper_parameters])Create a Model, i.e., an entrypoint for the learning algorithm.
backward
(new_obs_per_agent, reward_per_agent)Make each agent learn, based on their rewards and observations.
forward
(obs_per_agent)Choose an action for each agent, based on their observations.
get_optimal_actions
(obs_per_agent)Return the actions that are considered optimal, for each agent.
Attributes
default_hyperparameters
- _assert_known_agents(required_agents_names: Iterable[str])[source]¶
Internal method checking we can handle (at least) the required agents.
If the env sends observations (or rewards) about an unknown agent (i.e., we have no
QsomAgent
registered for this name), we cannot handle it.- Parameters:
required_agents_names – The agents’ names that are required, i.e., present in either the environment’s observations or rewards.
Note
We silently ignore agents that are known but not any more in the env, to support (potential) future use-cases, such as agent termination.
- backward(new_obs_per_agent, reward_per_agent)[source]¶
Make each agent learn, based on their rewards and observations.
- get_optimal_actions(obs_per_agent: Dict[AgentID, ObsType]) Dict[AgentID, ActionType] [source]¶
Return the actions that are considered optimal, for each agent.
In other terms, this method ensures exploitation, whereas the
forward()
method encourages exploitation-exploration.It can be useful after the training phase, for testing purposes.
- Parameters:
observations_per_agent – A dictionary mapping agents’ name to their observations. Exactly as in
forward()
.- Returns:
A dict mapping each agent to its action. Actions have the same structure as in
forward()
, but they should be produced with only exploitation as a goal, i.e., selecting the action that should yield the best reward.
Warning
By default, to ensure that all models will have this method, it simply returns the same actions as
forward()
. Models that make a distinction between exploration and exploitation should override it.