smartgrid.wrappers.reward_aggregator.RewardAggregator¶

class smartgrid.wrappers.reward_aggregator.RewardAggregator(env: SmartGrid)[source]¶

Bases: ABC, SmartGrid

Wraps the multi-objective env into a single-objective by aggregating rewards.

The smartgrid.environment.SmartGrid environment supports multiple reward functions; its SmartGrid.step() method returns a dict of dictionaries, one dict for each agent, containing the rewards indexed by their reward function’s name. However, most Reinforcement Learning algorithms expect a scalar reward, or in this case, a dict of scalar rewards, one for each agent.

Classes that extend the RewardAggregator bridge this gap, by aggregating (scalarizing) the multiple rewards into a single one.

__init__(env: SmartGrid)[source]¶

Methods

`__init__`(env)
`action_space`(agent_name)	Return the action space of a specific agent, identified by its name.
`close`()	Closes the rendering window.
`get_agent`(agent_name)	Return an agent from its name (ID).
`observation_space`(agent_name)	Return the observation space of a specific agent, identified by its name.
`render`([mode])	Render the current state of the simulator to the screen.
`reset`([seed, options])	Reset the SmartGrid to its initial state.
`reward`(rewards)	Transform multi-objective rewards into single-objective rewards.
`state`()	Returns the state.
`step`(actions)	Advance the simulation to the next step.

Attributes

`agents`	The list of agents' names contained in the environment (world).
`max_num_agents`
`metadata`
`num_agents`	The number of agents currently living in the environment.
`observation_shape`	The shape, i.e., number of dimensions, of the observation space.
`unwrapped`
`observation_manager`	The observation manager, responsible for creating observations each step.
`max_step`	The maximum number of steps allowed in the environment (or None by default).
`reward_calculator`	The RewardCollection, responsible for determining agents' rewards each step.
`world`	The simulated world in which the SmartGrid exists.
`possible_agents`
`observation_spaces`
`action_spaces`

_get_info(rewards: Dict[AgentID, Dict[str, float]]) → Dict[AgentID, Dict[str, Any]]¶

Return additional information on the world (for the current time step).

Information (currently) contain only the rewards, for each agent.

Parameters:: rewards – The dictionary of rewards, one for each agent. As multiple reward functions can be used, rewards are represented as dictionaries themselves, indexed by the reward function’s name.
Returns:: A dictionary of additional information, indexed by the agent’s name. Each element is itself a dictionary that currently contains only the agent’s reward, indexed by 'reward'.

_get_obs() → Dict[AgentID, ObsType]¶

Determine the observations for all agents.

Returns:: A dictionary of observations for each agent, indexed by the agent’s name. Each observation is a dataclass containing all (global and local) metrics. Global and local observations can also be obtained through the get_global_observation() and get_local_observation() methods.

_get_reward() → Dict[AgentID, Dict[str, float]]¶

Determine the reward for each agent.

Rewards describe to which degree the agent’s action was appropriate, w.r.t. moral values. These moral values are encoded in the reward function(s), see smartgrid.rewards for more details on them.

Reward functions may comprise multiple objectives. In such cases, they can be aggregated so that the result is a single float (which is used by most of the decision algorithms). This behaviour (whether to aggregate, and how to aggregate) is controlled by an optional wrapper, see RewardAggregator

for details.

Returns:: A dictionary of rewards, one element per agent. The element itself is a dict which contains at least one reward, indexed by the reward’s name.

_np_random: np.random.Generator¶

The pseudo-random number generator (PRNG), for reproducibility.

It should usually not be accessed by the user, and must be passed down to elements of the SmartGrid (e.g., World) that need it. The generator is set by the reset() method, optionally with a specific seed.

action_space(agent_name: AgentID) → Space¶

Return the action space of a specific agent, identified by its name.

Parameters:: agent_name – The name of the desired Agent. It must correspond to an existing Agent in the current World, i.e., an agent in the agents list.
Returns:: An instance of gymnasium.spaces.Box indicating the number of dimensions of actions (parameters), as well as the low and high bounds for each dimension.

property agents: List[AgentID]¶: The list of agents’ names contained in the environment (world).

Warning

As per the PettingZoo API, and contrary to what the name suggests, this returns the agents’ names (IDs), not the agents themselves. Please see get_agent() to get an Agent from its name.

close()¶: Closes the rendering window.

get_agent(agent_name: AgentID) → Agent¶

Return an agent from its name (ID).

Parameters:: agent_name – The name of the requested agent.

max_step: int | None¶

The maximum number of steps allowed in the environment (or None by default).

As the environment is not episodic, it does not have a way to terminate (i.e., agents cannot “solve” their task nor “die”). The maximum number of steps is a way to limit the simulation and force the environment to terminate. In practice, it simply determines the truncated return value of step(). This return value, in turn, acts as a signal for the external interaction loop. By default, or when sent to None, truncated will always return false, which means that the environment can be used forever.

property num_agents: int¶: The number of agents currently living in the environment.

observation_manager: ObservationManager¶

The observation manager, responsible for creating observations each step.

Can be configured (extended) to return different observations.

property observation_shape¶: The shape, i.e., number of dimensions, of the observation space.

observation_space(agent_name: AgentID) → Space¶

Return the observation space of a specific agent, identified by its name.

Parameters:: agent_name – The name of the desired Agent. In practice, it does not impact the result, as all Agents use the same observation space.
Returns:: An instance of gymnasium.spaces.Box indicating the number of dimensions of an observation, as well as the low and high bounds for each dimension.

render(mode='text')¶

Render the current state of the simulator to the screen.

Note

No render have been configured for now. Metrics’ values can be observed directly through the object returned by step().

Parameters:: mode – Not used
Returns:: None

reset(seed: int | None = None, options: Dict | None = None) → Tuple[Dict[AgentID, ObsType], Dict[AgentID, Dict[str, Any]]]¶

Reset the SmartGrid to its initial state.

This method will call the reset method on the internal objects, e.g., the World, the Agents, etc. Despite its name, it must be used first and foremost to get the initial observations.

Parameters:

seed – An optional seed (int) to configure the random generators and ensure reproducibility. Note: this does not change the global generators (Python random and NumPy np.random). SmartGrid components must rely on the _np_random attribute.
options – An optional dictionary of arguments to further configure the simulator. Currently unused.

Returns:

A tuple containing the observations and additional information for the first (initial) time step, in this order. There is no additional information in the current version, but an empty dict is still returned to be coherent with the base API. The observations is a dictionary indexed by agents’ name, containing their initial observations, for each agent in the World.

abstract reward(rewards: Dict[AgentID, Dict[str, float]]) → Dict[AgentID, float][source]¶

Transform multi-objective rewards into single-objective rewards.

Parameters:: rewards – A dict mapping each learning agent to its rewards. The rewards are represented as a dict themselves (dict of dicts), containing one or several rewards, indexed by their reward function’s name, e.g., { 'fct1': 0.8, 'fct2': 0.4 }.
Returns:: A dict mapping each agent to its scalar reward. The rewards are scalarized from the agents’ dict of rewards.

reward_calculator: RewardCollection¶

The RewardCollection, responsible for determining agents’ rewards each step.

This environment has a (partial) support for multi-objective use-cases, i.e., multiple reward functions can be used at the same time. The RewardCollection is used to hold all these functions, and compute the rewards for all functions, and for all agents, at each time step. It returns a list of dicts (multiple rewards for each agent), which can be scalarized to a list of floats (single reward for each agent) by using a wrapper over this environment. See the reward_aggregator module for details.

state() → ndarray¶

Returns the state.

State returns a global view of the environment appropriate for centralized training decentralized execution methods like QMIX

step(actions: Dict[AgentID, ActionType]) → Tuple[Dict[AgentID, ObsType], Dict[AgentID, float], Dict[AgentID, bool], Dict[AgentID, bool], Dict[AgentID, Dict[str, Any]]][source]¶

Advance the simulation to the next step.

This method takes the actions’ decided by agents (learning algorithms), and sends them to the World so it can update itself based on these actions. Then, the method computes the new observations and rewards, and returns them so that agents can decide the next action.

Parameters:

actions – The dictionary of actions, indexed by the agent’s name, where each action is a vector of parameters that must be coherent with the agent’s action space.

Returns:

A tuple containing information about the next (new) state:

obs_n: A dict that contains the observations about the next state; please see _get_obs() for details about the dict contents.
reward_n: A dict containing the rewards for each agent; please see _get_reward() for details about its content.
terminated_n: A dict of boolean values indicating, for each agent, whether the agent is “terminated”, e.g., completed its task or failed. Currently, always set to False: agents cannot complete nor fail (this is not an episodic environment).
truncated_n: A dict of boolean values indicating, for each agent, whether the agent should stop acting, because, e.g., the environment has run out of time. See max_step for details.
info_n: A dict containing additional information about the next state, please see _get_info() for details about its content.

world: World¶

The simulated world in which the SmartGrid exists.

The world is responsible for handling all agents and “physical” interactions between the smart grid elements.