Introduction
A partially observable Markov decision process, or POMDP (Kaelbling, Littman, and Cassandra 1998) is a tuple \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \tau, \rho, \omega \rangle, where the elements \mathcal{S}, \mathcal{A}, \tau, \rho describe a classical MDP:
| \mathcal{S} | The set of all states of the environment. |
| \mathcal{A} | The set of all actions available to an agent (implicitly constrained by the state, which can be implemented in the agent’s policy). |
| \tau(s'; a, s) | The state transition function, giving the conditional probability of transition to state s' given previous state s and action a, i.e., \tau(s'; a, s) = Pr(s' \mid a, s). |
| \rho | The reward function. |
| \mathcal{O} | A set of observations that the agent can experience. |
| \omega(o; a, s') | The observation function, the probability of observing o after taking action a and ending up in state s': \omega(o; a, s') = Pr(o \mid a, s'). If O is conditionally independent of A given S', then \omega(o; s') = Pr(o \mid s'). |
Since states are only partially observed, the agent maintains a belief about which state the environment is in. A belief state – denoted bel – is a probability distribution over S. The belief state is updated each time the agent takes an action a, transitions to a new state s', and gets an observation o:
bel(s' \mid o,a) = \frac{\omega(o; a, s')\sum_{s \in \mathcal{S}}\tau(s';a,s)bel(s)}{Pr(o|a,bel)}
It is common to define \eta = Pr(o|a,bel)^{-1} and write
The normalization term is the summation over the values of S': Pr(o|a,bel) = \sum_{s' \in \mathcal{S}} \omega(o; a, s')\sum_{s \in \mathcal{S}}\tau(s';a,s)bel(s)
bel(s'\mid o,a) = \eta\Big[{\omega(o; a, s')\sum_{s \in \mathcal{S}}\tau(s';a,s)bel(s)}\Big] \tag{1}
This updating procedure is also known in robotics as Bayes filtering (Thrun 2002). In this field, this procedure is broken down into two steps:
- Prediction, in which the robot uses its knowledge about the world to form a beleif about the current state
- Correction, in which the robot accounts for what it is observing to arrive at a better estimate.
The prediction step may be written as \overline{bel}(s') = \sum_{s \in \mathcal{S}}\tau(s';a,s)bel(s). The correction step then becomes bel(s') = \eta \omega(o;s') \overline{bel}(s'). This division of belief updating into two computational steps will become useful when we consider various experimental settings.
The two step Bayes filtering map nicely to a common description of Bayesian posterior inference in terms of a product between the prior and the likelihood. Here, the prior probability of some state s' is computed from the distribution over previous states and the state transition function. It is telling us the probability of s' without considering any direct evidence for that state. The likelihood of s' is evaluated against the observation (also called measurement) function \omega. This term “fine-tunes” (i.e., corrects) the prior probability by weighing it agains what is being observed.
As long as we specify \omega and \tau, we know how to describe belief updating (but not necessarily how to compute it). Specifications of these functions may reflect our assumptions about what the agent knows or should know about its environment (\tau) and what it can observe (\omega).
We shall assume that agents represent \tau internally as an internal world model approximating reality. Whereas the true state transitions are constrained solely by reality, the internal model predicts state transitions based on uncertain beliefs about states and the knowledge on how various elements of the world state interact, i.e., “how the world works”. Abstractly, we can think of \tau as having a particular parametric form that constrains how s' gets assigned a probability measure based on a and s. We will symbolize the parameters of \tau with \theta. It is natural to also assume that agents may be more or less certain about the particular values of these parameters, and accordingly maintain a belief about them.
Urn Task
In a single trial of the urn task (Van Lieshout et al. 2018), the subject is told that a random ball was drawn from an urn containing balls of two colors presented schematically. The schematic informs the subject of the proportion of colors in the urn (e.g., 15/20 or 75% red). The color of the drawn ball is hidden. The subject can request information in order to reveal its color.
In terms of POMDP, we can define the following sets and random variables:
| Set | Random variable |
|---|---|
| \mathcal{S} = {noball, red, blue} | S_t : \mathcal{S} \mapsto {0, 1, 2} |
| \mathcal{A} = {idle, seek, skip} | A_t : \mathcal{A} \mapsto {0, 1, 2} |
| \mathcal{O} = {none, red, blue} | O_t : \mathcal{O} \mapsto {0, 1, 2} |
Transitions between states are governed according to \tau(s_{t+1}; a_t, s_t), which can be represented as a table. Note that the full table contains have 21 rows1. For brevity, I only include transitions with non-zero probabilities:
| s_t | a_t | s_{t+1} | Pr(s_{t+1}) |
|---|---|---|---|
| noball | idle | red | \theta |
| noball | idle | blue | \theta - 1 |
| red | idle | red | 1 |
| blue | idle | blue | 1 |
| red | seek | red | 1 |
| blue | seek | blue | 1 |
| red | skip | red | 1 |
| blue | skip | blue | 1 |
Thus, if the agent is idle in the (starting) no-ball state at t, they expect to transition to either a state in which a red or a blue ball was drawn at t+1. After that, the state remains unchanged until the episode (the trial) terminates.
The observation function \omega(o_{t+1}; a_t, s_{t+1}) can also be represented as a table. Again, only non-zero observations are shown (full table can be found in the Footnotes2):
| s_{t+1} | a_t | o_{t+1} | Pr(o_{t+1}) |
|---|---|---|---|
| red | idle | none | 1 |
| blue | idle | none | 1 |
| red | seek | red | 1 |
| blue | seek | blue | 1 |
| red | skip | none | 1 |
| blue | skip | none | 1 |
Belief updating
Before information
On a given trial, the environment always starts in the ‘noball’ state and then transitions automatically to a red- or a blue-ball state. Upon this transition, the agent receives no observation. They, do, however make a prediction:
\overline{bel}(s_{t+1} \mid a_t) = \sum_{s_t \in \mathcal{S}} \tau(s_{t+1}; a_t, s_t) bel(s_t)
which simplifies to
\overline{bel}(s_{t+1} \mid a_t) = \begin{cases} \theta & \mathrm{if} ~ s_{t+1} = \mathrm{red} \\ 1 - \theta & \mathrm{if} ~ s_{t+1} = \mathrm{blue} \\ \end{cases}
because s_0 is always ‘noball’ and \tau(s_{t+1} \mid \mathrm{idle}, \mathrm{noball}) is \theta for s_{t+1}=\mathrm{red} and 1-\theta otherwise. The full belief update is therefore:
bel(s_{t+1} \mid \mathrm{idle}, \mathrm{none}) = \frac{\omega(\mathrm{none}; \mathrm{idle}, s_{t+1}) \overline{bel}(s_{t+1} \mid a_t)}{\sum_{\tilde{s}_{t+1} \in \mathcal{S}}\omega(\mathrm{none}; \mathrm{idle}, \tilde{s}_{t+1}) \overline{bel}(\tilde{s}_{t+1} \mid a_t)}
which also simplifies substantially, due to \omega(\mathrm{none}; \mathrm{idle}, \tilde{s}_{t+1}) = 1 for any \tilde{s}_{t+1}. Furthermore, the normalizing constant \sum_{\tilde{s}_{t+1} \in \mathcal{S}} \overline{bel}(\tilde{s}_{t+1} \mid a_t) = \theta + (1 - \theta) = 1. Thus:
bel(s_{t+1} \mid a_t, \mathrm{none}) = \begin{cases} \theta & \mathrm{if} ~ s_{t+1} = \mathrm{red} \\ 1 - \theta & \mathrm{if} ~ s_{t+1} = \mathrm{blue} \\ \end{cases}
After information
After arriving at an uncertain state at t+1, the agent can consider any action \in \mathcal{A}. These actions do not change the state (i.e., \forall a_t, \tau(S_{t+2}=s_{t+1}; a_{t+1}, s_{t+1} = 1). However, they influence the observation that the agent receives.
The agent can resolve the uncertainty completely, by seeking information and observing the color of the ball. The prediction step computes as:
Note, [P] is the Iverson bracket evaluating to 1 if the proposition inside the bracket is true. The transition function from t=1 to t=2 is equivalent to an Iverson bracket [s_{t+2} = s_{t+1}] because the state does not change in response to any agent action. In practice, that means that the prediction \overline{bel}(s_{t+2}|\cdot) = bel(s_{t+1}| \cdot, \cdot).
\begin{align*} \overline{bel}(s_{t+2} \mid \mathrm{seek}) & = \sum_{s_{t+1} \in \mathcal{S}} \tau(s_{t+2}; \mathrm{seek}, s_{t+1})bel(s_{t+1}) \\ \\ & = [s_{t+2} = s_{t+1}]bel(s_{t+1}) \end{align*}
and the correction step removes uncertainty:
\begin{align*} bel(s_{t+2} \mid \mathrm{seek}, o_{t+2}) & = \eta\omega(o_{t+2}; \mathrm{seek}, s_{t+2})\overline{bel}(s_{t+2} \mid \mathrm{seek}) \\ \\ & = \frac{[o_{t+2} = s_{t+2}]\overline{bel}(s_{t+2} \mid \mathrm{seek})}{\sum_{\tilde{s}_{t+2} \in \mathcal{S}}[o_{t+2} = \tilde{s}_{t+2}]\overline{bel}(s_{t+2} \mid \mathrm{seek})} \\ \\ & = \begin{cases} 1 & ~ \mathrm{if} ~ s_{t+2} = o_{t+2} \\ 0 & ~ \mathrm{otherwise} \end{cases} \end{align*}
where, [o_{t+2} = s_{t+2}] is another Iverson bracket corresponding to the observation function which always returns the veridical information about the actual state.
The agent can also remain idle or explicitly forego information and remain uncertain. In this case, the belief remains updated (note that the prediction + correction simplifies the same way as in the first belief update):
bel(s_{t+2} \mid \mathrm{seek}, \mathrm{none}) = \begin{cases} \theta & \mathrm{if} ~ s_{t+2} = \mathrm{red} \\ 1 - \theta & \mathrm{if} ~ s_{t+2} = \mathrm{blue} \\ \end{cases}
Curiosity
Van Lieshout et al. (2018) hypothesized curiosity to depend on information-gain, E[IG] (see Gureckis 2021). Since seeking information resolves the uncertainty about the hidden state, the agent can expect reduce as much uncertainty as there is in the initial uncertain prediction about the hidden state:
H(S_{t+1}) = -\sum_{s_{t+1} \in \mathcal{S}} \overline{bel}(s_{t+1}) \log \overline{bel}(s_{t+1})
We can formalize the particular hypothesis about curiosity by defining the probabilistic policy to seek information as a logistic function of entropy:
\pi(A_t=\mathrm{seek} \mid \theta) = \frac{1}{1 + e^{-H(S_t \mid \theta)}}
That is, the probability to seek information depends (logistically) on the \theta-informed belief about the hidden state of the environment.
Note, that since the agent is given the value of \theta at the beginning of the trial, all the uncertainty about the hidden state is aleatoric. That is, the uncertainty can only be reduced by observing the value of the state itself. Aleatoric uncertainty contrasts with epistemic uncertainty that can be reduced by past observations. An intuitive example of aleatoric uncertainty is the uncertainty of the outcome of a coin flip – we can’t know it unless we flip the coin. Epistemic uncertainty, in turn, is the uncertainty about the fairness of the coin – we can reduce it by observing a series of coin flips. Some experimental paradigms go beyond inducing aleatoric uncertainty which allows for testing hypotheses different from what Van Lieshout et al. (2018) considered. E.g., instead of proposing that information-seeking is driven by aleatoric uncertainty, one could propose that it is driven by epistemic uncertainty.
The definition of the policy \pi above assumes that the agent is already motivated by uncertainty. However, policies that drive decision processes are often learned through trial-and-error. While some theories of curiosity explain its mechanism in terms of “wanting” of information, others might explain how that drive arises in the first place. In reinforcement learning, the propensity to perform an action is modulated by that action’s reward history contextualized by some state. In other words, motivation to act a certain way in a certain state reflects how much that action (or similar actions) in that state tends to result in a reward. Thus, one way in which an agent can become an intrinsically motivated information-seeker is by being rewarded for seeking information in an uncertain state.
Looking task
In a landmark study Kidd, Piantadosi, and Aslin (2012) introduced a task that was similar to the urn task described above, but it involved both epistemic and aleatoric uncertainties. The authors also proposed an early cognitive model of curiosity based on both of these sources of uncertainty. Later on F. Poli et al. (2020) used a similar paradigm while assessing additional mechanistic proposals of curiosity. Since both research groups studied infant curiosity measured via looking behavior, I refer to the task used in these studies as the looking task.
In a looking task, an agent (infant) observes multiple events sampled from the same underlying distribution. In experiment 1 from Kidd, Piantadosi, and Aslin (2012), an opaque screen repeatedly moved up and down, revealing and hiding a scene which had or hadn’t a toy in it. The presence of the toy was a binary event drawn from a Bernoulli distribution with an unknown but learnable parameter \theta. In experiment 2 from Kidd, Piantadosi, and Aslin (2012; and also in F. Poli et al. 2020), the events were drawn from a Categorical distribution with a vector parameter \boldsymbol{\theta}. For participants, this variable appeared as the location of a spontaneously appearing object on the scene.
Although the study by Kidd, Piantadosi, and Aslin (2012) preceded Van Lieshout et al. (2018), we may think of the former as extending the latter in two directions when viewed as a POMDP. First, the looking task hides not only the state (toy present vs toy absent), but also the parameter of the transition function. If we translated this extension in the looking task into the urn task, we would make the urn opaque, so that participants don’t even know the initial distribution of red and blue balls. To enable the agent learn this hidden parameter, the looking task lets it observe multiple samples from the same distribution. Translating into the urn setting, this is equivalent to letting the agent draw (with replacement) several balls from the same urn and observe their colors. The second extension is the increased number of events (experiment 2 in Kidd, Piantadosi, and Aslin 2012; F. Poli et al. 2020). This is like increasing the number of colors in the urn.
We can also formalize the looking task in terms of POMDP. Let’s start with a simpler toy-no-toy setting (Kidd, Piantadosi, and Aslin 2012, experiment 1):
\mathrm{ITI} stands for inter-trial interval and corresponds to the state in which the opaque screen is down (hiding the presense/absense of the toy). In settings with more than one event, it corresponds to a neutral state from which the next state is uncertain.$
| Set | Random variable |
|---|---|
| \mathcal{S} = {ITI, notoy, toy} | S_t : \mathcal{S} \mapsto {0, 1, 2} |
| \mathcal{A} = {look, lookaway} | A_t : \mathcal{A} \mapsto {0, 1, 2} |
| \mathcal{O} = {screen, notoy, toy} | O_t : \mathcal{O} \mapsto {0, 1, 2} |
As before, we can represent the transition function as a table. Since we only have 2 actions and since the ‘lookaway’ action terminates the episode, the table is smaller, so we can easily inspect the full table:
| s_t | a_t | s_{t+1} | Pr(s_{t+1}) |
|---|---|---|---|
| ITI | look | ITI | 0 |
| ITI | look | notoy | 1 - E[\Theta_t] |
| ITI | look | toy | E[\Theta_t] |
| notoy | look | ITI | 1 |
| notoy | look | notoy | 0 |
| notoy | look | toy | 0 |
| toy | look | ITI | 1 |
| toy | look | notoy | 0 |
| toy | look | toy | 0 |
This (subjective) transition function describes a process by which the ‘ITI’ state is always expected to transition to either a ‘toy’ or ‘notoy’ state, which then goes back to the ‘ITI’ state and so on. Note that here, the probability that the toy is present is no longer known with certainty, as in the urn task. Instead, this probability is estimated as the expected value of \Theta_t indexed by time, as it will change accumulating observations.
The observation function looks as follows:
| s_{t+1} | a_t | o_{t+1} | Pr(o_{t+1}) |
|---|---|---|---|
| ITI | look | ITI | 1 |
| ITI | look | notoy | 0 |
| ITI | look | toy | 0 |
| notoy | look | ITI | 0 |
| notoy | look | notoy | 1 |
| notoy | look | toy | 0 |
| toy | look | ITI | 0 |
| toy | look | notoy | 0 |
| toy | look | toy | 1 |
The state is fully observable as long as the agent looks at it.
Belief updating
Let’s see how beliefs change as the agent keeps looking (A_t = \mathrm{look} = a). The episode always starts from ‘ITI’, so the first prediction that agent makes does not involve a sum over initial states:
\begin{align*} \overline{bel}(s_1 \mid a) & = \tau(s_1; a, s_0) \\ & = \begin{cases} E[\Theta_0] & ~ \mathrm{if} ~ s_1=\mathrm{toy} \\ 1-E[\Theta_0] & ~ \mathrm{if} ~ s_1=\mathrm{notoy} \end{cases} \end{align*}
As assumed in a model proposed by Kidd, Piantadosi, and Aslin (2012), \Theta_t \sim \mathrm{Dirichlet}(\boldsymbol{\alpha}_t), where \boldsymbol{\alpha}_t is a hyperparameter vector defining a Dirichlet distribution. In the simpler toy-no-toy setting, the distribution of \Theta_t becomes a beta distribution and \boldsymbol{\alpha}_t = [\alpha_t, \beta_t]^\top, i.e.:
\Theta_t \sim \mathrm{beta}(\alpha_t, \beta_t)
The presence of the toy, S_1 follows a Bernoulli distribution, which is conjugate with \mathrm{beta}. This conjugacy make the predictive distribution of S_1 very simple (see Arnold 2018):
\overline{bel}(s_1 \mid a) = \begin{cases} \frac{\alpha_t}{\alpha_t+\beta_t} & \mathrm{if} ~ s_1 = \mathrm{toy} \\ \frac{\beta_t}{\alpha_t+\beta_t} & \mathrm{if} ~ s_1 = \mathrm{notoy} \\ \end{cases}
The prediction step is followed by correction, which after working through the derivation similar to Section 2.1.2, can we written as:
\begin{align*} bel(s_1 \mid a, o_1) & = \frac{\omega(o_1; a, s_{1}) \overline{bel}(s_1 \mid a)}{\sum_{\tilde{s}_1 \in \mathcal{S}}\omega(o_1; a, \tilde{s}_1) \overline{bel}(\tilde{s}_1 \mid a)} \\ \\ & = [s_1 = o_1] \end{align*}
Upon reducing the uncertainty from \overline{bel}(s_1 \mid a) to bel(s_1 \mid a, o_1), the agent can update its belief about the hidden parameter \dot{\theta}.
At this point, it is useful to introduce the notion of episodic memory (\mathcal{D}) which we will formally define as the set of observations. We assume that the agent starts with an empty memory (\mathcal{D}_t = \{\}) and augments it with observations at each time step:
\mathcal{D}_{t+1} = \mathcal{D}_{t} \cup \{o_t\}
Episodic memory is used to condition the distribution of \Theta_t to get a posterior belief. Again, thanks to the beta-Bernoulli conjugacy, computing the posterior expectation is straightforward. For example, after making the first observation o_1, the expected value of the hidden parameter on the next time step becomes:
\begin{align*} E[\Theta_2 | \mathcal{D}_2] & = \frac{\alpha + [o_1 = \mathrm{toy}]}{\alpha + [o_1 = \mathrm{toy}] + \beta + [o_1 = \mathrm{notoy}]} \\ \\ & = \frac{\alpha + [o_1 = \mathrm{toy}]}{\alpha + \beta + 1} \end{align*}
since [o_1 = \mathrm{toy}] + [o_1 = \mathrm{notoy}] = 1. For a general time step:
\begin{align*} E[\Theta_t | \mathcal{D}_t] & = \frac{\alpha + \sum_{o \in \mathcal{D}_t}[o = \mathrm{toy}]}{\alpha + \sum_{o \in \mathcal{D}_t}[o = \mathrm{toy}] + \beta + \sum_{o \in \mathcal{D}_t}[o = \mathrm{notoy}]} \\ \\ & = \frac{\alpha + c_{\mathrm{toy},t}}{\alpha + \beta + (t-1)} \end{align*}
if we let c_{s,t} = \sum_{o \in \mathcal{D}_t}[o = s]. Note that the (t-1) term results from summing over the (t-1) observations of either toy or no-toy state.
Finally, we can also generalize to settings like the experiment 2 from Kidd, Piantadosi, and Aslin (2012) or the study by F. Poli et al. (2020), when there are more than 2 events other than the ‘ITI’ state (e.g., a toy appears in one of three locations). A transition from s_{t-1} = \mathrm{ITI} to some non-‘ITI’ state s_t = s becomes a marginal expectation:
\begin{align*} E[S_{t}=s|\mathcal{D}_t] & = \frac{\alpha_s + \sum_{o \in \mathcal{D}_t}[o = s]}{\sum_{\tilde s} \alpha_{\tilde s} + \sum_{o \in \mathcal{D}_t}[o = \tilde s]} \\ \\ & = \frac{\alpha_s + c_{s,t}}{\sum_{\tilde s} \alpha_{\tilde s} + c_{\tilde{s},t}} \end{align*}
Curiosity
In a setting with two or moree events, the agent keeps track of a random vector \boldsymbol{\Theta}_t \sim \mathrm{Dirichlet}(\boldsymbol{\alpha}_t), with \boldsymbol{\alpha}_t = [{\alpha}_{1,t}, ..., {\alpha}_{K,t}], where K is the number of non-‘ITI’ states. Kidd, Piantadosi, and Aslin (2012) defined the probability density function of this random vector as
Note that Kidd, Piantadosi, and Aslin (2012) defined the exponent term of {\alpha}_{i,t} as {{\alpha}_0 + c_{i,t} - 1}, which makes explicit the parameter {\alpha}_0 of the Dirichlet PDF.
p(\boldsymbol{\alpha}_t | \mathcal{D}_t) = B(\boldsymbol{c}_t)^{-1} \prod_{i=1}^K {\alpha}_{i,t}^{c_{i,t}}
However, the quantity of interest was the marginal expectation of the observed non-‘ITI’ state s at time t:
E[S_{t}=s|\mathcal{D}_t] = \frac{\alpha_s + c_{s,t}}{\sum_{\tilde s} \alpha_{\tilde s} + c_{\tilde{s},t}}
Specifically, Kidd, Piantadosi, and Aslin (2012) defined complexity of the observed state s as Shannon’s information of that event:
I_{t,s} = I(S_t=s) = -\log E[S_{t}=s|\mathcal{D}_t]
and predicted the probability to look away to be a non-monotonic function of I_{t,s}. Specifically, they defined a proportional hazards model (of looking away as):
h(t | s) = h_0(t) \exp(b_1 I_{t,s} + b_2 I_{t,s}^2 + z_t)
where z_t is a linear combination of covariates at time t. Here, h(t | s) is the hazard of looking away at time t after observing state s. It relates to the probability to keep looking through a cumulative hazard function \tilde h(t)=\int_0^t h(\tilde t) ~d \tilde t. Thus we can relate this model to our POMDP framework by defining the policy to keep looking as follows:
\pi(A_t=\mathrm{look}) = e^{-\tilde h(t)}
If looking is motivated by curiosity, then this formulation may be interpreted as a mechanism of curiosity proposed by Kidd, Piantadosi, and Aslin (2012).
F. Poli et al. (2020) considered other probabilistic that might predict the look-away hazard. These included:
- Predictability defined as negative Shannon’s entropy of the probability distribution over the uncertain state (i.e., before the correction step): -H(\overline{bel}(S_t|\mathrm{look})) = -\sum_{\tilde s \in \mathcal{S} \setminus \{\mathrm{ITI}\}} E[S_{t}=s|\mathcal{D}_t] \log(E[S_{t}=s|\mathcal{D}_t])
Kidd, Piantadosi, and Aslin (2012) also included Shannon’s entropy (without the minus sign) in their Cox regression as a covariate.
- Learning progress defined as the Kullback-Leibler divergence between the current state-prediction belief (predicting a non-‘ITI’ state) and the state-prediction belief for the next non-‘ITI’ state after correction: \begin{align*} D_{KL} & (\overline{bel}(S_{t+1}|\mathrm{look}) || \overline{bel}(S_t|\mathrm{look})) = \\ & =\sum_{\tilde s \in \mathcal{S} \setminus \{\mathrm{ITI}\}} E[S_{t+1}=s|\mathcal{D}_{t+1}] \log \frac{E[S_{t+1}=s|\mathcal{D}_{t+1}]}{E[S_{t}=s|\mathcal{D}_{t}]} \end{align*}
Up next
I am planning to review more curiosity tasks in the future:
- Missing letters task (Singh and Manjaly 2021)
- Noisy image task (Van Lieshout et al. 2018)
- Trivia task (Kang et al. 2009)
- Image reveal tasks (Hsiung et al. 2023; Käckenmester, Kroencke, and Wacker 2018)
- Active learning tasks (Ten et al. 2021; Francesco Poli et al. 2022)
References
Footnotes
Full transition table from the Urn task POMDP:
s_t a_t s_{t+1} Pr(s_{t+1}) noball idle noball 0 noball idle red \theta noball idle blue 1 - \theta red idle noball 0 red idle red 1 red idle blue 0 blue idle noball 0 blue idle red 1 blue idle blue 0 red seek noball 0 red seek red 1 red seek blue 0 blue seek noball 0 blue seek red 1 blue seek blue 0 red skip noball 0 red skip red 1 red skip blue 0 blue skip noball 0 blue skip red 1 blue skip blue 0 Note that seeking or skipping information (a_t=1 or a_t=2) is not possible in the no-ball state (s_t=0), which is why the table contains 21, not 27, rows. Note also that only the ‘idle’ action is available in the no-ball state.↩︎
Full table of observation probabilities.
↩︎s_{t+1} a_t o_{t+1} Pr(o_{t+1}) red idle red 0 red idle blue 0 red idle none 1 blue idle red 0 blue idle blue 0 blue idle none 1 red seek red 1 red seek blue 0 red seek none 0 blue seek red 0 blue seek blue 1 blue seek none 0 red skip red 0 red skip blue 0 red skip none 1 blue skip red 0 blue skip blue 0 blue skip none 1