Reinforcement Learning in the
Control of Attention
Luiz M G Gonçalves
Laboratory for Analysis and Architecture of Systems
(State University of Campinas-near future)
www.laas.fr/~lmgarcia
Roderic A Grupen
Laboratory for Perceptual Robotics
State University of Massachusetts (USA)
www-robotics.cs.umass.edu
Objective

To develop a robotic system to
perform tasks involving attention and
pattern categorization, integrating
multi-modal (haptic and visual)
information in a behaviorally
cooperative active system.
Motivation

Towards finding an useful robotic
system able to:
 foveate (verge) the eyes onto a ROI;
 keep attention on the ROI if needed;
 choose another ROI (shift focus of
attention).

Result is a behaviorally cooperative
active system, which provides online feedback to environmental
stimuli in form of actions
Method






Use of (real time) visual information from
a stereo head and a simulator
Selective Attention (bottom-up salience
maps)
Multi feature extraction (perceptual state)
Associative memory (pattern address
identification)
Efficient topological mapping
Learn policies to program the system
Task Specification (Objectives)

Visual Monitoring or Environment Inspection
 Construction
of an attentional map
 Keep this map consistent with a current
perception (update)
 Categorize all patterns
Processo Markoviano


Um processo estocástico cujo
passado não influencia o futuro se
o seu presente está
completamente especificado
Ex: Jogo de damas, Xadrez
tn1  tn 
P{ X (tn )  xn | X (t ), t  tn1} 
P{ X (tn )  xn | X (tn1 )}
Programação Dinâmica


Percorrer todos os estados possíveis,
testando todas as possibilidades
(executar todas as ações infinitamente)
Solução melhor (PD):
 Reduzir
a complexidade de um problema
que pode ser resolvido em uma dimensão
D para dois ou mais problemas em
dimensões menores

Ex: Disparidade estéreo:
1
problema em 3D (x,y,d) é reduzido para
2 problemas em 2D (x,d) e (y,d)
Pavlov




Animal faz certo, ganha comida
Animal faz errado, apanha
Em teoria, é provado que apenas
um deles (recompensa ou punição
funciona): fez coisa errada, não
ganha comida.
Assim:
 robô
fez certo => recompensa
Reinforcement Learning
(Related Work)





Watkins: Learning from Delayed Rewards
(1989).
Sutton/Barto: Reinforcement Learning: An
Introduction (1998).
Araujo: Learning a Control Composition in a
Complex Environment (1996).
Huber: A Feedback Control Structure for Online Learning tasks (1997).
Coelho: A Control Basis for Learning
Multifingered Grasps (1997).
Modelling a problem with delayed
reinforcement as an MDP:





a set of states (estados) S,
a set of actions (operadores) A,
a reward function R:SxA, and
a state transition function T:SxA 
(S), which maps states transition to
probabilities.
Q-learning equation:

Qs, a  Qs, a    r  


Qs,a  Qs, a
Max
aA

Q-learning equation

Qs, a  Qs, a    r  










Qs ,a   Qs, a
Max
aA

a = ação executada
r = recompensa
s’ = estado resultante de aplicar a
A = todas as ações possíveis a’ de
serem executadas em s’
 = learning rate (geralmente 0.1)
 = fator de disconto (geralmente 0.5)
Observações


Uma transição no espaço de
estados pode ser completamente
caracterizada pelo vetor (s,a,r,s’)
Supondo que para todos os pares
(s,a), Q(s,a) seja atualizado
infinitamente (muitas vezes) para
todo par (s,a), Q(s,a) converge
com probabilidade 1 para a melhor
recompensa possível para este
par.
Exploração e explotação




Exploração; randomicamente escolher uma ação
Explotação: após certo tempo, o sistema começa
a convergir, assim, escolhe-se ações que sabese estejam contribuindo para a convergência
Balancear entre exploração e explotação
Temperatura (lembra Simulated Annealing)
 Escolher
randomicamente em função da
temperatura (inicial alta, depois baixa)
 Na prática, mesmo no final, ainda 10% randomico
Algoritmo Q-learning




1) Define current state s by decoding sensory information available;
2) Use stochastic action selector to determine action a;
3) Perform action a, generating new state s’ and a reinforcement r;
4) Calculate temporal differencial error r’:
r'  r  
Max Qs,a  Qs, a
aA

5) Update Q-value of the state/action pair(s,a)
Qs,a  Qs, a  r

6) Go to 1;
Elegibility trace


Atualizar não apenas um par
estado-ação de cada vez, mas sim
uma seqüência de pares (após
execução de uma série de ações).
Ganho em convergência
Na prática




Uma tabela (Q-table)
Linhas são os estados (s)
Colunas são as ações (a)
Elemento Q(s,a) são os Q-values,
valores dados pela função que
avalia a utilidade de tomar a ação
a quando o estado é s
Roger-the-Crab
Stereo Head Environment
Degrees of Freedom (Controllers)
System Control Architecture
Low-level Control

Defining a target
 Pre-attentional phase (stimuli + internal
biased)
 Shifting attention (saccade generation)
 Fine saccade (using target model)
 Verging eyes onto a target (correlation)
 Movements are computed from errors to
image centers
Low-level Control


Identifying Objects
 Selecting a region of interest
 Extracting features
 Associative memory match
Mapping objects and/or updating
memory
 Pre-attentional maps
 Automatic supervised learning
Behavioral Program
A straight-forward control algorithm





Step 0: Initialize the associative memory and start the
concurrent controllers of arms, neck, and eyes.
Step 1: Re-direct attention; if a representation is activated,
update attentional maps and re-do this step (1).
Step 2: Try a visual improvement; if a representation is
activated, update attentional maps and return to step 1.
Step 3: Try an arm improvement; if a representation is
activated, update attentional maps and return to step 1;
Step 4: Activate “supervised learning” module, update
attentional maps and return to step 1.
Finite state machine
Results

Q-learning convergence
Partial Evaluation of strategies

Attentional
Shifts
Partial Evaluation of strategies

Visual/arm
Improv
Partial Evaluation of strategies

Objects
Identified
Partial Evaluation of strategies

New objects
Global evaluation
Mapped objects
Task accomplishment
Mapped objects
Times for each phase or process











Phase
Computing retina
Transfer to host
Total acquiring
Pre-attention
Salience map
Total attention
Total saccade
Features for match
Memory match
Total matching
Min(sec) Max(sec) Mean(sec)
0.145 0.189
0.166
0.017 0.059
0.020
0.162 0.255
0.186
0.139 0.205
0.149
0.067 0.134
0.075
0.324 0.395
0.334
0.466 0.903
0.485
0.135 0.158
0.150
0.012 0.028
0.019
0.323 0.353
0.333
Conclusions





The system can support other sensors.
Attention and categorization act
together: tasks must be formulated
Inspection task succesfully done.
Currently support a 10-15 frame rate.
Reinforcement learning appr. worked
well in simulation
Future works






Consider focus for saccade generation and
accomodation (vergence)
Test with partially ocluded objects
Derive policies (with Q-learning) for control of topdown attention
Increase the state space and/or the set of actions
Define other hierarchical tasks (several policies, each
appropriate for a given task)
Test learning architecture on a real environment
Thanks



Thanks to CNPQ, CAPES, FAPERJ, NSF
and UMASS (USA)
To all of you for your patience
To Mimmo and Dr. Arcangelo Distante for
hosting me:-).