Reinforcement Learning in the
Control of Attention
Luiz M G Gonçalves
Laboratory for Analysis and Architecture of Systems
(State University of Campinas-near future)
www.laas.fr/~lmgarcia
Roderic A Grupen
Laboratory for Perceptual Robotics
State University of Massachusetts (USA)
www-robotics.cs.umass.edu
Objective
To develop a robotic system to
perform tasks involving attention and
pattern categorization, integrating
multi-modal (haptic and visual)
information in a behaviorally
cooperative active system.
Motivation
Towards finding an useful robotic
system able to:
foveate (verge) the eyes onto a ROI;
keep attention on the ROI if needed;
choose another ROI (shift focus of
attention).
Result is a behaviorally cooperative
active system, which provides online feedback to environmental
stimuli in form of actions
Method
Use of (real time) visual information from
a stereo head and a simulator
Selective Attention (bottom-up salience
maps)
Multi feature extraction (perceptual state)
Associative memory (pattern address
identification)
Efficient topological mapping
Learn policies to program the system
Task Specification (Objectives)
Visual Monitoring or Environment Inspection
Construction
of an attentional map
Keep this map consistent with a current
perception (update)
Categorize all patterns
Processo Markoviano
Um processo estocástico cujo
passado não influencia o futuro se
o seu presente está
completamente especificado
Ex: Jogo de damas, Xadrez
tn1 tn
P{ X (tn ) xn | X (t ), t tn1}
P{ X (tn ) xn | X (tn1 )}
Programação Dinâmica
Percorrer todos os estados possíveis,
testando todas as possibilidades
(executar todas as ações infinitamente)
Solução melhor (PD):
Reduzir
a complexidade de um problema
que pode ser resolvido em uma dimensão
D para dois ou mais problemas em
dimensões menores
Ex: Disparidade estéreo:
1
problema em 3D (x,y,d) é reduzido para
2 problemas em 2D (x,d) e (y,d)
Pavlov
Animal faz certo, ganha comida
Animal faz errado, apanha
Em teoria, é provado que apenas
um deles (recompensa ou punição
funciona): fez coisa errada, não
ganha comida.
Assim:
robô
fez certo => recompensa
Reinforcement Learning
(Related Work)
Watkins: Learning from Delayed Rewards
(1989).
Sutton/Barto: Reinforcement Learning: An
Introduction (1998).
Araujo: Learning a Control Composition in a
Complex Environment (1996).
Huber: A Feedback Control Structure for Online Learning tasks (1997).
Coelho: A Control Basis for Learning
Multifingered Grasps (1997).
Modelling a problem with delayed
reinforcement as an MDP:
a set of states (estados) S,
a set of actions (operadores) A,
a reward function R:SxA, and
a state transition function T:SxA
(S), which maps states transition to
probabilities.
Q-learning equation:
Qs, a Qs, a r
Qs,a Qs, a
Max
aA
Q-learning equation
Qs, a Qs, a r
Qs ,a Qs, a
Max
aA
a = ação executada
r = recompensa
s’ = estado resultante de aplicar a
A = todas as ações possíveis a’ de
serem executadas em s’
= learning rate (geralmente 0.1)
= fator de disconto (geralmente 0.5)
Observações
Uma transição no espaço de
estados pode ser completamente
caracterizada pelo vetor (s,a,r,s’)
Supondo que para todos os pares
(s,a), Q(s,a) seja atualizado
infinitamente (muitas vezes) para
todo par (s,a), Q(s,a) converge
com probabilidade 1 para a melhor
recompensa possível para este
par.
Exploração e explotação
Exploração; randomicamente escolher uma ação
Explotação: após certo tempo, o sistema começa
a convergir, assim, escolhe-se ações que sabese estejam contribuindo para a convergência
Balancear entre exploração e explotação
Temperatura (lembra Simulated Annealing)
Escolher
randomicamente em função da
temperatura (inicial alta, depois baixa)
Na prática, mesmo no final, ainda 10% randomico
Algoritmo Q-learning
1) Define current state s by decoding sensory information available;
2) Use stochastic action selector to determine action a;
3) Perform action a, generating new state s’ and a reinforcement r;
4) Calculate temporal differencial error r’:
r' r
Max Qs,a Qs, a
aA
5) Update Q-value of the state/action pair(s,a)
Qs,a Qs, a r
6) Go to 1;
Elegibility trace
Atualizar não apenas um par
estado-ação de cada vez, mas sim
uma seqüência de pares (após
execução de uma série de ações).
Ganho em convergência
Na prática
Uma tabela (Q-table)
Linhas são os estados (s)
Colunas são as ações (a)
Elemento Q(s,a) são os Q-values,
valores dados pela função que
avalia a utilidade de tomar a ação
a quando o estado é s
Roger-the-Crab
Stereo Head Environment
Degrees of Freedom (Controllers)
System Control Architecture
Low-level Control
Defining a target
Pre-attentional phase (stimuli + internal
biased)
Shifting attention (saccade generation)
Fine saccade (using target model)
Verging eyes onto a target (correlation)
Movements are computed from errors to
image centers
Low-level Control
Identifying Objects
Selecting a region of interest
Extracting features
Associative memory match
Mapping objects and/or updating
memory
Pre-attentional maps
Automatic supervised learning
Behavioral Program
A straight-forward control algorithm
Step 0: Initialize the associative memory and start the
concurrent controllers of arms, neck, and eyes.
Step 1: Re-direct attention; if a representation is activated,
update attentional maps and re-do this step (1).
Step 2: Try a visual improvement; if a representation is
activated, update attentional maps and return to step 1.
Step 3: Try an arm improvement; if a representation is
activated, update attentional maps and return to step 1;
Step 4: Activate “supervised learning” module, update
attentional maps and return to step 1.
Finite state machine
Results
Q-learning convergence
Partial Evaluation of strategies
Attentional
Shifts
Partial Evaluation of strategies
Visual/arm
Improv
Partial Evaluation of strategies
Objects
Identified
Partial Evaluation of strategies
New objects
Global evaluation
Mapped objects
Task accomplishment
Mapped objects
Times for each phase or process
Phase
Computing retina
Transfer to host
Total acquiring
Pre-attention
Salience map
Total attention
Total saccade
Features for match
Memory match
Total matching
Min(sec) Max(sec) Mean(sec)
0.145 0.189
0.166
0.017 0.059
0.020
0.162 0.255
0.186
0.139 0.205
0.149
0.067 0.134
0.075
0.324 0.395
0.334
0.466 0.903
0.485
0.135 0.158
0.150
0.012 0.028
0.019
0.323 0.353
0.333
Conclusions
The system can support other sensors.
Attention and categorization act
together: tasks must be formulated
Inspection task succesfully done.
Currently support a 10-15 frame rate.
Reinforcement learning appr. worked
well in simulation
Future works
Consider focus for saccade generation and
accomodation (vergence)
Test with partially ocluded objects
Derive policies (with Q-learning) for control of topdown attention
Increase the state space and/or the set of actions
Define other hierarchical tasks (several policies, each
appropriate for a given task)
Test learning architecture on a real environment
Thanks
Thanks to CNPQ, CAPES, FAPERJ, NSF
and UMASS (USA)
To all of you for your patience
To Mimmo and Dr. Arcangelo Distante for
hosting me:-).