Reinforcement learning of 2-joint virtual arm reaching in motor cortex simulation

TitleReinforcement learning of 2-joint virtual arm reaching in motor cortex simulation
Publication TypeConference Paper
Year of Publication2012
AuthorsNeymotin, S. A., Chadderdon G. L., Kerr C. C., Francis J. T., & Lytton W. W.
Conference NameComputational Neuroscience Meeting (CNS '12)
Keywords2012, BMC Neuroscience, CNS

Few attempts have been made to model learning of sensory-motor control using spiking neural units. We trained a 2-degree-of-freedom virtual arm to reach for a target using a spiking-neuron model of motor cortex that maps proprioceptive representations of limb position to motor commands and undergoes learning based on reinforcement mechanisms suggested by the dopaminergic reward system. A 2-layer model of layer 5 motor cortex (M1) passed motor commands to the virtual arm and received proprioceptive position information from it. The reinforcement algorithm trained synapses of M1 using reward (punishment) signals based on visual perception of decreasing (increasing) distance of the virtual hand from the target. Output M1 units were partially driven by noise, creating stochastic movements that were shaped to achieve desired outcomes.The virtual arm consisted of a shoulder joint, upper arm, elbow joint, and forearm. The upper- and forearm were each controlled by a pair of flexor/extensor muscles. These muscles received rotational commands from 192 output cells of the M1 model, while the M1 model received input from muscle-specific groups of sensory cells, each of which were tuned to fire over a range of muscle lengths. The M1 model had 384 excitatory and 192 inhibitory event-based integrate-and-fire neurons, with AMPA/NMDA and GABA synapses. Excitatory and inhibitory units were interconnected probabilistically. Plasticity was enabled in the feedforward connections between input and output excitatory units. Poisson noise was added to the output units for driving stochastic movements. The reinforcement learning (RL) algorithm used eligibility traces for synaptic credit/blame assignment, and a global signal (+1=rewar