I Introduction
The drive towards autonomous learning systems requires computing tasks locally or insitu, defraying rising energy costs due to inefficiencies in the modern computer architecture [1]
. A variety of emerging nonvolatile memory devices, such as phasechange materials, filamentary resistive RAM, and magnetic memories (spintransfertorqueRAM (STTRAM) and spinorbittorqueRAM (SOTRAM)), may implement this vision. Critically, emerging devices can perform not only data storage but complex physicspowered operations such as vectormatrix multiplies (VMMs) when densely wired
[2].The workhorse algorithm in AI workloads is backpropagation of error (BP). BP relies upon a teacher signal supplied to all layers and the storage of highquality gradients on each layer during the parameter update phase [3]. In contrast, competitive learning or adaptive resonance methods provide labels sparsely, e.g. only to some parts of the system; the rest learn according to internally adaptive units and/or dynamics [4]
. Competitive learning relies upon the winnertakeall (WTA) motif, a cascadable nonlinear operation that can be used to build deep systems, just as perceptrons can be used to build multilayer perceptrons (MLP)
[5, 6]. Original proposals for building WTA circuits relied upon a chain of inhibition transistors [7]. Analog and digital WTA or spike feedback CMOS systems have been realized [8, 9], and conceptual proposals for WTA systems using emerging devices exist [10, 11]. However, these works either do not discuss scalable (local) learning rules that might lead to largescale WTA systems, or do not adequately benchmark against stateoftheart tasks in the machine learning field . In order to implement efficient WTA learning, we draw upon the spiketimingdependent plasticity (STDP) rule, a primitive predictive/correlative engine [12]. As in [13], we implement STDP and WTA learning together with emerging memory, however our chosen synapses are analog and, as in
[14], we closely study neuronal behavior/interactions to implement optimal competitive learning with hidden units.
Our chosen analog memory is the threeterminal magnetictunneljunction (3TMTJ) device. These devices: 1) achieve high switching efficiency due to the SOT interaction at input/output terminals; 2) possess a nonvolatile state variable, a domainwall interface (DWI) moving through a soft ferromagnetic track; 3) can be dually utilized as a synapse, holding an internal conductance state when the output terminal is long, or implement the neuron function, when the track is long. In the former case, domain wall synapses notably possess good energy footprint and advantageous operation on neural network tasks in comparison to other nanodevice synaptic options [15]. In the latter case, assuming tight spacing lateral inhibition exists between neighboring DWMTJ neuron tracks, and the physicsderived leak function can be used to implement rapid inference operations given pretrained weights [16]. In this work , we describe an efficient combination of unsupervised (WTA+STDP) and supervised (labeldriven) learning in an allDWMTJ device array that approaches BPlevel performance and remarkable energy efficiency on difficult tasks.
Ii Operation of Nanomagnetic WTA Primitive
Our system relies upon three operations 1) Inference: a vectormatrixmultiplies on clustered weights generate postsynaptic outputs. 2) DomainWall Competition : A dynamic step whereby interacting neuron units evolve according to postsynaptic inputs (a vector of currents ), as well as the behavior or nearby neighbor units, according to a physicsinformed model. 3) Learning/Programming: An update step where weights are updated according to a simplified version of the spiketimingdependent plasticity (STDP) rule; neurons implement different hidden statistical models of the input [17]. These stages are progressively implemented in the unsupervised phase (labelfree). Once unlabeled examples from the training set have been seen, weights are frozen and a leastmeansquares (LMS) filter is progressively built in a second weights matrix using labeled data points.
Iia Details of Lateral Inhibition Model
As in [16], the dependence of a magnetic stray field’s transverse (vertical) component impinges upon that of neighboring wires. This can be described by:
(1) 
based on [18]. Here, is the magnetic saturation field set at 1.6T, , , and are width, thickness of the track and interwire spacing respectively. When is in the proper range, it can effectively reduce DW velocity . Instead of rigorously calculating in the neural simulator, we focus on an ensemble parameter that modifies naive, currentdominated DW motion :
(2) 
This ratio captures the predominance of currentdriven vs. coupled (fielddriven) DW behavior. At very low , field influences are negligible; at , coupling is intermediate, and current and field DW influences are mixed; as approaches , neighbor field effects outweigh the influence of input current. Physically, the spacing can vary between 10nm and 150nm spacing in order to reflect a full spectrum of coupling strength. However, may not evolve linearly in this regime, as demonstrated in [19].
IiB Details of Analog Plasticity Model
As in [20], the number of weights given a domain wall length , track width , and length of output MTJ terminal (where the analog conductances are realized) is
(3) 
Given , 6 bits could be implemented given an output port length of . Analog weights can be implemented with the use of notches for precise control and nonlinearity [21], or can be obtained intrinsically via fine current controlled pulses. Due to DWI momentum effects, notchfree systems will typically require greater output/synapse length.
During plasticity events, differences in currents between synaptic input and output 3TMTJ ports determines the motion of the DWI modulating . As in Fig. 1, the circuit potentiates the synapse/increases the conductance when the two currents are coincident and depotentiates the synapse/decreases the conductance when they are not. This implements an approximate version of Hebbian/antiHebbian learning , or approximate STDP (hereafter ). The teacher signal implementation relies upon DWMTJ neurons being connected backward to the synaptic devices of that layer , as in the orange wires shown in Fig. 2(a). Further electrical details on the scheme are given in [22].
IiC Integration with Companion Supervised Learning System
A WTA primitive can be difficult to interface, leading to the desire to efficiently combine unsupervised and supervised subsystems [23]. In our case, the results from the competitively learning DWMTJ system are forwardpropagated to a supervised learning layer that is constructed additionally from DWMTJ synapses and neurons, as shown in Fig. 2 and first suggested in [24]. This system contains total DW synapses to encode both positive and negative weights, where is the number of hidden nodes and is the labelapplied terminal set of neurons. We have considered two possible strategies for the supervised learning policy. The first signbased learning policy can be implemented with great energy efficiency in neuromorphic hardware [25], and reduces to:
(4) 
where is the input from hidden neuron , is the output at the terminal neuron, is the target (correct) label, is the sign function and is the unit of conductance change per update. The second policy, softmax learning, requires an analog computation but can achieve superior results in machine learning contexts. Given the original postsynaptic update , the softmax function is computed subsequently. Weights are ultimately updated according to , given a learning rate , and following the crossentropy formulation , where is the presynaptic activation values of that layer , as in [26].
Iii Description of Data Science Tasks
We consider three tasks: 1) the Human Activity Recognition (HAR) set of phone sensor data (e.g. body acceleration, angular speed). There are 5 classes of activity (standing, walking, etc), 21,000 training and 2,500 test examples of dimension [27]
. 2) the MNIST database of handwritten digits, which includes 60,000 training and a separate 10,000 test examples, at
[28]. 3) The fashionMNIST (fMNIST) database, which is of same dimensionality as 2), represents items of clothing (sneakers, tshirt, etc) and is notably less linearly separable than either of the previous tasks [29].Iv Performance on Tasks
Iva Parameters for successful clustering
For correct clustering system operation, the most critical parameter tends to be the coupling parameter . As visible in Fig. 3
, while the intermediate/low amount of stray field interaction (overfiring) and dominant stray field interaction (underfiring) both do poorly, the highintermediate level of interaction in which current matters but is outweighed by locally dominant neighbors results generalizes properly. Computationally, this suggests an intermediate point between ’hard’ WTA (in which one or close to one neurons fire) and ’soft’ WTA (in which most neurons fire) best implements clustering and forces a useful hidden representations of the input dataset.
Task  Learning Style  
Random, AnaBP  STDP, BinBP  STDP, AnaBP  





Next, we evaluate how critical two common enhancements to standard WTA operation – homeostasis [30] and rankorder coding [31] – are to strong performance in the hidden layer. Fig. 4 shows that these two operations are also important. In the case of homeostasis, we find that a small number of homeostatically inhibited time steps provides this benefit already, and a great deal of finetuning is not needed. A similar result is obtained for order coded learning, where a sufficiently large exponent is needed to clip the updates to a reasonable number of total neurons firing. Note that when this parameter is very low, the hidden layer tends to again overfire and redundantly sample. Since correct values of also naturally clip the total number that can fire, this suggests that the poor aSTDP results in Fig. 4(a) are unlikely.
IvB Dimensional and learning set requirements
Fig. 5 illustrates performance on MNIST task as a function of competing units and number of supervised training samples given a properly calibrated hidden layer. Ultimately, classification on the testset is achieved when using anaBP in the second layer with only examples drawn from the training set (but with a fairly large ). Table 1 summarizes the top results for the other two tasks. For HAR, is reached given and ; fMNIST requires and .This suggests the current design is adequate on more separable tasks, while deeper networks may be required to prevent unacceptable system size blowup on very nonseparable (difficult) ones. These are notably low numbers for the total number of labeled data points presented; a modern memristive MLP requires many multiples of the task set, e.g. 200500k samples for MNIST or fMNIST [32, 26], and achieves 96% on MNIST and 81 % on fMNIST. Thus, our present results are very slightly inferior to BP. However, as in Table 1, clustering outperforms the random weights system definitively, given the more robust learning procedure in the readout layer.
IvC Resilience to Intrinsic Physics Effects in System
Several issues may occur in the physical learning system which are nonideal: a) synapselevel coarseness, e.g. limited resolution of synapses; b) synapselevel processinduced variation at the output MTJ cell (which creates different states and TMR ratio); c) neuronlevel stochastic effects due to natural fractal edge roughness in DWMTJ nanotracks [33] which can cause a neuron, at a given clustering timestep, to fail to compete/fire. For coarseness, Fig. 6(a) shows that requires 4 bits per synapse to outperform random weights , regardless of second layer policy; performance continues to increase with more resolution, leveling off at 78 bits. Meanwhile, the supervised layer is sensitive to synaptic depth when using the binary BP rule but insensitive to it when using the analog rule regardless of firstlayer weight style. Next, Fig. 7(a) shows that the clustering operation is almost unaffected by synapselevel variability. Finally, Fig. 7(b) shows the effects of arbitrary domain wall pinning are significant and linear. If around of neurons do not fire at any given clustering step, accuracy is lost. However, the effect of random pinning is negligible when not in ultralow current operation.
V Energy Footprint of Proposed Systems
Drawing on methodology in [26], [34], and [35]
, we estimate the energy overhead for the entire online learning procedure. On the device level, we have assumed that on average average
, DW velocity is , for SOT switching, , , and is chosen according to Equation (3). We assume the circuit operates in current mode during VMM operations and during the training/plasticity events, and no additional analogtodigital conversion (ADC) is needed at the hidden layer due to the allDW design. However, at the output layer, a Ramp ADC, comparators, and softmax subthreshold circuit are implemented to fully interface with digital labels. Based on our estimates, this peripheral circuitry dominates the overall energy footprint and leads to the following results at 6 bits of ADC accuracy for the three tasks using clustered weights and anaBP in : 1.96 for HAR, 7.41 for MNIST, and 18.55 for fMNIST. Lastly, we parameterize hidden layer dimension and bits ( Fig. 8). While energy scales linearly with the system size, it scales quadratically as a function of bits. Since 6 bits of weight precision is workable for BinBP and far less suffices for AnaBP, no blowup in energy is expected. Future energy efficiencies may be unlocked by further increasing domain wall velocities via material optimization [36], or increasing the efficiency of spinorbit torque switching for more efficient currentmode inference operations.Vi Conclusion
In this work, we have designed and evaluated a learning system which closely draws upon the dynamics of DWMTJ memory devices to learn efficiently. The major positive result of the work is that currentmode (all DWMTJ ) internal operation, low bit requirements, and a low number of required updates allow us to achieve learning with energy budget at very high speed. The major incomplete aspect of the work is that our accuracy results are still inferior to stateoftheart deep networks using BP. Our immediate next steps are thus to examine deeper (cascaded) implementations of semisupervised DWMTJ systems that may be MLcompetitive.
Acknowledgment
Sandia National Laboratories is a multimission laboratory managed and operated by NTESS, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DENA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
References
 [1] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014, pp. 10–14.
 [2] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani, M. Ishii, P. Narayanan, A. Fumarola et al., “Neuromorphic computing using nonvolatile memory,” Advances in Physics: X, vol. 2, no. 1, pp. 89–124, 2017.
 [3] D. E. Rumelhart, R. Durbin, R. Golden, and Y. Chauvin, “Backpropagation: The basic theory,” Backpropagation: Theory, architectures and applications, pp. 1–34, 1995.
 [4] S. Grossberg, “Competitive learning: From interactive activation to adaptive resonance,” Cognitive science, vol. 11, no. 1, pp. 23–63, 1987.
 [5] W. Maass, “Neural computation with winnertakeall as the only nonlinear operation,” in Advances in neural information processing systems, 2000, pp. 293–299.
 [6] ——, “On the computational power of winnertakeall,” Neural computation, vol. 12, no. 11, pp. 2519–2535, 2000.
 [7] C. A. Mead, J. Lazzaro, M. Mahowald, and S. Ryckebusch, “Winnertakeall circuits for neural computing systems,” Oct. 22 1991, uS Patent 5,059,814.

[8]
S. Ramakrishnan and J. Hasler, “Vectormatrix multiply and winnertakeall as an analog classifier,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 2, pp. 353–361, 2013.  [9] J. Park, J. Lee, and D. Jeon, “7.6 a 65nm 236.5 nj/classification neuromorphic processor with 7.5% energy overhead onchip learning using direct spikeonly feedback,” in 2019 IEEE International SolidState Circuits Conference(ISSCC). IEEE, 2019, pp. 140–142.

[10]
S. N. Truong, K. Van Pham, W. Yang, K.S. Min, Y. Abbas, C. J. Kang, S. Shin, and K. Pedrotti, “Ta 2 o 5memristor synaptic array with winnertakeall method for neuromorphic pattern matching,”
Journal of the Korean Physical Society, vol. 69, no. 4, pp. 640–646, 2016.  [11] A. Wu, Z. Zeng, and J. Chen, “Analysis and design of winnertakeall behavior based on a novel memristive neural network,” Neural Computing and Applications, vol. 24, no. 78, pp. 1595–1600, 2014.
 [12] R. P. Rao and T. J. Sejnowski, “Spiketimingdependent hebbian plasticity as temporal difference learning,” Neural computation, vol. 13, no. 10, pp. 2221–2237, 2001.
 [13] A. F. Vincent, J. Larroque, N. Locatelli, N. B. Romdhane, O. Bichler, C. Gamrat, W. S. Zhao, J.O. Klein, S. GaldinRetailleau, and D. Querlioz, “Spintransfer torque magnetic memory as a stochastic memristive synapse for neuromorphic systems,” IEEE transactions on biomedical circuits and systems, vol. 9, no. 2, pp. 166–174, 2015.

[14]
D. Krotov and J. J. Hopfield, “Unsupervised learning by competing hidden units,”
Proceedings of the National Academy of Sciences, vol. 116, no. 16, pp. 7723–7731, 2019.  [15] D. Kaushik, U. Singh, U. Sahu, I. Sreedevi, and D. Bhowmik, “Comparing domain wall synapse with other non volatile memory devices for onchip learning in analog hardware neural network,” arXiv preprint arXiv:1910.12919, 2019.
 [16] N. Hassan, X. Hu, L. JiangWei, W. H. Brigner, O. G. Akinola, F. GarciaSanchez, M. Pasquale, C. H. Bennett, J. A. C. Incorvia, and J. S. Friedman, “Magnetic domain wall neuron with lateral inhibition,” Journal of Applied Physics, vol. 124, no. 15, p. 152127, 2018.

[17]
D. Kappel, B. Nessler, and W. Maass, “Stdp installs in winnertakeall circuits an online approximation to hidden markov model learning,”
PLoS computational biology, vol. 10, no. 3, p. e1003511, 2014.  [18] R. EngelHerbert and T. Hesjedal, “Calculation of the magnetic stray field of a uniaxial magnetic domain,” Journal of Applied Physics, vol. 97, no. 7, p. 074504, 2005.
 [19] C. Cui, O. G. Akinola, N. Hassan, C. H. Bennett, M. J. Marinella, J. S. Friedman, and J. Incorvia, “Maximized lateral inhibition in paired magnetic domain wall racetracks for neuromorphic computing,” arXiv preprint arXiv:1912.04505, 2019.
 [20] J. A. Currivan, Y. Jang, M. D. Mascaro, M. A. Baldo, and C. A. Ross, “Low energy magnetic domain wall logic in short, narrow, ferromagnetic wires,” IEEE Magnetics Letters, vol. 3, pp. 3 000 104–3 000 104, 2012.
 [21] O. Akinola, X. Hu, C. H. Bennett, M. Marinella, J. S. Friedman, and J. A. C. Incorvia, “Threeterminal magnetic tunnel junction synapse circuits showing spiketimingdependent plasticity,” Journal of Physics D: Applied Physics, vol. 52, no. 49, p. 49LT01, 2019.
 [22] A. Velasquez, C. Bennett, N. Hassan, W. Brigner, O. Akinola, J. A. Incorvia, M. Marinella, and J. Friedman, “Unsupervised competitive hardware learning rule for spintronic clustering architecture,” GOMAC 2020, Proceedings, 2020.
 [23] D. Querlioz, W. Zhao, P. Dollfus, J.O. Klein, O. Bichler, and C. Gamrat, “Bioinspired networks with nanoscale memristive devices that combine the unsupervised and supervised learning approaches,” in 2012 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH). IEEE, 2012, pp. 203–210.

[24]
C. H. Bennett, N. Hassan, X. Hu, J. A. C. Incornvia, J. S. Friedman, and M. J. Marinella, “Semisupervised learning and inference in domainwall magnetic tunnel junction (dwmtj) neural networks,” in
Spintronics XII, vol. 11090. International Society for Optics and Photonics, 2019, p. 110903I.  [25] C. S. Thakur, R. Wang, S. Afshar, G. Cohen, T. J. Hamilton, J. Tapson, and A. van Schaik, “An online learning algorithm for neuromorphic hardware implementation,” arXiv preprint arXiv:1505.02495, 2015.
 [26] C. H. Bennett, V. Parmar, L. E. Calvet, J.O. Klein, M. Suri, M. J. Marinella, and D. Querlioz, “Contrasting advantages of learning with random weights and backpropagation in nonvolatile memory neural networks,” IEEE Access, 2019.
 [27] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. ReyesOrtiz, “A public domain dataset for human activity recognition using smartphones.” in Esann, 2013.
 [28] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, p. 18, 2010.
 [29] H. Xiao, K. Rasul, and R. Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
 [30] D. Querlioz, O. Bichler, A. F. Vincent, and C. Gamrat, “Bioinspired programming of memory devices for implementing an inference engine,” Proceedings of the IEEE, vol. 103, no. 8, pp. 1398–1416, 2015.
 [31] B. S. Bhattacharya and S. B. Furber, “Biologically inspired means for rankorder encoding images: A quantitative analysis,” IEEE transactions on neural networks, vol. 21, no. 7, pp. 1087–1099, 2010.
 [32] I. Kataeva, F. MerrikhBayat, E. Zamanidoost, and D. Strukov, “Efficient training algorithms for neural networks based on memristive crossbar circuits,” in 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 2015, pp. 1–8.
 [33] S. Dutta, S. A. Siddiqui, J. A. CurrivanIncorvia, C. A. Ross, and M. A. Baldo, “The spatial resolution limit for an individual domain wall in magnetic nanowires,” Nano letters, vol. 17, no. 9, pp. 5869–5874, 2017.
 [34] V. Parmar and M. Suri, “Design exploration of iot centric neural inference accelerators,” in Proceedings of the 2018 on Great Lakes Symposium on VLSI. ACM, 2018, pp. 391–396.
 [35] M. J. Marinella, S. Agarwal, A. Hsia, I. Richter, R. JacobsGedrim, J. Niroula, S. J. Plimpton, E. Ipek, and C. D. James, “Multiscale codesign analysis of energy, latency, area, and accuracy of a reram analog neural training accelerator,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 1, pp. 86–101, 2018.
 [36] F. Ajejas, V. Křižáková, D. de Souza Chaves, J. Vogel, P. Perna, R. Guerrero, A. Gudin, J. Camarero, and S. Pizzini, “Tuning domain wall velocity with dzyaloshinskiimoriya interaction,” Applied Physics Letters, vol. 111, no. 20, p. 202402, 2017.
Comments
There are no comments yet.