TY - GEN
T1 - Knowledge sharing for population based neural network training
AU - Oehmcke, Stefan
AU - Kramer, Oliver
PY - 2018/1/1
Y1 - 2018/1/1
N2 - Finding good hyper-parameter settings to train neural networks is challenging, as the optimal settings can change during the training phase and also depend on random factors such as weight initialization or random batch sampling. Most state-of-the-art methods for the adaptation of these settings are either static (e.g. learning rate scheduler) or dynamic (e.g ADAM optimizer), but only change some of the hyper-parameters and do not deal with the initialization problem. In this paper, we extend the asynchronous evolutionary algorithm, population based training, which modifies all given hyper-parameters during training and inherits weights. We introduce a novel knowledge distilling scheme. Only the best individuals of the population are allowed to share part of their knowledge about the training data with the whole population. This embraces the idea of randomness between the models, rather than avoiding it, because the resulting diversity of models is important for the population’s evolution. Our experiments on MNIST, fashionMNIST, and EMNIST (MNIST split) with two classic model architectures show significant improvements to convergence and model accuracy compared to the original algorithm. In addition, we conduct experiments on EMNIST (balanced split) employing a ResNet and a WideResNet architecture to include complex architectures and data as well.
AB - Finding good hyper-parameter settings to train neural networks is challenging, as the optimal settings can change during the training phase and also depend on random factors such as weight initialization or random batch sampling. Most state-of-the-art methods for the adaptation of these settings are either static (e.g. learning rate scheduler) or dynamic (e.g ADAM optimizer), but only change some of the hyper-parameters and do not deal with the initialization problem. In this paper, we extend the asynchronous evolutionary algorithm, population based training, which modifies all given hyper-parameters during training and inherits weights. We introduce a novel knowledge distilling scheme. Only the best individuals of the population are allowed to share part of their knowledge about the training data with the whole population. This embraces the idea of randomness between the models, rather than avoiding it, because the resulting diversity of models is important for the population’s evolution. Our experiments on MNIST, fashionMNIST, and EMNIST (MNIST split) with two classic model architectures show significant improvements to convergence and model accuracy compared to the original algorithm. In addition, we conduct experiments on EMNIST (balanced split) employing a ResNet and a WideResNet architecture to include complex architectures and data as well.
KW - Asynchronous evolutionary algorithms
KW - Hyper-parameter optimization
KW - Population based training
UR - http://www.scopus.com/inward/record.url?scp=85054508337&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-00111-7_22
DO - 10.1007/978-3-030-00111-7_22
M3 - Article in proceedings
AN - SCOPUS:85054508337
SN - 9783030001100
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 258
EP - 269
BT - KI 2018
A2 - Turhan, Anni-Yasmin
A2 - Trollmann, Frank
PB - Springer Verlag,
T2 - 41st German Conference on Artificial Intelligence, KI 2018
Y2 - 24 September 2018 through 28 September 2018
ER -