One-shot Learning

Deep learning models keep getting more and more sophisticated and are very good at analyzing patterns and discriminative features in datasets. However, they usually need a large amount of data to converge, and this amount of data is not always available. We can think of examples like signature-verification where a system only has one or two examples of a customer’s signature or face recognition where only one portrait of the person might be available.

One-shot learning is the classification task where a model has to predict the label of inputs without having trained on the class involved at all. For this task we give one or few examples of each possible classes and it has to classify each input in one of the classes of the examples. Humans are very good at one-shot learning, indeed if someone sees a giraffe for the first time in his life he would easily recognize the animal if he sees one again. However, this is a bit more tricky when it comes to computationnally model it.

We can see an other the utility of one-shot learning where we won’t have to retrain the model in case the number of possible classes changes. To do so, some people tackled the problem using meta-learning : the model needs to learn to learn and not to learn to classify.

The task of n-way k-shot classification is the task of classifying instances from n different classes and by providing the classifier k examples for each class.

Meta-Learning : train like you test

Meta-learning or learning-to-learn [1] to solve tasks drawn from a given task family.

To achieve the best performance during the test phase, one idea of meta-learning is to mimic the test procedure during the training phase. So first of all the train, validation and test set need to have completely different labels. Then we need to imitate the fact that the support set (example given during the one-shot phase) is small and only covers a small part of all the labels in the dataset, and the inputs need to be classified in one of these classes.

To do so during the training phase, we first subsample a small part of all the labels form the training dataset D so :

Then, we will draw a small support set and training batch from the dataset with each element of the support set and the training batch having a label in the subset L sampled the step before.

Finally the optimization is done on the sampled batch like in classic supervised models by forwardpropagation and backpropragation :

Each tuple of the 3 subsets can be seen as one task in the numerous differents tasks the models will need to optimize on.

Omniglot Dataset

This dataset is made of 50 different alphabets and each character has 20 samples drawn by hand. The performance of the models we will introduce in this blog will be measured on this dataset. For the one-shot task the models will train, validate and test on sets that have no alphabets in common so that the results are not biased.

Omniglot

Siamese Neural Networks

Architecture

The idea behind siamese neural nets is to analyze the similarity of 2 inputs rather than trying to classify the instance directly. To do so Gregory Koch & al [2] designed a twin network to handle the pair :

Siamese neural network

The input goes through the same network architecture with the same weights and the output of the hidden layer is then passed to the distance layer that will compute the similarity of x_1 and x_2. The two networks need to have their weights tied so that two similar images will be mapped to similar spaces.

Siamese convolutional network

In the article they use a convolutional network for the hidden layer and a weighted L1 distance for the distance layer.

Convolutional Siamese network
Convolutional Layer

The loss function used is a regularized cross-entropy as this is a binary classification problem, with the vector y for all pair of each mini-batch gives

The accuracy of this network was measured on the task of one-shot learning on Omniglot along with existing algorithms :

Accuracy for 10-way one-shot classification on Omniglot

We can see that Humans and the Hierarchical Bayesian Program Learning are top of the class for this task with over 95% accuracy. However the Siamese convolutional network still manages to achieve 92% accuracy while having no complementary domain-specific information like characters or strokes that are crucial for the Hierarchical Bayesian Program Learning.

Relation Network

Architecture

Relation Network (RN) (Sung et al., 2018) [3] is similar to siamese network but with a few differences:

  1. Instead of using a simple L1 distance, we use Convolutional Neural network for relation score.
  2. The loss function is changed from cross-entropy to mean squared error as computing relation score is inherently a regression task not a classification task.

The image below shows the architecture of the Relation network. F_phi is the embedding layer. Embeddings from the support set (training set) are concatenated with the input’s embedding and these concatenated embeddings are passed on to CNN Relation layer g_phi to finally get a relation score. Relation score for images belonging to the same class will be higher than the relation score for images from the different class.

Relation Network Architecture

The CNN they implemented for the feature extraction utilize a sequence of 3x3x3 3D convolution which is pretty common for computer vision deep learning model. For the relation layer 1x1 2D convolution and 3x3 2D convolution with a fully connected layer to get the relation scores of the concatenated vectors.

Embedding Layer
Relation layer

Results

Accuracy for 5-way one-shot classification on Omniglot

For the Omniglot set, the Relation network manages to get a significantly higher accuracy than the Siamese Network, which is understandable as a CNN will be much more precise than a simple weighted L1 distance.

Conclusion

The research on One-shot learning is still ongoing and existing models can still be improved. Even though the results showed on Omniglot dataset are very good there are tasks that are more complicated than classifying characters. For example the Relation Network achieves 50.6% of accuracy on the Mini Imagenet for the 10-way one-shot classification task, whereas the Memory Matching Network designed by Qi Cai & al [4] attains 53.37% accuracy on this task.

References

[1]Meta-Learning : Learning to learn fast, Lilian Weng : https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html#training-in-the-same-way-as-testing

[2] Siamese Neural Networks for One-shot Image Recognition, Gregory Koch & al : http://www.cs.toronto.edu/~rsalakhu/papers/oneshot1.pdf

[3] Deep Relation Network for Hyperspectral Image Few-Shot Classification, Kuiliang Gao & al : https://www.mdpi.com/2072-4292/12/6/923/htm

[4] Memory Matching Networks for One-Shot Image Recognition, Qi Cai & al : https://openaccess.thecvf.com/content_cvpr_2018/papers/Cai_Memory_Matching_Networks_CVPR_2018_paper.pdf