I am currently researching on continual learning, a paradigm of machine learning. The datasets for continual learning consist of a sequence of tasks, where each task is to train and test a dataset. Many existing datasets can be used to construct a sequence of tasks. This post explores datasets that are suitable for constructing continual learning tasks, and provides a summary of the most commonly used datasets in the field.
There are 3 ways to construct a sequence of tasks for continual learning (please refer to my continual learning beginners guide for more details):
- Combine: each task uses a different dataset from different sources.
- Permute: permute the pixels of a dataset using different permutation seeds to create different tasks.
- Split: split a dataset by class into different tasks.
This field focuses on the paradigm itself rather than application scenarios, so take the simplest scenario – supervised image classification as the playground by default. We are going to look at the datasets designed for image classification, or that have class label information that can used for image classification.
Dataset | Number of samples | Number of classes | Ave per class | Image Size | Permute | Split | Combine |
---|---|---|---|---|---|---|---|
MNIST | 70,000 | 10 | 7,000 | 1x28x28 | y | MLP, 98 | |
Fashion-MNIST | 70,000 | 10 | 7,000 | 1x28x28 | y | MLP, 86 | |
Kuzushiji-MNIST | 70,000 | 10 | 7,000 | 1x28x28 | y | MLP, 87 | |
EMNIST ByClass | 814,255 | 62 | ~13000 | 1x28x28 | y | MLP, 84 | |
EMNIST ByMerge | 814,255 | 47 | ~17300 | 1x28x28 | y | MLP, 87 | |
EMNIST Balanced | 131,600 | 47 | 2800 | 1x28x28 | y | MLP, 82 | |
EMNIST Letters | 145,600 | 26 | 5600 | 1x28x28 | y | MLP, 90 | |
EMNIST Digits | 280,000 | 10 | 28000 | 1x28x28 | y | MLP, 98 | |
QMNIST | 120,000 | 10 | 12,000 | 1x28x28 | |||
notMNIST | Small ~19,000, Large ~500,000 | 10 | ~1,900, ~50,000 | 1x28x28 | y | MLP, 95 | |
Sign Language MNIST | 34,627 | 24 | 1,148 | 1x28x28 | y | MLP, 60; | |
Arabic Handwritten Digits | 70,000 | 10 | 7,000 | 1x28x28 | y | MLP, 97 | |
Kannada-MNIST | 70,000 | 10 | 7,000 | 1x28x28 | y | MLP, 93 | |
CIFAR-10 | 60,000 | 10 | 6,000 | 3x32x32 | y | ResNet18, 50 | |
CIFAR-100 | 60,000 | 100 | 600 | 3x32x32 | y | ResNet18, 47 | |
GTSRB | 51,839 | 43 | 1,205 | Coloured, not aligned | y | ResNet18, 80 | |
SVHN | 99,289 (without extra) | 10 | 9,929 | 3x32x32 | y | ResNet18, 90 | |
Linnaeus 5 | 8,000 | 5 | 1,600 | 3x256x256 / 3x128x128 / 3x64x64 / 3x32x32 | y | ResNet18, 54 | |
TinyImageNet | 120,000 | 200 | 600 | 3x64x64 | y | ResNet18, 70 (30 epochs) | |
MedMNIST2D PathMNIST | 107,180 | 9 | 11,000 | 1x28x28 | |||
MedMNIST2D ChestMNIST | 112,120 | 2 | 56,060 | 1x28x28 | |||
MedMNIST2D DermaMNIST | 10,015 | 7 | 1,430 | 1x28x28 | |||
MedMNIST2D OCTMNIST | 109,309 | 4 | 27,327 | 1x28x28 | |||
MedMNIST2D PneumoniaMNIST | 5,856 | 2 | 2,928 | 1x28x28 | |||
MedMNIST2D BreastMNIST | 780 | 2 | 390 | 1x28x28 | |||
MedMNIST2D BloodMNIST | 17,092 | 8 | 2,136 | 1x28x28 | |||
MedMNIST2D TissueMNIST | 236,386 | 8 | 29,548 | 1x28x28 | |||
MedMNIST2D OrganAMNIST | 58,830 | 11 | 5,348 | 1x28x28 | |||
MedMNIST2D OrganCMNIST | 23,583 | 11 | 2,144 | 1x28x28 | |||
MedMNIST2D OrganSMNIST | 25,211 | 11 | 2,283 | 1x28x28 | |||
Omniglot | 32,460 | 1,623 | 20 | 1x105x105 | |||
FaceScrub | 106,863 | 530 | 202 | Coloured, not aligned | ? |