Configure CL Dataset (CL Main)
The continual learning dataset is a sequence of datasets corresponding to continual learning tasks, each of which has its own training and test data. If you are not familiar with continual learning datasets, feel free to gain some knowledge from my continual learning beginners’ guide about CL datasets.
The CL dataset is a sub-config under the experiment index config (CL Main). To configure a custom CL dataset, you need to create a YAML file in the cl_dataset/
folder. Below is an example of the CL dataset config.
Example
configs
├── __init__.py
├── entrance.yaml
├── experiment
│ ├── example_clmain_train.yaml
│ └── ...
├── cl_dataset
│ └── permuted_mnist.yaml
...
configs/experiment/example_clmain_train.yaml
defaults:
...
- /cl_dataset: permuted_mnist.yaml
...
configs/cl_dataset/permuted_mnist.yaml
_target_: clarena.cl_datasets.PermutedMNIST
root: data/MNIST
num_tasks: 10
validation_percentage: 0.1
batch_size: 128
permutation_mode: first_channel_only
Supported CL Datasets & Required Config Fields
In CLArena, we have implemented many CL datasets as Python classes in the clarena.cl_datasets
module that you can use for your experiments.
To choose a CL dataset, assign the _target_
field to the class name of the CL dataset. For example, to use the Permuted MNIST dataset, set the _target_
field to clarena.cl_datasets.PermutedMNIST
. Each CL dataset has its own hyperparameters and configurations, which means it has its own required fields. The required fields are the same as the arguments of the class specified by _target_
. The arguments for each CL dataset class can be found in the API documentation.
Below is the full list of supported CL datasets. We only support image classification datasets. The CL datasets can be constructed from regular datasets in three main ways: permute, split, combine, so we divide them into three categories: Permuted, Split, and Combined. Please refer to my continual learning beginners’ guide about the three types of datasets. Note that the “Permuted CL Dataset”, “Split CL Dataset”, “Combined CL Dataset”, and “Other CL Dataset” are exactly the class names that the _target_
field is assigned to.
For more information about the original datasets that these CL datasets are constructed from, please refer to my article: A Summary of Vision Datasets for Image Classification.
Permuted CL Datasets
Permuted CL Dataset | Description | Required Config Fields |
---|---|---|
PermutedArabicHandwrittenDigits | Permuted Arabic Handwritten Digits dataset. The Arabic Handwritten Digits Dataset (AHDD) is a collection of handwritten Arabic digits (0-9). It consists of 60,000 training and 10,000 test images of handwritten Arabic digits (10 classes), each 28x28 grayscale image (similar to MNIST). | Same as PermutedArabicHandwrittenDigits class arguments |
PermutedCaltech101 | Permuted Caltech 101 dataset. The Caltech 101 dataset is a collection of pictures of objects. It consists of 9,146 images of 101 classes, each color image. | Same as PermutedCaltech101 class arguments |
PermutedCaltech256 | Permuted Caltech 256 dataset. The Caltech 256 dataset is a collection of pictures of objects. It consists of 30,607 images of 256 classes, each color image. | Same as PermutedCaltech256 class arguments |
PermutedCelebA | Permuted CelebA dataset. The CelebFaces Attributes Dataset (CelebA) is a large-scale celebrity faces dataset. It consists of 202,599 face images of 10,177 celebrity identities (classes), each 178x218 color image. Note that the original CelebA dataset is not a classification dataset but a attributes dataset. We only use the identity of each face as the class label for classification. |
Same as PermutedCelebA class arguments |
PermutedCIFAR10 | Permuted CIFAR-10 dataset. The CIFAR-10 dataset is a subset of the 80 million tiny images dataset. It consists of 50,000 training and 10,000 test images of 10 classes, each 32x32 color image. | Same as PermutedCIFAR10 class arguments |
PermutedCIFAR100 | Permuted CIFAR-100 dataset. The CIFAR-100 dataset is a subset of the 80 million tiny images dataset. It consists of 50,000 training and 10,000 test images of 100 classes, each 32x32 color image. | Same as PermutedCIFAR100 class arguments |
PermutedCountry211 | Permuted Country211 dataset. The Country211 dataset is a collection of geolocation pictures of different countries. It consists of 62,200 images of 211 countries (classes), each 256x256 color image. | Same as PermutedCountry211 class arguments |
PermutedCUB2002011 | Permuted CUB-200-2011 dataset. The CUB (Caltech-UCSD Birds)-200-2011) is a bird image dataset. It consists of 11,788 images of 200 bird species (classes), each 64x64 color image. | Same as PermutedCUB2002011 class arguments |
PermutedDTD | Permuted DTD dataset. The Describable Textures Dataset (DTD) is a collection of describable texture pictures. It consists of 5,640 images of 47 kinds of textures (classes), each 300x300-640x640 color image. | Same as PermutedDTD class arguments |
PermutedEMNIST | Permuted EMNIST dataset. The EMNIST dataset is a collection of handwritten letters and digits (including A-Z, a-z, 0-9). It consists of 814,255 images in 62 classes, each 28x28 grayscale image. EMNIST has 6 different splits: |
Same as PermutedEMNIST class arguments |
PermutedEuroSAT | Permuted EuroSAT dataset. The EuroSAT dataset is a collection of satellite images of lands. It consists of 27,000 images of 10 classes, each 64x64 color image. | Same as PermutedEuroSAT class arguments |
PermutedFaceScrub | Permuted FaceScrub dataset. The original FaceScrub dataset is a collection of human face images. It consists 106,863 images of 530 people (classes), each high resolution color image. To make it simple, this version uses subset of the official Megaface FaceScrub challenge, cropped and resized to 32x32. We have FaceScrub-10, FaceScrub-20, FaceScrub-50, FaceScrub-100 datasets where the number of classes are 10, 20, 50 and 100 respectively. |
Same as PermutedFaceScrub class arguments |
PermutedFashionMNIST | Permuted Fashion-MNIST dataset. The Fashion-MNIST dataset is a collection of fashion images. It consists of 60,000 training and 10,000 test images of 10 types of clothing (classes), each 28x28 grayscale image (similar to MNIST). | Same as PermutedFashionMNIST class arguments |
PermutedFER2013 | Permuted FER2013 dataset. The FER2013 dataset is a collection of facial expression images. It consists of 35,887 images of 7 facial expressions (classes), each 48x48 grayscale image. | Same as PermutedFER2013 class arguments |
PermutedFGVCAircraft | Permuted FGVC-Aircraft dataset. The FGVC-Aircraft dataset is a collection of aircraft images. It consists of 10,200 images, each color image. FGVC-Aircraft has 3 different class labels by variant, family and manufacturer, which has 102, 70, 41 classes respectively. We support all of them in Permuted FGVC-Aircraft. |
Same as PermutedFGVCAircraft class arguments |
PermutedFlowers102 | Permuted Oxford 102 Flower dataset. The Oxford 102 Flower dataset is a collection of flower pictures. It consists of 8,189 images of 102 kinds of flowers (classes), each color image. | Same as PermutedFlowers102 class arguments |
PermutedFood101 | Permuted Food-101 dataset. The Food-101 dataset is a collection of food images. It consists of 101,000 images of 101 classes, each color image. | Same as PermutedFood101 class arguments |
PermutedGTSRB | Permuted GTSRB dataset. The GTSRB dataset is a collection of traffic sign images. It consists of 51,839 images of 43 different traffic signs (classes), each color image. | Same as PermutedGTSRB class arguments |
PermutedImagenette | Permuted Imagenette dataset. The Imagenette dataset is a subset of 10 easily classified classes from Imagenet. Permuted Linnaeus 5 dataset. The Linnaeus 5 dataset is a collection of flower images. It consists of 8,000 images of 5 flower species (classes). It provides 256x256, 128x128, 64x64, and 32x32 color images. We support all of them in Permuted Linnaeus 5. We support all of them in Permuted Imagenette. | Same as PermutedImagenette class arguments |
PermutedKannadaMNIST | Permuted Kannada-MNIST dataset. The Kannada-MNIST dataset is a collection of handwritten Kannada digits (0-9). It consists of 60,000 training and 10,000 test images of handwritten Kannada digits (10 classes), each 28x28 grayscale image (similar to MNIST). | Same as PermutedKannadaMNIST class arguments |
PermutedKMNIST | Permuted Kuzushiji-MNIST dataset. The Kuzushiji-MNIST dataset is a collection of Japanese Kuzushiji character images. It consists of 60,000 training and 10,000 test images of Japanese Kuzushiji images (10 classes), each 28x28 grayscale image (similar to MNIST). | Same as PermutedKMNIST class arguments |
PermutedLinnaeus5 | Permuted Linnaeus 5 dataset. The Linnaeus 5 dataset is a collection of flower images. It consists of 8,000 images of 5 flower species (classes). It provides 256x256, 128x128, 64x64, and 32x32 color images. We support all of them in Permuted Linnaeus 5. | Same as PermutedLinnaeus5 class arguments |
PermutedMNIST | Permuted MNIST dataset. The MNIST dataset is a collection of handwritten digits. It consists of 60,000 training and 10,000 test images of handwritten digit images (10 classes), each 28x28 grayscale image. | Same as PermutedMNIST class arguments |
PermutedNotMNIST | Permuted NotMNIST dataset. The NotMNIST dataset is a collection of letters (A-J). Permuted MNIST dataset. This version uses the smaller set, which consists of about 19,000 images of 10 classes, each 28x28 grayscale image. | Same as PermutedNotMNIST class arguments |
PermutedOxfordIIITPet | Permuted Oxford-IIIT Pet dataset. The Oxford-IIIT Pet dataset is a collection of cat and dog pictures. It consists of 7,349 images of 37 breeds (classes), each color image. It also provides a binary classification version with 2 classes (cat or dog). We support both versions in Permuted Oxford-IIIT Pet. | Same as PermutedOxfordIIITPet class arguments |
PermutedPCAM | Permuted PCAM dataset. The PCAM dataset is a collection of medical images of breast cancer. It consists of 327,680 images in 2 classes (benign and malignant), each 96x96 color image. | Same as PermutedPCAM class arguments |
PermutedRenderedSST2 | Permuted Rendered SST2 dataset. The Rendered SST2 dataset is a collection of optical character recognition images. It consists of 9,613 images in 2 classes (positive and negative sentiment), each 448x448 color image. | Same as PermutedRenderedSST2 class arguments |
PermutedSEMEION | Permuted SEMEION dataset. The SEMEION dataset is a collection of handwritten digits. It consists of 1,593 handwritten digit images (10 classes), each 16x16 grayscale image. | Same as PermutedSEMEION class arguments |
PermutedSignLanguageMNIST | Permuted Sign Language MNIST dataset. The Sign Language MNIST dataset is a collection of hand gesture images representing ASL letters (A-Y, excluding J). It consists of 34,627 images of 24 classes, each 28x28 grayscale image. | Same as PermutedSignLanguageMNIST class arguments |
PermutedStanfordCars (download link expired) |
Permuted Stanford Cars dataset. The Stanford Cars dataset is a collection of car images. It consists of 16,185 images in 196 classes, each color image. | Same as PermutedStanfordCars class arguments |
PermutedSUN397 | Permuted SUN397 dataset. The SUN397 dataset is a collection of scene images. It consists of 108,754 images of 397 classes, each color image. | Same as PermutedSUN397 class arguments |
PermutedSVHN | Permuted SVHN dataset. The SVHN dataset is a collection of street view house number images. It consists 73,257 training and 26,032 test images of 10 classes, each 32x32 color image. | Same as PermutedSVHN class arguments |
PermutedTinyImageNet | Permuted TinyImageNet dataset. The TinyImageNet dataset is smaller, more manageable version of the Imagenet dataset. It consists of 100,000 training, 10,000 validation and 10,000 test images of 200 classes, each 64x64 color image. | Same as PermutedTinyImageNet class arguments |
PermutedUSPS | Permuted USPS dataset. The USPS dataset is a collection of handwritten digits. It consists of 9,298 handwritten digit images (10 classes), each 16x16 grayscale image. | Same as PermutedUSPS class arguments |
Split CL Datasets
Split CL Dataset | Description | Required Config Fields |
---|---|---|
SplitCIFAR10 | Split CIFAR-10 dataset. The CIFAR-10 dataset is a subset of the 80 million tiny images dataset. It consists of 50,000 training and 10,000 test images of 10 classes, each 32x32 color image. | Same as SplitCIFAR10 class arguments |
SplitCIFAR100 | Split CIFAR-100 dataset. The CIFAR-100 dataset is a subset of the 80 million tiny images dataset. It consists of 50,000 training and 10,000 test images of 100 classes, each 32x32 color image. | Same as SplitCIFAR100 class arguments |
SplitCUB2002011 | Split CUB-200-2011 dataset. The CUB (Caltech-UCSD Birds)-200-2011) is a bird image dataset. It consists of 100,000 training, 10,000 validation, 10,000 test images of 200 bird species (classes), each 64x64 color image. | Same as SplitCUB2002011 class arguments |
SplitMNIST | Split MNIST dataset. The MNIST dataset is a collection of handwritten digits. It consists of 60,000 training and 10,000 test images of handwritten digit images (10 classes), each 28x28 grayscale image. | Same as SplitMNIST class arguments |
SplitTinyImageNet | Split TinyImageNet dataset. The TinyImageNet dataset is smaller, more manageable version of the Imagenet dataset. It consists of 100,000 training, 10,000 validation and 10,000 test images of 200 classes, each 64x64 color image. | Same as SplitTinyImageNet class arguments |
Combined CL Datasets
Combined CL Dataset | Description | Required Config Fields |
---|---|---|
Combined | Combined CL dataset. We currently support: CIFAR-10, CIFAR-100, MNIST, SVHN, Fashion-MNIST, TrafficSigns, FaceScrub, NotMNIST, EMNIST Digits, EMNIST Letters, Arabic Handwritten Digits, Kannada-MNIST, Sign Language MNIST, Kuzushiji-MNIST, Food-101, Linnaeus 5, Caltech 101, EuroSAT, DTD, Country 211 | Same as Combined class arguments |