1 Reviewer W3dN

1.1 Rebuttal

Thanks for your valuable comments! Here are our responses to your proposed weaknesses and questions.

Response to weakness 1:

We had already conducted a third dataset called Combined-20. Please find it in supplemental material section B. Combined-20 is a much more complex and realistic benchmark.

There are mainly three ways to construct continual learning datasets: permute the pixels, split by class, and combine different dataset sources. Combined-20 represents the third way by combining 20 distinct datasets commonly used in machine learning, each forming a separate task. The 20 datasets are: CIFAR-10, CIFAR-100, MNIST, SVHN, Fashion-MNIST, TrafficSigns, FaceScrub, NotMNIST, EMNIST Digits, EMNIST Letters, Arabic Handwritten Digits, Kannada-MNIST, Sign Language MNIST, Kuzushiji-MNIST, Food-101, Linnaeus 5, Caltech 101, EuroSAT, DTD, Country 211.

Response to weakness 2 and question 3:

We had stated the training time of our experiments in supplemental material section F, part “Training Details”. The experiment setting and very detailed information of the environment where we run the codes, including GPUs, Python package versions (PyTorch, Captum), seeds, etc, are provided in the supplemental material so this training time information is guaranteed to be practical and reproducible.

In our statistics, FG-AdaHAT using most of our importance measures takes reasonable hours to train in our practices, but there is one exception: Feature Ablation. The Feature Ablation significantly exceeded the time budget under the same experimental conditions. However, this Feature Ablation is a very vanilla method to calculate neuron importance. There are many alternative methods to use.

We will organise them into more detailed profiling results in the camera ready version.

Response to weakness 3 and question 2:

We had included analysis of the aggregation strategies in supplemental material section G (Hyperparameter Study). Please scroll down in the supplemental material and this content is at page 9 just above the references.

For the sensitivity of the aggregation strategy, our analysis had stated that minimum consistently performs best across all settings, while the others perform similarly. We had also provided possible reasons for this phenomenon in the analysis.

We had stated that we use the minimum aggregation strategy for all experiments in supplemental material section F, part “Hyperparameters”.

Response to question 1:

Unfortunately, the FG-AdaHAT is based on HAT architecture. HAT architecture is applicable to task-incremental learning (TIL) only. The HAT architecture requires task ID information of the input data to choose the corresponding mask, which only TIL can offer [39]. In fact, most similar architecture-based approaches require task information preemptively, which makes them inapplicable to other scenarios as well.

References:

[39] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548–4557. PMLR, 2018.

1.2 Reply from the Reviewer

Thanks for the response. After reading the rebuttal and reviewing the supplemental material they referenced, I believe my concerns have been partially addressed. I would like to suggest that the authors include the experiments on the third dataset in the main paper as well. While I understand that the proposed FG-AdaHAT is built upon the HAT architecture and thus inherits the need for a task ID during inference, I would like to raise a concern regarding the practicality of such a requirement. In real-world scenarios, task ID information is often unavailable during deployment. Have the authors considered extending their method to a task ID-free setting, such as class-incremental, where task boundaries are unknown? This could significantly broaden the applicability and impact of the proposed method. While the authors claim that architecture-based methods are mostly applicable to TIL, this assertion does not entirely hold. Recent works such as LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning (CVPR’25) propose architecture-based continual learning methods for class-incremental settings that do not require task identity. I would like to maintain my original rating.

1.3 Reply from Us

Thank you for your feedback!

We will move the third benchmark to the main paper in the camera ready version.

And thanks for your comment on architecture-based methods for class-incremental learning. We plan to extend our method to other continual learning settings in future work, so your reference is very helpful for this!

2 Reviewer wpHL

2.1 Rebuttal

Thanks for your valuable comments! Here are our responses to your proposed weaknesses and questions.

Response to weakness 1 and question 1:

Due to space limitations, we had moved the full experimental results to the supplemental material. Some vital results such as the third dataset Combined-20 are not included in the main paper.

We do this to clearly present our methodology, as it involves two key parts: FG-AdaHAT framework (section 3) and fine-grained neuron importance (section 4), both of which need a thorough explanation.

We will revise our writing to give more space to accommodate vital experimental results in the camera ready version.

Response to weakness 2 and question 2:

We had already conducted a third dataset called Combined-20. Please find in supplemental material section B. Combined-20 is a much more complex and larger benchmark.

The details of the model architecture were included in the supplemental material as well, please refer to section F, part “Network Architecture”. Essentially we use a simple multi-layer perceptron (MLP) for simple dataset like Permuted MNIST, and ResNet-18 for complex datasets like Split CIFAR-100 and Combined-20. These model architectures are just enough for these datasets and the three datasets we used are enough to represent the three categories of continual learning datasets stated above. Due to limitations of our resources, we cannot afford larger settings.

Response to weakness 3:

Due to space limit, we had moved the full experimental results and detailed experiment settings to the supplemental material section C and F, respectively. In section F, we had provided detailed choices of network architecture, hyperparameters, optimizers, training epochs and batch sizes, GPUs, seeds, training times, and Python packages we used.

2.2 Reply from the Reviewer

This author’s rebuttal has partially solved my questions and doubts, but I still hope that the author make your work more novel and solid. Considering that it has partially answered my doubts, I will increase my score.

2.3 Reply from Us

Thank you for your feedback!

3 Reviewer vi5k

3.1 Rebuttal

Thanks for your valuable comments! Here are our responses to your proposed weaknesses and questions.

Response to weakness 1:

We had already conducted a third dataset called Combined-20. Please find in supplemental material section B. Combined-20 is a much more complex and larger benchmark.

Response to weakness 2:

Before the neuron importance measures are applied to guide our gradient adjustment, they are scaled to \([0,1]\) by min-max scaling to be not relevant or sensitive to value scales. Please refer to section 4. For example, the raw neuron importance measure Output Gradients (OG) is scaled to \([0,1]\) and then becomes the importance score to use.

All hyperparameters in addition to AdaHAT’s adjustment intensity \(\alpha\) in our work are two: base value \(b_L\) and the choice of \(Agg()\). Both of them are discussed in supplemental material section G (Hyperparameter Study). It is shown that they are not playing the key roles in our method and contribute little. Another potential hyperparameter you might be concerned about is the choice of fine-grained importance measure. For this, I have discussed in both section 5.2 (Result Analysis) and supplemental material section C (Full Experimental Results) that “overall performance is largely similar and consistent across all 9 different measures, with only minor differences”.

Response to weakness 3:

As we implied in section 2, architecture-based continual learning approaches have two rough categories: 1. pre-allocating a fixed network for future tasks; 2. dynamic incremental networks. What you proposed in weakness 3 is a general problem of all architecture-based approaches belonging to the former category. We argue that a pre-allocated fixed network is still somehow better than dynamic incremental networks as the latter often leads to linearly increasing memory cost.

Response to weakness 4:

Our method is based on HAT architecture. In HAT architecture, there are many neurons masked by multiple tasks (please refer to Figure 2 in original AdaHAT paper [49] which can illustrate this very clearly), which leads to network parameters sharing across tasks. Therefore, our method is not entirely parameter isolation though. The shared parameters encourage knowledge sharing and positive transfer.

Besides, this sharing is highly adaptive because the mask itself is learned as well (it is gated from learnable parameter called task embedding, please refer to HAT paper [39] section 2.2), which means FG-AdaHAT learns how much to share (the overlapping ratio of masks) and which part of network to encourage knowledge sharing.

Response to weakness 5:

Our mask is neuron mask rather than parameter mask, which means each neuron rather than each parameter corresponds to a mask value;
Our mask for task is stored as binary values (0 or 1) after training the task, which takes significantly less memory than float values.

These two facts suggest that our memory overhead is much less impactful. Therefore, in the example raised by you, 20 tasks do correspond to 20 sets of masks, but take up far less than 20 backbones’ space (that’s in parameter level rather than neuron level). Compared to many architecture-based approaches like Progressive Networks [36], DEN [53] and even the recent one Winning Subnetworks (WSN) [18] which uses parameter mask, FG-AdaHAT has achieved decent performance without incurring that much larger memory overhead.

This huge gap between neuron level and parameter level memory overhead is obvious. Nevertheless, we will add quantified analysis of memory cost in the camera ready version to give a formal analysis of this.

3.2 Response to weakness 6:

We had explained this in section 3, part “Task-specific Importance Scheduling”, that gradient adjustment mechanism inevitably updates parameters allocated to previous tasks and leads to a performance drop at the first few tasks, yet all we can do is to alleviate this early problem rather than fully address it. The importance scheduler is used to alleviate this problem (as stated in part “Task-specific Importance Scheduling”) and has shown its effectiveness. We have a whole paragraph stating this effectiveness, please refer to the last paragraph of section 5.2 (Result Analysis). Essentially, the FG-AdaHAT outperforms HAT earlier in the tasks than AdaHAT, and we even have result on Permuted MNIST that FG-AdaHAT outperforms HAT all the time. We had explained the reason in that paragraph.

On the other hand, the original AdaHAT [49] and our FG-AdaHAT focus more on long task sequences, which is what continual learning is more about. If we only want to pursue high performance on early few tasks, why don’t we just give up continual learning and simply use multi-task learning? We argue that it is still worth sacrificing a bit early performance in order to improve performance in the long run.

3.3 Other comments:

Just to correct in case of your confusion, our paper is focused on task-incremental learning (TIL) instead of class-incremental learning (CIL). Thank you! :)

References:

[18] Haeyong Kang, Rusty John Lloyd Mina, Sultan Rizky Hikmawan Madjid, Jaehong Yoon, Mark Hasegawa-Johnson, Sung Ju Hwang, and Chang D Yoo. Forget-free continual learning with winning subnetworks. In International Conference on Machine Learning, pages 10734–10750. PMLR, 2022.

[36] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.

[49] Pengxiang Wang, Hongbo Bo, Jun Hong, Weiru Liu, and Kedian Mu. Adahat: Adaptive hard attention to the task in task-incremental learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 143–160. Springer, 2024.

[53] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.

3.4 Reply from the Reviewer

I thank the authors for the rebuttal. It partially addresses my concerns regarding the masking cost and setting details. However, other concerns, like the scaling to larger-scale datasets (where the authors seem to avoid discussing it by using MNIST-like toy datasets and variations) or larger backbones, the implementation in task-agnostic settings (like CIL), are not well-addressed. I will maintain my initial rating.

3.5 Reply from Us

Thank you for your feedback!

We plan to extend our method to other continual learning settings in future work. Due to limitations of our resources, we cannot afford datasets larger than Combined-20 and corresponding backbones at the time we worked on this paper, but we will evaluate on larger datasets when resources permit.

4 Reviewer yjGf

4.1 Rebuttal

Thanks for your valuable comments! Here are our responses to your proposed weaknesses and questions.

Response to weakness 1:

We had already conducted a third dataset called Combined-20. Please find in supplemental material section B. Combined-20 is a much more complex and realistic benchmark.

Response to weakness 2 and question 3:

We had stated in both section 5.1 (Experimental Setup) and supplemental material section E (Definition of Evaluation Metrics) that the metric forward transfer (FWT) is for measuring plasticity.

Our main goal is to achieve better average performance over all tasks. This is achieved by better balancing the stability and plasticity. Therefore, our goal is not to improve plasticity alone but to improve both. (Note that improving plasticity alone often leads to poor stability because of the stability-plasticity dilemma. )

Response to weakness 3:

Response to weakness 4:

We are going to introduce more baselines to compare later in the camera ready version.

Response to weakness 5:

We will add them later. By the way, the first citation “NISPA” mentioned by you is already included in the related work section as a reference to how architecture-based approaches addresses the stability-plasticity trade-off. See reference [16].

Response to question 2:

We had stated in section 5.2 (Result Analysis) that “many FG-AdaHAT variants improve BWT over AdaHAT while sacrificing only a small amount of FWT”. As you have already discovered, the metric forward transfer (FWT) hardly improves in our method but decreases. That is because stability (measured by BWT) and plasticity (measured by FWT) is a trade-off as we mentioned many times in the paper. It often gets stuck in the stability-plasticity dilemma where improving one often leads to decreasing the other. The best case is improving both but cannot be achieved in most cases. Our results which improve a lot of stability while sacrificing a little bit plasticity, is still good. (And we do have some cases improving both, as stated in section 5.2. ) What matters most to continual learning is a balance of stability and plasticity instead of strictly improving both.

This trade-off has been discussed a lot in continual learning research community. Please check out this survey paper [48] I’ve referred to if you are interested.

References:

[16] Mustafa Burak Gurbuz and Constantine Dovrolis. Nispa: Neuro-inspired stability-plasticity adaptation for continual learning in sparse networks. arXiv preprint arXiv:2206.09117, 2022.

[48] A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

4.2 Reply from the Reviewer

I thank the authors for their response.

I would suggest including the results on the third benchmark in the main paper.

I think there is a difference between plasticity (the ability to learn new task) and forward transfer (the influence of old tasks on the performance of new task). I would recommend making this difference more clear.

I believe the paper would be strengthened by a direct comparison with the mentioned related works, rather than only discussing them in the related work.

As my concerns are partially addressed, I will keep my score.

4.3 Reply from Us

Thank you for your feedback! We will move the third benchmark to the main paper and introduce more baselines to compare later in the camera ready version.

Regarding your remaining question about plasticity and forward transfer, we would like to clarify that we follow the established interpretation from the survey [48] and original AdaHAT paper [49]. On page 3 of [48], it is stated that “learning plasticity can be evaluated by forward transfer (FWT)”, with the definition of FWT provided in equation (7), matching our formulation exactly. We use the same method for evaluating the stability-plasticity trade-off as employed in [49], where plasticity is measured through FWT as well.

References:

[48] A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

5 Advice That May Be Useful

Verbose and lack focus. Make it concise. Allocate more space to experiment section.
Add profiling results of different importance measures.
More advanced datasets and architectures. Follow the trends in deep learning.
More baselines.
Extend to CIL.
Sensitivity analysis?
Add memory cost analysis.
Move essential parts to main paper as much as possible. Supplemental material might not be published.