By sharing representations between related tasks, we can enable our model to generalize better on our original task.
If you find yourself optimizing more than one loss function, you are effectively doing multi-task learning.
“MTL improves generalization by leveraging the domain specific information contained in the training signals of related tasks”.
An inductive bias is provided by auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task.
Hard or soft parameter sharing of hidden layers.
Hard parameter sharing
— sharing the hidden layers between all tasks.
— reduces the risk of overfitting.
Soft parameter sharing
— each task has its own model with its own parameters.
— The distance between the parameters of the model is then regularized to encourage the parameters to be similar.
Why does MTL work?
— increases the sample size for training the model.
— averages the noise patterns over multiple tasks.
— focuses attention on features that matter (how?)
— allow model to eavesdrop on another task, to better learn the features for this task.
— representation bias.
One of our interns is trying hard parameter sharing for a NLP task, that I have been working on since the start of the year. We hope to submit our work to a conference later in the year.