Papernotes: Ruder - An overview of Multi-Task Learning in Deep Neural Networks.

Following are my notes reading this paper by Sebastian Ruder.

By sharing representations between related tasks, we can enable our model to generalize better on our original task.

If you find yourself optimizing more than one loss function, you are effectively doing multi-task learning.

“MTL improves generalization by leveraging the domain specific information contained in the training signals of related tasks”.

An inductive bias is provided by auxiliary tasks, which cause the model to prefer hypotheses that explain more than one task.

Hard or soft parameter sharing of hidden layers.

Hard parameter sharing

— sharing the hidden layers between all tasks.

— reduces the risk of overfitting.

Image Source: Sebastian Ruder

Soft parameter sharing

— each task has its own model with its own parameters.

 — The distance between the parameters of the model is then regularized to encourage the parameters to be similar.

Why does MTL work?

 — increases the sample size for training the model.

 — averages the noise patterns over multiple tasks.

 — focuses attention on features that matter (how?)

 — allow model to eavesdrop on another task, to better learn the features for this task.

 — representation bias.


My updates:

One of our interns is trying hard parameter sharing for a NLP task, that I have been working on since the start of the year. We hope to submit our work to a conference later in the year.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

Up ↑

%d bloggers like this: