Multi-modal ML with OpenAI's CLIP

Language models (LMs) can not rely on language alone. That is the idea behind the “Experience Grounds Language” paper, that proposes a framework to measure LMs' current and future progress. A key idea is that, beyond a certain threshold LMs need other forms of data, such as visual input [1] [2].

World Scopes (WS), as datasets become larger in scope and span multiple modalities, the capabilities of models trained with them increase.

This is a companion discussion topic for the original entry at