Unsupervised
"Language modeling is usually framed as unsupervised distribution estimation"
https://arxiv.org/pdf/1810.04805.pdf
"Another notable family within unsupervised learning are autoregressive models, in which the data is split into a sequence of small pieces, each of which is predicted in turn. Such models can be used to generate data by successively guessing what will come next, feeding in a guess as input and guessing again. Language models, where each word is predicted from the words before it, are perhaps the best known example"
https://deepmind.com/blog/article/unsupervised-learning
"Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsupervised tasks, described in this section. This step is presented in the left part of Figure 1."
https://arxiv.org/pdf/1810.04805.pdf (BERT paper)
Self-supervised
"This idea has been widely used in language modeling. The default task for a language model is to predict the next word given the past sequence. BERT adds two other auxiliary tasks and both rely on self-generated labels."
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://arxiv.org/abs/1909.11942
"A robustly optimized method for pretraining natural language processing (NLP) systems that improves on Bidirectional Encoder Representations from Transformers, or BERT, the self-supervised method released by Google in 2018. BERT is a revolutionary technique that achieved state-of-the-art results on a range of NLP tasks while relying on unannotated text drawn from the web, as opposed to a language corpus that’s been labeled specifically for a given task."
https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/
Self-supervised is Unsupervised
Self supervised learning is an elegant subset of unsupervised learning where you can generate output labels ‘intrinsically’ from data objects by exposing a relation between parts of the object, or different views of the object.
https://towardsdatascience.com/self-supervised-learning-78bdd989c88b