Ilya Sutskever to John Carmack 30 papers

Updated 18 May 2026

Just some notes from when I read the 30 papers.

The Unreasonable Effectiveness of Recurrent Neural Networks
Attention is All You Need
Understanding LSTM Networks
Machine Super Intelligence
A Tutorial Introduction to the Minimum Description Length Principle
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
Neural Machine Translation by Jointly Learning to Align and Translate
A Simple Neural Network Module for Relational Reasoning
Identity Mappings in Deep Residual Networks
Multi-Scale Context Aggregation by Dilated Convolutions
Deep Residual Learning for Image Recognition
Relational Recurrent Neural Networks
Order Matters: Sequence to Sequence for Sets
Recurrent Neural Network Regularization
Neural Turing Machines
Neural Message Passing for Quantum Chemistry
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Stanford’s CS231n Convolutional Neural Networks for Visual Recognition
Kolmogorov Complexity and Algorithmic Randomness
The First Law of Complexodynamics
Scaling Laws for Neural Language Models
ImageNet Classification with Deep Convolutional Neural Networks
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Variational Lossy Autoencoder
Pointer Networks
Quantifying the Rise and Fall of Complexity in Closed Systems: the Coffee Automaton

Pointer networks

Convex hulls Delauny triangulation Traveling salesman

Rnn Lstm

Content based attention

Write up attention mechanism

2.3

4 empirical results

Interesting that they apply to other problems

Variational lossy autoencoder

What is bits back coding?

Can control what representation model learnings. Local vs global features

GPIPE

for scaling model size by splitting

Main contribution is using micro batches to reduce size of bubble.

Image net classification

Alexnet

37.5% top-1 error rate

1000 classes

Dropout regularization

Five convolution 3 fully connected

Used relu

Uses 2 gpus

Local normalization

Overlap pooling

Overfitting problems since 60 million parameters

Data augmention by image transform Pca on rgb

Dropout Weight intialzed with Zero mean Gaussian Bias intoalized to 1

Color agnostic kernel on gpu 1 Color specific kernel on gpu 2

Scaling laws for neural language models

Weak dependence on depth and width model arch just number of parameters data and compute increase model 8x increase data 5x

What is byte pair encoding?

Add high karma Reddit to dataset

Language modeling versus memorizing the documents?

Need to do early stopping

Given compute there is an optimal model size

There is a limit for tranformer models when two lines cross

10^12 parameters

The First Law of Complexodynamics

Kolmogorov complexity of a string x is the length of the shortest computer program that outputs x

Deepspeech 2

Throw a Deep NN at it What us ctc loss function

What is dnn-hmm?

Uses language model for beam search

Batchnorm

Viterbi alignment

Neural message passing for quantum chemistry

Mpnn

Formulate previous other work has mpnn.

Remember gaussian09

Gg-nn model

Neural Turing machine

Memory interacted by attention mechanism

Read and write heads

Hopfield networks

Content and location based addressing

Cosine similarity

Circular convolution , memory wraps around

Different tasks Copy and repeat copy Associative recall

Ntm better than LSTM

Recurrent neural network regularization

Dropout doesn’t work well for rnn and lstm

Only apply dropout on subset of connections to make it work

Also developed by other group.

What is perplexity?

Order matters

What is seq2seq

What is perplexity

Relational recurrent neural networks

Relational core memory using multi head dot product attention

Mhdpa

Keeping neural networks simple by minimizing the description length of the weights

Add noise to weights to make network generalize better

Different ways to limit information in weights

Mdl principle

Is this still used in practice

Deep residual learning for image recognition

Resnet, skip connections

What’s VGG?

What’s 10 crop testing?

Multi scale context aggregation by dilated convolutions

What is a dilated convolution.

Typo, stanfard under experiments

Neural machine translation by jointly learning align and translate

Bottle neck of encode and decode into a vector limit performance. Search from source.

Encoder bidir3ction rnn Decoder search through source

Rnn search model

Identity mappings in deep residual networks

Resnet

New residual unit in figure.

Identified more shortcuts aka skip connections

A simple neural network module for relational reasoning

Relational network

Clevr tasks

A Tutorial Introduction to the Minimum Description Length Principle

Model selection using mdl

Kolmogorov complexity

Prefix code