Ilya Sutskever to John Carmack 30 papers

Updated 18 May 2026

Just some notes from when I read the 30 papers.

Pointer networks

Convex hulls Delauny triangulation Traveling salesman

Rnn Lstm

Content based attention

Write up attention mechanism

2.3

4 empirical results

Interesting that they apply to other problems

Variational lossy autoencoder

What is bits back coding?

Can control what representation model learnings. Local vs global features

GPIPE

for scaling model size by splitting

Main contribution is using micro batches to reduce size of bubble.

Image net classification

Alexnet

37.5% top-1 error rate

1000 classes

Dropout regularization

Five convolution 3 fully connected

Used relu

Uses 2 gpus

Local normalization

Overlap pooling

Overfitting problems since 60 million parameters

Data augmention by image transform Pca on rgb

Dropout Weight intialzed with Zero mean Gaussian Bias intoalized to 1

Color agnostic kernel on gpu 1 Color specific kernel on gpu 2

Scaling laws for neural language models

Weak dependence on depth and width model arch just number of parameters data and compute increase model 8x increase data 5x

What is byte pair encoding?

Add high karma Reddit to dataset

Language modeling versus memorizing the documents?

Need to do early stopping

Given compute there is an optimal model size

There is a limit for tranformer models when two lines cross

10^12 parameters

The First Law of Complexodynamics

Kolmogorov complexity of a string x is the length of the shortest computer program that outputs x

Deepspeech 2

Throw a Deep NN at it What us ctc loss function

What is dnn-hmm?

Uses language model for beam search

Batchnorm

Viterbi alignment

Neural message passing for quantum chemistry

Mpnn

Formulate previous other work has mpnn.

Remember gaussian09

Gg-nn model

Neural Turing machine

Memory interacted by attention mechanism

Read and write heads

Hopfield networks

Content and location based addressing

Cosine similarity

Circular convolution , memory wraps around

Different tasks Copy and repeat copy Associative recall

Ntm better than LSTM

Recurrent neural network regularization

Dropout doesn’t work well for rnn and lstm

Only apply dropout on subset of connections to make it work

Also developed by other group.

What is perplexity?

Order matters

What is seq2seq

What is perplexity

Relational recurrent neural networks

Relational core memory using multi head dot product attention

Mhdpa

Keeping neural networks simple by minimizing the description length of the weights

Add noise to weights to make network generalize better

Different ways to limit information in weights

Mdl principle

Is this still used in practice

Deep residual learning for image recognition

Resnet, skip connections

What’s VGG?

What’s 10 crop testing?

Multi scale context aggregation by dilated convolutions

What is a dilated convolution.

Typo, stanfard under experiments

Neural machine translation by jointly learning align and translate

Bottle neck of encode and decode into a vector limit performance. Search from source.

Encoder bidir3ction rnn Decoder search through source

Rnn search model

Identity mappings in deep residual networks

Resnet

New residual unit in figure.

Identified more shortcuts aka skip connections

A simple neural network module for relational reasoning

Relational network

Clevr tasks

A Tutorial Introduction to the Minimum Description Length Principle

Model selection using mdl

Kolmogorov complexity

Prefix code

##