Ilya Sutskever to John Carmack 30 papers
Updated 18 May 2026
Just some notes from when I read the 30 papers.
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Attention is All You Need
- Understanding LSTM Networks
- Machine Super Intelligence
- A Tutorial Introduction to the Minimum Description Length Principle
- Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
- Neural Machine Translation by Jointly Learning to Align and Translate
- A Simple Neural Network Module for Relational Reasoning
- Identity Mappings in Deep Residual Networks
- Multi-Scale Context Aggregation by Dilated Convolutions
- Deep Residual Learning for Image Recognition
- Relational Recurrent Neural Networks
- Order Matters: Sequence to Sequence for Sets
- Recurrent Neural Network Regularization
- Neural Turing Machines
- Neural Message Passing for Quantum Chemistry
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
- Stanford’s CS231n Convolutional Neural Networks for Visual Recognition
- Kolmogorov Complexity and Algorithmic Randomness
- The First Law of Complexodynamics
- Scaling Laws for Neural Language Models
- ImageNet Classification with Deep Convolutional Neural Networks
- GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
- Variational Lossy Autoencoder
- Pointer Networks
- Quantifying the Rise and Fall of Complexity in Closed Systems: the Coffee Automaton
Pointer networks
Convex hulls Delauny triangulation Traveling salesman
Rnn Lstm
Content based attention
Write up attention mechanism
2.3
4 empirical results
Interesting that they apply to other problems
Variational lossy autoencoder
What is bits back coding?
Can control what representation model learnings. Local vs global features
GPIPE
for scaling model size by splitting
Main contribution is using micro batches to reduce size of bubble.
Image net classification
Alexnet
37.5% top-1 error rate
1000 classes
Dropout regularization
Five convolution 3 fully connected
Used relu
Uses 2 gpus
Local normalization
Overlap pooling
Overfitting problems since 60 million parameters
Data augmention by image transform Pca on rgb
Dropout Weight intialzed with Zero mean Gaussian Bias intoalized to 1
Color agnostic kernel on gpu 1 Color specific kernel on gpu 2
Scaling laws for neural language models
Weak dependence on depth and width model arch just number of parameters data and compute increase model 8x increase data 5x
What is byte pair encoding?
Add high karma Reddit to dataset
Language modeling versus memorizing the documents?
Need to do early stopping
Given compute there is an optimal model size
There is a limit for tranformer models when two lines cross
10^12 parameters
The First Law of Complexodynamics
Kolmogorov complexity of a string x is the length of the shortest computer program that outputs x
Deepspeech 2
Throw a Deep NN at it What us ctc loss function
What is dnn-hmm?
Uses language model for beam search
Batchnorm
Viterbi alignment
Neural message passing for quantum chemistry
Mpnn
Formulate previous other work has mpnn.
Remember gaussian09
Gg-nn model
Neural Turing machine
Memory interacted by attention mechanism
Read and write heads
Hopfield networks
Content and location based addressing
Cosine similarity
Circular convolution , memory wraps around
Different tasks Copy and repeat copy Associative recall
Ntm better than LSTM
Recurrent neural network regularization
Dropout doesn’t work well for rnn and lstm
Only apply dropout on subset of connections to make it work
Also developed by other group.
What is perplexity?
Order matters
What is seq2seq
What is perplexity
Relational recurrent neural networks
Relational core memory using multi head dot product attention
Mhdpa
Keeping neural networks simple by minimizing the description length of the weights
Add noise to weights to make network generalize better
Different ways to limit information in weights
Mdl principle
Is this still used in practice
Deep residual learning for image recognition
Resnet, skip connections
What’s VGG?
What’s 10 crop testing?
Multi scale context aggregation by dilated convolutions
What is a dilated convolution.
Typo, stanfard under experiments
Neural machine translation by jointly learning align and translate
Bottle neck of encode and decode into a vector limit performance. Search from source.
Encoder bidir3ction rnn Decoder search through source
Rnn search model
Identity mappings in deep residual networks
Resnet
New residual unit in figure.
Identified more shortcuts aka skip connections
A simple neural network module for relational reasoning
Relational network
Clevr tasks
A Tutorial Introduction to the Minimum Description Length Principle
Model selection using mdl
Kolmogorov complexity
Prefix code
##