AlexNet Paper Implementation

AlexNet needs no introduction. This blog is more of an exercise for myself where I read a paper, replicate it and document the process. The “ImageNet Classification with Deep Convolutional Neural Networks” paper is a good start for this.

Introduction

AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, marked a significant milestone in deep learning. Before AlexNet’s win in the 2012 ImageNet competition, convolutional neural networks (CNNs) were not used very often, because of computational constraints and shallow architectures.

Dataset Used

The dataset used in the paper was ImageNet’s ILSVRC, which was a subset of ImageNet’s entire dataset, consisting of approximately 1.2 million training images, 50,000 validation images and 150,000 test images, all categorized into 1,000 classes. The images from the dataset were of different resolutions but as the model required a constant input size, they opted for 256x256 pixels (256 is not the input size for the model).

The dataset I used for the implementation was CIFAR-10, which is a much smaller dataset in comparison. It contains only 60,000 color images in total, with 6,000 images per class across 10 classes. The dataset is split into 50,000 training images and 10,000 testing images. I used 10,000 of the training images as the validation dataset. The images from CIFAR-10 are of low resolution, with a size of 32x32 pixels.

Figure 1: Sample Image of CIFAR-10 Dataset

Data Augmentation

The paper used two low-memory augmentation methods on the datasets. The first method involves extracting random 224x224 patches from the 256x256 images. This allows the model to train on these smaller sections rather than the entire image. The second part of this method is performing horizontal reflections on these patches.

But the images I worked with are only 32x32, so it’s not a good idea to slice them up even more. So, I added random 4-pixel padding around the edges of the images to increase their size and then randomly cropped the images back to 32x32 pixels. And finally performed horizontal reflections on these cropped images.

Figure 2: Image After Padding and Random Cropping and After Horizontal Reflection

The second augmentation method is PCA-based colour jitter to mimic global lighting variation. They first computed the eigenvectors (principal components) and eigenvalues of the 3 × 3 RGB covariance matrix over the zero-centred, unit-scaled ImageNet training set. For every training image they added the vector

\[\Delta \;=\; \sum_{i=1}^{3} v_i \bigl(\lambda_i \alpha_i\bigr),\]

where \(v_i\) is the i-th eigenvector, \(\lambda _i\) its eigenvalue, and \(\alpha_i \sim N(0,0.1)\). The same \(\Delta\) is added to all pixels, so colour shifts occur along directions of naturally high variance while the spatial structure stays untouched.

In my CIFAR-10 implementation I keep the one-shift-per-image idea but apply a tweak. Because I compute PCA on raw 0–255 pixels (not normalised data), the eigenvalues are several orders of magnitude larger than those in the paper. Multiplying by \(\lambda _i\) therefore pushed many channels beyond the valid 0–255 range; clipping them produced almost pure black or white frames. To avoid that I scale each axis by \(\sqrt{\lambda_i}\), its standard deviation, rather than the eigenvalue itself, and still multiply by a Gaussian noise term with \(\sigma = 0.1\). Empirically this keeps the jitter amplitude comparable to the original augmentation while preventing saturation, yet it still biases the perturbation toward directions of greater colour variance.

Figure 3: Image after PCA Color Jitter

Model Architecture

The original AlexNet was designed for 224x224 ImageNet images and featured large convolutional kernels (e.g., 11x11 in the first layer) with aggressive downsampling (stride=4). It used overlapping 3x3 max-pooling (stride=2) and split computations across two GPUs to handle its depth and parameter count (60M+). The fully connected (FC) layers (4096 → 4096 → 1000 neurons) were massive, tailored for ImageNet’s 1,000-class output. In contrast, my implementation adapts to CIFAR-10’s 32x32 images by scaling down kernels (3x3 in the first layer), reducing strides (stride=2), and using non-overlapping 2x2 pooling to preserve spatial details. The FC layers are streamlined (4096 → 4096 → 10 neurons) to match CIFAR-10’s 10 classes, with dropout added to all FC layers (vs. only the first two in the original) to combat overfitting on the smaller dataset.

The adjustments address two core challenges: scale and generalization. CIFAR-10’s 32x32 images lack the fine-grained details of ImageNet, making large kernels and aggressive pooling counterproductive, they would erase critical spatial information. Smaller kernels and gentler downsampling retain discriminative features. Similarly, the original FC layers’ enormous parameter count would overfit CIFAR-10’s limited training data (50k vs. ImageNet’s 1.2M images). Streamlining the classifier and expanding dropout ensure better regularization. These changes preserve AlexNet’s foundational principles while optimizing it for smaller, modern tasks.

Training Setup

The paper trained on 1.2 million ImageNet images using non-standard hardware: two NVIDIA GTX 580 GPUs (3GB VRAM each) with cross-GPU parallelization, where specific layers ran on separate GPUs. They used SGD with momentum (0.9), starting with a learning rate of 0.01, reduced manually by a factor of 10 when validation loss plateaued. Training ran for 90 epochs with mini-batches of 128 images, supplemented by weight decay (L2 penalty of 0.0005) and no gradient clipping. Critically, they initialized weights from zero-mean Gaussian distributions (σ=0.01) and biases to 1 (for conv2, conv4, conv5 and all FC layers), with data augmentation limited to PCA jitter, random crops, and horizontal flips.

Using Pytorch on a single T4 GPU for CIFAR-10’s 50,000 training images, I trained the model for 49 epochs using SGD with momentum (0.9) and weight decay (5e-4) at a batch size of 128. Key enhancements included learning rate scheduling (step decay from 0.01 to 0.001 after epoch 30), gradient clipping (max norm=2.0), and an expanded augmentation pipeline comprising PCA-based jitter, padding, random cropping, and horizontal flips. Additionally, weight initialization followed PyTorch’s defaults (Kaiming He for ReLU layers), and 20% of the training data was reserved for validation monitoring.

Result & Analysis

The paper achieved a groundbreaking top-5 error rate of 15.3% on ImageNet, slashing the previous state-of-the-art by nearly 10%. This success stemmed from scaling depth (8 layers = 5 convolutional + 3 fully-connected layers), GPU parallelism, and novel techniques like ReLU/dropout. Crucially, PCA-based augmentation reduced top-1 error by >1%, proving that lighting-invariant augmentations were vital for generalization. The model’s accuracy gains came at significant computational cost: 5-6 days of training across two GPUs.

My adapted AlexNet achieved 80.45% test accuracy after 49 epochs of training, with final training accuracy at 82.88% and validation accuracy at 79.29%. The model showed consistent improvement throughout training, with a significant accuracy jump after the learning rate decay at epoch 30 (from 76.53% to 79.73% train accuracy in one epoch).

Figure 4: Training/Validation Loss and Accuracy

Conclusion

My tweaked AlexNet hit 86% test accuracy on CIFAR-10 after just 49 epochs; training longer with stronger augmentations would likely lift it further. The complete implementation and training code can be found in my GitHub repository and on Colab.

References

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
CIFAR-10 dataset: https://www.cs.toronto.edu/~kriz/cifar.html