I was taking a look at Convolutional Neural Network from CS231n Convolutional Neural Networks for Visual Recognition. In Convolutional Neural Network, the neurons are arranged in 3 dimensions(height, width, depth). I am having trouble with the depth of the CNN. I can't visualize what it is.

In the link they said The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume.

For example loook at this picture. Sorry if the image is too crappy.

I can grasp the idea that we take a small area off the image, then compare it with the "Filters". So the filters will be collection of small images? Also they said We will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron. So is the receptive field has the same dimension as the filters? Also what will be the depth here? And what do we signify using the depth of a CNN?

So, my question mainly is, if i take an image having dimension of [32*32*3] (Lets say i have 50000 of these images, making the dataset [50000*32*32*3]), what shall i choose as its depth and what would it mean by the depth. Also what will be the dimension of the filters?

Also it will be much helpful if anyone can provide some link that gives some intuition on this.

EDIT: So in one part of the tutorial(Real-world example part), it says The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96].

Here we see the depth is 96. So is depth something that i choose arbitrarily? or something i compute? Also in the example above(Krizhevsky et al) they had 96 depths. So what does it mean by its 96 depths? Also the tutorial stated Every filter is small spatially (along width and height), but extends through the full depth of the input volume.

So that means the depth will be like this? If so then can i assume Depth = Number of Filters?