
If you use your smartphone to snap a photo of a plant, landmark, or animal, chances are good that iOS or Android will identify the subject. In doing so, the software is leveraging a form of AI known as image classification. If you’re a tech pro exploring AI skills to learn, image classification is well within your reach—let’s walk through how to build an example of it.
Most of us have heard of Large Language Models (LLMs), which are meant for processing human language. For image classification, you need a different group of models called Convolutional Neural Networks (CNNs), which are built to process image data, pixel by pixel, as well as audio and even signal data.
For images, the models break images down into multiple layers of features, such as edges and textures for the higher layers; then the lower layers start to detect more sophisticated parts of the image. After the layers are combined, the model starts to classify the image based on probability. It might return information like “There’s an 85 percent chance that this image contains a dog, 8 percent chance it contains a bear, 3 percent chance it contains a cat, and 3 percent of something unknown to the model.”
If you’ve studied LLMs and sentence transformers, you’ve seen how the models built for language transform a word or a sentence into a vector (i.e., a series of numbers). In the case of sentence transformers, two sentences that are similar based on context will produce two vectors that have a similar angle; the similarity is determined by calculating the cosine. The value of the cosine will be a decimal number between -1 and 1; if the number is near 1, the sentences are similar.
With CNNs, these image-linked vectors are compared to what the CNN has already learned from millions of images used in training. For example, if the image presented includes a cat in it, then the vector will be close to vectors in the trained data that contain cats.
Now just to be clear, we’re not talking about things like facial recognition. The app won’t look at a portrait and say, “That’s Sally.” Instead it will determine that there are three humans in the photo.
As with working with LLMs and sentence transformers, you don’t need to actually learn the details of how the model does its job (unless you’re interested in exploring that as a career). Instead, you can make use of existing libraries that allow you to insert an image and return data about it. However, you’ll want to have at least a basic understanding of the steps the model takes and why each step is important.
So what can you build with this technology? You could build a personal photo album app that recognizes scenes and items and groups them together. Then you could query the app: “Show me the photos from when we went to the Grand Canyon,” or “Show me the pics from the picnic we had at the park last summer.” Or you could build an app for a car collector that classifies car pictures.
Apps like this also have a place in medicine, such as dermatology or dentistry. It’s really endless what you could build; for example, if you work in a school system, it might be fun to build an AI that identifies different types of art for an art class.
Hardware Requirements
These AI models need lots of processor resources. Most desktop CPUs don’t have what’s needed, but the higher-end GPUs do (such as those from NVidia). If you have a decent Nvidia card for your gaming or crypto mining, your AI apps can make use of it. If you don’t, Amazon Web Services (AWS) has servers available with the most advanced GPUs Nvidia makes (just make sure you remember to shut them down when you’re finished—they cost upwards of 50 cents per hour or more, and if you forget to turn them off, you could get hit with hundreds of dollars on your next bill).
Practicing Image Classification in Python
Although we won’t show any code here, let’s look at the steps to create a basic image classifier that can recognize basic things like cats and dogs.
First, you’ll want to use a couple of Python libraries:
- PyTorch: This is one of the most used libraries for AI and machine learning
- Torchvision: This is the computer vision library portion of PyTorch
And you’ll want to configure torchvision to use a model called resnet18. This model recognizes a specific set of things; you can find the list here. If you scroll through, you’ll find a lot of specific animals, for example. For example, if you use the browser’s search box and look for “cat” you’ll find tabby cat, tiger cat, Persian cat, and Siamese cat, and so on. The documentation and examples for the Torchvision library are here.
After loading a model, you’ll rely on a method called Compose, which is a member of the transforms class found in the torchvision package. Here are the steps:
- Resize the image to the size the model needs.
- Crop the image from the center
- Create a tensor. That’s an image version of the vectors we mentioned earlier. (The difference is these tensors have several dimensions.)
- Normalize the data in the tensor. This refers to the statistical concept of normalization, and it moves the data such that it centers around 0. This step essentially aligns the object found in the image with the individual images of the sets we’re comparing the image to.
Although you can find these steps in the documentation, try asking ChatGPT or Google Gemini about them, with the simple prompt: “Show me some sample Python code that uses the Compose method in torchvision.” (It will likely show you an entire app, not just this part of it.)
It’s important to understand that the Compose method will return a function that serves as a preprocessor that you can then use multiple times over for the images you load. You can actually call the Compose method before you read in any images; then you can read in an image and send it to the preprocessor method that Compose returns.
Next, you’ll take the preprocessed image data and pass it into the model for the actual image classification. This will return a list of possible matches, each with their score (such as 85 percent for “dog,” as we mentioned earlier.) However, it won’t actually have the word “dog”; instead, it will have the index of dog in the list we linked to earlier.
Finally, you’ll use a JSON loader with the URL to the list. You’ll load it in as an array, and you’ll find the item in the array whose index matches what you got back from the model.
And there you have it! You can create an image classifier pretty quickly.
Learning More and Writing Code
After understanding the above, you’ll be ready for some sample code. While we’re not providing code here in this article, head over to ChatGPT and put in the following prompt:
“Can you show me a basic example of an image classifier in Python using the torchvision library and the resnet18 model? I want to be able to search for things like cats, dogs, station wagons, and lighthouses. The code should include a call to transforms.Compose.”
You should get a nice example that does the same steps we just described. And then try it with Google Gemini and see if you get a similar example (spoiler alert: you will).
Look through each line carefully, and then ask either chatbot to explain the lines. For example, when I did it, I saw several lines for the Compose method that builds the preprocessor; I started asking Gemini about that code. Here are some possible questions you could ask:
- “In this code, does transforms.Compose do the actual classification?”
- “What type of object does transforms.Compose return?”
Pro tip: This question returned a nice explanation of how Compose returns a “callable object” whereas you’ll note earlier I said it returns a function. There’s a technical difference there (although most people just call it a function); if you’re new to Python, you might want to ask a follow-up question such as:
- “In Python, what is the difference between a function and a callable object?”
Continue going through the AI-generated code, and keep asking questions to learn what each line does. Here are more questions to consider:
- “What does the no_grad function do?”
- “Is the object returned from the resnet18 method a function or a callable object? I see that I’m using it like a function.”
You’ll probably get a pretty good discussion of what it does inside and how it works. But if not, ask!
Conclusion
The rapid evolution of AI has made this an interesting time for tech professionals. Not only can you build interesting tools like an image classifier, but you can also rely on AI (via chatbots such as ChatGPT and Gemini) to help you understand how they work. Good luck!