Main image of article Building an AI-powered Recommendation System in Python

When you visit a website such as Amazon, you’re instantly inundated with product recommendations. But these recommendations are anything but random: Amazon’s systems are carefully analyzing the behavior of every visitor to the site, tracking searches, product clicks, and sales.

Amazon is monitoring not only your behavior, but also customers with similar behavior, all in the name of delivering better-targeted recommendations. For example, when you search for hiking gear, Amazon is evaluating your choices against the choices of everyone else who shopped for tents and similar products.

Amazon pioneered this concept of AI-powered recommendation systems. And now, with the help of libraries, you can build your own recommendation system. Let’s talk about how you can do this in Python.

Ultimately, what you’re doing is filtering a product catalog, leaving out products the user is unlikely to purchase, and zeroing the algorithm on products with a higher likelihood of purchase. There are two sides to these filters:

  • Content-based filtering: This refers to tracking the user’s own actions (i.e., what they searched for, what they clicked on, what they ultimately purchased).
  • Collaborative filtering: This refers to the cumulative actions of other shoppers similar to the user (i.e., what they clicked on, searched for, and purchased).
  • Hybrid filtering: This refers to the combination of the two previous filters.

Additionally, you’ll track how many times a user clicked on a particular product, as well as overall popularity of products. These will give certain products higher scores in deciding whether to display them.

When a user first creates an account in your system, they won’t have any product search history or purchase history. You likely won’t know anything about them yet, and so you can’t look for similar users. With new users, you’ll just suggest the most popular products at first. But as they type in search queries and click on different products, you’ll soon gather more data about them.

Every click they make, you record. You even record how long they spend on a product page. For example, if they click a product and four seconds later click a different one, most likely they weren’t interested in that first product.

You can also track demographics: user age, where they’re located, what they do for a living, and so on. Not all shopping sites track such information, but if they do, it’s another influence on the algorithm. Gradually, you’ll build up a profile for every user and what you think they might be interested in buying.

To build a recommendation system, you’ll want to combine different types of libraries:

  • General data libraries: These are libraries that can read in large amounts of data and provide general data analysis. For this category we’re talking about libraries such as:
    • Numpy: This is a fast, efficient numerical computing library. It works especially well with large arrays and matrices
    • Pandas: This library builds on numpy to provide tools for working with structured data in the form of tables (called data frames).
  • Machine Learning: This is where much of the AI happens. There are libraries such as:

Additionally, there’s a library called TfidfVectorizer, built by the people who gave us scikit-learn, which converts product descriptions into mathematical vectors. These vectors can be compared using what’s called cosine similarity to determine if they’re similar. In this way, if a user clicks on one product, you can use this cosine similarity to find products that are similar and recommend them.

Be aware, however, that this is really one part of a much bigger library, and the documentation leave a bit to be desired. You’ll want to read third-party blogs about it, such as this one from Geeks for Geeks.

As you gather data from the users (i.e., their product clicks, their queries, their product purchases), you need to present this data in the right form for the libraries you’re using. Internally, the libraries build large matrices that can easily be searched and sorted; but thankfully, they take care of that behind the scenes, and you don’t need to worry about such things. (Still, it’s important to have a basic understanding of how they work so you’re not just blindly using the tools.)

Once the data is ready, you feed it into the different libraries depending on your particular needs. That’s when the modeling part happens and decisions are made. The model might learn things like:

  • Users who buy such-and-such often buy this other item later.
  • Users similar to the logged-in user usually prefer certain qualities about a product (such as eco-friendly).
  • Products with these tags tend to be purchased together.

The easiest library to help here is the one called Surprise. It’s built for beginners and has built-in algorithms for making these connections. However, like many such libraries, it has its limitations; in this case, it’s not great for huge datasets.

For larger data sets, the LightFM library works great. Alternatively, if you want to bring vectors in and perform cosine similarity, the aforementioned TfidfVectorizer library along with scikit-learn are great options.

A couple other libraries you could explore are:

  • spaCy: This is a modern Python library for natural language processing.
  • NLTK: Standing for “Natural Language Toolkit,” this library is a bit older, but it’s well used and respected, providing various linguistic tools and language models.

In any case, at this stage, products start to emerge as similar to the products the user searched for or purchased.

We mentioned earlier the issue of a brand-new user and what to recommend.

First, there’s the issue of the brand-new user. You don’t really know anything about their interests or spending habits; the best approach is to fall back on recommending broadly popular items until you learn more about the user.

Next is the new product problem: You’ve added some new products to your inventory, but nobody has viewed them yet, and you really don’t yet know how they’re going to fit into product recommendations. Since this product doesn’t fall into any models yet, you need to be careful. You don’t want it to get skipped altogether.

This is where the cosine similarity comes in. Cosine similarity is a big deal in AI in general, and we already mentioned that it can help find similar products. If you stick to some of the basic libraries we described, such as Surprise and LightFM, you might find your new products getting left out. That’s when it’s time to add in the more advanced libraries as well, such as TfidfVectorizer.

In other words, you’ll be using a combination of approaches to come up with the initial list of product recommendations. Then once you have this initial list, you can move on to ranking them, as we describe next.

After the above analysis, your system will have a good idea of what products to recommend. But you’re not done yet. You’ll likely have many products, and you still need to do further analysis to decide which ones to present to the user.

There are different options here, and you’ll likely have to craft your own algorithm here. But in general you will:

  • Look at the user’s purchase history and decide how to go next. For example, you could list products that are similar to what they purchased. But do you really want to? Suppose they just bought a four-person tent. Does it make sense to recommend another four-person tent? Probably not. Instead, you’ll want to suggest related items, such as a camp stove.
  • Compare what came next for similar users. If your logged-in user purchased a tent, what did similar users purchase after that? Then look through your list and find the best match.

You will then rank the products you have from most- to least-likely to purchase, and then you’ll find the best recommendations and present them to the user.

Next, you’ll have to decide whether to store this list of recommendations in the database, and gradually suggest them. This can get tricky, because one user might search for and purchase products that are unrelated to each other. Today they might purchase a tent; tomorrow they might purchase a gaming chair. But that’s okay! They’re still the same user, and if you notice such non-patterns, you can still make recommendations unrelated to their current search but related to their purchase habits, which could lead towards a new purchase.

The science of product recommendations is still young, and as such, not always perfect. That’s why you’ll need to test whether your recommendations are even working, and if not, adjust your algorithms and models.

Many companies use what’s called A/B testing. In this context, they provide one set of recommendations to one set of randomly-chosen users, and a different set of recommendations to the remaining users. They evaluate which ones did well.

You also need to consider feedback loops. This is an unfortunate scenario where, over time, only the most popular products bubble to the top and become the most recommended items, meaning less popular products effectively get buried and no longer appear in searches and recommendations.

The solution here is to inject some randomness into it when you’re ranking your products. Don’t just go with the built-in ranking; randomize them just a little bit. Maybe pull some products at the bottom of the list to the top just to see how they perform.

Building an AI-powered recommendation in Python can be not only a useful exercise, but a solid approach to sales even with smaller companies. Don’t feel like this kind of system should only be used by massive companies such as Amazon. If you’re building a website for a smaller company with only 100 products, you can still make use of these algorithms. The idea is that product recommendations lead to increased sales—which can generate revenue and profits for new projects.