Dataquest: Tutorial: Find Dominant Colors in an Image through Clustering

Tutorial: Find Dominant Colors in an Image through Clustering

Analyzing images with code can be difficult. How do you make your code "understand" the context of an image?

In general, the first step of analyzing images with AI is finding the dominant colors. In this tutorial, we’re going to find dominant colors in images using matplotlib‘s image class. Finding dominant colors is also something you can do with third-party APIs, but we’re going to build our own system for doing this so that we have total control over the process.

We will first look at converting an image into its component colors in the form of a matrix, and then perform k-means clustering on it to find the dominant colors.

Prerequisites

This tutorial assumes that you know basics of Python, but you don’t need to have worked with images in Python before.

This tutorial is based on the following:

  • Python version 3.6.5
  • matplotlib version 2.2.3: to decode images and visualize dominant colors
  • scipy version 1.1.0: to perform clustering that determines dominant colors

The packages matplotlib and scipy can be installed through a package manager — pip. You may want to install specific versions of packages in a virtual environment to make sure there are no clashes with dependencies of other projects you are working on.

pip install matplotlib==2.2.3 pip install scipy==1.1.0 

Further, we analyze JPG images in this tutorial, the support for which is available only when you install an additional package, pillow.

pip install pillow==5.2.0 

Alternatively, you can just use a Jupyter notebook. The code for this tutorial was run on a Jupyter notebook in Anaconda version 1.8.7. These above packages are pre-installed in Anaconda.

import matplotlib matplotlib.__version__ 
'3.0.2' 
import PIL PIL.__version__ 
'5.3.0' 
import scipy scipy.__version__ 
'1.1.0' 

Decoding images

Images may have various extensions — JPG, PNG, TIFF are common. This post focuses on JPG images only, but the process for other image formats should not be very different. The first step in the process is to read the image.

An image with a JPG extension is stored in memory as a list of dots, known as pixels. A pixel, or a picture element, represents a single dot in an image. The color of the dot is determined by a combination of three values — its three component colors (Red, Blue and Green). The color of the pixel is essentially a combination of these three component colors.

Tutorial: Find Dominant Colors in an Image through Clustering
Image source: Datagenetics

Let us use Dataquest’s logo for the purpose of finding dominant colors in the image. You can download the image here.

To read an image in Python, you need to import the image class of matplotlib (documentation). The imread() method of the image class decodes an image into its RGB values. The output of the imread() method is an array with the dimensions M x N x 3, where M and N are the dimensions of the image.

from matplotlib import image as img  image = img.imread('./dataquest.jpg') image.shape 
(200, 200, 3) 

You can use the imshow() method of matplotlib‘s pyplot class to display an image, which is in the form of a matrix of RGB values.

%matplotlib inline  from matplotlib import pyplot as plt  plt.imshow(image) plt.show() 

Tutorial: Find Dominant Colors in an Image through Clustering

The matrix we get from the imread() method depends on the type of the image being read. For instance, PNG images would also include an element measuring a pixel’s level of transparency. This post will only cover JPG images.

Before moving on to clustering the images, we need to perform an additional step. In the process of finding out the dominant colors of an image, we are not concerned about the position of the pixel. Hence, we need to convert the M x N x 3 matrix to three individual lists, which contain the respective red, blue and green values. The following snippet converts the matrix stored in image into three individual lists, each of length 40,000 (200 x 200).

r = [] g = [] b = []  for line in image:     for pixel in line:         temp_r, temp_g, temp_b = pixel         r.append(temp_r)         g.append(temp_g)         b.append(temp_b) 

The snippet above creates three empty lists, and then loops through each pixel of our image, appending the RGB values to our r, g, and b lists, respectively. If done correctly, each list wiill have a length of 40,000 (200 x 200).

Clustering Basics

Now that we have stored all of the component colors of our image, it is time to find the dominant colors. Let us now take a moment to understand the basics of clustering and how it is going to help us find the dominant colors in an image.

Clustering is a technique that helps in grouping similar items together based on particular attributes. We are going to apply k-means clustering to the list of three colors that we have just created above.

The colors at each cluster center will reflect the average of the attributes of all members of a cluster, and that will help us determine the dominant colors in the image.

How Many Dominant Colors?

Before we perform k-means clustering on the pixel data points, it might be good for us to figure out how many clusters are ideal for a given image, since not all images will have the same number of dominant colors.

Since we are dealing with three variables for clustering — the Red, Blue and Green values of pixels — we can visualize these variables on three dimensions to understand how many dominant colors may exist.

To make a 3D plot in matplotlib, we will use the Axes3D() class (documentation). After initializing the axes using the Axes3D() class, we use the scatter method and use the three lists of color values as arguments.

from mpl_toolkits.mplot3d import Axes3D  fig = plt.figure() ax = Axes3D(fig) ax.scatter(r, g, b) plt.show() 

Tutorial: Find Dominant Colors in an Image through Clustering

In the resultant plot, we can see that the distribution of points forms two elongated clusters. This is also supported by the fact that by looking at it, we can see that the image is primarily composed of two colors. Therefore, we will focus on creating two clusters in the next section.

First, though, we can probably guess that it’s possible that a 3D plot may not yield distinct clusters for some images. Additionally, if we were using a PNG image, we would have a fourth agrument (each pixel’s transparency value), which would make plotting in three dimensions impossible. In such cases, you may need to use the elbow method to determine the ideal number of clusters.

Perform Clustering in SciPy

In the previous step, we’ve determined that we’d like two clusters, and now we are ready to perform k-means clustering on the data. Let’s create a Pandas data frame to easily manage the variables.

import pandas as pd  df = pd.DataFrame({'red': r,              'blue': b,              'green': g}) 

There are essentially three steps involved in the process of k-means clustering with SciPy:

  1. Standardize the variables by dividing each data point by its standard deviation. We will use the whiten() method of the vq class.
  2. Generate cluster centers using the kmeans() method.
  3. Generate cluster labels for each data point using the vq() method of the vq class.

The first step above ensures that the variations in each variable affects the clusters equally. Imagine two variables with largely different scales. If we ignore the first step above, the variable with the larger scale and variation would have a larger impact on the formation of clusters, thus making the process biased. We therefore standardize the variable using the whiten() function. The whiten() function takes one argument, a list or array of the values of a variable, and returns the standardized values. After standardization, we print a sample of the data frame. Notice that the variations in the columns has reduced considerably in the standardized columns.

from scipy.cluster.vq import whiten  df['scaled_red'] = whiten(df['red']) df['scaled_blue'] = whiten(df['blue']) df['scaled_green'] = whiten(df['green']) df.sample(n = 10) 
red blue green scaled_red scaled_blue scaled_green
24888 255 255 255 3.068012 3.590282 3.170015
38583 255 255 255 3.068012 3.590282 3.170015
2659 67 91 72 0.806105 1.281238 0.895063
36954 255 255 255 3.068012 3.590282 3.170015
39967 255 255 255 3.068012 3.590282 3.170015
13851 75 101 82 0.902357 1.422033 1.019377
14354 75 101 82 0.902357 1.422033 1.019377
18613 75 100 82 0.902357 1.407954 1.019377
6719 75 100 82 0.902357 1.407954 1.019377
14299 71 104 82 0.854231 1.464272 1.019377

The next step is to perform k-means clustering with the standardized columns. We will use the kmeans() function of perform clustering. The kmeans() function (documentation) has two required arguments — the observations and the number of clusters. It returns two values — the cluster centers and the distortion. Distortion is the sum of squared distances between each point and its nearest cluster center. We will not be using distortion in this tutorial.

from scipy.cluster.vq import kmeans  cluster_centers, distortion = kmeans(df[['scaled_red', 'scaled_green', 'scaled_blue']], 2) 

The final step in k-means clustering is generating cluster labels. However, in this exercise, we won’t need to do that. We are only looking for the dominant colors, which are represnted by the cluster centers.

Display Dominant Colors

We have performed k-means clustering and generated our cluster centers, so let’s see what values they contain.

print(cluster_centers) 
[[2.94579782 3.1243935  3.52525635]  [0.91860119 1.05099931 1.4465091 ]] 

As you can see, the results we get are standardized versions of RGB values. To get the original color values we need to multiply them with their standard deviations.

We will display the colors in the form of a palette using the imshow() method of matplotlib‘s pyplot class. However, to display colors, imshow() needs the values of RGB in the range of 0 to 1, where 1 signifies 255 in our original scale of RGB values. We therefore must divide each RGB component of our cluster centers with 255 in order to get to a value between 0 to 1, and display them through the imshow() method.

Finally, we have one more consideration before we plot the colors using the imshow() function (documentation). The dimensions of the cluster centers are N x 3, where N is the number of clusters. imshow() is originally intended to display an A X B matrix of colors, so it expects a 3D array of dimentions A x B x 3 (three color elements for each block in the palette). Hence, we need to convert our N x 3 matrix to 1 x N x 3 matrix by passing the colors of the cluster centers as a list with a single element. For instance, if we stored our colors in colors we need to pass [colors] as an argument to imshow().

Let’s explore the dominant colors in our image.

colors = []  r_std, g_std, b_std = df[['red', 'green', 'blue']].std()  for cluster_center in cluster_centers:     scaled_r, scaled_g, scaled_b = cluster_center     colors.append((         scaled_r * r_std / 255,         scaled_g * g_std / 255,         scaled_b * b_std / 255     )) plt.imshow([colors]) plt.show() 

Tutorial: Find Dominant Colors in an Image through Clustering

As expected, the colors seen in the image are quite similar to the prominent colors in the image that we started with. That said, you’ve probably noticed that the light blue color above doesn’t actually appear in our source image. Remember, the cluster centers are the means all of of the RGB values of all pixels in each cluster. So, the resultant cluster center may not actually be a color in the original image, it is just the RBG value that’s at the center of the cluster all similar looking pixels from our image.

Conclusion

In this post, we looked at a step by step implementation for finding the dominant colors of an image in Python using matplotlib and scipy. We started with a JPG image and converted it to its RGB values using the imread() method of the image class in matplotlib. We then performed k-means clustering with scipy to find the dominant colors. Finally, we displayed the dominant colors using the imshow() method of the pyplot class in matplotlib.

Want to learn more Python skills? Dataquest offers a full sequence of courses that can take you from zero to data scientist in Python. Sign up and start learning for free!

Planet Python

Dataquest: How to Learn Python for Data Science In 5 Steps

Why Learn Python For Data Science?

How to Learn Python for Data Science In 5 Steps

Before we explore how to learn Python for data science, we should briefly answer why you should learn Python in the first place.

In short, understanding Python is one of the valuable skills needed for a data science career.

Though it hasn’t always been, Python is the programming language of choice for data science. Here’s a brief history:

  • In 2016, it overtook R on Kaggle, the premier platform for data science competitions.
  • In 2017, it overtook R on KDNuggets’s annual poll of data scientists’ most used tools.
  • In 2018, 66% of data scientists reported using Python daily, making it the number one tool for analytics professionals.

Data science experts expect this trend to continue with increasing development in the Python ecosystem. And while your journey to learn Python programming may be just beginning, it’s nice to know that employment opportunities are abundant (and growing) as well.

According to Indeed, the average salary for a Data Scientist is $ 127,918.

The good news? That number is only expected to increase. The experts at IBM predicted a 28% increase in demand for data scientists by the year 2020.

So, the future is bright for data science, and Python is just one piece of the proverbial pie. Fortunately, learning Python and other programming fundamentals is as attainable as ever. We’ll show you how in five simple steps.

But remember – just because the steps are simple doesn’t mean you won’t have to put in the work. If you apply yourself and dedicate meaningful time to learning Python, you have the potential to not only pick up a new skill, but potentially bring your career to a new level.

How to Learn Python for Data Science

How to Learn Python for Data Science In 5 Steps

First, you’ll want to find the right course to help you learn Python programming. Dataquest’s courses are specifically designed for you to learn Python for data science at your own pace.

In addition to learning Python in a course setting, your journey to becoming a data scientist should also include soft skills. Plus, there are some complimentary technical skills we recommend you learn along the way.

Step 1: Learn Python Fundamentals

Everyone starts somewhere. This first step is where you’ll learn Python programming basics. You’ll also want an introduction to data science.

One of the important tools you should start using early in your journey is Jupyter Notebook, which comes prepackaged with Python libraries to help you learn these two things.

Kickstart your learning by: Joining a community

By joining a community, you’ll put yourself around like-minded people and increase your opportunities for employment. According to the Society for Human Resource Management, employee referrals account for 30% of all hires.

Create a Kaggle account, join a local Meetup group, and participate in Dataquest’s members-only Slack discussions with current students and alums.

Related skills: Try the Command Line Interface

The Command Line Interface (CLI) lets you run scripts more quickly, allowing you to test programs faster and work with more data.

Step 2: Practice Mini Python Projects

We truly believe in hands-on learning. You may be surprised by how soon you’ll be ready to build small Python projects.

Try programming things like calculators for an online game, or a program that fetches the weather from Google in your city. Building mini projects like these will help you learn Python. programming projects like these are standard for all languages, and a great way to solidify your understanding of the basics.

You should start to build your experience with APIs and begin web scraping. Beyond helping you learn Python programming, web scraping will be useful for you in gathering data later.

Kickstart your learning by: Reading

Enhance your coursework and find answers to the Python programming challenges you encounter. Read guidebooks, blog posts, and even other people’s open source code to learn Python and data science best practices – and get new ideas.

Automate The Boring Stuff With Python by Al Sweigart is an excellent and entertaining resource.

Related skills: Work with databases using SQL

SQL is used to talk to databases to alter, edit, and reorganize information. SQL is a staple in the data science community, as 40% of data scientists report consistently using it.*

Step 3: Learn Python Data Science Libraries

Unlike some other programming languages, in Python, there is generally a best way of doing something. The three best and most important Python libraries for data science are NumPy, Pandas, and Matplotlib.

NumPy and Pandas are great for exploring and playing with data. Matplotlib is a data visualization library that makes graphs like you’d find in Excel or Google Sheets.

Kickstart your learning by: Asking questions

You don’t know what you don’t know!

Python has a rich community of experts who are eager to help you learn Python. Resources like Quora, Stack Overflow, and Dataquest’s Slack are full of people excited to share their knowledge and help you learn Python programming. We also have an FAQ for each mission to help with questions you encounter throughout your programming courses with Dataquest.

Related skills: Use Git for version control

Git is a popular tool that helps you keep track of changes made to your code, which makes it much easier to correct mistakes, experiment, and collaborate with others.

Step 4: Build a Data Science Portfolio as you Learn Python

For aspiring data scientists, a portfolio is a must.

These projects should include several different datasets and should leave readers with interesting insights that you’ve gleaned. Your portfolio doesn’t need a particular theme; find datasets that interest you, then come up with a way to put them together.

Displaying projects like these gives fellow data scientists something to collaborate on and shows future employers that you’ve truly taken the time to learn Python and other important programming skills.

One of the nice things about data science is that your portfolio doubles as a resume while highlighting the skills you’ve learned, like Python programming.

Kickstart your learning by: Communicating, collaborating, and focusing on technical competence

During this time, you’ll want to make sure you’re cultivating those soft skills required to work with others, making sure you really understand the inner workings of the tools you’re using.

Related skills: Learn beginner and intermediate statistics

While learning Python for data science, you’ll also want to get a solid background in statistics. Understanding statistics will give you the mindset you need to focus on the right things, so you’ll find valuable insights (and real solutions) rather than just executing code.

Step 5: Apply Advanced Data Science Techniques

Finally, aim to sharpen your skills. Your data science journey will be full of constant learning, but there are advanced courses you can complete to ensure you’ve covered all the bases.

You’ll want to be comfortable with regression, classification, and k-means clustering models. You can also step into machine learning – bootstrapping models and creating neural networks using scikit-learn.

At this point, programming projects can include creating models using live data feeds. Machine learning models of this kind adjust their predictions over time.

Remember to: Keep learning!

Data science is an ever-growing field that spans numerous industries.

At the rate that demand is increasing, there are exponential opportunities to learn. Continue reading, collaborating, and conversing with others, and you’re sure to maintain interest and a competitive edge over time.

How Long Will It Take To Learn Python?

After reading these steps, the most common question we have people ask us is: “How long does all this take?”

There are a lot of estimates for the time it takes to learn Python. For data science specifically, estimates a range from 3 months to a year of consistent practice.

We’ve watched people move through our courses at lightning speed and others who have taken it much slower.

Really, it all depends on your desired timeline, free time that you can dedicate to learn Python programming and the pace at which you learn.

Dataquest’s courses are created for you to go at your own speed. Each path is full of missions, hands-on learning and opportunities to ask questions so that you get can an in-depth mastery of data science fundamentals.

Get started for free. Learn Python with our Data Scientist path and start mastering a new skill today.

Resources and studies cited:

Planet Python