sun, 13-mar-2016, 08:27

Introduction

There are now 777 photos in my photolog, organized in reverse chronological order (or chronologically if you append /asc/ to the url). With that much data, it occurred to me that there ought to be a way to organize these photos by color, similar to the way some people organize their books. I didn’t find a way of doing that, unfortunately, but I did spend some time experimenting with image similarity analysis using color.

The basic idea is to generate histograms (counts of the pixels in the image that fit into pre-defined bins) for red, green and blue color combinations in the image. Once we have these values for each image, we use the chi square distance between the values as a distance metric that is a measure of color similarty between photos.

Code

I followed this tutorial Building your first image search engine in Python which uses code like this to generate 3D RGB histograms (all the code from this post is on GitHub):

import cv2

def get_histogram(image, bins):
   """ calculate a 3d RGB histogram from an image """
   if os.path.exists(image):
      imgarray = cv2.imread(image)

      hist = cv2.calcHist([imgarray], [0, 1, 2], None,
                           [bins, bins, bins],
                           [0, 256, 0, 256, 0, 256])
      hist = cv2.normalize(hist, hist)

      return hist.flatten()
   else:
      return None

Once you have them, you need to calculate all the pair-wise distances using a function like this:

def chi2_distance(a, b, eps=1e-10):
   """ distance between two histograms (a, b) """
   d = 0.5 * np.sum([((x - y) ** 2) / (x + y + eps)
                     for (x, y) in zip(a, b)])

   return d

Getting histogram data using OpenCV in Python is pretty fast. Even with 32 bins, it only took about 45 minutes for all 777 images. Computing the distances between histograms was a lot slower, depending on how the code was written.

With 8 bin histograms, a Python script using the function listed above, took just under 15 minutes to calculate each pairwise comparison (see the rgb_histogram.py script).

Since the photos are all in a database so they can be displayed on the Internet, I figured a SQL function to calculate the distances would make the most sense. I could use the OpenCV Python code to generate histograms and add them to the database when the photo was inserted, and a SQL function to get the distances.

Here’s the function:

CREATE OR REPLACE FUNCTION chi_square_distance(a numeric[], b numeric[])
RETURNS numeric AS $_$
   DECLARE
      sum numeric := 0.0;
      i integer;
   BEGIN
      FOR i IN 1 .. array_upper(a, 1)
      LOOP
            IF a[i]+b[i] > 0 THEN
               sum = sum + (a[i]-b[i])^2 / (a[i]+b[i]);
            END IF;
      END LOOP;

      RETURN sum/2.0;
   END;
$_$
LANGUAGE plpgsql;

Unfortunately, this is incredibly slow. Instead of the 15 minutes the Python script took, it took just under two hours to compute the pairwise distances on the 8 bin histograms.

When your interpreted code is slow, the solution is often to re-write compiled code and use that. I found some C code on Stack Overflow for writing array functions. The PostgreSQL interface isn’t exactly intuitive, but here’s the gist of the code (full code):

#include <postgres.h>
#include <fmgr.h>
#include <utils/array.h>
#include <utils/lsyscache.h>

/* From intarray contrib header */
#define ARRPTR(x) ( (float8 *) ARR_DATA_PTR(x) )

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(chi_square_distance);
Datum chi_square_distance(PG_FUNCTION_ARGS);

Datum chi_square_distance(PG_FUNCTION_ARGS) {
   ArrayType *a, *b;
   float8 *da, *db;

   float8 sum = 0.0;
   int i, n;

   da = ARRPTR(a);
   db = ARRPTR(b);

   // Generate the sums.
   for (i = 0; i < n; i++) {
      if (*da - *db) {
            sum = sum + ((*da - *db) * (*da - *db) / (*da + *db));
      }
      da++;
      db++;
   }

   sum = sum / 2.0;

   PG_RETURN_FLOAT8(sum);
}

This takes 79 seconds to do all the distance calculates on 8 bin histograms. That kind of improvement is well worth the effort.

Results

After all that, the results aren’t as good as I was hoping. For some photos, such as the photos I took while re-raising the bridge across the creek, sorting by the histogram distances does actually identify other photos taken of the same process. For example, these two photos are the closest to each other by 32 bin histogram distance:

//media.swingleydev.com/img/photolog/2014/08/end_of_the_log_raised_to_the_bank_2014-08_600.jpg

//media.swingleydev.com/img/photolog/2014/08/moving_heavy_things:_log_edition_2014-08_600.jpg

But there are certain images, such as the middle image in the three below that are very close to many of the photos in the database, even though they’re really not all that similar. I think this is because images with a lot of black in them (or white) wind up being similar to each other because of the large areas without color. It may be that performing the same sort of analysis using the HSV color space, but restricting the histogram to regions with high saturation and high value, would yield results that make more sense.

//media.swingleydev.com/img/photolog/2016/01/sunrise_at_abr_2016-01_600.jpg

//media.swingleydev.com/img/photolog/2013/01/arrival_600.jpg

//media.swingleydev.com/img/photolog/2012/09/chinook_sunrise_600.jpg

tags: photos SQL photolog OpenCV C color RGB