A Blog About Anime, Code, and Pr0 H4x

imgSeek - Overcoming Performance Bottlenecks by Clustering Iskdaemon Instances

March 13, 2013 at 12:00 PM

imgSeek (and it's server-side variant iskdaemon) is an open source image matching engine developed by Ricardo Cabral. In a nutshell, imgSeek makes it possible (with just a little bit of hacking) to perform reverse image searches and visual similarity comparisons against a specific group of images of your choice, just like Google Images or TinEye (but without the beefy monthly fees). For more information, take a look at the official imgSeek documentation.

For those of you who may not see the value in being able to do reverse image searches and visual similarity comparisons on an arbitrary set of images, allow me to elaborate with a real-world example of how I use imgSeek.

Ever since mid-November I've been using imgSeek to handle all of the server-side logic behind the "identify a card by taking a picture of it" feature of my iOS app Duel Master: Yu-Gi-Oh Edition. With this feature all a user has to do is take a picture of a Yu-Gi-Oh card, and within seconds all of the relevant information about that particular card (details, rulings, prices, etc) will be presented to them. This is all accomplished by uploading the picture they took to one of the Studio Bebop API servers, feeding it through imgSeek, and then returning the results to the app for additional processing and presentation. You can see a demonstration of this feature in action in the promo video below.

As you can see, imgSeek definitely works, and it works pretty well too. However don't let my super awesome app promo fool you, there are some serious drawbacks and kinks to using imgSeek. Despite the fact that imgSeek is technically stable enough to be deployed for real-world projects, the truth is that it is still in-development software, and as such suffers from bugs, hiccups, and scary segmentation faults.

Moreover, imgSeek doesn't scale very well, especially when it comes to handling lots of requests at once. While there is some vague mention of a "clustered mode" within the default iskdaemon configuration file, it isn't actually an implemented feature yet. As such, if you hit a stock copy of iskdaemon with lots of requests at a time, you're going to start to see some serious lag due to the fact that at the moment a stock copy of iskdaemon can only process one request at a time. So if it takes roughly one second for your copy of iskdaemon to perform a visual similarity comparison, and you've got twenty or thirty requests in the queue, you can expect some serious latency. (Which will only grow worse as you add more images to your database.)

Luckily for all of you, I've taken it upon myself to implement fixes for all of theses gripes (and a few more I didn't bother mentioning), and put them into a special branch of iskdaemon called iskdaemon-clustered!

Clustered Access Layout

The basic theory behind clustering instances of iskdaemon is pretty straightforward. First you launch multiple instances of iskdaemon, each listening on a different port, but all sharing the same image database file. Then, you use Nginx (or whatever HTTP daemon floats your boat), to handle the actual load balancing via a round robin style proxy pass. Simple right? WRONG! Well sort of...

The above layout will work fine as long as you're just performing read requests (queryImgID, queryImgBlob, queryImgPath), but once you start doing write requests (addImgBlob, addImg) through the load balancing proxy pass, that's when things start to break. To put it simply, your instance nodes will start to develop database inconsistencies with each other. By which I mean that some nodes will have an image, while others some won't. Moreover once you start trying to save/load to the same database file things get even worse, because that's when you'll start to see endless loops of database errors and/or crashes.

To overcome this shortcoming, I decided to tweak things so that there is a separate instance of iskdaemon running outside of the proxy pass group, that is specifically dedicated to performing write requests. Then with the addition of some fancy h4x, I made it so that the reader iskdaemon instances in the proxy pass group automatically update their local copies of the image database so that they are always up to date.

Implementing the separate writer instance was pretty straight forward, but the logic behind keeping all of the reader instances up to date is a bit more complicated. In this next section I'll be going over in detail how I do that. You don't have to read the next section to compile/install iskdaemon-clustered, but you probably should. If you don't feel like it, skip ahead to Installing iskdaemon-clustered On Your Server.

Overcoming Database Inconsistencies With Multiple Iskdaemon Instances


Clustered access layout with separate writer instance.

The reason that parallel iskdaemon instances can develop database inconsistencies in the first place lies in the fact that iskdaemon reads and writes its image data to a single database file that is only read into memory when the iskdaemon process first starts. Any image data you add via addImgBlob or addImg is held only in memory until you call saveDb.

So when you add an image to your images database using one of your parallel instances of iskdaemon, the other instances won't know about it until they reread the database file, which normally only happens when you start iskdaemon. To overcome this hurdle I've modified queryImgBlob and queryImgID so that they call a special function that checks to see if the images database has been modified since the last time there was a read request, and rereads it into memory if there have been any changes, before doing any actual image matching work.

Unfortunately rereading the database file into memory is trickier than you might think. If for instance you try to reread the database file while your writer process is in the middle of saving its new changes, you'll more than likely run into a fat load of read errors that could potentially send your reader instances into an infinite loop of database errors. To work around this issue, I implemented another special function that copies the main images database file into a temporary file, which is read, and then deleted. If the read fails for some reason, the function recurses into itself until it successfully rereads the image database file. I'll be the first to admit that it's not the most elegant of solutions, but it's simple, and it works.

Below is a flow chart that outlines the process iskdaemon-clustered uses to handle read requests without running into database inconsistencies.



Installing iskdaemon-clustered On Your Server

Please note that the following instructions are for compiling/installing on a *nix system. You're on your own Windows users.

First up, make sure you have all of the necessary prerequisites. (If you are using Gentoo, you should be able to emerge all of this stuff without any problems.)

  • git
  • nginx
  • python version >= 2.5 (but not 3.0 yuck!) and python development libraries
  • python twisted matrix libs 8.x or later
  • python SOAPpy package 0.12
  • C/C++ compilers
  • libmagick
  • libmagick++
  • SWIG

Next, clone the iskdaemon-clustered Github repository.

git clone https://github.com/StudioBebop/iskdaemon-clustered.git
          

Now compile iskdaemon-clustered!

$ cd iskdaemon-clustered
          $ cd src
          $ python setup.py build
          $ sudo python setup.py install
          

Now assuming that you have all of the necessary prerequisites, and nothing went wrong, iskdaemon-clustered should now be installed on your system. If you are having problems compiling, try taking a look at the installation instructions on the imgSeek website.

Now for the fun part, configuring your iskdaemon cluster! As I explained earlier, the basic concept here is to launch multiple instances of iskdaemon.py in parallel that all share the same database file. To make this easier for you, I've included a python script (launch-clustered-isk.py) that makes this super easy (again it's a little hacky, but it gets the job done without too much work).

First copy launch-clustered-isk.py to wherever you'd like to hold the database and other files for your iskdaemon cluster. (I just use my home directory.)

cp iskdaemon-clustered/launch-clustered-isk.py ~
          

launch-clustered-isk.py should work right out of the box, but for the sake of learning, let's take a quick peak at it's configuration options.

Open launch-clustered-isk.py up in your favorite text editor (Nano master race reporting in). Lines 19-23 are the places where you can make adjustments where you need/want to, each line is commented, but I'll give you a quick overview anyway.

  • instance_count = 13
    This sets how many instances of iskdaemon.py you'd like to launch. With the default configuration 13 instances will be launched. 1 for writing, and 12 for reading.
  • start_port = 1336
    This sets the port to start your instances listening on. With each instance the listening port will be incremented by 1.
    • Instance 1 - listens on port 1336
    • Instance 2 - listens on port 1337
    • Instance 3 - listens on port 1338
  • execpath = "/usr/bin/iskdaemon.py"
    This sets the path to iskdaemon.py. The default value _should
    work, but you will need to adjust it if you ended up installing isdkaemon.py somewhere else.
  • isk_root = os.path.join(os.path.abspath("."), "isk-cluster")
    This sets the path that will be created to hold all of the iskdaemon cluster files. You should probably leave this line alone.
  • iskdbpath = os.path.join(os.path.abspath("."), "isk-db")
    This sets the path to the main iskdaemon image database file. You should probably leave this line alone.

Once you have launch-clustered-isk.py configured just the way you want it, it's time to configure Nginx to handle proxying requests to your cluster.

Open up Nginx's config file (should be /etc/nginx/nginx.conf on Gentoo) in your favorite text editor, and in the http{} section, add the following lines.

http {
              upstream isk-cluster {
                  server localhost:1337;
                  server localhost:1338;
                  server localhost:1339;
                  # ... skipping some lines, and assuming you configured 12 reader instances
                  server localhost:1346;
                  server localhost:1347;
                  server localhost:1348;
              }
          
              # listen on localhost on port 81
              server {
                  listen 81;
                  server_name localhost;
          
                  location / {
                          proxy_pass http://isk-cluster;
                  }
              }
          }
          

Now (re)start Nginx, and then run launch-clustered-isk.py. If everything went right, you should see a bunch of lines about launching iskdaemon instances and listening on different ports. If you see error messages, something has gone terribly terribly wrong, and it's up to you to figure out what.

Assuming everything went as planned, you should now be able to access your iskdaemon read cluster from http://localost:81, and your writing instance via http://localhost:1336. Have fun!

Miscellaneous Tips and Information

  • launch-clustered-isk.py isn't a daemon process. If you want to launch and forget it, do what I do, and launch it in a screen.

    $ screen $ ./launch-clustered-isk.py &
  • launch-clustered-isk.py uses infinite loops to keep your various iskdaemon instances running, even if they crash. If you want to shut down your iskdaemon cluster, execute the following commands.

    $ killall launch-clustered-isk.py $ killall iskdaemon.py
  • I've modified the queryImgBlob and addImgBlob functions a little bit. To use them, send them base64 encoded image data. I did this so that I could use these functions with Ruby's default XMLRPC library.
  • An additional way you can give your iskdaemon instances a boost is by renicing them. To renice your entire iskdaemon cluster, execute the following command.

    $ echo renice -n -10 -p `echo \`pgrep iskdaemon.py\` | sed -e 's/ / /g'` | sudo /bin/bash
  • Iskdaemon and iskdaemon-clustered are resource monsters. Make sure you have the hardware resources necessary to run things smoothly. You don't want to be dipping heavily into swap space because you ran out of RAM.
    • I'm running my iskdaemon cluster on a server with 64gb of RAM and an SSD for the database files.
  • I use iskdaemon-clustered in the following projects.

  • If you have an idea to improve iskdaemon-clustered, or are using it in a project, let me know!

Go to Page