Tuesday, November 11, 2014

PHP Search Speed

I recently had a problem with a PHP command line script I use to sanity check some of our systems.

One particular check compares user accounts across a number of systems. There are ~100,000 user accounts. The script loads data in to memory from various LDAP and SQL interfaces. Once in memory, it iterates through the records and compares them. This process takes a few minutes, mostly spent loading data via LDAP.

I had identified a data problem in one system which uses SQL to get the user data. The SQL in question already joins a few tables and rather than introduce another complex join, I decided to just load the data about the new item into a new array using a separate SQL statement. This added just a few seconds to the script run time. However, once I implemented the record compare, the speed of the script slowed to a halt.

This was unexpected because I regularly process data sets this size and larger in PHP without and problems (once I raise the memory_limit).

This is what I discovered while debugging this problem:
It is well known that the PHP in_array & array_search functions are slow. They perform a linear search on the array resulting in O(n) performance.
Many people have contributed code which can implement a binary search on an array. This in theory reduces the performance to O(log n). My testing shows an improvement in search times but they are still quite slow and not representitive of the perforamce I am use to.
So how was it that I have never had this problem before? A quick inspection of my existing code shows that I always retrieve data by using the key. This results in a hashtable lookup of O(1).
So I changed my code to store my new data in the array key and use isset($array[$key]) and bang! my script is running on time again.

Footnote:

The performance of these functions may not be exactly O(n), O(log n) and O(1) but are close enough to enable comparing the performance of these functions against each other.

No comments:

Post a Comment