Analyzing WordPress Hook Usage with Azure Data Lake

2017-06-14

WordPress provides a large number of hooks that allow plugins to extend and modify its behavior. A few months ago, I was curious about which of these hooks are popular, and which of them are hardly ever used. I was also looking for an excuse to give Microsoft’s Data Lake Analytics a spin. U-SQL looked especially attractive as it brought back fond memories of petabyte-scale data crunching at Bing.

With that in mind, I set out to build some tools that would calculate the usage of WordPress’s hooks. Breaking that up into smaller steps, I came up with:

Crawl all published plugins on WordPress.org
Extract which hooks are used by each plugin
Extract a list of WordPress hooks
For each WordPress hook, calculate its usage

On the technical side, I set the following goals for this project:

The code should be developed in C# and U-SQL
The project should use .NET Core so that it’s cross-platform (Windows, Linux, Mac)
The project should be usable in Visual Studio, VS Code or from the command line

In this article I talk about the approach and algorithms in general. For the nitty-gritty details, you can check out the source code here: https://github.com/nabsul/WordPressPluginAnalytics.

See the README.md file for instructions on building and running the code.

Crawling for Plugins

I decided to crawl the WordPress.org plugins directory to extract a list of all the plugins. All of the plugins can also be accessed from a common SVN repository, but with different branches and tag folders, I felt that would be slightly more tedious than crawling the html pages to extract the official link to each zip file. The HtmlAgilityPack library makes parsing HTML and extracting information very easy. I use it to parse each page of plugins for the links to each individual plugin page, and then I parsed each plugin page for the zip file URL.

Once I have the zip file URL, I uploaded it to Azure Blob Storage. I considered skipping this and working directly with the data from WordPress.org, but I felt this approach allowed me to have a stable snapshot of the original data to experiment on without repeatedly hitting wordpress.org for the same data.

Running the process sequentially takes nearly 5 hours from a Digital Ocean droplet, but about 90% of that time is just waiting on I/O. Therefore, adding some parallelism to this process made a lot sense. This was done very simply by fetching all 12 plugins per page in parallel. This brought the run time down to just over an hour.

Extracting Data

Now that I have my raw data, the next step is to extract useful information out of it. I used System.IO.Compression.ZipArchive to iterate over each PHP file in the zip file. I then considered writing my own code to parse each PHP file, but quickly gave up on the idea when I realized how complicated that would get. So I looked around and found Devsense.Php.Parser. Using this library, I was able to work directly on tokenized data and avoided all the hassle of parsing text myself.

With that library, I extracted each hook usage and creation in the PHP files. I only count instances where the hook name is a constant string, since it would be impossible to predict the hook name for code like add_action( "updated_$myvar", ...).

The final result needed to be in a format that can be easily analyzed with U-SQL and Azure Data Lake Analytics. U-SQL comes with built in TSV extractors, so if you upload your raw data in that format, you don’t need custom C# code to process it. Data Lake Analytics can automatically uncompress gzipped files, which is great since my TSV files compress to about 10% of their uncompressed size.

Extracting the plugins takes less than 1 hour, so I didn’t bother to run parts of that code in parallel.

Running the Analysis

The final step of the process is running a U-SQL script to analyze the data and generate the final report. You can upload the data manually or using the command line tool included in the project. You should have two extraction files: One for the WordPress source code and one for all the plugins. The final step is to run the U-SQL script. Again, you can edit and submit the script manually, or if you followed the naming conventions used in the program you can submit the job using the command line tool.

U-SQL is a SQL-like language. If you’re familiar with SQL, the code in the script should all make sense. The raw data is read from the uploaded files. The WordPress data is filtered by hooks created and the plugins are filtered by hooks used. Hook usage is counted using a GROUP BY statement. The hooks from WordPress and the plugins are then cross-referenced using a JOIN. The graph of the job looks like this:

The Cost of Data Lake Analytics

The job should take a couple of minutes to run and costs around $0.03 (US). However, I learned a few important lessons on the pricing of Data Lake jobs. First, when running on a few GB of data make sure you run with a parallelism of 1. Increasing the parallelism on a small data set is just a waste of money. For example, my 3-cent job cost 12 cents when I ran it with a parallelism of 5. I also suspect that compressing my data files helped reduce the cost of jobs. Compressed data should mean less data travelling over the network, which can often result in significantly faster (and cheaper) jobs.

The second and more important point is about using custom code and libraries in your scripts: It is possible to upload and use custom .NET DLLs in your U-SQL scripts, but I highly recommend avoiding that unless it’s absolutely necessary. I experimented with uploading the individual plugin zip files to Data Lake storage and using a custom extractor library that directly processed the zip files and tokenizes the PHP. The cost of running such a job was around $5. This is way more than the cost of working on TSV files but it does makes sense since doing the Zip extraction and PHP parsing on Microsoft’s Azure infrastructure will consume far more CPU cycles than if you do most of the pre-processing separately.

As you can see, unlike simpler services like storage, the cost of using this type of service can vary widely depending on how you design your data pipelines. It is therefore important to spend some time researching and carefully considering these decisions before settling on an approach.

Viewing the Results

The final result of running the script is a small TSV formatted report with the follow pieces of information:

Hook Name: The name of the hook (prefixed with action_ and filter_ to differentiate those two types of hooks)
Num Plugins: Number of plugins using the hook
Num Usages: Number of times the hook is used

The data can be imported to a spread sheet for further analysis and charting:

https://1drv.ms/x/s!AoNGbuElNYPMjMUVzq5931eX9YzSuA

Conclusions

Overall, I felt like there was definitely a learning curve to Azure Data Lake services, but it wasn’t all too bad. I’m definitely curious how all of this could be done in the Hadoop ecosystem, which I’m much less familiar with. If anyone would like to try replicating these results in Hadoop, I would greatly appreciate a tutorial and/or shared source code.

This code could easily be expanded to perform other types of analysis. For example, it might be interesting to see the usage of various WordPress functions and classes. It also might be interesting to reduce the list of plugins to the most popular ones to get more realistic usage information for the hooks.