April 25, 2016    python image-processing statistics


For a long time, I’ve been interested in with web technology. In high school, I read Jesse Liberty’s Complete Idiot`s Guide to a Career in Computer Programming learning about Perl, CGI (common gateway interface), HTML, and other technologies. It wasn’t until I finished a degree in mathematics that I really started learning the basics, namely HTML, CSS, and JavaScript.

At that point, folks were just starting to come out of the dark ages of table-base layouts and experimenting with separating content (HTML) from presentation (CSS) from behavior (JavaScript). A popular discussion was over what the best type of layout was. I remember reading discussions of left vs right handed vs centered pages. (The latter is still a pain to implement but the web has still come a long way since then.) Those discussions stuck with me and motivates the study I’m running right now.

In mathematics, people have been using fundamental matrix algebra to help them accomplish some very interesting things. One application is in image decomposition, basically, breaking images into simpler, more basic components. A space of vectors (or coordinates that can represent data points) have “eigenvectors,” which are vectors that you can use to reconstruct other vectors that you have in front of you. In facial recognition, applying this technique to images of faces yields Eigenfaces, patterns that are common to many of the images.

Coming back to the idea of website layouts, I reasoned there must be a way to see this in real web data somehow. If we took images of the most popular websites on the web, how would their “eigenlayouts” look overall. Would certain layouts (like left or right or center) just pop out of the data somehow. Well, to answer that question, we need some data, we need to analyze it, and then we need to interpret it.

Data Retrieval

To run the analysis on these websites, we need to turn them into images. For that, I turned to PhantomJS, a headless browser geared toward rendering websites. For the sake of having a portable solution (able to run just about anywhere without needing too much bootstrapping), I decided to use a community-submitted Docker image that I found that that did that job nicely. Basic usage has you specifying the website you want to render as well as the output filename for the image. You can additionally pass in arguments for your webpage resolution. I went with 800px by 1200px because it’s a sensible minimum that people are creating websites for.

When I need to run a lot of commands and don’t need a who programming language, I typically turn to GNU make for defining a pipeline of work and for parallelizing it out.

IMAGEDIR := images/
DOMAINDIR := domains/

DOMAINLIST := $(shell find $(DOMAINDIR) -type f)

# Docker volume requires an absolute path
VOLUME := $(abspath $(IMAGEDIR))


    mkdir -p $(IMAGEDIR)
    $(eval DOMAIN := $(patsubst $(IMAGEDIR)%.png,%,$@))
    sudo docker run -t --rm -v $(VOLUME):/raster-output herzog31/rasterize http://$(DOMAIN) "$(notdir $@)" 1200px*800px 1.0

# This rule will create 10,000 files in the $(DOMAINDIR) directory
    mkdir -p $(DOMAINDIR)
    cut -d, -f2 top-1m.csv | head -10 | xargs -n1 -I{} touch "$(DOMAINDIR){}"

The gist of this is that first you run make mkdomains to create a directory domains/ filled with dummy targets of each domain you want to look up (10,000 take up about 80MB of space). Then, running make will use that seeded directory of domains to pull each one down and drop the image file in the images/ directory. You can file all of this under “stupid Makefile hacks.”

Next Steps

So far, this covers a very small portion of the 80% of data science that’s cleaning and munging data. The next blog post will focus on loading and analyzing this image data in Python using the [scikit-image](http://scikit-image.org/) library. If you liked this post, please consider subscribing to updates and following me on Twitter.