March 7, 2015

Clojure web scraping with Enlive

This is my Leeroy Jenkins moment in Clojure.

I am creating a web app that combines my interest in judo and desire to learn Clojure. This app will combine various sources of data to create a calendar of upcoming grappling events. I am not limiting it to judo... I have found several other sources of info for BJJ and submission wrestling as well. Most are web sources but a few others are more interesting: Facebook API and even CalDAV could be touched. I wish it to be as automated as possible so scraping is the order of the day. I already have a somewhat detailed plan for the app, to be posted about later.

I think the community may find it useful -- because of the competition among the various sport governing bodies this information is scattered.

The first order of business is getting some data. A lot of this is thanks to Dave Nolan's Enlive Tutorial. The first data source will be USA Judo's Event Calendar. I would like to give a shout-out to USA Judo, for having an easily-parsed, well-designed website. It makes this process easier.

The structure of the calendar is

  • paginated, with 10 results per page
  • the page number is indicated with a single url parameter
  • the CSS selector for a single event is "div list-item clearfix"
  • if a page number is requested for which there are no results (i.e requesting page 10 when there are only 80 results) yields a 200 status code result but an empty selector
The scaping code starts with a few basics:
(ns grapplevents-rest.scraper.usa-judo)

;base url for USA Judo published events
(def base-url "")

;the pages are structured to return 10 results per page
(def results-per-page 10)

(defn url-by-page
  "Construct a URL for an single page of results"
  (str base-url "?pg=" page))
Unit tests for this are not very interesting
(ns grapplevents-rest.scraper.usa-judo-test
  (:require [clojure.test :refer :all]
            [grapplevents-rest.scraper.usa-judo :as usa]]))

(deftest usa-judo-scraper
  (testing "url-by-page"
    (let [test-url (usa/url-by-page 1)]
      ;the url contains the base-url
      (is (re-find (re-pattern usa/base-url) test-url)))))
Now we can do something interesting -- use Enlive to grab data from the built url and select the stuff we want out.

First, I have defined a scraper.utils namespace for enlive functions I'm sure I'll be sharing among multiple web scrapers.

(ns grapplevents-rest.scraper.utils
  (:require [net.cgrand.enlive-html :as html]))

(defn fetch-url
  "Grab the contents of the url specified"
  (html/html-resource  ( url)))
Now we can select the actual elements from the page

(defn get-page-events
  "Return a list of events found on a given page, a map with enlive tags"
  (-> (url-by-page p)
      (html/select [:div.list-item.clearfix])))
We should use a mock for testing this last bit. I manually captured the text source of a page that yielded 10 events, and saved it for feeding in as input. To mock the web request we override the definition of our utility method with one that responds with the saved text file. Adding the following method to the utils namespace
(defn fetch-file
  "Grab the contents of the file specified"
  (html/html-resource  ( filename)))
An now we can test as follows
   (testing "get-page-events"
    ;mock the web request
    (with-redefs [utils/fetch-url
                    (fn [_]
                      (utils/fetch-file "test/grapplevents_rest/scraper/10-results.txt" ))]
      (let [events (usa/get-page-events 1)]
        (is (seq events))
        (is (= 10 (count events))))))
The argument to the mock function is discarded and we always get the contents of the text file.

Mocking this is fine for a lot of obvious reasons. But at some point there should be a true integration test that does a sanity check on the underlying assumptions of this scraper (that there are 10 results per page, or that the CSS selector continues to be correct, etc). Today is not the day I will address this though.

Now I have a function that will spit out a sequence of events from a single page. What would be neat to have is a producer function that would concatenate all of the events from each discreet page of events into one single sequence of events. This will abstract the 10/events-per-page detail away from the persistence layer.

I will take a similar approach to what was used in the Longest Increasing Subsequence post. I can use an ascending index value, coupled with modulo and knowing that there are 10 results per url, to figure out which page and the index on that page of the next event in the sequence.

After some futzing around in the REPL we have

(defn get-all-events
  "Get lazy seq of events harvested from all of the populated pages"
   (get-all-events 0))
   (let [pg (inc (quot c results-per-page))
         idx (mod c results-per-page)
         l (get-page-events pg)]
     (if (< idx (count l)) ;The events come in X blocks per page so decompose a bit
          (cons (nth l idx)
                (lazy-seq (get-all-events (inc c))))
Some notes:
  • the url pages are 1-indexed so the value pg is always incremented at the start
  • when we get to the last page of results it will potentially have fewer than the full 10 results. That is why the (count l) is used rather than just using a hardcoded 10 value
  • once we scroll off the end of the last page of results, the (count l) will be zero and we will get a nil to terminate the sequence
To test this behavior, I captured a second page of results from the live website. The second page has only 2 eventsso we have a grand total of 12 events to play with. Our mocked function to replace the web requests follows:
;mock a sequence of 12 events, ten in the first request and 2 in the second
(def get-files (fn [url]
                    (let [idx (Integer/parseInt (re-find #"\d$" url))]
                        (= idx 1) (utils/fetch-file "test/grapplevents_rest/scraper/10-results.txt")
                        (= idx 2) (utils/fetch-file "test/grapplevents_rest/scraper/02-results.txt")
                        :else (utils/fetch-file "test/grapplevents_rest/scraper/00-results.txt")))))
So page 3 and beyond always yield zero results from the selector.

We add tests to verify the sequence ends

  (testing "get-all-events"
    (with-redefs [utils/fetch-url get-files]
      (let [all-events (usa/get-all-events)]
        (is (seq all-events))
        (is (= 12 (count all-events)))
        (is (= 12 (count (take 23 all-events)))))))

Looking good so far. But now we have created a producer function that reuquests each page from the source website 10 times. Eeek! We should memoize this. Use memoize.core for nifty TTL functionality.

;memoize this to be good scraping citizens.  Cache for 60 seconds
(def cached-page-events
  (memo/ttl get-page-events :ttl/threshold 60000))
and then update the main producer function to call the memoized version
(defn get-all-events
  "Get lazy seq of events harvested from all of the populated pages"
   (get-all-events 0))
   (let [pg (inc (quot c results-per-page))
         idx (mod c results-per-page)
         l (cached-page-events pg)]
     (if (< idx (count l)) ;The events come in X blocks per page so decompose a bit
          (cons (nth l idx)
                (lazy-seq (get-all-events (inc c))))

Yay! Now we have a sequence of events that we can manipulate and persist at our leisure.

Tags: clojure grapplevents enlive