Clojure and me has moved.

Friday, June 13, 2008

WideFinder 2 in Clojure (naive port from Ruby)

This post has moved, go to its new location
I ported the reference implementation of Wide Finder 2 from Ruby to Clojure nearly line by line.
On my box, this code is more than 25% faster than the original Ruby when processing 10M lines (2'45" to 3'45") — but Ruby is faster up to 100k lines.
(def u-hits)
(def u-bytes)
(def s404s)
(def clients)
(def refs)

(defmacro acc [h k v]
`(set! ~h (assoc ~h ~k (+ (get ~h ~k 0) ~v))))

(defn top [n h]
(take n (sort #(- (val %2) (val %1)) h)))

(defn record [client u bytes ref]
(acc u-bytes u bytes)
(when (re-matches #"^/ongoing/When/\\d\\d\\dx/\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+$" u)
(acc u-hits u 1)
(acc clients client 1)
(when-not (or (= ref "\"-\"") (re-find #"^\"" ref)
(acc refs (subs ref 1 (dec (count ref))) 1))))) ; lose the quotes

(defn printf [#^String fmt & args]
(let [f (java.util.Formatter. *out*)]
(.format f (. java.util.Locale ENGLISH) fmt (to-array args))))

(defn report
([label hash] (report label hash false))
([label hash shrink]
(println (str "Top " label ":"))
(let [fmt (if shrink " %9.1fM: %s\n" " %10d: %s\n")]
(doseq [key val] (top 10 hash)
(let [key (if (< 60 (count key)) (str (subs key 0 60) "...") key)
val (if shrink (/ val 1024 1024) val)]
(printf fmt val key))))))

(binding [u-hits {} u-bytes {} s404s {} clients {} refs {}]
(doseq line (-> (. System in) ( "US-ASCII") line-seq)
(let [f (.split #"\\s+" line)]
(when (= "\"GET" (get f 5))
(let [[client u status bytes ref] (map #(get f %) [0 6 8 9 10])]
(= "200" status) (record client u (.parseInt Integer bytes) ref)
(= "304" status) (record client u 0 ref)
(= "404" status) (acc s404s u 1))))))

(print (count u-hits) "resources," (count s404s) "404s," (count clients) "clients\n\n")

(report "URIs by hit" u-hits)
(report "URIs by bytes" u-bytes true)
(report "404s" s404s)
(report "client addresses" clients)
(report "referrers" refs))

My next post will show how one can achieve some parallelization without altering much the logic:
(pdoseq line (-> (. System in) ( "US-ASCII") line-seq)
[u-hits (merge-with +), u-bytes (merge-with +), s404s (merge-with +), clients (merge-with +), refs (merge-with +)]

No comments: