I’ve long been interested in working with Cascading but didn’t relish the thought of jumping back into Java. Thankfully with the arrival of Clojure I can now happily play in my less-typing world.
I wanted to get a basic Cascading demo running so I took the Logparser example from the distribution and ported it over to Clojure. Here’s a zip (casclojure.zip) with the Ant build.xml, the Clojure source, the directory layout, etc. I’ll leave it up to you to get the required jars & libs–I’m using Hadoop 0.18.3 and Cascading 1.0.11. The build.xml has the usual config stuff for setting up where you keep your jars. I gave up trying to be smart about managing jars so I now dump everything into a single dir. Sure makes things like Ant config easier….
Incidentally, this is a perfect example of standing on the shoulders of genius and/or the sufficiently motivated. Most of the build.xml I picked up from Kyle Burton’s “Creating Executable Jars For Your Clojure Application“. (He has some other interesting Clojure posts talking about Incanter & Quartz scheduler btw.)
Clojure is concise. This example isn’t particularly idiomatic Clojure however; you wouldn’t use CamelCase for instance. But it does the job. (Also, gross, definitely need a syntax highlighter now.)
(ns logparser.app (:gen-class) (:import (java.util Properties) (cascading.flow Flow FlowConnector) (cascading.operation.regex RegexParser) (cascading.pipe Each Pipe) (cascading.scheme TextLine) (cascading.tap Hfs Lfs Tap) (cascading.tuple Fields) (org.apache.log4j Logger))) (def apacheFields (new Fields (into-array ["ip" "time" "method" "event" "status" "size"]))) (def apacheRegex "^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$") (def allGroups (int-array [1 2 3 4 5 6])) (def parser (new RegexParser apacheFields apacheRegex allGroups)) (def importPipe (new Each "parser" (new Fields (into-array ["line"])) parser)) (def properties (new Properties)) (FlowConnector/setApplicationJarClass properties logparser.app) (defn -main [& args] (let [localLogTap (new Lfs (new TextLine) (first args)) remoteLogTap (new Hfs (new TextLine) (last args)) parsedLogFlow (. (new FlowConnector properties) connect localLogTap remoteLogTap importPipe)] (dorun (. parsedLogFlow start) (. parsedLogFlow complete))))
Items to note:
1. Having to use into-array, int-array, etc. Just a few Java-isms.
2. Roughly, Hadoop needs a Jar with a Main. Clojure uses :gen-class to handle that sorta thing. :gen-class at first seemed really complicated. Turns out it is and it isn’t. Keep it simple and no problem! For my purposes I just needed to make sure I had a -main definition.
3. You’ll see logparser.app everywhere. It’s the tie the binds this all together. Most of this exercise was really about getting a common namespace set up for everything: the Clojure, the build.xml, the Jar contents, the runtime environment.
From the build.xml the key tasks are the compile and the jar-with-manifest. They demonstrate what needs to happen to make Clojure compilation possible and to make a Hadoop-happy Jar.
If you’re following along with Cascading’s Gentle Introduction you can use your newly generated Jar in place of the one mentioned:
hadoop jar ./build/logparser-0.1.jar data/apache.200.txt output
I’ve only run this in the local Hadoop mode, no distribution, no cluster. Running it on a cluster will be for another post perhaps.
I hope this helps. Let me know otherwise. I’ll try to help but I’m far from an expert on any of this.