Play with the Machine » clojure http://www.machinelake.com Sat, 03 Sep 2011 16:08:33 +0000 en hourly 1 Exploring LingPipe with Clojure & Cljr http://www.machinelake.com/2010/09/11/exploring-lingpipe-with-clojure-cljr/ http://www.machinelake.com/2010/09/11/exploring-lingpipe-with-clojure-cljr/#comments Sat, 11 Sep 2010 21:33:57 +0000 gavin http://www.machinelake.com/?p=87596693 Clojure’s REPL is already a great start when it comes to exploring new Java packages. In the past I’d throw together a disposable lein project with the unknown code and open a REPL and get exploring. But now I have a new tool in the toolbox that lets me skip this step. Cljr is more focused on the workspace (for lack of a better name) rather than the lein style project. In short, you get both a REPL and swank (emacs’ clojure integration) backed by global package management. Cljr can pull in the random jars you have locally or it can pull them in from Clojars.

I’m still exploring various ideas I have for mining Twitter for artist & band related info. This time I’m looking at using LingPipe to do the heavy lifting. LingPipe is very comprehensive, very deep and comes with a steep learning curve. Where do I begin?

My process for learning a new package of code is to find the tutorial or demos, pick one and immediately start rewriting it. LingPipe has some great documentation with annotated code so the picking was the easy part. Here’s the Interesting Phrases tutorial converted to Clojure and using a sampling of Twitter search results:

(Keep in mind this isn’t truly executable code (mind the laziness.) It’s a dump of my REPL session mostly.)

(import '[com.aliasi.tokenizer IndoEuropeanTokenizerFactory TokenizerFactory]
        '[com.aliasi.lm TokenizedLM]
        '[com.aliasi.util Files ScoredObject AbstractExternalizable])
(import '[java.io File])
(require '[clojure.contrib.str-utils2 :as s]
         '[clojure.contrib.duck-streams :as ds]
         '[org.danlarkin [json :as json]])
(def n-gram 3)
(def min-count 5)
(def max-ngram-reporting-length 2)
(def ngram-reporting-length 2)
(def max-count 100)
(def tweet-text (ds/read-lines "/tmp/tweets.txt"))
(def tweet-json (map json/decode-from-str tweet-text))
(def tweet-texts (remove empty? (map :text tweet-json)))
(def tokenizer-factory (IndoEuropeanTokenizerFactory/INSTANCE))
(defn build-model
  [tf ngram t]
  (let [m (TokenizedLM. tf ngram)
        tweets t]
    (doseq [text tweets]
      (. m handle text))
    (. (. m sequenceCounter) prune 3)
    m))
(defn report-filter
  [score toks]
  (seq [score (s/join " " toks)]))
(defn report
  [ngrams]
  (map #(report-filter (. % score) (. % getObject)) ngrams))
;; Training background model
(def background-model
     (build-model tokenizer-factory n-gram tweet-texts))
;; Assembling collocations in Training
(def coll
     (. background-model collocationSet ngram-reporting-length min-count max-count))

The LingPipe code had a bunch more Java for displaying the results but since I’m in a REPL I’m ok using Clojure’s built-in pretty print:

((4264.0 "please dear")
 (4264.0 "Broken Social")
 (4092.4745660377357 "Social Scene")
 (4012.2317491968543 "Betty White")
 (3898.3688775373626 "Lupe Fiasco")
 (3788.437748482946 "dear God")
 (3376.0359201766046 "White when")
 (3365.5167372444303 "God let")

Looking at the output, I can tell the Tweets I collected obviously had Broken Social Scene and Lupe Fiasco in the results, both bands, makes sense. But Betty White? There’s no escaping her. But according to LingPipe this is what’s interesting in this mess of tweets.

The experiment was a success! With a minimal time investment and some throwaway code I’ve decided I want to spend some more time with LingPipe and dig deeper. Quick & easy thanks to Cljr.

]]>
http://www.machinelake.com/2010/09/11/exploring-lingpipe-with-clojure-cljr/feed/ 0
Side by side: Python, Common Lisp, Clojure http://www.machinelake.com/2009/11/25/side-by-side-python-common-lisp-clojure/ http://www.machinelake.com/2009/11/25/side-by-side-python-common-lisp-clojure/#comments Wed, 25 Nov 2009 23:12:16 +0000 gavin http://www.machinelake.com/?p=87596433 UPDATE: Find the rest of the code at http://bitbucket.org/gavinmcgovern/clj-bayes/.

My holiday project (of the sort that doesn’t involve cooking at least) is porting the Bayesian spam code from Peter Seibel’s great book “Practical Common Lisp” to Clojure. This will be a big part of the next-gen Big In Twitter that’s slowly coming together. Although I’ve been using Bayes, I haven’t really understood what was going on under the hood. I’m finding Peter’s chapter a fantastic walkthrough & approach to understanding it. He presents the basics and then goes on and adds optimizations. Perfect.

It was smooth sailing up until the other day. One little function tripped me up. I’ll show you.

Peter based some of his spam filter Common Lisp on a Python implementation from an article by Gary Robinson. Specifically Peter created a chi square function from this Python:

def chi2P(chi, df):
    assert df & 1 == 0
    m = chi / 2.0
    sum = term = math.exp(-m)
    for i in range(1, df//2):
        term *= m / i
        sum += term
    return min(sum, 1.0)

There’s a lot to like there (I removed the comments): concise, minimal noise, short. Even if you didn’t know the math (like me!) you could probably follow along. Would probably be even shorter if you used Python’s list comprehensions.

Here’s what Peter came up with for the Common Lisp version:

(defun inverse-chi-square (value degrees-of-freedom)
  (assert (evenp degrees-of-freedom))
  (min
   (loop with m = (/ value 2)
      for i below (/ degrees-of-freedom 2)
      for prob = (exp (- m)) then (* prob (/ m i))
      summing prob)
   1.0))

He uses the oddball loop macro. It’s a DSL for iteration. It’s charming, it’s weird, it doesn’t seem very Lispy. I like how it has synonyms, “summing” for “sum”, “collecting” for “collect”, etc. Verb tense agreement is important!

While there’s a loop in Clojure it isn’t at all related to Common Lisp’s loop. This is where things got a little muddy for me. Spent a lot of time trying various approaches and while I was able to achieve parts of the original function I wasn’t able to get the whole thing. The combination of “term *= m/i ” and the “sum += term” was killing me; so much happening at once.

Taking a breather I started poking around clojure-contrib. There is so much buried in there. A real gold mine. I eventually stumbled upon seq-utils and the “reductions” function. And that was exactly 100% what I needed. After Seq-utils and a little of Clojure’s list comprehensions and 10 minutes of coding I had this:

(defn inverse-chi-square
  [chi df]
  (assert (even? df))
  (let [m (/ chi 2.0)]
    (min
      (reduce +
        (reductions * (Math/exp (- m)) (for [i (range 1 (/ df 2))] (/ m i))))
      1.0)))

It’s been many many years since I did any sort of Common Lisp programming but one lasting memory was the vast quantity of high quality code freely available. Lots of motivated people writing excellent Common Lisp. I’m finding the same with the Clojure community. I love just being able to reach into the common libs, pull out a few gems and slap them together. Thanks! (Btw, anything wrong my version?!)

]]>
http://www.machinelake.com/2009/11/25/side-by-side-python-common-lisp-clojure/feed/ 0
Cascading’s Logparser example in Clojure http://www.machinelake.com/2009/07/02/cascadings-logparser-example-in-clojure/ http://www.machinelake.com/2009/07/02/cascadings-logparser-example-in-clojure/#comments Thu, 02 Jul 2009 16:31:50 +0000 gavin http://www.machinelake.com/?p=642 I’ve long been interested in working with Cascading but didn’t relish the thought of jumping back into Java. Thankfully with the arrival of Clojure I can now happily play in my less-typing world.

I wanted to get a basic Cascading demo running so I took the Logparser example from the distribution and ported it over to Clojure. Here’s a zip (casclojure.zip) with the Ant build.xml, the Clojure source, the directory layout, etc. I’ll leave it up to you to get the required jars & libs–I’m using Hadoop 0.18.3 and Cascading 1.0.11. The build.xml has the usual config stuff for setting up where you keep your jars. I gave up trying to be smart about managing jars so I now dump everything into a single dir. Sure makes things like Ant config easier….

Incidentally, this is a perfect example of standing on the shoulders of genius and/or the sufficiently motivated. Most of the build.xml I picked up from Kyle Burton’sCreating Executable Jars For Your Clojure Application“. (He has some other interesting Clojure posts talking about Incanter & Quartz scheduler btw.)

Clojure is concise. This example isn’t particularly idiomatic Clojure however; you wouldn’t use CamelCase for instance. But it does the job. (Also, gross, definitely need a syntax highlighter now.)

(ns logparser.app
  (:gen-class)
  (:import
     (java.util Properties)
     (cascading.flow Flow FlowConnector)
     (cascading.operation.regex RegexParser)
     (cascading.pipe Each Pipe)
     (cascading.scheme TextLine)
     (cascading.tap Hfs Lfs Tap)
     (cascading.tuple Fields)
     (org.apache.log4j Logger)))
(def apacheFields (new Fields (into-array ["ip" "time" "method" "event" "status" "size"])))
(def apacheRegex
  "^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$")
(def allGroups (int-array [1 2 3 4 5 6]))
(def parser (new RegexParser apacheFields apacheRegex allGroups))
(def importPipe (new Each "parser" (new Fields (into-array ["line"])) parser))
(def properties (new Properties))
(FlowConnector/setApplicationJarClass properties logparser.app)
(defn -main [& args]
 (let [localLogTap (new Lfs (new TextLine) (first args))
       remoteLogTap (new Hfs (new TextLine) (last args))
       parsedLogFlow (. (new FlowConnector properties) connect localLogTap remoteLogTap importPipe)]
 (dorun
  (. parsedLogFlow start)
  (. parsedLogFlow complete))))

Items to note:

1. Having to use into-array, int-array, etc. Just a few Java-isms.

2. Roughly, Hadoop needs a Jar with a Main. Clojure uses :gen-class to handle that sorta thing. :gen-class at first seemed really complicated. Turns out it is and it isn’t. Keep it simple and no problem! For my purposes I just needed to make sure I had a -main definition.

3. You’ll see logparser.app everywhere. It’s the tie the binds this all together. Most of this exercise was really about getting a common namespace set up for everything: the Clojure, the build.xml, the Jar contents, the runtime environment.

From the build.xml the key tasks are the compile and the jar-with-manifest. They demonstrate what needs to happen to make Clojure compilation possible and to make a Hadoop-happy Jar.

If you’re following along with Cascading’s Gentle Introduction you can use your newly generated Jar in place of the one mentioned:

hadoop jar ./build/logparser-0.1.jar data/apache.200.txt output

I’ve only run this in the local Hadoop mode, no distribution, no cluster. Running it on a cluster will be for another post perhaps.

I hope this helps. Let me know otherwise. I’ll try to help but I’m far from an expert on any of this.

]]>
http://www.machinelake.com/2009/07/02/cascadings-logparser-example-in-clojure/feed/ 0
Another reason I like Clojure http://www.machinelake.com/2009/06/25/another-reason-i-like-clojure/ http://www.machinelake.com/2009/06/25/another-reason-i-like-clojure/#comments Thu, 25 Jun 2009 20:44:36 +0000 gavin http://www.machinelake.com/?p=639 Exploring new packages of code, say the latest jar from Echo Nest, is incredibly easy and just plain fun. For instance:

(import '(com.echonest.api.v3.artist Artist ArtistAPI DocumentList))
(def artist-api (new ArtistAPI "YOUR DEV KEY HERE"))
(def hot-artists (. artist-api getTopHotttArtists 10))
(println (.. (first hot-artists) getItem getName))

Will spit out “Papercuts” (as of the time of this writing), the current #1 artist on Echo Nest’s Hottt list.

I started looking for a nice plugin for Wordpress that would’ve done some pretty syntax highlighting on the Clojure code but it was taking too long. Spent less time writing up the Clojure actually.

]]>
http://www.machinelake.com/2009/06/25/another-reason-i-like-clojure/feed/ 0