Exploring LingPipe with Clojure & Cljr

Clojure’s REPL is already a great start when it comes to exploring new Java packages. In the past I’d throw together a disposable lein project with the unknown code and open a REPL and get exploring. But now I have a new tool in the toolbox that lets me skip this step. Cljr is more focused on the workspace (for lack of a better name) rather than the lein style project. In short, you get both a REPL and swank (emacs’ clojure integration) backed by global package management. Cljr can pull in the random jars you have locally or it can pull them in from Clojars.

I’m still exploring various ideas I have for mining Twitter for artist & band related info. This time I’m looking at using LingPipe to do the heavy lifting. LingPipe is very comprehensive, very deep and comes with a steep learning curve. Where do I begin?

My process for learning a new package of code is to find the tutorial or demos, pick one and immediately start rewriting it. LingPipe has some great documentation with annotated code so the picking was the easy part. Here’s the Interesting Phrases tutorial converted to Clojure and using a sampling of Twitter search results:

(Keep in mind this isn’t truly executable code (mind the laziness.) It’s a dump of my REPL session mostly.)

(import '[com.aliasi.tokenizer IndoEuropeanTokenizerFactory TokenizerFactory]
        '[com.aliasi.lm TokenizedLM]
        '[com.aliasi.util Files ScoredObject AbstractExternalizable])
(import '[java.io File])
(require '[clojure.contrib.str-utils2 :as s]
         '[clojure.contrib.duck-streams :as ds]
         '[org.danlarkin [json :as json]])
(def n-gram 3)
(def min-count 5)
(def max-ngram-reporting-length 2)
(def ngram-reporting-length 2)
(def max-count 100)
(def tweet-text (ds/read-lines "/tmp/tweets.txt"))
(def tweet-json (map json/decode-from-str tweet-text))
(def tweet-texts (remove empty? (map :text tweet-json)))
(def tokenizer-factory (IndoEuropeanTokenizerFactory/INSTANCE))
(defn build-model
  [tf ngram t]
  (let [m (TokenizedLM. tf ngram)
        tweets t]
    (doseq [text tweets]
      (. m handle text))
    (. (. m sequenceCounter) prune 3)
    m))
(defn report-filter
  [score toks]
  (seq [score (s/join " " toks)]))
(defn report
  [ngrams]
  (map #(report-filter (. % score) (. % getObject)) ngrams))
;; Training background model
(def background-model
     (build-model tokenizer-factory n-gram tweet-texts))
;; Assembling collocations in Training
(def coll
     (. background-model collocationSet ngram-reporting-length min-count max-count))

The LingPipe code had a bunch more Java for displaying the results but since I’m in a REPL I’m ok using Clojure’s built-in pretty print:

((4264.0 "please dear")
 (4264.0 "Broken Social")
 (4092.4745660377357 "Social Scene")
 (4012.2317491968543 "Betty White")
 (3898.3688775373626 "Lupe Fiasco")
 (3788.437748482946 "dear God")
 (3376.0359201766046 "White when")
 (3365.5167372444303 "God let")

Looking at the output, I can tell the Tweets I collected obviously had Broken Social Scene and Lupe Fiasco in the results, both bands, makes sense. But Betty White? There’s no escaping her. But according to LingPipe this is what’s interesting in this mess of tweets.

The experiment was a success! With a minimal time investment and some throwaway code I’ve decided I want to spend some more time with LingPipe and dig deeper. Quick & easy thanks to Cljr.

This entry was posted in clojure. Bookmark the permalink. Both comments and trackbacks are currently closed.