If you have to pick one, I would choose Clojure, for the following reasons:
- It's a Lisp - everyone should learn a Lisp. See http://www.paulgraham.com/avg.html
- It has a unique approach to concurrency - see http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
- It's a JVM language, which makes it immediately useful from a practical perspective: the library & tool ecosystem on the JVM is extremely good, better than any other platform IMHO. If you want to do serious tech. work in the enterprise or startup space, it is very helpful to gain a good knowledge of the JVM. FWIW, Scala also falls into this category of "interesting JVM languages".
Also, Clojure makes parallel map-reduce very easy. Here's one to start with:
(reduce + (pmap inc (range 1000)))
=> 500500
Using ratherpmap
than map
is enough to give you a parallel mapping operation. There are also parallel reducers if you use Clojure 1.5, see the reducers framework for more details.
Apart from that, you can also use Scalding, which is a Scala abstraction on top of Cascading to abstract low-level Hadoop details. It was developed at Twitter, and seems mature enough today so you can start actually using it without too much trouble.
Here is an example how you would do a Wordcount in Scalding:
package com.twitter.scalding.examples
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
}
}
I think it's a good candidate since because it's using Scala it's not too far from regular Map/Reduce Java programs, and even if you don't know Scala it's not too hard to pick up.