Definitely check out Scalding. Speaking as a user and occasional contributor, I've found it to be a very useful tool. The Scalding API is also meant to be very compatible with the standard Scala collections API. Just as you can call flatMap, map, or groupBy on normal collections, you can do the same on scalding Pipes, which you can imagine as a distributed List of tuples. There's also a typed version of the API which provides stronger type-safety guarantees. I haven't used Scoobi, but the API seems similar to what they have.
Additionally, there are a few other benefits:
- Scalding is heavily used in production at Twitter and has been battle-tested on Twitter-scale datasets.
- It has several active contributors both inside and outside Twitter that are committed to making it great.
- It is interoperable with your existing Cascading jobs.
- In addition to the Typed API, it has a Fields API which may be more familiar to users of R and data-frame frameworks.
- It provides a robust Matrix Library.