Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse. I've tested the following:
- Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
- Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using:
load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally, the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) then it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transferring the result to the Datawarehouse. This approach is better if you have a lot of processing power, and you would not impact the Datawarehouse's performance as much as the other solutions.