Which is the easiest way for text analytics with hadoop

Question

am experimenting with hadoop and the distributions of Hortonwork and cloudera in order to do some simple text analytics. All the examples I have found until now on the web regarding e.g. wordcount deal with only one column. But I have many text files on which wordcount must be applied and the results must be saved in a spreadsheet, each in a separate column. So I was wondering what is the easiest way to do text analytics with hadoop in conjunction with spreadsheets. The functions I need are:

transform to lower case
filter stopwords
transpose results
write to excel

Can this be accomplished easily with Pig or Rhadoop or something else?

Frankie · Answer 1 · Nov 22, 2018

Apache pig provides CSVExcelStorage class for loading or storing into csv format, it uses CSV conventions of Excel 2007. Apart from that I have also experimented with storing the results from Pig to mongoDB and then reading it into R using rmongodb library.