What is the best way to integrate SAS with Hadoop without losing the parallel processing capacity of Hadoop

0 votes
I am trying to understand the integration between SAS and Hadoop. From what I understand, SAS processes like proc sql can only work against a SAS data set, I cannot issue proc sql against a text file on a hadoop node. Is it correct?

If yes, then I need to uses some ETL jobs to first take the data out of HDFS and convert it to SAS tables. But if I do that, I will lose the parallel processing capabilties of Hadoop.

So what is the ideal way of integrating SAS and Hadoop and still use the parallel processing power of Hadoop?

I understand you can call a map reduce job from inside SAS, but can a map reduce job be written in SAS? I think not.
Oct 16, 2018 in Big Data Hadoop by Neha
• 6,280 points
294 views

1 answer to this question.

0 votes

One of the major pushes at SAS Global Forum 2015 was actually the new options for connections to Hadoop and Teradata. FEDSQL and DS2, new in SAS 9.4, exist in part specifically to enable SAS to better work with Hadoop. You can execute code directly in your Hadoop node, as well as do a lot more efficient processing in SAS directly.

Assuming you have the most recent release of SAS (9.4 TS1M3), you can look at the SAS Release Notes (Current as of 9/3/2015; in the future this will point to later versions). That includes information like the following:

In the second maintenance release for SAS 9.4, the SAS In-Database Code Accelerator for Hadoop runs the DS2 data program as well as the thread program inside the database. Several new functions have been added. The HTTP package enables you to construct an HTTP client to access web services and a new logger enables logging of HTTP traffic. A connection string parameter is available when instantiating an SQLSTMT package.

SAS FedSQL is a SAS proprietary implementation of the ANSI SQL:1999 core standard. It provides support for new data types and other ANSI 1999 core compliance features and proprietary extensions. FedSQL provides data access technology that brings a scalable, threaded, high-performance way to access, manage, and share relational data in multiple data sources. FedSQL is a vendor-neutral SQL dialect that accesses data from various data sources without submitting queries in the SQL dialect that is specific to the data source. In addition, a single FedSQL query can target data in several data sources and return a single result table. The FEDSQL procedure enables you to submit FedSQL language statements from a Base SAS session. The first maintenance release for SAS 9.4 adds support for Memory Data Store (MDS), SAP HANA, and SASHDAT data sources.

In the second maintenance release for SAS 9.4, SAS FedSQL supports Hive, HDMD, and PostgreSQL data sources. Data types can be converted to another data type. You can add DBMS-specific clauses to the end of the CREATE INDEX statement, and you can write a SASHDAT file in compressed format.

In the third maintenance release of SAS 9.4, FedSQL has added support for HAWQ and Impala distributions of Hadoop, enhanced support for Impala, new data types, and more.

Hadoop Support

The first maintenance release for SAS 9.4 enables you to use the SPD Engine to read, write, and update data in a Hadoop cluster through the HDFS. In addition, you can now use the HADOOP procedure to submit configuration properties to the Hadoop server.

In the second maintenance release for SAS 9.4, performance has been improved for the SPD Engine access to Hadoop. The SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS is available from the support.sas.com third-party site for Hadoop.

In the third maintenance release of SAS 9.4, access to data stored in HDFS is enhanced with a new distributed lock manager and therefore easier access to Hadoop clusters using Hadoop configuration files.

Beyond this, there is extensive documentation and papers written on the subject; documentation for the SAS Connector for Hadoop, for example.

answered Oct 16, 2018 by Frankie
• 9,810 points

Related Questions In Big Data Hadoop

0 votes
1 answer

Best way of starting & stopping the Hadoop daemons with command line

First way is to use start-all.sh & ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,350 points
2,065 views
+1 vote
1 answer
0 votes
1 answer

What is the use of sequence file in Hadoop?

Sequence files are binary files containing serialized ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by Ashish
• 2,630 points
1,887 views
0 votes
12 answers

What is Zookeeper? What is the purpose of Zookeeper in Hadoop Ecosystem?

Hey, Apache Zookeeper says that it is a ...READ MORE

answered Apr 29 in Big Data Hadoop by Gitika
• 25,420 points
6,306 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,760 points
3,544 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,760 points
438 views
+1 vote
11 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
18,088 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,280 points
1,313 views
0 votes
1 answer

What is the best functional language to do Hadoop Map-Reduce?

down voteacceptedBoth Clojure and Haskell are definitely ...READ MORE

answered Sep 4, 2018 in Big Data Hadoop by Frankie
• 9,810 points
57 views
0 votes
1 answer

What is the standard way to create files in your hdfs file-system?

Well, it's so easy. Just enter the below ...READ MORE

answered Sep 22, 2018 in Big Data Hadoop by Frankie
• 9,810 points
137 views