Concatenate columns in apache spark dataframe

+5 votes

I need to concatenate two columns in a dataframe. Is there any function in spark sql to do the same?

Announcement! Career Guide 2019 is out now. Explore careers to become a Big Data Developer or Architect!

Apr 26, 2018 in Apache Spark by Shubham
• 13,290 points
27,999 views

11 answers to this question.

+1 vote

You can use the following set of codes for scala

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

For Python

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

answered Apr 26, 2018 by kurt_cobain
• 9,240 points
0 votes

You can use CONCAT with SQL:

You can use following code for scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("scala", 1), ("implementation", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In case of Python

df = sqlContext.createDataFrame([("python", 1), ("implementation", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

answered Jun 13, 2018 by shams
• 3,580 points
0 votes
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
answered Nov 13, 2018 by Vaishnavi
0 votes

This code is helpful if you don't know the number or name of columns:

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
answered Nov 13, 2018 by Sagar
0 votes

Use the following code:

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+
answered Nov 13, 2018 by Nabarupa
0 votes

In Spark 2.3.0, you may do:

spark.sql( """ select '1' || column_a from table_a """)
answered Nov 13, 2018 by Jino
• 5,560 points
0 votes

Try this:

spark.sql( """ select '1' || column_a from table_a """)
answered Nov 13, 2018 by Kalgi
• 40,420 points
0 votes
You can do it in pyspark using sqlContext..
answered Nov 13, 2018 by Maverick
• 10,040 points
Can you explain how?

Something like this:

#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])

# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Yess I agree with @Ali, have a look at it @Kalgi.
0 votes

You can use the function concat with select. something like this:

val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
answered Nov 27, 2018 by Kalgi
• 40,420 points
0 votes

Using concat and withColumn:

val Df =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
answered Nov 27, 2018 by Kailash
+1 vote
its late but this how you can achieve:

if you want to add some delim  then df.withColumn("crimes",concat($"E_CATEGORY",lit("|"),$"E_C_TYPE",lit("|"),$"E_SUB_TYPE"))

otherwise

or df.withColumn("crimes",concat($"E_CATEGORY",$"E_C_TYPE",$"E_SUB_TYPE"))
answered Mar 21 by anonymous

Related Questions In Apache Spark

0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,360 points
690 views
0 votes
2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

answered Jul 4 in Apache Spark by Dhara dhruve
805 views
0 votes
1 answer

How to convert rdd object to dataframe in spark

SqlContext has a number of createDataFrame methods ...READ MORE

answered May 30, 2018 in Apache Spark by nitinrawat895
• 10,670 points
1,296 views
0 votes
6 answers
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,630 points
4,187 views
0 votes
1 answer

When not to use foreachPartition and mapPartition?

With mapPartion() or foreachPartition(), you can only ...READ MORE

answered Apr 30, 2018 in Apache Spark by Data_Nerd
• 2,360 points
2,136 views
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
5,237 views
0 votes
1 answer

How to groupBy/count then filter on count in Scala

I think the exception is caused because ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
6,525 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,240 points
1,209 views
+1 vote
1 answer

getting null values in spark dataframe while reading data from hbase

Can you share the screenshots for the ...READ MORE

answered Jul 31, 2018 in Apache Spark by kurt_cobain
• 9,240 points
364 views