Concatenate columns in apache spark dataframe

Question

I need to concatenate two columns in a dataframe. Is there any function in spark sql to do the same?

Announcement! Career Guide 2019 is out now. Explore careers to become a Big Data Developer or Architect!

kurt_cobain · Answer 1 · Apr 26, 2018

You can use the following set of codes for scala

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

For Python

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

answered Apr 26, 2018 by kurt_cobain
• 9,350 points

shams · Answer 2 · Jun 13, 2018

You can use CONCAT with SQL:

You can use following code for scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("scala", 1), ("implementation", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")

In case of Python

df = sqlContext.createDataFrame([("python", 1), ("implementation", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")

answered Jun 13, 2018 by shams
• 3,670 points

score 0 · Answer 3 · Nov 13, 2018

val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()

answered Nov 13, 2018 by Vaishnavi

MD · Answer 4 · Nov 13, 2018

This code is helpful if you don't know the number or name of columns:

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

answered Nov 13, 2018 by Sagar

If my colum names are stored in list say col_list and I want to concatenate them with space between each column value .... In pyspark Dataframe .. any idea ?how to do this

commented Apr 21, 2020 by anonymous

I think you can use one loop and fetch one by one from your list and add space.

commented Apr 21, 2020 by MD
• 95,460 points

score 0 · Answer 5 · Nov 13, 2018

Use the following code:

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+

Maverick · Answer 6 · Nov 14, 2018

You can do it in pyspark using sqlContext..

answered Nov 14, 2018 by Maverick
• 10,840 points

Can you explain how?

commented Nov 14, 2018 by Kalgi
• 52,340 points

Something like this:

#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])

# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))

commented Nov 14, 2018 by Ali
• 11,360 points

Yess I agree with @Ali, have a look at it @Kalgi.

commented Nov 14, 2018 by Maverick
• 10,840 points

score 0 · Answer 7 · Nov 27, 2018

Using concat and withColumn:

val Df =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

score +1 · Answer 8 · Mar 21, 2019

its late but this how you can achieve:

if you want to add some delim then df.withColumn("crimes",concat($"E_CATEGORY",lit("|"),$"E_C_TYPE",lit("|"),$"E_SUB_TYPE"))

otherwise

or df.withColumn("crimes",concat($"E_CATEGORY",$"E_C_TYPE",$"E_SUB_TYPE"))

Concatenate columns in apache spark dataframe

Your comment on this question:

11 answers to this question.

Your answer

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Related Questions In Apache Spark

cache tables in apache spark sql

In a Spark DataFrame how can I flatten the struct?

How to convert rdd object to dataframe in spark

How to replace null values in Spark DataFrame?

Changing Column position in spark dataframe

When not to use foreachPartition and mapPartition?

Writing File into HDFS using spark scala

How to groupBy/count then filter on count in Scala

Efficient way to read specific columns from parquet file in spark

getting null values in spark dataframe while reading data from hbase

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES