Concatenate columns in apache spark dataframe

+5 votes

I need to concatenate two columns in a dataframe. Is there any function in spark sql to do the same?

Announcement! Career Guide 2019 is out now. Explore careers to become a Big Data Developer or Architect!

Apr 26, 2018 in Apache Spark by Shubham
• 13,480 points
62,597 views

11 answers to this question.

+1 vote

You can use the following set of codes for scala

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

For Python

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

answered Apr 26, 2018 by kurt_cobain
• 9,390 points
0 votes

You can use CONCAT with SQL:

You can use following code for scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("scala", 1), ("implementation", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In case of Python

df = sqlContext.createDataFrame([("python", 1), ("implementation", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

answered Jun 13, 2018 by shams
• 3,660 points
0 votes
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
answered Nov 13, 2018 by Vaishnavi
0 votes

This code is helpful if you don't know the number or name of columns:

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
answered Nov 13, 2018 by Sagar
If my colum names are stored in list say col_list and I want to concatenate them with space between each column value  .... In pyspark Dataframe .. any idea ?how to do this
I think you can use one loop and fetch one by one from your list and add space.
0 votes

Use the following code:

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+
answered Nov 13, 2018 by Nabarupa
0 votes

In Spark 2.3.0, you may do:

spark.sql( """ select '1' || column_a from table_a """)
answered Nov 13, 2018 by Jino
• 5,810 points
0 votes

Try this:

spark.sql( """ select '1' || column_a from table_a """)
answered Nov 13, 2018 by Kalgi
• 52,310 points
0 votes
You can do it in pyspark using sqlContext..
answered Nov 13, 2018 by Maverick
• 10,840 points
Can you explain how?

Something like this:

#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])

# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Yess I agree with @Ali, have a look at it @Kalgi.
0 votes

You can use the function concat with select. something like this:

val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
answered Nov 27, 2018 by Kalgi
• 52,310 points
0 votes

Using concat and withColumn:

val Df =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
answered Nov 27, 2018 by Kailash
+1 vote
its late but this how you can achieve:

if you want to add some delim  then df.withColumn("crimes",concat($"E_CATEGORY",lit("|"),$"E_C_TYPE",lit("|"),$"E_SUB_TYPE"))

otherwise

or df.withColumn("crimes",concat($"E_CATEGORY",$"E_C_TYPE",$"E_SUB_TYPE"))
answered Mar 21, 2019 by anonymous

Related Questions In Apache Spark

0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,390 points
2,046 views
0 votes
2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

answered Jul 4, 2019 in Apache Spark by Dhara dhruve
3,609 views
0 votes
1 answer

How to convert rdd object to dataframe in spark

SqlContext has a number of createDataFrame methods ...READ MORE

answered May 30, 2018 in Apache Spark by nitinrawat895
• 11,380 points
2,885 views
+1 vote
8 answers

How to replace null values in Spark DataFrame?

Hi, In Spark, fill() function of DataFrameNaFunctions class is used to replace ...READ MORE

answered Dec 15, 2020 in Apache Spark by MD
• 95,180 points
56,947 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,650 points
10,195 views
0 votes
1 answer

When not to use foreachPartition and mapPartition?

With mapPartion() or foreachPartition(), you can only ...READ MORE

answered Apr 30, 2018 in Apache Spark by Data_Nerd
• 2,390 points
5,279 views
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
12,792 views
0 votes
1 answer

How to groupBy/count then filter on count in Scala

I think the exception is caused because ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
22,303 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,390 points
3,936 views
+1 vote
1 answer

getting null values in spark dataframe while reading data from hbase

Can you share the screenshots for the ...READ MORE

answered Jul 31, 2018 in Apache Spark by kurt_cobain
• 9,390 points
1,149 views