Spark withColumn() is a DataFrame function that is used to add a new column to DataFrame, change the value of an existing column, convert the datatype of a column, derive a new column from an existing column, on this post, I will walk you through commonly used DataFrame column operations with Scala examples.
First, let’s create a simple DataFrame to work with.
import spark.sqlContext.implicits._
val data = Seq(("111",50000),("222",60000),("333",40000))
val df = data.toDF("EmpId","Salary")
df.show(false)
Yields below output
+-----+------+
|EmpId|Salary|
+-----+------+
|111 |50000 |
|222 |60000 |
|333 |40000 |
+-----+------+
Using withColumn() to Add a New Column
withColumn() is used to add a new or update an existing column on DataFrame, here, I will just explain how to add a new column by using an existing column. withColumn() function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in Column type.
//Derive a new column from existing
df.withColumn("CopiedColumn",col("salary")* -1)
.show(false)
Here, we have added a new column CopiedColumn by multiplying -1 with an existing column Salary. This yields the below output.
+-----+------+------------+
|EmpId|Salary|CopiedColumn|
+-----+------+------------+
|111 |50000 |-50000 |
|222 |60000 |-60000 |
|333 |40000 |-40000 |
+-----+------+------------+
You can also add columns based on some conditions, please refer to Spark Case When and When Otherwise examples
Using Select to Add Column
The above statement can also be written using select() as below and this yields the same as the above output. You can also add multiple columns using select.