How to create dummy variables based on a categorical variable of lists in R

Question

There is a data frame with a categorical variable holding listss of strings having various lengths. Consider the below example:

data <- data.frame(x = 1:5)
data$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
data

  x       y
1 1       A
2 2    A, B
3 3       C
4 4 B, D, C
5 5       E

The required form is a dummy variable for each unique string being seen anywhere in data$y, i.e.:

data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))

  x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1

The approach I have chosen is very slow on big data frames. Below is my approach

unique_Strings <- unique(unlist(data$y))
n <- ncol(data)
for (i in 1:length(unique_Strings)) {
+   data[,  n + i] <- sapply(data$y, function(x) ifelse(unique_Strings[i] %in% x, 1, 0))
+   colnames(data)[n + i] <- unique_Strings[i]
+ }

Any suggestions so that I can improve on my code!

CodingByHeart77 · Answer 1 · Apr 13, 2018

You can use mtabulate in the following way:

library(qdapTools)
cbind(data[1], mtabulate(data$y))
#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

answered Apr 13, 2018 by CodingByHeart77
• 3,750 points

How to create dummy variables based on a categorical variable of lists in R

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Data Analytics

How to arrange a data set in ascending order based on a variable?

How to change the value of a variable using R programming in a data frame?

How to create a 2D array of vectors of different lengths in R programming?

How to create a date variable in R?

How to convert a list of vectors with various length into a Data.Frame?

How to create a list of Data frames?

How to convert a list to data frame in R?

What is the difference between [] and [[]] notations to access the elements of a list or dataframe in R?

How to sum a variable by group in R?

How to write a custom function which will replace all the missing values in a vector with the mean of values in R?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES