How to create dummy variables based on a categorical variable of lists in R

Question

There is a data frame with a categorical variable holding listss of strings having various lengths. Consider the below example:

data <- data.frame(x = 1:5)
data$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
data

  x       y
1 1       A
2 2    A, B
3 3       C
4 4 B, D, C
5 5       E

The required form is a dummy variable for each unique string being seen anywhere in data$y, i.e.:

data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))

  x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1

The approach I have chosen is very slow on big data frames. Below is my approach

unique_Strings <- unique(unlist(data$y))
n <- ncol(data)
for (i in 1:length(unique_Strings)) {
+   data[,  n + i] <- sapply(data$y, function(x) ifelse(unique_Strings[i] %in% x, 1, 0))
+   colnames(data)[n + i] <- unique_Strings[i]
+ }

Any suggestions so that I can improve on my code!

CodingByHeart77 · Answer 1 · Apr 13, 2018

You can use mtabulate in the following way:

library(qdapTools)
cbind(data[1], mtabulate(data$y))
#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

answered Apr 13, 2018 by CodingByHeart77
• 3,750 points

How to create dummy variables based on a categorical variable of lists in R

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Data Analytics

How to arrange a data set in ascending order based on a variable?

How to change the value of a variable using R programming in a data frame?

How to create a 2D array of vectors of different lengths in R programming?

How to create a date variable in R?

How to convert a list of vectors with various length into a Data.Frame?

How to create a list of Data frames?

How to convert a list to data frame in R?

What is the difference between [] and [[]] notations to access the elements of a list or dataframe in R?

How to sum a variable by group in R?

How to write a custom function which will replace all the missing values in a vector with the mean of values in R?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES