There is a data frame with a categorical variable holding listss of strings having various lengths. Consider the below example:
data <- data.frame(x = 1:5)
data$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
data
x y
1 1 A
2 2 A, B
3 3 C
4 4 B, D, C
5 5 E
The required form is a dummy variable for each unique string being seen anywhere in data$y, i.e.:
data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))
x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1
The approach I have chosen is very slow on big data frames. Below is my approach
unique_Strings <- unique(unlist(data$y))
n <- ncol(data)
for (i in 1:length(unique_Strings)) {
+ data[, n + i] <- sapply(data$y, function(x) ifelse(unique_Strings[i] %in% x, 1, 0))
+ colnames(data)[n + i] <- unique_Strings[i]
+ }
Any suggestions so that I can improve on my code!