Cobalt Survey Analysis


1. Consumer Segmentation

Choosing an appropriate distance metric

In this step, we clustered consumers into different groups by 5 variables: purposes of visit, brand of interest, gender, age, and income. These variables are categorical and ordinal variables which make the clustering algorithms based on numerical distance metrics such as L-norm distance, cosine similarity, or Pearson correlation useless in this data setting. We were considering 2 methods to address this problem.

data.mca = MCA(data, ncp = 20, graph = FALSE)
data.mca$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1      0.3177                  6.354                             6.354
## dim 2      0.2347                  4.693                            11.047
## dim 3      0.2234                  4.467                            15.515
## dim 4      0.2155                  4.310                            19.824
## dim 5      0.2105                  4.210                            24.034
## dim 6      0.2081                  4.163                            28.197
## dim 7      0.2058                  4.115                            32.312
## dim 8      0.2056                  4.113                            36.425
## dim 9      0.2038                  4.075                            40.500
## dim 10     0.2026                  4.052                            44.551
## dim 11     0.2010                  4.020                            48.572
## dim 12     0.2004                  4.008                            52.579
## dim 13     0.1998                  3.995                            56.575
## dim 14     0.1985                  3.970                            60.545
## dim 15     0.1971                  3.941                            64.486
## dim 16     0.1962                  3.924                            68.410
## dim 17     0.1952                  3.904                            72.314
## dim 18     0.1950                  3.899                            76.214
## dim 19     0.1925                  3.851                            80.064
## dim 20     0.1872                  3.744                            83.808
## dim 21     0.1838                  3.675                            87.484
## dim 22     0.1770                  3.539                            91.023
## dim 23     0.1698                  3.395                            94.418
## dim 24     0.1557                  3.113                            97.532
## dim 25     0.1234                  2.468                           100.000
options(width = 1200)
data[c(1, 2, 6), ]
##                               Purpose   Brand Gender      Age            Income
## 2          Shopping for a new vehicle Unknown   Male 55 to 64 $45,000 - $54,999
## 4  Get Service info or Schedule Servi Unknown   Male 25 to 34     Over $100,000
## 10                      Just browsing Unknown   Male 25 to 34 $45,000 - $54,999
gower.dist(data[1, ], data[6, ])
##      [,1]
## [1,]  0.4
gower.dist(data[1, ], data[2, ])
##      [,1]
## [1,]  0.6

Compute dissimmilarity matrix and execute hierarchical clustering

We used 'daisy {cluster}' function to compute a dissimilarity matrix that contains the distances among 31311 out of 50203 data points with unique SessionID (we removed observations that have NA values to handle missing data). After computing the dissimilarity matrix of all pairs of the observations, we examined clustering algorithms such as PAM (Partition Around Medoids) and hierarchical clustering. We went for the hierarchical method as it has more persuasive visualization than that of PAM.

# distance matrix with gower dissimilarity function
demo.agr = daisy(data, metric = "gower")
# ward methods for agglomerative, have tried complete/average but ward is the chosen method
clust.ward = hclust(demo.agr, method = "ward")
plot(clust.ward, labels = FALSE)

Dendrogram

Choosing number of clusters and cutting tree

We used Silhouette average width as a criterion to choose an appropriate number of clusters. Silhouette width of an observation is a measure of how close it is to the cluster it’s belonged to, and how far it is to the other clusters. The width with a value of 1 means this observation should be in this cluster with 100% certainty, whereas -1 means it definitely shouldn’t. Maximizing the Silhouette average width of all data points is our optimizing objective that leads us the number of clusters K. As we can see from the figure 1, it looks like 11 is the best choice in term of maximizing Silhouette average width and manageable size.

# separate them into groups, choosing number of clusters by silhouette average width
avg_widths = c(rep(0, 20))
for (i in 2:20) {
    clust = cutree(clust.ward, k = i)
    s = silhouette(clust, demo.agr)
    avg_widths[i] = summary(s)$avg.width
}
plot(avg_widths, ylab = "Silhouette Average Width", xlab = "Number of clusters", 
    xlim = c(2, 20), col = "purple")
Dendrogram11

2. Topic Analysis

Failed Task Completion (other) Site change Suggestions
Before PLDA After PLDA Before PLDA After PLDA
Multinomial Naïve Bayes 44.6% 60.9% 56.6% 66,8%
Multinomial Logistic Regression 50.4% 65.8% 58% 67,9%
Topic Analysis Process

Reference:

  1. Bellman, Richard. “Dynamic programming and Lagrange multipliers.” Proceedings of the National Academy of Sciences of the United States of America 42.10 (1956): 767.

  2. Gower, John C. “A general coefficient of similarity and some of its properties.” Biometrics (1971): 857-871.