PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia
ISSN: 1983-9456 (Impressa)
ISSN: 2317-0123 (On-line)
Editor: Fauze Najib Mattar
Sistema de avaliação: Triple Blind Review
Idiomas: Português e Inglês
Publicação: ABEP – Associação Brasileira de Empresas de Pesquisa
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster
Analysis1
Como Danificar Seriamente um Estudo de Segmentação: Use Análise Fatorial como Insumo
para Cluster Analysis
Submission: Mar./28/2014 - Approval: Apr./22/2014
Luiz Sá Lucas
Master in Mathematical Programming from the Federal University of Rio de Janeiro - UFRJCOPPE. Electrical Engineer - Systems from the Pontifical Catholic University of Rio de Janeiro
PUC-RJ. Technical Director at Ibope Intelligence. Board Member of the European Society of
Marketing Research - ESOMAR. He has published or presented several papers on mathematical
programming techniques in specialized journals and Congresses/Conferences in Latin America,
Europe and Asia.
E-mail: [email protected]; [email protected]
Professional Address: Rua da Assembléia - nº 98 - 12ºandar - 20011-000 - Rio de Janeiro/RJ Brasil.
Wagner Esteves
Master in Production Engineering from Universidade Federal Fluminense - UFF. Graduate degree
in Statistics from the National School of Statistical Sciences - ENCE / IBGE. Planning Coordinator
at Ibope Intelligence.
E-mail: [email protected]
Larissa Catalá
Postgraduate in Market Research by the School of Communication and Arts, University of São
Paulo - ECA-USP. Graduate degree in Statistics from the University of Campinas - UNICAMP.
Technologist in Statistics from IBGE.
E-mail: [email protected]; [email protected]
1
This was one of the papers presented at ABEP’s 6 th Brazilian Market, Opinion and Media Research Congress (held on
March 24 and 25, 2014), turned into an article by its author(s), submitted to PMKT, and approved for publication.
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
ABSTRACT
Consumers are not equal. They have different needs, buying behavior, propensities, brand loyalty
etc. Hence Segmentation becomes one of the key techniques in the positioning of a brand. However,
a bad and generalized habit has crystallized in Market Research: the use of factor analysis, followed
in tandem by an application of cluster analysis on these factors. The article presents an argument to
prevent such use, in particular Principal Component Analysis as this first step. Here we present an
exercise: an extensive segmentation with both approaches (PCA and the original variables),
analyzing which of the two methods showed the best result among 720 segmentations we
performed.
KEYWORDS:
Segmentation, factor analysis, principal components analysis, cluster analysis, brand positioning.
RESUMO
Os consumidores não são iguais, eles têm diferentes necessidades, comportamentos de compra,
propensões, fidelidade à marca etc. Daí a segmentação se transformar numa das principais técnicas
para o posicionamento de uma marca. No entanto, um mau e generalizado hábito se cristalizou na
pesquisa de mercado: o uso de Análise Fatorial seguida da aplicação de Cluster Analysis aos fatores
obtidos. Este artigo apresenta uma argumentação para evitar esse uso, em particular do método da
Análise de Componentes Principais – ACP (ou Principal Component Analysis – PCA, em inglês),
como o primeiro passo. Apresenta-se como exemplo uma extensa segmentação com as duas
abordagens (ACP e as variáveis originais), analisando qual dos dois métodos apresentou melhor
resultado dentre as 720 segmentações efetuadas.
PALAVRAS-CHAVE:
Segmentação, análise fatorial, análise de componentes principais, análise de grupamento,
posicionamento de marca.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
132
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
1. INTRODUCTION
The title of this article brings some exaggeration. The damage is not always disastrous and, as will
be seen, there are a few cases where Principal Components Analysis - ACP is justified. However,
for the kind of problems handled in marketing the superiority of Cluster Analysis - CA, on the
original variables, becomes clear in the examples presented.
2. SEGMENTATION THROUGH TIME
The 1930s was extremely fruitful on creating different factor analysis techniques, including ACP.
Cluster Analysis algorithms only began to appear on the mid-1960s and even on the 1980s, where
the dominance of factor analysis techniques - AF was such that even the cluster analysis was
performed using AF (the so called Q Factor Analysis). An excellent description of this aspect can
be found in Myers (1996) and Stewart (1981). It is worth quoting a small excerpt from the excellent
book of Myers (1996):
Although the general concept of market segmentation has been formally introduced for the first time by
Wendell Smith (1956), markets have already been targeted by many decades and even before that.
Perhaps the earliest forms of segmentation have been based on the marketing mix (product, price,
promotion and distribution). Even before that, most likely merchants brought to markets products that
differed in terms of type and desired quality, acceptable price levels and/or distribution methods.
Myers (1996) also comments on the transition from Henry Ford´s 1900 position: “consumers can
have any car color they want provided it is black” (emphasis on producing, as demand was
sufficient to adopt this position) to Alfred Sloan´s (GM) in 1920: “a car for every budget and
purpose” (emphasis in the market). Thus, the concept of segmentation is old and based on the fact
that consumers are not equal.
Indeed, the trend in modeling in marketing today seeks to go further: choice models, such as
conjoint analysis, does not only segment costumers, but also estimate individual preferences. There
are several texts on the subject, but perhaps the most complete one is written by Wedel and
Kamakura (2000). It was on this text, and on an ESOMAR workshop with Steve Cohen, that our
attention was aroused to the difficulties associated with the use of ACP and Cluster Analysis in
tandem.
3. SEGMENTATION AND MARKETING
Gray (2013) gives a brief description of segmentation that seems quite useful. According to his text,
segmentation is one of the most important methodologies in Market Research. In many ways it
facilitates better decisions and increase profitability, as it helps to:
 Understand what motivates the behavior of different consumers in a product category or service.
 Reveals patterns of behavior and consumer motivation and links them to their categories
(demographics, for example).
 Indicates how the various brands position themselves according to the needs of consumers
segments.
 Identifies unmet needs.
 Modifies existing offerings to attract a higher volume of customers.
 Helps in developing new products.
 Improves relationship with consumers.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
133
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
Gray (2013) also indicates several ways to segment the market, but here focus will be on the most
usual way: implementing methods of cluster analysis on measurements of needs, preference for
brands, lifestyles, demographics, information databases etc.2
4. SEGMENTATION AND MARKETING STRATEGY
According to strategic needs, segmentation can take different forms:
 Targeting needs.
 Targeting lifestyle.
 Demographic segmentation.
 Behavioral Targeting.
Each has specific goals, but a detailed analysis is beyond the scope of this article. For greater depth
on the subject we can suggest Kaden, Linda and Prince (2013). Other references include: Wang and
King (2007), McDonald and Dunbar (2004), Dibb and Simkin (2010; 1996) and Kamakura and
Wedel (2000).
5. WEAKNESSES OF ACP VERSUS ORIGINAL VARIABLES
At first we can present some comments on ACP. A popular approach to the weaknesses of Factor
Analysis and PCA in particular are presented in the seminal book of Stephen Jay Gould “The
Mismeasure of Men”, published in 1981. Based on the arguments presented there one must be
careful when interpreting axes on any factor analysis.
Quoting another study, currently in development but available on the Web (SHALIZI, 2014):
ACP is a very good tool when you need or try a reduction in the size of the data when you're not sure
exactly what to use. It has some interesting mathematical properties ... the dimensions found by ACP ...
can be real characteristics of the data or just reasonable and convenient fictions and abstracts. That they
are real is a hypothesis that these methods may suggest, but for which they can only suggest very weak
evidence. This matters because in the end we do data mining to discover knowledge on which we can act.
One thing is to make our action is only as a simulation that helps us adjust our models to practice, and
another is to try to act on the world based on how the parts interact with each other. To do it right, we
need to know what these parts really are.
One factor is essentially a geometric feature and not a match with real phenomenon. Adopting this
match, we could act based on what Gould (1981) called "reify, that is, to imagine that something is
real only because we can build it in an abstract way.”
5.1 ACP VERSUS ORIGINAL VARIABLES
The following comments are strongly based on a text by Rizzo and Yeung (2001). According to
these authors' approach, different clustering algorithms provide different solutions.
That leads to a question: which one is the right one? It is believed that the clustering process is an
ad-hoc technique that separates elements in groups, and provides a description of the universe
which is useful for marketing purposes (especially the discovery of niches/targets/target markets of
interest) but there is not a right answer on the subject (SÁ LUCAS, 2007) .
2
View more information on the link between research and databases Sá Lucas (2007).
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
134
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
PCA is a dimensionality reduction technique usually applied to a data set that transforms the
original variables into new ones (principal components - CPs) in order to summarize characteristics
of data.
These components are uncorrelated (not necessarily independent) and can be arranged such that the
k-th element is one that has the k-th largest variance in the set of principal components (PCs). The
traditional approach is to use the first few PCs because they capture most of the variation in the
original data set.
In contrast, the latter CPs are regarded as the ones that catch residual noise in the data. Here it is
worth a note: in theory of random signals it is very common to use the concept of white noise,
which is nothing more than a random variable with normal distribution, zero mean and a non-zero
variance. However, white noise has no information at all and has variance.
So taking variance as the amount of information is at least rash. When taking the first PCs (the
largest variance) it is expected to extract the data clustering structure. There are rules of thumb for
the number of factors to be extracted, but these rules are informal and ad-hoc.
On the other way, according to Yeung and Ruzzo (2001), there are theoretical considerations that
indicate that the first few PCs cannot contain cluster information.
Assuming that data consists of a mixture of two multivariate normal distributions with different
means, but with the same variance-covariance matrix intracluster Chang (1983) demonstrated that
the first cluster CPs may contain less information than others with smaller variance. He even
artificially generated a solution for this case into two groups, where the better separation between
them occurred in the subspace spanned by the first and last CP.
5.2 EXAMPLE
In our example an approach similar to the one from Ruzzo and Yeung (2001) is used. We generated
from a package in R (Cluster Generation) different databases containing 3, 4 , 5, 6 , 7 and 8 groups
with three different degrees of separation between them (QIU; JOE, 2006a; 2006b).
We always take up to 20 variables in each case. In all 720 cases were calculated CPs. For selecting
the most important variables, we used an ad-hoc but very powerful technique: we took a grouping
into five clusters, and generated a predictor by Random Forest (BREIMAN, 2001).
In each case (Groups 3 through 8) we used up to targets 20 descriptors, taking in each step a new
PC and a new variable, according to the orders indicated (variance for CPs and Random Forest for
variables). The clustering algorithm was WDM (SÁ LUCAS, 2007).
As the correct solution for the clusters were known and provided by cluster generation, we
calculated the degree of accuracy of each application of the algorithm by the adjusted Rand index
(aRI) (RAND, 1971). When the algorithm perfectly reproduces the clusters, the index was equal to
1. In the worst case, the index would be equal to zero. Results are shown in Figures 1 to 9.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
135
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
0.6
0.6
Metodo
1.WDM3
2.PCA3
Metodo
BaRI4
BaRI3
0.4
0.4
1.WDM3
2.PCA3
0.2
0.2
0.0
0.0
5
10
15
20
5
Nvar
10
15
20
Nvar
FIGURE 1
Comparing the methods – Clusters badly discriminated (3 and 4 groups).
In Figure 1 the clusters are badly discriminated and, in general, the ACP has better performance
than the original variables, although both have an aRI well below 1 (clustering could not reproduce
well the correct segmentation).
0.5
0.5
0.4
0.4
0.3
1.WDM3
2.PCA3
0.3
Metodo
BaRI6
BaRI5
Metodo
1.WDM3
2.PCA3
0.2
0.1
0.2
0.0
5
10
15
20
Nvar
5
10
15
20
Nvar
FIGURA 2
Comparing the methods – Clusters badly discriminated (5 and 6 groups).
In Figure 2, the original variables are beginning to stand out, though with aRI still well below 1.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
136
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
BaRI8
0.6
Metodo
0.4
1.WDM3
2.PCA3
0.2
5
10
15
20
Nvar
FIGURE 3
Comparing the methods – Clusters badly discriminated (7 and 8 groups).
Situation in Figure 3 is similar to Figure 2: the aRI still well below 1.
1.0
0.8
Metodo
1.WDM3
2.PCA3
0.6
MaRI4
MaRI3
0.8
Metodo
0.6
1.WDM3
2.PCA3
0.4
0.4
5
10
15
20
5
Nvar
10
15
20
Nvar
FIGURE 4
Comparing the methods - Clusters moderately discriminated (3 and 4 groups).
When discrimination increases, the original variables begin to stand out, as in Figure 4, but now
with Ari closer to 1.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
137
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
1.0
0.75
MaRI5
Metodo
1.WDM3
2.PCA3
MaRI6
0.8
Metodo
1.WDM3
0.50
2.PCA3
0.6
0.25
0.4
5
10
15
5
20
10
15
20
Nvar
Nvar
FIGURE 5
Comparing the methods - Clusters moderately discriminated (5 and 6 groups).
In Figure 5, with increasing discrimination, the original variables dominate, with aRI closer to 1.
0.75
Metodo
0.50
1.WDM3
2.PCA3
MaRI8
MaRI7
0.75
Metodo
1.WDM3
0.50
2.PCA3
0.25
0.25
0.00
5
10
15
20
5
Nvar
10
15
20
Nvar
FIGURE 6
Comparing the methods - Clusters moderately discriminated (7 and 8 groups).
Again, in Figure 6 the same phenomenon is repeated: with increasing discrimination, the original
variables dominate, with aRI closer to 1.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
138
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
1.00
1.0
0.8
Metodo
1.WDM3
0.50
2.PCA3
Metodo
AaRI4
AaRI3
0.75
1.WDM3
2.PCA3
0.6
0.25
0.4
5
10
15
20
5
Nvar
10
15
20
Nvar
FIGURE 7
Comparing the methods - Clusters highly discriminated (3 and 4 groups).
When discrimination increases, as shown in Figure 7, the original variables dominate with aRI
getting even to be equal to 1.
1.0
1.0
0.8
0.8
1.WDM3
2.PCA3
Metodo
AaRI6
AaRI5
Metodo
1.WDM3
2.PCA3
0.6
0.6
0.4
5
10
15
20
Nvar
5
10
15
20
Nvar
FIGURE 8
Comparing the methods - Clusters highly discriminated (5 and 6 groups).
Everything is repeated in Figure 8: as discrimination increases the original variables dominate with
aRI getting to be equal to 1.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
139
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
1.00
1.00
0.75
AaRI7
Metodo
1.WDM3
2.PCA3
Metodo
AaRI8
0.75
1.WDM3
2.PCA3
0.50
0.50
0.25
0.25
5
10
15
20
Nvar
5
10
15
20
Nvar
FIGURA 9
Comparing the methods - Clusters highly discriminated (7 and 8 groups).
Even in this case (Figure 9), with increased discrimination, the original variables dominate with aRI
getting to be equal to 1.
6. CONCLUSION
We notice that, as the discrimination between groups increase (better segmentation), dominance
using the original variables increases. In fact, if there is no discrimination between groups,
segmentation is done badly. In these cases, segmenting via ACP can minimize distortion.
The Heatmaps shown in Figure 10 summarize the performance of the methods. Percentages indicate
that the use of the original variables was better than or equal to that of CPs (ACP), as discrimination
among groups increases.
FIGURE 10
Heatmaps as discrimination among groups increases.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
140
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
As discrimination among groups get higher, the degree of discrimination increases. Moreover, the
percentage of use of variables becomes more blatant: light blue represents the points where the
original variables have equal or better performance than the principal components.
Furthermore, the use of too few or all the variables affect performance (see Figures 3 and 8). So we
can suggest as a general criterion the adoption of the original variables to a medium level (about
25% to 75%). Besides, the use of principal components would be justified only if we knew we had
an unclear grouping.
7. REFERENCES
BREIMAN, L. Random Forests, Machine Learning, 45, (1) pp. 5-32, 2001.
CHANG, W. On using principal components before separating a mixture of two multivariate
normal distributions, Applied Statistics, 32, pp. 267-275, 1983.
DHILLON, I.; MODHA, D.; SPANGLER, W. Class Visualization of High-Dimensional Data with
Applications, Computational Statistics and Data Analysis, 41, pp. 59-90, 2002.
DIBB, S.; SIMKIN, L. Target segment strategy. In: BAKER, M.; SAREN, M. (Org.). Marketing
Theory, 2010.
DIBB, S.; SIMKIN, L. The Market Segmentation Workbook. London: Routledge, 1996.
GOULD, S. J. The mismeasure of man. New York: W.W. Norton & Company, 1981.
GRAY, K. Think you Know Segmentation? Think Again! A Close Look at 4 Core Analysis. Quirk´s
marketing research media e-newsletter, 2013. in: <www.quirks.com/articles/2013/201312252.aspx>. Accessed in: feb, 5, 2014.
HUBERT, L.; ARABIE, P. Comparing Partitions. Journal of Classification, 1985, pp. 193-218.
KADEN, R.; LINDA, G.; PRINCE, M. Leading edge marketing research. Los Angeles: Sage
Publications. 2013.
KING, D.; WANG, F. Time to Re-think Segmentation, 2007. Disponível em:
<http://www.dmnews.com/time-to-re-think-segmentation/article/98990/>. Accessed in: jan, 6,
2014.
MCDONALD, M.; DUNBAR, I. Market Segmentation – How to do it, How to profit from it.
Oxford: Elsevier, 2004.
MYERS, J. H. Segmentation and positioning for strategic marketing decisions, Chicago: American
Marketing Association, 1996.
QIU, W.; JOE, H. Generation of Random Clusters with Specified Degree of Separation. Journal of
Classification, 23 (2), 2006a, pp. 315-334.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
141
How to Seriously Damage a Segmentation Study: Use Factor Analysis as Input for Cluster Analysis
Luiz Sá Lucas/ Wagner Esteves/ Larissa Catalá
QIU, W.; JOE, H. Separation Index and Membership Partial for Clustering. Computation Statistics
and Data Analysis, 50, 2006b, pp. 585-603.
RAND, W. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American
Statistical Association, 66, 1971, pp. 846-850.
SÁ LUCAS, L., Joint segmenting consumers using both behavioral and attitudinal data.
Proceedings of the Sawtooth Software Conference, 2007, pp. 199-218.
SHALIZI, C. The Truth about Principal Components and Factor Analysis. Advanced Data Analysis
from
an
Elementary
Point
of
View,
2014.
In:
<http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf>. Accessed in: feb, 4, 2014.
SMITH, W. Product Differentiation and Market Segmentaion as Alternative Marketing Strategies.
Journal of Marketing, 21 (July), 1956, pp. 3-8.
STEWART, D. The Application and Misapplication of Factor Analysis in Marketing Research.
Journal of Marketing Research, 18 (February), 1981, pp.51-62.
WEDEL, M.; KAMAKURA, W. Market Segmentation – Conceptual and Methodological
Foundations, International Series in Quantitative Marketing, Boston: Kluwer Academic Publishers,
2000.
YEUNG, K.; RUZZO, W. An Empirical study of Principal Component Analysis for Clustering
Gene Expression Data, Bioinformatics, 2001.
Note: Authors are solely responsible for the translation of their articles from Portuguese to English.
PMKT – Revista Brasileira de Pesquisas de Marketing, Opinião e Mídia (ISSN 1983-9456 Impressa e ISSN 2317-0123 On-line), São Paulo, Brasil,
V. 14, pp. 132-142, Abril, 2014 - www.revistapmkt.com.br
142
Download

the article