Oh my darling Clementine

Data mining may not have had much to offer the research community in the past, but as Tim Macer reports, SPSS is bringing it into the fold.

Data mining, with its post hoc connotations of desperately searching for something - anything – that might be interesting, seems like the antithesis of quantitative research for many researchers. Its associations with large databases, direct marketing and fraud detection make it hard to see the link with MR. So it is refreshing to see a data mining tool developed, under guidance from SPSS MR, to understand the geography of survey data.
Clementine was designed for researchers and business analysts to use - you need to be a hands-on computer user, but not a technical specialist; understanding your data is more important. And it's not just to be used on huge databases: it will read standard SPSS files too, which means by extension it can be used on most 'flat' MR survey data.
This is a different approach, and experience, to using a cross-tab tool to interrogate data, using synthesis rather than analysis, as you tend to work with all your variables at once. The outcome is typically a 'predictor' - such as the likelihood that someone will purchase frozen pizza, compare rates before renewing insurance, shop at Ikea or belong to a gym.
You can use Clementine as another segmentation tool, or you can turn these findings into action using something called the Clementine Solutions Publisher. This creates an executable program that you or anyone else can run against other samples to predict how they will behave - even samples of one.
When you open Clementine, it presents you with a blank canvas and four palettes of tools. From these you construct a model or 'stream' that will take your data on a journey from source to final results.
This is where the beauty of this program lies: you can see everything going on. In fact, for a technique many consider to be a bit of a black box, it does a better job of revealing what is happening than just about every standard MR analysis tool out there.
You start by dragging a data source 'node' on to the screen and tell it where to look. For high-end users, this could be an SQL query that is executed server side by, say, an Oracle database, (this guarantees good performance); or it could be a thousand records from a survey data file.
Next, you add other nodes to perform tasks and transformations on the data, like editing, filtering or weighting. Admittedly, there's a lot of databasey terminology to learn and some 'false friends' to be wary of, like field operations (editing and cleaning), or filter, which removes entire variables.
Clementine from SPSS MR
Pros
  • Easy data mining: build, refine and repeat at will
  • Handles multiple data sources: merge survey data and enterprise databases
  • Create predictor programs to run anywhere using the "solutions publisher"
Cons
  • Some confusing terminology
  • Ugly output
  • Hefty price tag
Most usefully, a 'merge' node will combine data from two different data sources, for instance a database and a related survey - messy and error-prone in most tab packages, but achieved gracefully here. And all the time, you benefit from instant feedback on screen through colour coding and other visual clues, plus the ability to check your partial results at any point.
Eventually, you reach the point where you choose one of the built-in algorithms for transforming the data and building the predictive model. Clementine has open, published interfaces, so it is even possible to plug in third party algorithms as extra nodes on your palette, or even to develop your own models.
A pause of one or more minutes may now ensue - it seems churlish to criticise this, considering a few years ago, such work would have required an overnight run. Then the results appear, when you either learn you have found something interesting, or more likely, not. No matter, you simply refine your stream slightly or try a different algorithm and a few seconds later, run it again: this repeatability is a real advantage.
On the minus side, it is still obvious that Clementine is a UNIX program masquerading as Windows software, especially when it comes to the output, which is ugly and uninspiring. A full Windows version is due out later this year, which should improve things. The other minus for many a potential user is the price tag, which, I warn you, starts at £33,000.
David Pihlens, Managing Partner of Commetrix, a marketing analysis consultancy, uses several statistical packages but tends to use Clementine either when there is a lot of data, or it is useful to be able to repeat the process later. He comments: "The idea of streams is very interesting: it fits well with all the manipulations you have to perform and helps you to cope with what is a choppy process. It also means you have a good visual tool to explain the process to the client."
Pihlens praises the ease by which you can get to different kinds of data - he even used it to analyse web logs of web site traffic directly, which is, as data goes, notoriously unstructured.
"The ability deploy your models using the solutions publisher is attractive. It brings stuff alive." Pihlens continues. "We are trying to get away from the dusty report that is flavour of the week and now we always try to build front ends like these. These days, people want to be able to use the analysis, not just read about it."
Combining survey data with enterprise data to give actionable, verifiable results is what we keep hearing that research consumers want. Persuade them to pay handsomely for it and you could get Clementine to give them what they want.

SPSS MR: www.spssmr.com/clementine

Tim Macer writes as an independent specialist and advisor in software for market research. His website is at www.meaning.uk.com

Published in Research, the magazine of the Market Research Society, June 2002, Issue 433.

© Copyright Tim Macer/Market Research Society 2002. All rights reserved. Reproduced with permission.

top of page | Back to list of reviews