`Home     Our Services     Your Resources     News & blog     About us     Contact

Wordstat reveiwed

When analysing data, words tend to fail most software solutions. Tim Macer reviews a textual analysis tool that can sift through open-ended responses.

Add a few open-ended questions to an Internet survey and - if your respondents are at all engaged with the subject matter - you will find yourself awash with verbatim responses. Processing these answers using classic coding methods is not only time-consuming, but unless you have the time to read every raw comment, leaves you with the uneasy feeling that the most valuable insights may have been left on the coding room floor.

WordStat is a statistically-based textual analysis tool that can overcome this by providing a very different and stunningly effective way to handle open-ended answers. It does not try to ape verbatim coding, but uses a range of advanced computer-based textual analysis methods to search out recurrent words and phrases. But words and short phrases can be misleading, and a poor indicator of meaning. Who knows what the two words “cool reception” might mean without seeing more of the answer. Is it a begrudging welcome, air conditioning set to 18C or even, just possibly, the ultimate in iconic urban foyer design? Here WordStat excels by always letting you zoom between the macro- and micro-view of your respondents’ comments so you can always get to the context and intervene with your own expert judgement.

The tool comes from Provalis Research, a software company based in Montreal. It is offered as a bolt-on module to their SimStat statistical analysis program, or QDA Miner, a qualitative code-and-retrieve analysis tool. A single licence to Simstat with WordStat costs around £600 at current exchange rates.

The first step is to get the data in, and that works best from SPSS or data in excel. Both categorical variables and verbatim texts are imported together. Unfortunately, there is no support for triple-s or other common MR data collection platforms, which is likely to result in some re-keying of labels on all the profile variables you import. You use the Simstat interface to access WordStat. This is a fully featured stats package with its own cross-tab facilities. If you refer to a variable that contains literal verbatim text, Simstat click into the specific WordStat capabilities. It is all quite seamless, and overall, the interface and navigation through the program is pretty well designed — though it is somewhat eye-poppingly sprawling, in the way that statistical software so often is.

The first time you use the system, you will need to spend longer than usual in populating the dictionary - which is really your own subject-specific wordlist - with all terms you encounter in your data. It provides a range of tools and helpers here; listing all of the words it has found in your data for example. There is a complementary pre-populated list of ‘exclusion’ words: several hundred uninteresting utility words like, ‘the’ ‘of’ ‘and’ etc. This too, you can edit in full.

There are many different ways to identify and analyse textual answers. The best place to start is with simple frequencies of words found in your word list. You can also look at frequencies of the words not found in your list, which is a good way to build the list in the first place, and also to spot emerging trends or differences in any longitudinal studies.

The dictionary applies lemmatisation rules, which you may also adjust - these ensure that word stems, singular and plural forms, abbreviations and even common misspellings get treated as one word.

From viewing words by frequency, a Key-Word-in-Context very usefully lets you drill down to the actual occurrences found - you will see a short single-line snippet of each, with the immediate words around it. A further click will reveal the whole verbatim, with the key words helpfully highlighted. Then it starts to get clever. You can also cross-tab words with other categorical variables in your data, and start to see associations. There is a phrase finder, where you can specify a minimum and maximum word length, and again these can be cross-tabbed and viewed in context.

Having reached a set of words you are happy with, you can then go on to produce a categorisation, which is effectively a hierarchical codeframe. As if my magic, the program will suggest words to go into that category - at least it does in the English language, as the program ships with the open source Wordnet lexical dictionary that contains some 200,000 English word definitions and inter-relationships. Furthermore, once you have created your categorisation, you can re-use this on subsequent new data. Each category will be populated according to the same set of rules you have determined with regard to relevant words and phrases.

For many users, these features would be quite enough to analyse their verbatim data, but time permitting, you can go much further. You can start to plot found words and phrases in three-dimensional correspondence plots, which amazingly you can rotate on screen. These can plot actual verbatim words and phrases against hard categorical data such as demographics or derived cluster groups. You can produce tree-like dendograms on word and phrase affinity with some devilish ‘heat’ plots that show relative correlation by colour spectrum, from blue to red.

For the really advanced user, the program also contains machine learning text analysis capabilities and multivariate statistical analysis on texts: you can actually perform a cluster analysis on the words and phrases in your corpus of verbatim texts.

The user view: Ralph Bishop at International Survey Research

Ralph Bishop is manager for qualitative research at International Survey Research. He uses WordStat to analyse verbatim answers in employee surveys in up to six different language, for which he has developed subject-specific dictionaries.

He comments: “The most appealing thing about WordStat is its own open-endedness. You really don’t need to code your text, though you can if you want. But so far I have found that using text analysis and properly choosing the context works just fine for me. It is good for any kind of research where there is open-ended material of any kind.”

Dr Bishop advises that to get the best out of it, some hard graft is needed first, to build a good domain-specific dictionary or lexicon. Once this has been done, it can be re-used on any other project sharing the same subject domain - though it benefits from a periodic refresh to keep it relevant. “The trick is to understand how to manipulate the dictionaries” he confides.

Once that is out of the way, he reports it means you deal with large volumes of verbatims very quickly. “This is most helpful because everything we do is done under very tight deadlines.”

He cautions: “You still have to be very careful about noise, typically words that are used in different context. Words like competitive - ‘competitive environment’, ‘competitive salary’ and so on. Natural language is full of these kinds of ambiguities so you cannot expect always to get absolute clear-cut distinctions.”

But Dr Bishop also sees potential for the tool to make researchers more adventurous in their use of verbatim questions. “If people are going to use this, it will give them a little room to expand on the open-ended responses on surveys, You might be able to ask some more reflective questions and get a little bit deeper into the psychology of what appeals to people about particular products or services.”

Published in Research, the magazine of the Market Research Society, August 2005, Issue 471.

© Copyright Tim Macer/Market Research Society 2005. All rights reserved. Reproduced with permission.

WordStat is a statistically-based textual analysis tool for handling verbatim texts or interview transcripts and deriving re-usable text classification or categorisation models.