Visualisation Techniques for Statistics

– the Currrent State of Play

 

Antony Unwin

Dept of Computer-Oriented Statistics and Data Analysis, Mathematics Institute, University of Augsburg, 86135 Augsburg, Germany

 

Summary

Graphic displays are very common, but good visualisation of information is not. There are many interesting ideas, but they are not coherently implemented in any currently available software. This paper sketches the requirements for statistical visualisation systems, given to-day’s state of knowledge.

 

1 Introduction

Visualisation has always been a vital tool for statisticians. There is a long tradition of visualising data, though the modern age really began with Playfair and his graphics representing trading and other information for the UK (a pertinent example of an application relevant to Official Statistics having great influence on the development of statistical practice). Visualisation is a large field, including the technical aspects of producing displays (discussed, for instance, in Wegman and Carr (1993)) and the psychological aspects of perception and interpretation (discussed in many, many publications). Neither are discussed here, where we concentrate on visualisation techniques and how they are implemented. Work on visualisation related to statistics may be separated into several streams.

Presentation graphics: static displays of data to convey results. Thanks to the increasing power and availability of computers, presentation graphics have become much more common, and in many ways much worse. Although there is much good advice around (cf. the books of Tufte and Cleveland), presentation graphics often rely more on apparently random use of complex options than on a genuine concern to convey information. Presentation graphics software works with positioning and formatting of the graphic display objects rather than with the data set itself. There are many systems for drawing presentation graphics, some with sophisticated defaults and helpful guidance, but it is essential to know what you want to display before deciding how to display it. There is a mismatch between the software’s capacity to generate a multitude of graphic formats and the occasional user’s understanding of appropriateness and content. As Cleveland says "Our tendency is to be misled into thinking we are absorbing relevant information when we see a lot" (Cleveland (1993) p1).

Scientific visualisation: this tends to concentrate on 3-d representations, often animated over time. There is great emphasis on colour and exact picturing of complex phenomena, such as geological structures or meteorological events. The aim is to be able to display sufficient detail to highlight critical features. Scientific visualisation is particularly impressive at displaying complex models and functions. While these techniques are valuable, they are more suited for modelling than data analysis and are not designed for high dimensional data in general. There is an excellent annotated webpage of scientific visualisation sites at

http://science.nas.nasa.gov/Groups/VisTech/visWeblets.html.

Cartographic visualisation (MacEachren (1995)): traditionally the best cartographers have impressed by their ability to summarise large amounts of spatial data in visually pleasing and informative maps. However, the maps have often been extremely crowded as the cartographers strive to include as much information as possible. Modern developments in cartographic visualisation (cf. the special issue of Computers and Geosciences, May 97 and the book edited by Hearnshaw and D. Unwin (1994)) have moved away from the static restrictions imposed by printed graphics. There are many interesting developments and cartographers with their primary emphasis on location and statisticians with their primary interest in data have much to learn from one another. (Results of such a collaboration between the statistics group in Augsburg and the geography departments in Leicester and Birkbeck were published this year in a special issue of RRS D "The Statistician" Vol 47 Part 3, devoted to spatial statistics.)

Information visualisation: graphic displays of concepts and qualitative relations. There are many novel and exciting ideas being tried out with displaying information on the web and this is very much an area to watch, even though it is usually not related to quantitative data.

Statistical visualisation: display for showing structure in the data rather than detail, conveying essentials rather than decoration. This should not be confused with presentation graphics for two reasons, the aim is to convey statistical information and dynamic and interactive tools may be employed (Eick and Wills (1995)). Interaction has added substantially to the power of statistical graphics and the full potential of the approach has still barely been realised, especially for multivariate data.

Only statistical visualisation is discussed in this paper, but fruitful, innovative ideas may be found in each of the other areas.

2 Implementation of visualisation techniques

There are two kinds of flexibility needed in visualisation techniques:

the capability of specifying exactly what is to be drawn for a very wide range of possibilities (plot types, size, colour, scale, screen position....) in ways which can be recorded and repeated;

the capability of directly interacting with displays to query them for information, to link them, and to change their shapes, scales and other characteristics so that a wide range of views can be scanned.

These two kinds of flexibility are not incompatible but they represent quite different system philosophies. Precise, repeatable commands are the traditional computer science approach and are associated with displaying optimal results for well-defined questions. Direct, interactive interfaces are a more modern approach and are associated with searching for information or interesting structures without fully specified questions. Both have their place, but there has been, and continues to be, substantial research on the precision approach, while less attention has been paid to the second interactive approach. There is a provocative, philosophical article by Huber which reviews some of these issues (Huber 1994).

Command-line interfaces are flexible in design, but not in use. They provide precision control easily (so that, for instance, you can draw four histograms of specified size and scaling at defined positions on screen) but not enough direct manipulation (so that the histograms cannot be interactively redrawn to discover other informative scalings). It is the contrast between specific power and general flexibility. With a more flexible interactive system the displays may not be as polished. Should interesting information emerge for which an optimal display is then sought, the user should be able to switch to any one of several presentation systems for creating a final display.

Many older statistical software systems have a strong modular structure which enables new methods to be added easily. These methods are of course then only weakly linked to the system and have to be used more or less independently (sometimes to the extent of operating as separate programmes with only a common data format). More modern systems use an object-oriented approach which also permits extending the software, but potentially in a more integrated way. A totally unrestricted system lacks consistency and does not look well (always an important criterion for visualisation software). A tightly constrained system inhibits inventiveness, but is consistent, which makes it easier to use, and highly integrated, which makes it fast.

Statisticians have not interested themselves in interface design problems a great deal. There is an extensive Human Computer Interface (HCI) literature, but it is difficult to find directly relevant material. Donald Norman’s books (Norman (1988) and (1993)) and Apple’s implementation of many interface ideas are honourable exceptions but are only beginnings. For tools to be used they must not only be well-implemented they must match potential users’ work needs. Looking up manuals continually for explanations of a long list of parameters is not appropriate for analyses. Mixed-up inconsistent menus are not suitable. Complicated command-line interfaces can be very powerful but are not for occasional or intermittent users. Direct interaction requires implicit context-sensitive support to guide use. The lack of effective systems has undoubtedly inhibited the development of new visualisation ideas. It is only when these ideas can be tested in practice that their value can become established.

3 Visualisation feature requirements

Many packages provide a large set of graphical displays with a wide range of options but rather as a collection of features than an integrated system and without any genuine interactive capability. A visualisation system should provide the basic displays, the tools to manipulate them interactively and the possibility of extending the software with more specialist or advanced graphics.

(a) Basic

Window types:

single variable windows (histograms, boxplots, dotplots...)

multiple variable windows (e.g. scatterplots)

multiple display windows (missing value plots, subgroup plots..)

data and results tables

Functionality within each graphic:

context-sensitive querying

zooming, flexible scaling

resizing of graphics

resizing of objects (e.g. points in scatterplots)

colouring of objects

selecting (ideally by points, areas or lassos)

masking, grouping.

Functionality between graphics and displays:

tiling and arranging windows, common scaling, linking of cases.

(b) Advanced

Displays

Mosaic plots for multivariate categorical data. Although these were suggested by Hartigan almost twenty years ago, they have only really become usable through added interactive capability. (Unwin et al (1996))

Parallel co-ordinates for multivariate continuous data (Inselberg (1988))

Biplots (for categorical data as in correspondence analysis but also for continuous data) (Gower and Hand (1996))

Rotating plots and projection pursuit (Swayne et al (1998))

Trellis graphics (Becker et al (1996))

Displays for time-dependent data (Unwin and Wills (1988))

Displays for spatial data (Unwin and Hofmann (1998))

Scalable options for amending standard displays to cope with large data sets (histograms, boxplots, scatterplots...)

Special displays for large data sets (Wills (1995)).

Linking of formats,of ordering of categories across displays

Linking across data sets/relations

Sorting by any criteria

Taking account of missings in graphic displays (Unwin et al (1996)).

Interactive selection sequences (Hofmann and Theus (1998))

No current system offers this complete range of features and few offer even the basic set.

4 Statistical software and visualisation

All statistical packages offer a more or less limited range of visualisation techniques, primarily presentation graphics tools. The classic packages have some exploratory tools, but concentrate on their traditional testing and modelling strengths. More recent programmable packages such as S+, XploRe and LispStat all offer flexible modelling facilities, but little directly for modern graphical analysis. S+ was path-breaking ten years ago for interactive graphics with scatterplot brushing, but has not improved that feature nor added other interactive tools. The addition of trellis graphics in the last couple of years is impressive, but this has only been done in a static way without interaction with the graphics. The strength of the package remains its relatively easy programmability, which has enabled many to prototype implementations of their theoretical ideas quickly and successfully. XploRe is similar in that it offers considerable statistical programming power but not so much for flexible and intuitive working with graphics. It has recently added an Internet capability, which is interesting for specialist statistical modelling. LispStat has some elegant graphic features, but cannot handle discrete variables graphically and uses an ineffective linking structure. Perhaps less flexible, but certainly very powerful, SAS can be used as a programming environment too. None of these systems have a strong modern graphic capability. They are excellent for statistical modelling but not for interactive work. All are specialist packages with steep learning curves and so tend to be relevant only to the statistical profession and not the wider group of ‘data analysts’ - all those non-statisticians who have the task of working with, and making some sense of, the data around them. Some smaller packages originally designed for PC’s have now been developed so that they are much more powerful than corresponding mainframe packages of a few years ago. Data Desk and JMP both offer substantial interactive power. JMP tends to play down its graphics features and emphasises its extensive range of analyses instead. Data Desk provides an excellent integrated set of graphics tools with a consistent interface but is, like JMP, not programmable.

Amongst specialist packages, there is XGobi which is currently the sole package offering interactive projection pursuit, but it can only display point plots and is therefore weak for categorical variables. It has good interactive capabilities for controlling its multidimensional rotating plots, but requires a new copy of the programme for each linked display. IBM market a system for interactive parallel co-ordinate plots, PVE. This is impressive, but is really restricted to the one display type and cannot deal appropriately with categorical variables. Other parallel co-ordinate systems are in development at Eurostat and by Al Inselberg in Israel. Lucent Technologies have started promoting a new system, Visual Insights, which runs both on PCs and workstations. Its main strengths are the large data sets that can be handled and some novel graphics ideas (for instance in the handling of categorical variables in parallel co-ordinate plots, in the nicheworks tool and in logical zooming on tables). At the moment there is little in the way of interactive querying. The Augsburg research software, MANET, was originally designed to account for missings in linked graphic displays, but has been substantially extended to include interactive mosaic plots, biplots and spatial displays. It is, however, solely a graphics package with no analytic capabilities.

5 Conclusions

Current software may not be as effective as we would like, yet it is extraordinary how powerful modern computer systems can be for producing elegant and informative graphic displays. It is just frustrating that the software could be a lot better and that it could be a lot better used. It is easy enough to outline requirements for better software, but not so easy to encourage better use. At least if higher standards were requested (and demanded) by users, then new ideas and better software would become available more quickly. It is in all our interests to raise the standards of statistical visualisation.

Acknowledgements

The department of Computer-Oriented Statistics and Data Analysis is supported by the Volkswagen Foundation. Thanks are due to the members of the Augsburg group for many helpful discussions.

References

Becker, R., Cleveland, W.S., Shyu, M-J. (1996). The Visual Design and Control of Trellis Display. JCGS, 5, 123-155

Cleveland, W. S. (1993). Visualizing Data. Summit, New Jersey, USA: Hobart Press

Cleveland, W. S. (1994). The Elements of Graphing Data (Revised ed.). Summit, New Jersey, USA: Hobart Press

Eick, S. G., Wills, G. J. (1995). High Interaction Graphics. European Journal of Operational Research, 445-459

Gower, J. C., Hand, D.J. (1996). Biplots. London: Chapman & Hall.

Härdle, W., Klinke, S., Turlach, B. A. (1995)

XploRe: An Interactive Statistical Computing Environment. New York: Springer

Hearnshaw, H. M., Unwin, D. J. (Ed.). (1994). Visualization in GIS. Chichester: Wiley.

Hofmann, H., Theus, M. (1998). Selection Sequences in MANET. Computational Statistics, 13(1), 77-87.

Huber, P. J. (1994). Languages for Data Analysis. In P. Dirschedl and Ostermann, R. (Eds.), Computational Statistics Heidelberg: Physica.

Inselberg, A. (1998). Visual Data Mining with Parallel Coordinates. Computational Statistics, 13(1), 47-63.

MacEachren, A. M. (1995). How Maps Work. NY: Guildford Press.

SAS (1994). JMP 3.1

Norman, D. A. (1988). The Design of Everyday Things. New York: Doubleday

Norman, D. A. (1993). Things that Make Us Smart. New York: Adison Wesley

Swayne, D. F., Cook, D., Buja A. (1998). XGobi: Interactive Dynamic Data Visualization in the X Window System. JCGS, 7(1).

Tierney, L. (1990). Lisp-Stat. New York: Wiley.

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, Connecticut: Graphic Press.

Tufte, E. R. (1997). Visual Explanations. Cheshire, Connecticut: Graphic Press.

Unwin, A. R., and Wills, G. (1988). Eyeballing Time Series. In Proceedings of the 1988 ASA Statistical Computing Section (pp. 263-268).

Unwin, A. R., Hawkins, G., Hofmann, H., and Siegl, B. (1996). Interactive Graphics for Data Sets with Missing Values - MANET. Journal of Computational and Graphical Statistics, 5(2), 113-122.

Unwin, A. R., Hofmann, H. (1998). New interactive graphics tools for exploratory analysis of spatial data. In S. Carver (Eds.), Innovations in GIS 5 (pp. 46-55). London: Taylor & Francis.

Velleman, P. F. (1997). Data Desk. Ithaca NY: Data Description.

Wegman, E., Carr. D. (1993). Statistical Graphics and Visualization. In C. R. Rao (Eds.), Computational Statistics (Handbook of Statistics Vol 9) (pp. 857-958). North Holland.

Wills, G. J. (1995). Visual Exploration of Large Structured Datasets. In New Techniques and Trends in Statistics, (237-246). IOS Press.

Some visualisation software web addresses:

Data Desk http://www.datadesk.com

JMP http://www.sas.com/otherprods/jmp/

MANET http://www1.math.uni-augsburg.de/Manet/

Visual Insights http://www.lucent.com/visualinsights