IEEE International Conference on Multimedia & Expo 2001
Tokyo, Japan
Electronic Proceedings
(C) 2001 IEEE


Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study

Dengsheng Zhang
Gippsland School of Computing and Info Tech
Monash University
Churchill, Victoria 3842
Australia
Tel: 61-3-9902 6772
Fax: 61-3-9902 6842
Email: dengsheng.zhang@infotech.monash.edu.au
Homepage: http://www-mugc.cc.monash.edu.au/~dengs

Guojun Lu
Gippsland School of Computing and Info Tech
Monash University
Churchill, Victoria 3842
Australia
Tel: 61-3-9902 6857
Fax: 61-3-9902 6842
Email: Guojun.Lu@infotech.monash.edu.au
Homepage: http://www.gscit.monash.edu.au/~guojunl


Abstract

Shape representation is a fundamental issue in the newly emerging multimedia applications. In the content based image retrieval (CBIR), shape is an important low level image feature. Many shape representations have been proposed. However, for CBIR, a shape representation should satisfy several properties such as affine invariance, robustness, compactness, low computation complexity and perceptual similarity measurement. Against these properties, in this paper we attempt to study and compare several shape descriptors which have been widely adopted for CBIR, they are: Fourier descriptors (FD), curvature scale space (CSS) descriptors (CSSD), Zernike moment descriptors (ZMD) and grid descriptors (GD). The strengths and limitations of these methods are analyzed and clarified. Retrieval results are given to show the comparison.

Keywords: Fourier descriptors, curvature scale space, Zernike moments, grid, CBIR, shape


Table of Contents


Introduction

In the newly emerging multimedia applications such as MPEG-4 and MPEG-7, shape plays an important role in supporting the so called content based functionalities. Many shape representation have been proposed for various purposes. These methods can generally be grouped into contour-based and region-based. Contour- based methods, such as chain code [FS78], shape signature [Davies97], polygonal approximation [Gu95], autoregressive models [KSP95], FD [PF77], [ZR72] and CSS [MM86], [MAK96], exploit shape boundary information which is crucial to human perception in judging shape similarity. Region-based methods, such as geometric moments [Hu62], Zernike moments [Teague80], [MKL97], grid representation [LS99] and area, exploit only shape interior information, therefore can be applied to more general shapes. Some geometric features such as average scale, skew, kurtosis, etc. reflected in the region based methods (geometric moments) are also important perceptual features. For CBIR purpose, a shape representation should be affine invariant, robust, compact, easy to derive and matching, and perceptually meaningful. In terms of these properties, FD, CSSD, ZMD and GD have been recognized as suitable for CBIR. CSSD and ZMD have been adopted in MPEG-7 as shape descriptors. In this paper we attempt to compare these two shape descriptors with other two shape descriptors: FD and GD. The rest of the paper is organized as following. In Section 2, we describe each technique in detail and clarify the strengths and limitations of each technique. Section 3 gives the experimental results and discussion follows. Section 4 concludes the paper.

Back to Table of Contents


Shape Descriptors

There are generally two types of shape representations, i.e., contour-based and region-based. Contour-based methods need extraction of boundary information which in some cases may not be available. Region-based methods, however, do not necessarily rely on shape boundary information, but they do not reflect local features of a shape. Therefore, for generic purposes, both types of shape representations are necessary. In this section we describe four important shape descriptors, known as: FD, ZMD, GD and CSSD. FD and CSSD are contour-based, while ZMD and GD are region-based.


Fourier Descriptors

Fourier descriptors [ZR72] are obtained by applying Fourier transform on shape boundary (usually represented by a shape signature), the Fourier transformed coefficients are called the Fourier descriptors of the shape. For good shape description, an appropriate shape signature is essential to obtaining Fourier descriptors. As has been shown in [ZL01] that shape signature using centroid distance outperforms other shape signatures in shape based retrieval.

Shape boundary is a set of coordinates (xi, yi), i = 1, 2, ..., L, which are extracted out in the preprocessing stage by contour tracing technique. The centroid distance function is expressed by the distance of the boundary points from the centroid (xc, yc) of the shape

[2.1.1]

where xc, yc are averages of x coordinates and y coordinates respectively. Due to the subtraction of centroid (which represents the position of the shape) from boundary coordinates, the centroid distance representation is invariant to shape translation.

In order to apply Fourier transform, all the shapes in database are normalized to the same number of boundary points. For the shape signature described above, assuming it is normalized to N points (128 in our case) in the normalization stage, the discrete Fourier transform of ri, i = 0, 1, ..., N-1 is then given by

[2.1.2]

The coefficients un, n = 0, 1, ..., N-1, are usually called Fourier descriptors (FD) of the shape, denoted as FDn, n = 0, 1, ..., N-1.

The FDs acquired in this way is translation invariant due to the translation invariance of centroid distance. To achieve rotation invariance, phase information of the FDs are ignored and only the magnitudes |FDn| are used. Scale invariance is achieved by dividing the magnitudes by the DC component, i.e., |FD0|. Since centroid distance is a real value function, only half of the FDs are needed to index the shapes. Finally, the following feature vector are used as the Fourier descriptors to index the shape

[fefeature.gif]

the similarity measure of the query shape and a target shape in the database is simply the Euclidean distance between the query and the target shape feature vectors.

The nice properties of Fourier descriptors are its robustness, being able to capture some perceptual characteristics of the shape and easy to derive. With Fourier descriptors, coarse shape features or global shape features are captured by lower order coefficients and the finer shape features are captured by higher order coefficients. Noise is not a problem with Fourier descriptors, for noise only appears in very high frequencies which are truncated out. Slight changes around the shape boundary doesn't cause much difference in the final representation. With fast Fourier transform (FFT), the computation is efficient. Because usually, only a small number of low order coefficients are enough to capture the overall shape features, the representation is also compact. The disadvantage of Fourier descriptors is local features cannot be located, because in Fourier transform only the magnitudes of the frequencies, not the locations, are known.

Back to Shape Descriptors

Back to Table of Contents


CSS descriptors

CSS descriptors [MM86] are essentially the descriptors of key local shape features. By dealing shape in scale space, not only the locations of, but also the degree of convexities (or concavities) of shape boundaries are detected. Since curvature is a very important local measure of how fast a planar contour is turning, therefore, curvature scale space is exploited. The CSS descriptors are obtained by first calculating the CSS contour map, the map is a multi-scale organization of the inflection points (or curvature zero-crossing points). To calculating CSS contour map, curvature is derived from shape boundary points (xi, yi), i = 1, 2, ..., L:

[2.2.1]

where 1st derivatives and 2nd derivatives are the first and the second derivatives at location i respectively. Curvature zero-cross points are then located in the shape boundary. The shape is then evolved into next scale by applying Gaussian smooth:

[2.2.2]

where Å means convolution, and g(i, s) is Gaussian function. As s increases, the evolving shape becomes smoother and smoother. New curvature zero-crossing points are located at the new scale. This process continues until no curvature zero-crossing points are found. The final CSS contour map is composed of all zero-crossing points zc(i, s), where i is the location s is the scale at which the zc is obtained. The peaks, or the maxima of the CSS contour map (only those peaks higher than the threshold are considered) are then extracted out and sorted in descending order as CSS descriptors to index the shape.

CSS descriptors is translation invariant. Scale invariance is achieved by normalizing all the shapes into fixed number of boundary points (128 in our case). Rotation invariance is achieved by circular shifting the highest peak to the origin of the CSS map. The similarity between two shapes is measured by the sum of the peak differences between all the matched peaks and the peak values of all the unmatched peaks [MAK96]. The difficulties of matching between two set of CSS descriptors are that the dimensions of the two set of CSS descriptors are usually different and the CSS peaks of two similar shapes are usually not matching. Besides, the peak orders of two similar shapes can be quite different. All these factors have to be considered during matching process. Mirror shapes and flipped shapes need also to be considered separately.

The powerfulness of CSS descriptors owes to its ability to capture those key local features such as the locations and the degrees of convexity (or concavity) of curve segments on the shape boundary, these features are very important to human perception in judging the similarity between shapes. The dimension of CSS descriptors is very low and the representation is robust due to Gaussian smooth process. But there are still some problems associate with CSS descriptors. In addition to the above mentioned matching difficulties, it only captures the local shape features, the global features which are also important to shape representation are missed out from the representation. As such, global features such as eccentricity, circularity and number of CSS peaks should be combined to form more practical descriptors. There may be no CSS descriptors for smooth convex shapes such as polygon composed of straight lines. Besides, due to the boundary sampling and the thresholding in the extraction of CSS peaks , the CSS descriptors may not reflect the true numbers of convex (or concave) segments on the shape boundary. All these factors affect the retrieval result.

Back to Shape Descriptors

Back to Table of Contents


Zernike Moments

Teague [Teague80] has proposed the use of orthogonal moments to recover the image from moments based on the theory of orthogonal polynomials, and has introduced Zernike moments, which allow independent moment invariants to be constructed to an arbitrarily high order. The complex Zernike moments are derived from Zernike polynomials:

[2.3.1]

and

[2.3.2]

where n and m are subject to n-|m| = even, |m|£ n. Zernike polynomials are a complete set of complex-valued function orthogonal over the unit disk, i.e., x2 + y2 = 1. Then the complex Zernike moments of order n with repetition m are defined as:

[2.3.3]

The theory of Zernike moments is similar to that of Fourier transform, to expand a signal into series of orthogonal basis. The precision of shape representation depends on the number of moments truncated from the expansion, the first 36 moments are used in our implementation.

Since Zernike basis functions take the unit disk as their domain, this disk must be specified before moments can be calculated. In our implementation, all the shapes are normalized into a unit circle of fixed radius of 64. The unit disk is then centered on the shape centroid. This makes the obtained moments scale and translation invariant. Rotation invariance is achieved by only using magnitudes of the moments. The magnitudes are then normalized by dividing them by the mass of the shape. The similarity between two shapes indexed with Zernike moments descriptors is the Euclidean distance between the two Zernike moments vectors.

Zernike moments descriptors does not need to know boundary information, making it suitable for more complex shape representation. Like Fourier descriptors, Zernike moments descriptors can be constructed to arbitrary order, this overcomes the drawback of geometric moments in which higher order moments are difficult to construct. However, Zernike moments descriptors lose the important perceptual meanings as those reflected in Fourier descriptors and geometric moments. In other words, we don't know what shape feature each Zernike moment represents.

Back to Shape Descriptors

Back to Table of Contents


Grid Descriptors

In grid shape representation [LS99], a shape is projected onto a grid of fixed size, 16x16 grid cells for example. The grid cells are assigned the value of 1 if they are covered by the shape (or covered beyond a threshold) and 0 if they are outside the shape. A shape number consisting of a binary sequence is created by scanning the grid in left-right and top-bottom order, and this binary sequence is used as shape descriptors to index the shape.

For two shapes to be comparable using grid descriptors, several normalization processes have to be done to achieve scale, rotation and translation invariance. It begins with finding out the major axis, i.e., the line joining the two furthest points on the boundary. Rotation normalization is achieved by turning the shape so that the major axis is parallel with x-axis. To avoid multi normalization results for mirrored shape and flipped shape, the centroid of the rotated shape may be restricted to the lower-left part, or a mirror and a flip operation on the shape number are applied in the matching stage. Scale normalization can be done by resizing the shape so that the length of the major axis is equal to the grid width, and by shift the shape to the upper-left of the grid, the representation is translation invariant. The distance between two set of grid descriptors is simply the number of elements having different values. For example, the grid descriptors for the two shapes in Figure 1 (a) and (b) are 001111000 011111111 111111111 111111111 111110011 001100011 and 001100000 011100000 111100000 111100000 011111100 000111000 respectively, and the distance between the two shapes will be 27.

grid1.gif

grid2.gif

Figure 1. Grid representation of shape

Grid representation is a straightforward shape representation which may be suitable for shape coding as is adopted in MPEG-4. However, for retrieval purpose, it is questionable, because a slight shape distortion, such as affine transform, can cause very big difference in the similarity measurement. Furthermore, since the normalizations are mainly based on major axis (which is unreliable in essence) and eccentricity (which is only reliable for convex shapes or compact shapes), shapes otherwise similar may be treated as different due to this normalization. For example, the two shapes in Figure 1 (c) and (d) are perceptually similar, but are very different under grid representation, for the major axis of shape (c) is horizontal while the major axis of shape (d) will be vertical. The online retrieval also involves high computation due to the usually high dimensionality of the feature vectors (for a shape of 192x192 pixels using cell size of 12x12 pixels, the dimension is 196).

Back to Shape Descriptors

Back to Table of Contents


Retrieval Experiments

In order to compare retrieval results using the four methods discussed above, we create a shape database and build a Java client-server retrieval framework to test the retrieval results. The shapes in our database are generated from a original database composed of 1098 fish shapes used by the SQUID system [SQUID] (also by MPEG-7), 70 synthetic polygon shapes and 26 bottle shapes used by Lu and Sajjanhar [LS99]. For each of these shapes, we generate three affine distorted shapes, one scaled shape, one mirror shape and one flip shape. The whole database is composed of 8358 shapes with single contours. Each shape is indexed with the four shape descriptors described in Section 2. The query method is query by example (QBE). As a query shape is submitted by the user from the client site which is a Java applet, the server side compares the query shape descriptors with each model shape descriptors in the database and ranks them in descending order in terms of similarity, or in ascending order in terms of distance. Since the database is large, sequential traversal of all the shapes in the database can cause long delay, a filter consists of shape eccentricity, circularity and number of CSS peaks is used to filter out those very different shapes. The filter process significantly reduces online retrieval time. 10 original shapes are randomly selected as the queries. For each query shape, a group of people are asked to select its similar shapes from the original database, and the affine, zoom, mirror and flip shapes generated from these similar shapes are also treated as the similar shapes to the query shape. The average precision and recall performance of the 10 retrievals are shown in Figure 2. We have also attempted to combine FD and ZMD, CSSD and ZMD, however, no significant gain has been obtained, therefore, they are not shown in the figure.

precision-recall.gif

Figure 2. Retrieval results from four shape descriptors. The numbers are in percentage.

From the experiment, it has been found that none of the region based methods outperforms the contour based methods significantly although region based methods are often claimed to be more robust in discriminating shapes. All the four methods tend to have relatively low precision when the database is large (refer to [LS99]). Although ZMD is used in MPEG-7 as region-based shape descriptors for complex shapes such as trade marks, it is also suitable for contour shapes. It performs slightly better than other methods. On average, FD is better than CSSD, while FD is much easier to derive, match, normalize and more compact compared with CSSD. Therefore, we view that FD should also be included as shape descriptors in MPEG-7. Although it is reported in [LS99] that GD outperforms FD, however, the database is small and the shapes are mostly synthetic polygons. In our experiment, retrieval using GD has a fast drop on precision and recall. In fact, it usually can only hit those scaled (zoomed), mirror and flip shapes, shapes with even slight skew are missed out from the retrieval.

It should be noted that the retrieval performance of these shape descriptors depends on shape database. Ideally, all retrieval test should be done on a standard shape database. In MPEG-7, there are two shape databases, one consisted of trade marks is for region-based shape test, it is not publicly available. The other is for contour-based shape test, it is consisted of only marine fishes which may be judged by ordinary observers as being too many similarities in most cases. In order to test the practicality of these shape descriptors in retrieval, distorted shapes have been generated and other types of shapes have been added into the database. The database issue is a common issue in CBIR, it needs to be further addressed in future research.

Back to Retrieval Experiments

Back to Table of Contents


Conclusions

In this paper we have made a study and a comparison on four shape descriptors using for shape retrieval. Results show that in terms of affine invariance, robustness, compactness, low computation complexity, ZMD and FD get more credits than the other two methods. Although ZMD lacks the perceptual meanings as reflected in FD and CSSD, retrieval results favor its performance. Although CSSD capture strong perceptual shape features, many negative factors have affected its performance. GD are suitable for cases where very similar shapes are wanted.

Back to Table of Contents


Bibliography

[Davies97]
E. R. Davies, Machine Vision: Theory, Algorithms, Practicalities, Academic Press, 1997.
[FS78]
H. Freeman and A. Saghri, "Generalized Chain Codes for Planar Curves," In Proceedings of the 4th International Joint Conference on Pattern Recognition, Kyoto, Japan, November 7-10, 1978, pp.701-703.
[Gu95]
Chuang Gu. "Multivalued Morphology and Segmentation-based Coding." Ph.D thesis of Signal Processing Lab. (LTS) of Swiss Federal Institute of Technology at Lausanne (EPFL), 1995.
[Hu62]
Ming-Kuei Hu, "Visual pattern Recognition by Moment Invariants," IRE Transactions on Information Theory, IT-8:179-187, 1962.
[KSP95]
Hannu Kauppinen, Tapio Seppanen and Matti Pietikainen, "An Experimental Comparison of Autoregressive and Fourier-Based Descriptors in 2D Shape Classification," IEEE Trans. PAMI, 17(2):201-207, 1995
[LS99]
G. J. Lu and A. Sajjanhar, "Region-based shape representation and similarity measure suitable for content-based image retrieval," Multimedia System 7:165-174, 1999.
[MAK96]
F. Mokhtarian, S. Abbasi and J. Kittler, "Efficient and Robust Retrieval by Shape Content through Curvature Scale Space," Int. Workshop on Image DataBases and Multimedia Search, Amsterdam, The Netherlands, 1996, pp. 35-42.
[MKL97]
B.M.Mehtre, Mohan S, Kankanhalli and Wing Foon Lee, "Shape Measures for Content based Image Retrieval: A Comparison," Information Processing & Management, 33(3): 319-337, 1997.
[MM86]
Farzin Mokhtarian and Alan Mackworth, "Scale-Based Description and Recognition of Planar Curves and Two-Dimensional Shapes," IEEE PAMI, 8(1):34-43, 1986
[PF77]
Eric Persoon and King-sun Fu, "Shape Discrimination Using Fourier Descriptors," IEEE Trans. On Systems, Man and Cybernetics, SMC-7(3):170-179, 1977.
[SQUID]
http://www.ee.surrey.ac.uk/Research/VSSP/ imagedb
[Teague80]
Michael Reed Teague, "Image Analysis Via the General theory of Moments," Journal of Optical Society of America, 70(8):920-930, 1980.
[ZL01]
D. S. Zhang and G. J. Lu, "Shape Retrieval Using Fourier Descriptors," Int. Conference on Multimedia and Distance Education, Fargo, ND, USA, June 2001
[ZR72]
Charles T. Zahn and Ralph Z, "Roskies. Fourier Descriptors for Plane closed Curves," IEEE Trans. On Computer, c-21(3):269-281, 1972.

Back to Table of Contents