IEEE International Conference on Multimedia & Expo 2001
Tokyo, Japan
Electronic Proceedings
(C) 2001 IEEE
Keywords: Fourier descriptors, curvature scale space, Zernike moments, grid, CBIR, shape
Shape boundary is a set of coordinates (xi, yi), i = 1, 2, ..., L, which are extracted out in the preprocessing stage by contour tracing technique. The centroid distance function is expressed by the distance of the boundary points from the centroid (xc, yc) of the shape
where xc, yc are averages of x coordinates and y coordinates respectively. Due to the subtraction of centroid (which represents the position of the shape) from boundary coordinates, the centroid distance representation is invariant to shape translation.
In order to apply Fourier transform, all the shapes in database are normalized to the same number of boundary points. For the shape signature described above, assuming it is normalized to N points (128 in our case) in the normalization stage, the discrete Fourier transform of ri, i = 0, 1, ..., N-1 is then given by
The coefficients un, n = 0, 1, ..., N-1, are usually called Fourier descriptors (FD) of the shape, denoted as FDn, n = 0, 1, ..., N-1.
The FDs acquired in this way is translation invariant due to the translation invariance of centroid distance. To achieve rotation invariance, phase information of the FDs are ignored and only the magnitudes |FDn| are used. Scale invariance is achieved by dividing the magnitudes by the DC component, i.e., |FD0|. Since centroid distance is a real value function, only half of the FDs are needed to index the shapes. Finally, the following feature vector are used as the Fourier descriptors to index the shape
the similarity measure of the query shape and a target shape in the database is simply the Euclidean distance between the query and the target shape feature vectors.
The nice properties of Fourier descriptors are its robustness, being able to capture some perceptual characteristics of the shape and easy to derive. With Fourier descriptors, coarse shape features or global shape features are captured by lower order coefficients and the finer shape features are captured by higher order coefficients. Noise is not a problem with Fourier descriptors, for noise only appears in very high frequencies which are truncated out. Slight changes around the shape boundary doesn't cause much difference in the final representation. With fast Fourier transform (FFT), the computation is efficient. Because usually, only a small number of low order coefficients are enough to capture the overall shape features, the representation is also compact. The disadvantage of Fourier descriptors is local features cannot be located, because in Fourier transform only the magnitudes of the frequencies, not the locations, are known.
where
and
are the first and
the second derivatives at location i respectively. Curvature zero-cross points are then located in
the shape boundary. The shape is then evolved into next scale by applying Gaussian smooth:
where Å means convolution, and g(i, s) is Gaussian function. As s increases, the evolving shape becomes smoother and smoother. New curvature zero-crossing points are located at the new scale. This process continues until no curvature zero-crossing points are found. The final CSS contour map is composed of all zero-crossing points zc(i, s), where i is the location s is the scale at which the zc is obtained. The peaks, or the maxima of the CSS contour map (only those peaks higher than the threshold are considered) are then extracted out and sorted in descending order as CSS descriptors to index the shape.
CSS descriptors is translation invariant. Scale invariance is achieved by normalizing all the shapes into fixed number of boundary points (128 in our case). Rotation invariance is achieved by circular shifting the highest peak to the origin of the CSS map. The similarity between two shapes is measured by the sum of the peak differences between all the matched peaks and the peak values of all the unmatched peaks [MAK96]. The difficulties of matching between two set of CSS descriptors are that the dimensions of the two set of CSS descriptors are usually different and the CSS peaks of two similar shapes are usually not matching. Besides, the peak orders of two similar shapes can be quite different. All these factors have to be considered during matching process. Mirror shapes and flipped shapes need also to be considered separately.
The powerfulness of CSS descriptors owes to its ability to capture those key local features such as the locations and the degrees of convexity (or concavity) of curve segments on the shape boundary, these features are very important to human perception in judging the similarity between shapes. The dimension of CSS descriptors is very low and the representation is robust due to Gaussian smooth process. But there are still some problems associate with CSS descriptors. In addition to the above mentioned matching difficulties, it only captures the local shape features, the global features which are also important to shape representation are missed out from the representation. As such, global features such as eccentricity, circularity and number of CSS peaks should be combined to form more practical descriptors. There may be no CSS descriptors for smooth convex shapes such as polygon composed of straight lines. Besides, due to the boundary sampling and the thresholding in the extraction of CSS peaks , the CSS descriptors may not reflect the true numbers of convex (or concave) segments on the shape boundary. All these factors affect the retrieval result.
and
where n and m are subject to n-|m| = even, |m|£ n. Zernike polynomials are a complete set of complex-valued function orthogonal over the unit disk, i.e., x2 + y2 = 1. Then the complex Zernike moments of order n with repetition m are defined as:
The theory of Zernike moments is similar to that of Fourier transform, to expand a signal into series of orthogonal basis. The precision of shape representation depends on the number of moments truncated from the expansion, the first 36 moments are used in our implementation.
Since Zernike basis functions take the unit disk as their domain, this disk must be specified before moments can be calculated. In our implementation, all the shapes are normalized into a unit circle of fixed radius of 64. The unit disk is then centered on the shape centroid. This makes the obtained moments scale and translation invariant. Rotation invariance is achieved by only using magnitudes of the moments. The magnitudes are then normalized by dividing them by the mass of the shape. The similarity between two shapes indexed with Zernike moments descriptors is the Euclidean distance between the two Zernike moments vectors.
Zernike moments descriptors does not need to know boundary information, making it suitable for more complex shape representation. Like Fourier descriptors, Zernike moments descriptors can be constructed to arbitrary order, this overcomes the drawback of geometric moments in which higher order moments are difficult to construct. However, Zernike moments descriptors lose the important perceptual meanings as those reflected in Fourier descriptors and geometric moments. In other words, we don't know what shape feature each Zernike moment represents.
For two shapes to be comparable using grid descriptors, several normalization processes have to be done to achieve scale, rotation and translation invariance. It begins with finding out the major axis, i.e., the line joining the two furthest points on the boundary. Rotation normalization is achieved by turning the shape so that the major axis is parallel with x-axis. To avoid multi normalization results for mirrored shape and flipped shape, the centroid of the rotated shape may be restricted to the lower-left part, or a mirror and a flip operation on the shape number are applied in the matching stage. Scale normalization can be done by resizing the shape so that the length of the major axis is equal to the grid width, and by shift the shape to the upper-left of the grid, the representation is translation invariant. The distance between two set of grid descriptors is simply the number of elements having different values. For example, the grid descriptors for the two shapes in Figure 1 (a) and (b) are 001111000 011111111 111111111 111111111 111110011 001100011 and 001100000 011100000 111100000 111100000 011111100 000111000 respectively, and the distance between the two shapes will be 27.
Figure 1. Grid representation of shape
Grid representation is a straightforward shape representation which may be suitable for shape coding as is adopted in MPEG-4. However, for retrieval purpose, it is questionable, because a slight shape distortion, such as affine transform, can cause very big difference in the similarity measurement. Furthermore, since the normalizations are mainly based on major axis (which is unreliable in essence) and eccentricity (which is only reliable for convex shapes or compact shapes), shapes otherwise similar may be treated as different due to this normalization. For example, the two shapes in Figure 1 (c) and (d) are perceptually similar, but are very different under grid representation, for the major axis of shape (c) is horizontal while the major axis of shape (d) will be vertical. The online retrieval also involves high computation due to the usually high dimensionality of the feature vectors (for a shape of 192x192 pixels using cell size of 12x12 pixels, the dimension is 196).
Figure 2. Retrieval results from four shape descriptors. The numbers are in percentage.
From the experiment, it has been found that none of the region based methods outperforms the contour based methods significantly although region based methods are often claimed to be more robust in discriminating shapes. All the four methods tend to have relatively low precision when the database is large (refer to [LS99]). Although ZMD is used in MPEG-7 as region-based shape descriptors for complex shapes such as trade marks, it is also suitable for contour shapes. It performs slightly better than other methods. On average, FD is better than CSSD, while FD is much easier to derive, match, normalize and more compact compared with CSSD. Therefore, we view that FD should also be included as shape descriptors in MPEG-7. Although it is reported in [LS99] that GD outperforms FD, however, the database is small and the shapes are mostly synthetic polygons. In our experiment, retrieval using GD has a fast drop on precision and recall. In fact, it usually can only hit those scaled (zoomed), mirror and flip shapes, shapes with even slight skew are missed out from the retrieval.
It should be noted that the retrieval performance of these shape descriptors depends on shape database. Ideally, all retrieval test should be done on a standard shape database. In MPEG-7, there are two shape databases, one consisted of trade marks is for region-based shape test, it is not publicly available. The other is for contour-based shape test, it is consisted of only marine fishes which may be judged by ordinary observers as being too many similarities in most cases. In order to test the practicality of these shape descriptors in retrieval, distorted shapes have been generated and other types of shapes have been added into the database. The database issue is a common issue in CBIR, it needs to be further addressed in future research.