IMAGE Team - The Pantheon project

Tutorial: Evaluation and ranking of image segmentation algorithms

Identification

This tutorial describes a method for ranking several image segmentation algorithms according to their performances. The originality of the method is that it takes into account the segmentation task requirements to calculate the performnace measures and perform ranking. The method is described by the way of a running example.

Contents

  1. Method
  2. Example
  3. Code

I. Method

Performance evaluation is based on discrepancy measures between segmentation results and reference segmentations called ground truths.

Vocabulary: For the sake of clarity, connected components that composed segmentation results are named "segments" and connected components that composed ground truths are named "regions".

Prerequisite:

I.1. Performance indicators

Performances are evaluated with five indicators, four geometric and one topological. Each indicator is measured by two discrepancy errors between segmentation results and related ground truths. Regarding the segmentation task, the performance value for the indicator is calculated by a tradeoff between the two d'error values. The evaluator should choose the tradeoff type among the 6 given in the right column of the table below.

1. Detection accuracy.

This indicator is concerned with how many regions of the ground truth have been detected by the segments of the segmentation result. A segment detects a region if a part of the segment covers a part of the region. There are possibly two types of error. The false negatives are regions of the ground truth that are not covered by a segment. The false positives are segments that do not cover any region.

If the detection accuracy indicator is deemed important for a given application, then it is necessary to specify which of the two previous errors is the most acceptable. More precisely, it is required to decide which of the propositions on the right fits the application the best.

1Both errors are acceptable,
meaning do not consider any error for this indicator.
2Both errors are undesirable,
meaning penalize equally under-detection and over-detection errors.
3Prefer false negatives,
meaning penalize more errors due to over-detection than to under-detection.
4Prefer false positives,
meaning penalize more errors due to under-detection.
5Do not penalize false negatives,
meaning penalize only over-detection.
6Do not penalize false positives,
meaning penalize only under-detection.
2. Fragmentation consistency.

Ideally, each region should be detected by exactly one segment and each segment should detect exactly one region. But two errors can occur: the over-segmentation corresponds to a subdivision of a region into multiple segments, and the under-segmentation corresponds to a grouping of multiple regions into one segment.

As far as fragmentation consistency is concerned for a given application, one of the propositions on the right must be specified regarding the task objectives.

1Both errors are acceptable.
2Both errors are undesirable.
3Prefer under-segmentation.
4Prefer over-segmentation.
5Do not penalize under-segmentation.
6Do not penalize over-segmentation.
3. Boundary precision.

Boundary precision is an important concern for a lot of applications. The maximum precision is achieved when the ground truth regions are covered exactly by the related segments. The two types of errors are the excess of pixels when segment pixels are positioned outside the related region boundary, and the deficit of pixels when region pixels are positioned outside the related segment boundary.

Assessment of boundary precision must be done regarding one of the propositions on the right.

1Both errors are acceptable.
2Both errors are undesirable.
3Prefer deficit of pixels.
4Prefer excess of pixels.
5Do not penalize deficit of pixels.
6Do not penalize excess of pixels.
4. Shape fidelity.

The shape fidelity assesses how far region shape is from related segment shape. It measures to what extent the excess and the deficit of pixels affect the shape of the detected regions: are the mis-segmented pixels rather spread along the region boundary which slightly modify the shape, or are they concentrated on extra areas which modify significantly the shape? This indicator distinguishes discrepancy errors due to area commission which affects the region exterior and to area omission which affects the region interior.

If fidelity to the shape is an important indicator, preference should be given to one of the propositions on the right.

1Both errors are acceptable.
2Both errors are undesirable.
3Prefer shape error due to area omission.
4Prefer shape error due to area commission.
5Do not penalize area omission.
6Do not penalize area commission.
5. Topology preservation.

The topology criterion consists in counting the number of holes. Indeed, in 2D, two regions are topologically equivalent if they have the same number of inner holes.

An added hole is a connected component composed of false negative pixels that are completely inside a segment, i.e., a connected component that has only one neighboring segment including the background. So, a hole in a segment that does not intersect the corresponding regions is not an added hole. Similarly, a deleted hole is a connected component composed of false positive pixels that are completely inside the related region.

The two possible errors are the hole addition and the hole deletion numbers in the detected regions. The related propositions are those on the right.

1Both errors are acceptable.
2Both errors are undesirable.
3Prefer hole addition.
4Prefer hole deletion.
5Do not penalize addition of holes.
6Do not penalize deletion of holes.

I.2 Computing the performance indicators values

Each indicator value is computed using a tradeoff between the two related antagonist errors regarding what the developer defined as the most acceptable error for the application. The tradeoff implementation depends of the acceptable error chosen for the indicator bu the evaluator:

  1. Both errors are acceptable. This leads to exclude this indicator for the ranking:
    indicatori=0.
    
  2. Both errors are undesirable. The two errors are equally penalized:
                      |   0.5             0.5      |-1
    indicatori = 1 -  | --------   +   ----------  |
                      |  1 - error2    1 - error1  |
    
  3. Prefer error1 to error2. We penalize error2 twice times more than error1:
                      |   0.8             0.2      |-1
    indicatori = 1 -  | --------   +   ----------  |
                      |  1 - error2    1 - error1  |
    
  4. Prefer error2 to error1. On the contrary, error1 is penalized twice times more than error2:
                      |   0.2             0.8      |-1
    indicatori = 1 -  | --------   +   ----------  |
                      |  1 - error2    1 - error1  |
    
  5. Do not penalize error1. This means to use only the error2 value:
    indicatori = errori2
    
  6. Do not penalize error2. This means to use only the error1 value:
    indicatori = errori1
    

I.3. Ranking

Finally, algorithm ranking is performed using the five indicator computed.

After specifying the most acceptable error for each indicator, the evaluator should then give priority between these indicators regarding the specific needs of the application; ties are possible.

To make the ranking, the priorities are first converted to weights affecting each indicator. This is done using a uniform distribution of the weights over the range [0..1] according to the "Rank Order Centroid" approach. Each weight wi of the indicator i is obtained from the priority j by:

wi = 1/n Sj=in{1/j}

for example, if n=5 then w1=0.4567, w2=0.2567, w3=0.1567, w4=0.0900, and w5=0.0400.

The final ranking is given by the weighted sum of the rank rij of each algorithm i for each indicator j multiplied by the corresponding weight wj:

ri=Sj=1n{rij.wj}

II. Example

II.1. Description of the application

This experiment is a biological application in the context of cancer screening. The purpose is to assist cytotechnologists in finding suspect or abnormal cells from analysis of cytology images. The image segmentation goal is to locate each serous cell as individual region as presented in Figure 1.2. These regions will then be fed into a classifier which has been trained to identify cell types.

The complete definition of the problem needs to specify which error is the most acceptable for each of the five indicators. In our study case, these acceptable errors are suggested by the classification requirements:

  • Detection accuracy: acceptable error #4: Prefer false positives.
    The images contain only small numbers of cells and even fewer suspect or abnormal cells (around 2% of the cells). So, under-detection is a serious error as it means that segmentation could have missed some suspect cells. Over-detection is less damaging error, the false positives regions will be further discarded by the classifier.
  • Fragmentation consistency: acceptable error #2: both errors are undesirable.
    Over-segmentation destroys cell integrity and under-segmentation creates cell aggregates. Both errors lead to a misclassification of the related regions.
  • Boundary precision: acceptable error #3: prefer deficit of pixels.
    Since the classifier uses photometric features of regions, excess of pixels biases the feature values with background pixels.
  • Shape fidelity: acceptable error #2: both errors are undesirable.
    Region shape features (eg., roundness, eccentricity) are also used by the classifier. So, commission and omission shape errors should be penalized equally.
  • Topology preservation: acceptable error #1: both errors are acceptable.
    Holes are not considered for this application.
Figure 1.1. A cytology image with serous cells and red blood corpuscles.
Figure 1.2. The ground truth is superimposed in white.

II.2. Evaluation of image segmentation algorithm performances

The Pandore script used to perform evaluation of algorithm performnaces is:

passesssegmentationalgorithm -v \
    0 \
    0.5 \
    cytology/resultimages/a001 \
    cytology/groundtruths \
    /tmp/detail_errors_algo1.pan \
    /tmp/total_errors_algo1.pan

The following table gives the ten discrepancy errors resulted from the comparison of the segmentation result given in Figure 2.3 against the two ground truths given in Figure 2.1 and Figure 2.2. The matching algorithm used (type 0) allows that one region is matched by n segments and one segment matches n regions. The detection threshold is set to 50% (0.50) in order to keep only almost complete cells. A lower thresholding would result in too fragmented cells and would not allow proper identification by classification.

Table 1. The error values resulting from the comparison of the segmentation result given in Figure 2.3 with the ground truths given in Figure 2.1 and Figure 2.2.
Detection accuracy error11: Recall error0.000
error12: Precision error0.083
Fragmentation consistency error21: Under-segmentation error0.196
error22: Over-segmentation error0.000
Boundary precision error31: Deficit error0.112
error32: Excess error0.046
Shape fidelity error41: Shape omission error0.262
error42: Shape commission error0.241
Topology preservation error51: Hole addition error0.000
error52: Hole deletion error0.000

The interpretation of the results is as follows:

  • Detection accuracy. All the ground truth regions have been detected, but 8.3% of the segments are false positive segments. Actually, 2 segments over 24 are false positive segments.

  • Fragmentation consistency. The segmentation result suffers from under-segmentation defect only. The under-segmentation error value of 0.196 indicates that there is 20.196/(1-0.196) = 1.184 region per correct segment. Actually, 3 segments over 22 merge two regions each.

  • Boundary precision. The boundary precision errors are mostly due to pixel deficit. The segment boundaries are placed rather inside the corresponding region: 11.2% of correctly detected region area is not detected, and 4.6% of correct segment area is spurious.

  • Shape fidelity. The mis-segmented pixels whether due to pixel deficit or pixel excess modify a little the region shape. The deficit pixels affect more the shape than the excess pixels. The average distance between the omission area boundary and the correct region boundary is of 100.262/(1-0.262)-1=1.3 pixel, and the average distance between the commission area boundary and the correct region boundary is of 100.241/(1-0.241)-1=1.1 pixel.

  • Topology preservation. There is no hole addition or hole deletion errors.

Figure 2.1. A cytology image with serous cells and red blood corpuscles. A first ground truth is superimposed in white.

Figure 2.2. The same cytology image with a second ground truth superimposed in white.
Figure 2.3. The segmentation result of the algorithm A superimposed on the source image.

II.3. Ranking of Image Segmentation Algorithm

We consider five different image segmentation algorithms, respectively named A, B, C, D and E, and one result yielded by each algorithm while segmenting the image given Figure 3.1. The results are presented in Figures 3.2 to 3.6.

Figure 3.1. A cytology image with serous cells and red blood corpuscles. The first ground truth is superimposed in white.
Figure 3.2. The segmentation result of the algorithm A superimposed on the source image.
Figure 3.3. The segmentation result of the algorithm B superimposed on the source image.
Figure 3.4. The segmentation result of the algorithm C superimposed on the source image.
Figure 3.5. The segmentation result of the algorithm D superimposed on the source image.
Figure 3.6. The segmentation result of the algorithm E superimposed on the source image.

The algorithm performances are given in the following Table 2.

Table 2. The error values resulted from the comparison of the five segmentation results against the ground truth.
  A B C D E
Error11: Recall error 0.000 0.000 0.000 0.000 0.000
Error12: Precision error 0.083 0.077 0.000 0.000 0.000
Error21: Under-segmentation error 0.1960.1540.292 0.000 0.000
Error22: Over-segmentation error 0.000 0.000 0.000 0.189 0.401
Error31: Deficit error 0.112 0.045 0.000 0.004 0.090
Error32: Excess error 0.046 0.017 0.349 0.112 0.005
Error41: Shape omission error0.2620.2500.0000.231 0.275
Error42: Shape commission error0.2410.2580.341 0.270 0.255
Error51: Hole addition error 0 0 0 0 0
Error52: Hole deletion error 0 0 0 0 0

Using the acceptable errors and the priority between indicators given below, the ranking is shown in Table 3:

  • Priority 1: Detection accuracy with preference to false positives (acceptable error #4).
  • Priority 2: Segmentation consistency while penalizing equally over-segmentation and under-segmentation (acceptable error #2).
  • Priority 3: Boundary precision with preference to pixel deficit (acceptable error #3).
  • Other error aare not taken into account.
Table 3. The indicator values of the five algorithms and the related ranking with propositions {4, 2, 3}.
A B C D E
Indicator10.0180.0160.0000.0000.000
Indicator20.1080.0840.1710.1000.251
Indicator30.0600.0230.3000.0920.024
Indicator4 0 0 0 0 0
Indicator5 0 0 0 0 0
Rank 51423

The ordering given Table 3 indicates that algorithms B and D are the best for this application. B is better in avarage, but D has good performances for the indicator with the higher priority. For this application, either B and D are relevant.

Pandore script used to rank the algorithms is:

 pranksegmentationalgorithmsfromfolders \
    0 0.5\
    4 1 \
    2 2 \
    3 3 \
    1 4 \
    1 4 \
    cytology/resultimages/ cytology/groundtruths \
    /tmp/indicators2.pan /tmp/rank2.pan

III. Code

Evaluation and ranking programs are provided with the Pandore distribution. The code provided below does not Pandore. It loads label image in ppm format (see example). However, the code can be easily adapted to any other image formats by changing the image loading functions in the file myinputoutput.h

III.2. Download

Code Download
(last update 08/27/15)

III.2. Compilation

III.3. Running

Usage:

pranksegmentationalgorithmsfromfolders [-v] matching_algorithm_id matching_threshold acceptable_error1 priority1 acceptable_error2 priority2 acceptable_error3 priority3 acceptable_error4 priority4 acceptable_error5 priority5 segmentation_result_path ground_truth_path result_indicators result_ranks

Example

pranksegmentationalgorithmsfromfolders \
    0 0.5\
    4 1 \
    2 2 \
    3 3 \
    1 4 \
    1 4 \
    ./example/resultimages/ \
    ./example/groundtruths \
    /tmp/indicators2.txt /tmp/rank2.txt

The Pantheon project
Image Team GREYC Laboratory
UMR CNRS 6072 - ENSICAEN - University of Caen, France
This page was last modified on 26 August 2015