IMAGE Team - The Pantheon project

# Tutorial: Evaluation and ranking of image segmentation algorithms

## Identification

This tutorial describes a method for ranking several image segmentation algorithms according to their performances. The originality of the method is that it takes into account the segmentation task requirements to calculate the performnace measures and perform ranking. The method is described by the way of a running example.

# Contents

## I. Method

Performance evaluation is based on discrepancy measures between segmentation results and reference segmentations called ground truths.

Vocabulary: For the sake of clarity, connected components that composed segmentation results are named "segments" and connected components that composed ground truths are named "regions".

Prerequisite:
• The test images are supposed to be stored in a folder named "sourceimages".
• The results of each segmentation algorithm are stored in another subfolder, for example "resultimages". Inside, each algorithm has its own subfolder, for example "algo001", "algo002", etc. Subfolders are organized like the "sourceimage" folder with the same image names.
• The ground truths of each expert are stored in a third folder, for example "groundtruths". Each expert has its own subfolder, for example "expert001", "expert002", etc. Subfolders are organized like the "sourceimage" folder with the same image names.

### I.1. Performance indicators

Performances are evaluated with five indicators, four geometric and one topological. Each indicator is measured by two discrepancy errors between segmentation results and related ground truths. Regarding the segmentation task, the performance value for the indicator is calculated by a tradeoff between the two d'error values. The evaluator should choose the tradeoff type among the 6 given in the right column of the table below.

1. Detection accuracy.

This indicator is concerned with how many regions of the ground truth have been detected by the segments of the segmentation result. A segment detects a region if a part of the segment covers a part of the region. There are possibly two types of error. The false negatives are regions of the ground truth that are not covered by a segment. The false positives are segments that do not cover any region.

If the detection accuracy indicator is deemed important for a given application, then it is necessary to specify which of the two previous errors is the most acceptable. More precisely, it is required to decide which of the propositions on the right fits the application the best.

 1 Both errors are acceptable, meaning do not consider any error for this indicator. 2 Both errors are undesirable, meaning penalize equally under-detection and over-detection errors. 3 Prefer false negatives, meaning penalize more errors due to over-detection than to under-detection. 4 Prefer false positives, meaning penalize more errors due to under-detection. 5 Do not penalize false negatives, meaning penalize only over-detection. 6 Do not penalize false positives, meaning penalize only under-detection.
2. Fragmentation consistency.

Ideally, each region should be detected by exactly one segment and each segment should detect exactly one region. But two errors can occur: the over-segmentation corresponds to a subdivision of a region into multiple segments, and the under-segmentation corresponds to a grouping of multiple regions into one segment.

As far as fragmentation consistency is concerned for a given application, one of the propositions on the right must be specified regarding the task objectives.

 1 Both errors are acceptable. 2 Both errors are undesirable. 3 Prefer under-segmentation. 4 Prefer over-segmentation. 5 Do not penalize under-segmentation. 6 Do not penalize over-segmentation.
3. Boundary precision.

Boundary precision is an important concern for a lot of applications. The maximum precision is achieved when the ground truth regions are covered exactly by the related segments. The two types of errors are the excess of pixels when segment pixels are positioned outside the related region boundary, and the deficit of pixels when region pixels are positioned outside the related segment boundary.

Assessment of boundary precision must be done regarding one of the propositions on the right.

 1 Both errors are acceptable. 2 Both errors are undesirable. 3 Prefer deficit of pixels. 4 Prefer excess of pixels. 5 Do not penalize deficit of pixels. 6 Do not penalize excess of pixels.
4. Shape fidelity.

The shape fidelity assesses how far region shape is from related segment shape. It measures to what extent the excess and the deficit of pixels affect the shape of the detected regions: are the mis-segmented pixels rather spread along the region boundary which slightly modify the shape, or are they concentrated on extra areas which modify significantly the shape? This indicator distinguishes discrepancy errors due to area commission which affects the region exterior and to area omission which affects the region interior.

If fidelity to the shape is an important indicator, preference should be given to one of the propositions on the right.

 1 Both errors are acceptable. 2 Both errors are undesirable. 3 Prefer shape error due to area omission. 4 Prefer shape error due to area commission. 5 Do not penalize area omission. 6 Do not penalize area commission.
5. Topology preservation.

The topology criterion consists in counting the number of holes. Indeed, in 2D, two regions are topologically equivalent if they have the same number of inner holes.

An added hole is a connected component composed of false negative pixels that are completely inside a segment, i.e., a connected component that has only one neighboring segment including the background. So, a hole in a segment that does not intersect the corresponding regions is not an added hole. Similarly, a deleted hole is a connected component composed of false positive pixels that are completely inside the related region.

The two possible errors are the hole addition and the hole deletion numbers in the detected regions. The related propositions are those on the right.

 1 Both errors are acceptable. 2 Both errors are undesirable. 3 Prefer hole addition. 4 Prefer hole deletion. 5 Do not penalize addition of holes. 6 Do not penalize deletion of holes.

### I.2 Computing the performance indicators values

Each indicator value is computed using a tradeoff between the two related antagonist errors regarding what the developer defined as the most acceptable error for the application. The tradeoff implementation depends of the acceptable error chosen for the indicator bu the evaluator:

1. Both errors are acceptable. This leads to exclude this indicator for the ranking:
```indicatori=0.
```
2. Both errors are undesirable. The two errors are equally penalized:
```                  |   0.5             0.5      |-1
indicatori = 1 -  | --------   +   ----------  |
|  1 - error2    1 - error1  |
```
3. Prefer error1 to error2. We penalize error2 twice times more than error1:
```                  |   0.8             0.2      |-1
indicatori = 1 -  | --------   +   ----------  |
|  1 - error2    1 - error1  |
```
4. Prefer error2 to error1. On the contrary, error1 is penalized twice times more than error2:
```                  |   0.2             0.8      |-1
indicatori = 1 -  | --------   +   ----------  |
|  1 - error2    1 - error1  |
```
5. Do not penalize error1. This means to use only the error2 value:
```indicatori = errori2
```
6. Do not penalize error2. This means to use only the error1 value:
```indicatori = errori1
```

### I.3. Ranking

Finally, algorithm ranking is performed using the five indicator computed.

After specifying the most acceptable error for each indicator, the evaluator should then give priority between these indicators regarding the specific needs of the application; ties are possible.

To make the ranking, the priorities are first converted to weights affecting each indicator. This is done using a uniform distribution of the weights over the range [0..1] according to the "Rank Order Centroid" approach. Each weight wi of the indicator i is obtained from the priority j by:

```wi = 1/n Sj=in{1/j}
```

for example, if n=5 then w1=0.4567, w2=0.2567, w3=0.1567, w4=0.0900, and w5=0.0400.

The final ranking is given by the weighted sum of the rank rij of each algorithm i for each indicator j multiplied by the corresponding weight wj:

```ri=Sj=1n{rij.wj}
```

## II. Example

### II.1. Description of the application

This experiment is a biological application in the context of cancer screening. The purpose is to assist cytotechnologists in finding suspect or abnormal cells from analysis of cytology images. The image segmentation goal is to locate each serous cell as individual region as presented in Figure 1.2. These regions will then be fed into a classifier which has been trained to identify cell types.

The complete definition of the problem needs to specify which error is the most acceptable for each of the five indicators. In our study case, these acceptable errors are suggested by the classification requirements:

• Detection accuracy: acceptable error #4: Prefer false positives.
The images contain only small numbers of cells and even fewer suspect or abnormal cells (around 2% of the cells). So, under-detection is a serious error as it means that segmentation could have missed some suspect cells. Over-detection is less damaging error, the false positives regions will be further discarded by the classifier.
• Fragmentation consistency: acceptable error #2: both errors are undesirable.
Over-segmentation destroys cell integrity and under-segmentation creates cell aggregates. Both errors lead to a misclassification of the related regions.
• Boundary precision: acceptable error #3: prefer deficit of pixels.
Since the classifier uses photometric features of regions, excess of pixels biases the feature values with background pixels.
• Shape fidelity: acceptable error #2: both errors are undesirable.
Region shape features (eg., roundness, eccentricity) are also used by the classifier. So, commission and omission shape errors should be penalized equally.
• Topology preservation: acceptable error #1: both errors are acceptable.
Holes are not considered for this application.
 Figure 1.1. A cytology image with serous cells and red blood corpuscles. Figure 1.2. The ground truth is superimposed in white.

### II.2. Evaluation of image segmentation algorithm performances

The Pandore script used to perform evaluation of algorithm performnaces is:

```passesssegmentationalgorithm -v \
0 \
0.5 \
cytology/resultimages/a001 \
cytology/groundtruths \
/tmp/detail_errors_algo1.pan \
/tmp/total_errors_algo1.pan
```

The following table gives the ten discrepancy errors resulted from the comparison of the segmentation result given in Figure 2.3 against the two ground truths given in Figure 2.1 and Figure 2.2. The matching algorithm used (type 0) allows that one region is matched by n segments and one segment matches n regions. The detection threshold is set to 50% (0.50) in order to keep only almost complete cells. A lower thresholding would result in too fragmented cells and would not allow proper identification by classification.

 Detection accuracy error11: Recall error 0.000 error12: Precision error 0.083 Fragmentation consistency error21: Under-segmentation error 0.196 error22: Over-segmentation error 0.000 Boundary precision error31: Deficit error 0.112 error32: Excess error 0.046 Shape fidelity error41: Shape omission error 0.262 error42: Shape commission error 0.241 Topology preservation error51: Hole addition error 0.000 error52: Hole deletion error 0.000

The interpretation of the results is as follows:

• Detection accuracy. All the ground truth regions have been detected, but 8.3% of the segments are false positive segments. Actually, 2 segments over 24 are false positive segments.

• Fragmentation consistency. The segmentation result suffers from under-segmentation defect only. The under-segmentation error value of 0.196 indicates that there is $20.196/\left(1-0.196\right)= 1.184$ region per correct segment. Actually, 3 segments over 22 merge two regions each.

• Boundary precision. The boundary precision errors are mostly due to pixel deficit. The segment boundaries are placed rather inside the corresponding region: 11.2% of correctly detected region area is not detected, and 4.6% of correct segment area is spurious.

• Shape fidelity. The mis-segmented pixels whether due to pixel deficit or pixel excess modify a little the region shape. The deficit pixels affect more the shape than the excess pixels. The average distance between the omission area boundary and the correct region boundary is of $100.262/\left(1-0.262\right)-1=1.3$ pixel, and the average distance between the commission area boundary and the correct region boundary is of $100.241/\left(1-0.241\right)-1=1.1$ pixel.

• Topology preservation. There is no hole addition or hole deletion errors.

 Figure 2.1. A cytology image with serous cells and red blood corpuscles. A first ground truth is superimposed in white. Figure 2.2. The same cytology image with a second ground truth superimposed in white. Figure 2.3. The segmentation result of the algorithm A superimposed on the source image.

### II.3. Ranking of Image Segmentation Algorithm

We consider five different image segmentation algorithms, respectively named A, B, C, D and E, and one result yielded by each algorithm while segmenting the image given Figure 3.1. The results are presented in Figures 3.2 to 3.6.

 Figure 3.1. A cytology image with serous cells and red blood corpuscles. The first ground truth is superimposed in white. Figure 3.2. The segmentation result of the algorithm A superimposed on the source image. Figure 3.3. The segmentation result of the algorithm B superimposed on the source image. Figure 3.4. The segmentation result of the algorithm C superimposed on the source image. Figure 3.5. The segmentation result of the algorithm D superimposed on the source image. Figure 3.6. The segmentation result of the algorithm E superimposed on the source image.

The algorithm performances are given in the following Table 2.

A B C D E
Error11: Recall error 0.000 0.000 0.000 0.000 0.000
Error12: Precision error 0.083 0.077 0.000 0.000 0.000
Error21: Under-segmentation error 0.1960.1540.292 0.000 0.000
Error22: Over-segmentation error 0.000 0.000 0.000 0.189 0.401
Error31: Deficit error 0.112 0.045 0.000 0.004 0.090
Error32: Excess error 0.046 0.017 0.349 0.112 0.005
Error41: Shape omission error0.2620.2500.0000.231 0.275
Error42: Shape commission error0.2410.2580.341 0.270 0.255
Error51: Hole addition error 0 0 0 0 0
Error52: Hole deletion error 0 0 0 0 0

Using the acceptable errors and the priority between indicators given below, the ranking is shown in Table 3:

• Priority 1: Detection accuracy with preference to false positives (acceptable error #4).
• Priority 2: Segmentation consistency while penalizing equally over-segmentation and under-segmentation (acceptable error #2).
• Priority 3: Boundary precision with preference to pixel deficit (acceptable error #3).
• Other error aare not taken into account.
A B C D E
Indicator10.0180.0160.0000.0000.000
Indicator20.1080.0840.1710.1000.251
Indicator30.0600.0230.3000.0920.024
Indicator4 0 0 0 0 0
Indicator5 0 0 0 0 0
Rank 51423

The ordering given Table 3 indicates that algorithms B and D are the best for this application. B is better in avarage, but D has good performances for the indicator with the higher priority. For this application, either B and D are relevant.

Pandore script used to rank the algorithms is:

``` pranksegmentationalgorithmsfromfolders \
0 0.5\
4 1 \
2 2 \
3 3 \
1 4 \
1 4 \
cytology/resultimages/ cytology/groundtruths \
/tmp/indicators2.pan /tmp/rank2.pan
```

## III. Code

Evaluation and ranking programs are provided with the Pandore distribution. The code provided below does not Pandore. It loads label image in ppm format (see example). However, the code can be easily adapted to any other image formats by changing the image loading functions in the file myinputoutput.h

(last update 08/27/15)

### III.2. Compilation

• Unix / linux / macos X:
• ```make -f makefile.unix pranksegmentationalgorithmsfromfolders
```
• Windows with visual C++:
• ```nmake -f makefile.msdos pranksegmentationalgorithmsfromfolders.exe
```

### III.3. Running

#### Usage:

```pranksegmentationalgorithmsfromfolders [-v] matching_algorithm_id matching_threshold acceptable_error1 priority1 acceptable_error2 priority2 acceptable_error3 priority3 acceptable_error4 priority4 acceptable_error5 priority5 segmentation_result_path ground_truth_path result_indicators result_ranks
```

#### Example

```pranksegmentationalgorithmsfromfolders \
0 0.5\
4 1 \
2 2 \
3 3 \
1 4 \
1 4 \
./example/resultimages/ \
./example/groundtruths \
/tmp/indicators2.txt /tmp/rank2.txt
```

The Pantheon project
Image Team GREYC Laboratory
UMR CNRS 6072 - ENSICAEN - University of Caen, France