Tutorial: Evaluation and ranking of image segmentation algorithms

Identification

Pantheon Project
Author: Régis CLOUARD
Date: Aug 27, 2015
Required: Pandore

This tutorial describes a method for ranking several image segmentation algorithms according to their performances. The originality of the method is that it takes into account the segmentation task requirements to calculate the performnace measures and perform ranking. The method is described by the way of a running example.

Method
Example
Code

I. Method

Performance evaluation is based on discrepancy measures between segmentation results and reference segmentations called ground truths.

Vocabulary: For the sake of clarity, connected components that composed segmentation results are named "segments" and connected components that composed ground truths are named "regions".

Prerequisite:

The test images are supposed to be stored in a folder named "sourceimages".
The results of each segmentation algorithm are stored in another subfolder, for example "resultimages". Inside, each algorithm has its own subfolder, for example "algo001", "algo002", etc. Subfolders are organized like the "sourceimage" folder with the same image names.
The ground truths of each expert are stored in a third folder, for example "groundtruths". Each expert has its own subfolder, for example "expert001", "expert002", etc. Subfolders are organized like the "sourceimage" folder with the same image names.

I.1. Performance indicators

Performances are evaluated with five indicators, four geometric and one topological. Each indicator is measured by two discrepancy errors between segmentation results and related ground truths. Regarding the segmentation task, the performance value for the indicator is calculated by a tradeoff between the two d'error values. The evaluator should choose the tradeoff type among the 6 given in the right column of the table below.

1. Detection accuracy.

This indicator is concerned with how many regions of the ground truth have been detected by the segments of the segmentation result. A segment detects a region if a part of the segment covers a part of the region. There are possibly two types of error. The false negatives are regions of the ground truth that are not covered by a segment. The false positives are segments that do not cover any region.

If the detection accuracy indicator is deemed important for a given application, then it is necessary to specify which of the two previous errors is the most acceptable. More precisely, it is required to decide which of the propositions on the right fits the application the best.

1	Both errors are acceptable, meaning do not consider any error for this indicator.
2	Both errors are undesirable, meaning penalize equally under-detection and over-detection errors.
3	Prefer false negatives, meaning penalize more errors due to over-detection than to under-detection.
4	Prefer false positives, meaning penalize more errors due to under-detection.
5	Do not penalize false negatives, meaning penalize only over-detection.
6	Do not penalize false positives, meaning penalize only under-detection.

2. Fragmentation consistency.

Ideally, each region should be detected by exactly one segment and each segment should detect exactly one region. But two errors can occur: the over-segmentation corresponds to a subdivision of a region into multiple segments, and the under-segmentation corresponds to a grouping of multiple regions into one segment.

As far as fragmentation consistency is concerned for a given application, one of the propositions on the right must be specified regarding the task objectives.

1	Both errors are acceptable.
2	Both errors are undesirable.
3	Prefer under-segmentation.
4	Prefer over-segmentation.
5	Do not penalize under-segmentation.
6	Do not penalize over-segmentation.

3. Boundary precision.

Boundary precision is an important concern for a lot of applications. The maximum precision is achieved when the ground truth regions are covered exactly by the related segments. The two types of errors are the excess of pixels when segment pixels are positioned outside the related region boundary, and the deficit of pixels when region pixels are positioned outside the related segment boundary.

Assessment of boundary precision must be done regarding one of the propositions on the right.

1	Both errors are acceptable.
2	Both errors are undesirable.
3	Prefer deficit of pixels.
4	Prefer excess of pixels.
5	Do not penalize deficit of pixels.
6	Do not penalize excess of pixels.

4. Shape fidelity.

The shape fidelity assesses how far region shape is from related segment shape. It measures to what extent the excess and the deficit of pixels affect the shape of the detected regions: are the mis-segmented pixels rather spread along the region boundary which slightly modify the shape, or are they concentrated on extra areas which modify significantly the shape? This indicator distinguishes discrepancy errors due to area commission which affects the region exterior and to area omission which affects the region interior.

If fidelity to the shape is an important indicator, preference should be given to one of the propositions on the right.

1	Both errors are acceptable.
2	Both errors are undesirable.
3	Prefer shape error due to area omission.
4	Prefer shape error due to area commission.
5	Do not penalize area omission.
6	Do not penalize area commission.

5. Topology preservation.

The topology criterion consists in counting the number of holes. Indeed, in 2D, two regions are topologically equivalent if they have the same number of inner holes.

An added hole is a connected component composed of false negative pixels that are completely inside a segment, i.e., a connected component that has only one neighboring segment including the background. So, a hole in a segment that does not intersect the corresponding regions is not an added hole. Similarly, a deleted hole is a connected component composed of false positive pixels that are completely inside the related region.

The two possible errors are the hole addition and the hole deletion numbers in the detected regions. The related propositions are those on the right.

1	Both errors are acceptable.
2	Both errors are undesirable.
3	Prefer hole addition.
4	Prefer hole deletion.
5	Do not penalize addition of holes.
6	Do not penalize deletion of holes.

I.2 Computing the performance indicators values

Each indicator value is computed using a tradeoff between the two related antagonist errors regarding what the developer defined as the most acceptable error for the application. The tradeoff implementation depends of the acceptable error chosen for the indicator bu the evaluator:

Both errors are acceptable. This leads to exclude this indicator for the ranking:
```
indicator_i=0.
```

Both errors are undesirable. The two errors are equally penalized:

                  |   0.5             0.5      |^-1
indicator_i = 1 -  | --------   +   ----------  |
                  |  1 - error₂    1 - error₁  |

Prefer error₁ to error₂. We penalize error₂ twice times more than error₁:

                  |   0.8             0.2      |^-1
indicator_i = 1 -  | --------   +   ----------  |
                  |  1 - error₂    1 - error₁  |

Prefer error₂ to error₁. On the contrary, error₁ is penalized twice times more than error₂:

                  |   0.2             0.8      |^-1
indicator_i = 1 -  | --------   +   ----------  |
                  |  1 - error₂    1 - error₁  |

Do not penalize error₁. This means to use only the error₂ value:
```
indicator_i = error_i2
```
Do not penalize error₂. This means to use only the error₁ value:
```
indicator_i = error_i1
```

I.3. Ranking

Finally, algorithm ranking is performed using the five indicator computed.

After specifying the most acceptable error for each indicator, the evaluator should then give priority between these indicators regarding the specific needs of the application; ties are possible.

To make the ranking, the priorities are first converted to weights affecting each indicator. This is done using a uniform distribution of the weights over the range [0..1] according to the "Rank Order Centroid" approach. Each weight w_i of the indicator i is obtained from the priority j by:

w_i = 1/n S_j=iⁿ{1/j}

for example, if n=5 then w₁=0.4567, w₂=0.2567, w₃=0.1567, w₄=0.0900, and w₅=0.0400.

The final ranking is given by the weighted sum of the rank r_ij of each algorithm i for each indicator j multiplied by the corresponding weight w_j:

r_i=S_j=1ⁿ{r_ij.w_j}

II. Example

II.1. Description of the application

This experiment is a biological application in the context of cancer screening. The purpose is to assist cytotechnologists in finding suspect or abnormal cells from analysis of cytology images. The image segmentation goal is to locate each serous cell as individual region as presented in Figure 1.2. These regions will then be fed into a classifier which has been trained to identify cell types.

The complete definition of the problem needs to specify which error is the most acceptable for each of the five indicators. In our study case, these acceptable errors are suggested by the classification requirements:

Detection accuracy: acceptable error #4: Prefer false positives.
The images contain only small numbers of cells and even fewer suspect or abnormal cells (around 2% of the cells). So, under-detection is a serious error as it means that segmentation could have missed some suspect cells. Over-detection is less damaging error, the false positives regions will be further discarded by the classifier.
Fragmentation consistency: acceptable error #2: both errors are undesirable.
Over-segmentation destroys cell integrity and under-segmentation creates cell aggregates. Both errors lead to a misclassification of the related regions.
Boundary precision: acceptable error #3: prefer deficit of pixels.
Since the classifier uses photometric features of regions, excess of pixels biases the feature values with background pixels.
Shape fidelity: acceptable error #2: both errors are undesirable.
Region shape features (eg., roundness, eccentricity) are also used by the classifier. So, commission and omission shape errors should be penalized equally.
Topology preservation: acceptable error #1: both errors are acceptable.
Holes are not considered for this application.

Figure 1.1. A cytology image with serous cells and red blood corpuscles.

Figure 1.2. The ground truth is superimposed in white.

II.2. Evaluation of image segmentation algorithm performances

The Pandore script used to perform evaluation of algorithm performnaces is:

passesssegmentationalgorithm -v \
    0 \
    0.5 \
    cytology/resultimages/a001 \
    cytology/groundtruths \
    /tmp/detail_errors_algo1.pan \
    /tmp/total_errors_algo1.pan

The following table gives the ten discrepancy errors resulted from the comparison of the segmentation result given in Figure 2.3 against the two ground truths given in Figure 2.1 and Figure 2.2. The matching algorithm used (type 0) allows that one region is matched by n segments and one segment matches n regions. The detection threshold is set to 50% (0.50) in order to keep only almost complete cells. A lower thresholding would result in too fragmented cells and would not allow proper identification by classification.

Table 1. The error values resulting from the comparison of the segmentation result given in Figure 2.3 with the ground truths given in Figure 2.1 and Figure 2.2.

Detection accuracy	error₁₁: Recall error	0.000
Detection accuracy	error₁₂: Precision error	0.083
Fragmentation consistency	error₂₁: Under-segmentation error	0.196
Fragmentation consistency	error₂₂: Over-segmentation error	0.000
Boundary precision	error₃₁: Deficit error	0.112
Boundary precision	error₃₂: Excess error	0.046
Shape fidelity	error₄₁: Shape omission error	0.262
Shape fidelity	error₄₂: Shape commission error	0.241
Topology preservation	error₅₁: Hole addition error	0.000
Topology preservation	error₅₂: Hole deletion error	0.000

The interpretation of the results is as follows:

Detection accuracy. All the ground truth regions have been detected, but 8.3% of the segments are false positive segments. Actually, 2 segments over 24 are false positive segments.
Fragmentation consistency. The segmentation result suffers from under-segmentation defect only. The under-segmentation error value of 0.196 indicates that there is $2$ ^{0.196/(1-0.196)} = 1.184 region per correct segment. Actually, 3 segments over 22 merge two regions each.
Boundary precision. The boundary precision errors are mostly due to pixel deficit. The segment boundaries are placed rather inside the corresponding region: 11.2% of correctly detected region area is not detected, and 4.6% of correct segment area is spurious.
Shape fidelity. The mis-segmented pixels whether due to pixel deficit or pixel excess modify a little the region shape. The deficit pixels affect more the shape than the excess pixels. The average distance between the omission area boundary and the correct region boundary is of $10$ ^{0.262/(1-0.262)}-1=1.3 pixel, and the average distance between the commission area boundary and the correct region boundary is of $10$ ^{0.241/(1-0.241)}-1=1.1 pixel.
Topology preservation. There is no hole addition or hole deletion errors.

Figure 2.1. A cytology image with serous cells and red blood corpuscles. A first ground truth is superimposed in white.

Figure 2.2. The same cytology image with a second ground truth superimposed in white.

Figure 2.3. The segmentation result of the algorithm A superimposed on the source image.

II.3. Ranking of Image Segmentation Algorithm

We consider five different image segmentation algorithms, respectively named A, B, C, D and E, and one result yielded by each algorithm while segmenting the image given Figure 3.1. The results are presented in Figures 3.2 to 3.6.


Figure 3.1. A cytology image with serous cells and red blood corpuscles. The first ground truth is superimposed in white.	Figure 3.2. The segmentation result of the algorithm A superimposed on the source image.	Figure 3.3. The segmentation result of the algorithm B superimposed on the source image.

Figure 3.4. The segmentation result of the algorithm C superimposed on the source image.	Figure 3.5. The segmentation result of the algorithm D superimposed on the source image.	Figure 3.6. The segmentation result of the algorithm E superimposed on the source image.

The algorithm performances are given in the following Table 2.

Table 2. The error values resulted from the comparison of the five segmentation results against the ground truth.

	A	B	C	D	E
Error₁₁: Recall error	0.000	0.000	0.000	0.000	0.000
Error₁₂: Precision error	0.083	0.077	0.000	0.000	0.000
Error₂₁: Under-segmentation error	0.196	0.154	0.292	0.000	0.000
Error₂₂: Over-segmentation error	0.000	0.000	0.000	0.189	0.401
Error₃₁: Deficit error	0.112	0.045	0.000	0.004	0.090
Error₃₂: Excess error	0.046	0.017	0.349	0.112	0.005
Error₄₁: Shape omission error	0.262	0.250	0.000	0.231	0.275
Error₄₂: Shape commission error	0.241	0.258	0.341	0.270	0.255
Error₅₁: Hole addition error	0	0	0	0	0
Error₅₂: Hole deletion error	0	0	0	0	0

Using the acceptable errors and the priority between indicators given below, the ranking is shown in Table 3:

Priority 1: Detection accuracy with preference to false positives (acceptable error #4).
Priority 2: Segmentation consistency while penalizing equally over-segmentation and under-segmentation (acceptable error #2).
Priority 3: Boundary precision with preference to pixel deficit (acceptable error #3).
Other error aare not taken into account.

Table 3. The indicator values of the five algorithms and the related ranking with propositions {4, 2, 3}.

	A	B	C	D	E
Indicator₁	0.018	0.016	0.000	0.000	0.000
Indicator₂	0.108	0.084	0.171	0.100	0.251
Indicator₃	0.060	0.023	0.300	0.092	0.024
Indicator₄	0	0	0	0	0
Indicator₅	0	0	0	0	0
Rank	5	1	4	2	3

The ordering given Table 3 indicates that algorithms B and D are the best for this application. B is better in avarage, but D has good performances for the indicator with the higher priority. For this application, either B and D are relevant.

Pandore script used to rank the algorithms is:

 pranksegmentationalgorithmsfromfolders \
    0 0.5\
    4 1 \
    2 2 \
    3 3 \
    1 4 \
    1 4 \
    cytology/resultimages/ cytology/groundtruths \
    /tmp/indicators2.pan /tmp/rank2.pan

III. Code

Evaluation and ranking programs are provided with the Pandore distribution. The code provided below does not Pandore. It loads label image in ppm format (see example). However, the code can be easily adapted to any other image formats by changing the image loading functions in the file myinputoutput.h

III.2. Download

Code Download
(last update 08/27/15)

III.2. Compilation

Unix / linux / macos X:

make -f makefile.unix pranksegmentationalgorithmsfromfolders

Windows with visual C++:

nmake -f makefile.msdos pranksegmentationalgorithmsfromfolders.exe

III.3. Running

Usage:

pranksegmentationalgorithmsfromfolders [-v] matching_algorithm_id matching_threshold acceptable_error1 priority1 acceptable_error2 priority2 acceptable_error3 priority3 acceptable_error4 priority4 acceptable_error5 priority5 segmentation_result_path ground_truth_path result_indicators result_ranks

Example

pranksegmentationalgorithmsfromfolders \
    0 0.5\
    4 1 \
    2 2 \
    3 3 \
    1 4 \
    1 4 \
    ./example/resultimages/ \
    ./example/groundtruths \
    /tmp/indicators2.txt /tmp/rank2.txt

The Pantheon project
Image Team GREYC Laboratory
UMR CNRS 6072 - ENSICAEN - University of Caen, France
This page was last modified on 26 August 2015