Over the years many verification measures have been devised. Some are specific to one type of forecasts while others can be applied to multiple forecast elements. The purpose of this web page is to describe several verfication indices or scores that are commonly used in meteorology. You are referred to the two references listed at the end of this web page for details on these measures.
Let's consider a forecast event that either occurs or does not occur. This event is categorical, non-probablistic, and discrete. Examples of this type of forecast include rain versus no rain, or a severe weather warning. This type of forecast can be represented by a 2x2 contingency table.
Observed | ||||
Yes | No | |||
Forecast | Yes | a | b | a+b |
No | c | d | c+d | |
a+c | b+d | n = a+b+c+d |
This table looks at four possible outcomes:
Several measures can be derived from this table of data.
The percent correct is the percent of forecasts that are correct. Specifically,
PC ranges from zero (0) for no correct forecasts to one (1) when all forecasts are correct.
It is not useful for low frequency events such as severe weather warnings. In these cases there is a high frequency of "not forecast/not occurred" (d) events. This gives high PC values that are misleading with regard to the forecasting of the low frequency event. This shortcoming is compensated for by the next three scores.
The Hit Rate is the fraction of observed events that is forecast correctly. It is calculated as follows:
It is also known as the Probability of Detection (POD). It ranges from zero (0) at the poor end to one (1) at the good end.
The False Alarm Ratio is the fraction of "yes" forecasts that were wrong, i.e., were false alarms. It is calculated as follows:
It ranges from zero (0) at the good end to one (1) at the poor end.
The Threat Score (TS) or Critical Success Index (CSI) combines Hit Rate and False Alarm Ratio into one score for low frequency events. It is calculated as follows:
This score ranges from zero (0) at the poor end to one (1) at the good end. It does not consider "not forecast/not occurred" (d) events.
CSI, POD and FAR are used extensively by the National Weather Service to verify severe thunderstorm and tornado warnings.
Bias compares the number of times an event was forecast to the number of times an event was observed. Specifically,
John Finley was a sergeant in the U.S. Army Signal Service in the 1880s. He made 2,803 tornado forecasts for 18 regions east of the Rocky Mountains. His results can be examined using the 2x2 contingency table.
Tornadoes Observed | ||||
Yes | No | |||
Tornadoes Forecast | Yes | 28 | 72 | 100 |
No | 23 | 2,680 | 2,703 | |
51 | 2,752 | 2,803 |
These data produce the following statistics:
You can see why the PC is not a good measure for tornadoe forecasting. These statistics say that the forecast was correct 96.6 percent of the time. However, 95.6 percent was due to "not forecasting" their occurrence. A POD of 54.9 percent is admirable considering the state of meteorology in the 1880s but a FAR of 72.0 percent is rather high. The bias implies a tendency to overforecast the occurrence of tornadoes.
An interesting slant on these statistics occurs when the Finley data are modified to indicate that no tornadoes are forecast.
Tornadoes Observed | ||||
Yes | No | |||
Tornadoes Forecast | Yes | 0 | 0 | 0 |
No | 51 | 2,752 | 2,803 | |
51 | 2,752 | 2,803 |
For this case these data produce the following revised statistics:
In this case where tornadoes were never forecast, the PC went up to 98.2 percent.
Skill Score (SS) measures forecast accuracy relative to some set of control or reference forecast. It essentially answers the question:
The control or reference forecast include:
Skill Score is basically the percentage improvement over the reference forecast. It is expressed as follows:
where:
If A = A_{perf}, SS = 100%.
If A = A_{ref}, SS = 0 (no skill).
Please note that SS can be either positive or negative.
Two skill scores can be applied to the 2x2 contingency table. These are the Heidke Skill Score and the Gilbert Skill Score.
For the Heidke Skill Score (HSS), the reference measure is the proportion correct that would be expected by random forecasts that are statistically independent of the observations.
From the 2x2 contingency table, the marginal probability of a yes forecast is (a+b)/n.
From the 2x2 contingency table, the marginal probability of a yes observation is (a+c)/n.
Thus, the probability of a correct yes forecast by chance is:
The probability of a correct no forecast by chance is:
Let:
and substitute these values into the general skill score formula. This gives the following expression for HSS:
HSS is independent of n. HSS = 1 for a perfect forecast; HSS = 0 shows no skill. If HSS < 0, the forecast is worse than the reference forecast.
For the Gilbert Skill Score (GSS), the reference measure is the threat score (TS or CSI) for random forecasts using the following:
This gives the following expression for GSS:
In this formula, a_{ref} depends upon n.
GSS is also known as the Equitable Threat Score (ETS).
Mean Absolute Error (MAE) is a scalar accuracy measure that is calculated as follows:
where:
Each forecast-observation pair gives an error value. This measure sums the absolute values of these errors and divides by the number of forecasts to give an average error. MAE = 0 for a perfect forecast.
MAE is commonly used for verifying maximum and minimum temperature forecasts.
Mean Square Error (MSE) is a scalar accuracy measure that is calculated as follows:
It is similar to MAE in that:
In this case the forecast-observation errors are squared before they are averaged. MSE is more sensitive to large errors (outliers) than MAE. Large errors contribute more to the average than a linear difference. MSE = 0 is a perfect forecast.
The square root of MSE is the Root Mean Square Error (RMSE).
Mean Error (ME) is a scalar accuracy measure that is calculated as follows:
It is similar to MAE and MSE in that:
ME allows both positive and negative errors to be used in the average. As a result, ME is also known as bias.
Brier Score (BS) is an accuracy measure for probabilitic forecasts of dichotomous events. A dichotomous event is one that either occurs or does not occur. For example, rain either occurs or does not occur. Brier Score is calculated as follows:
where:
This is essentially the formula for MSE. BS ranges from zero (0) to one (1) with BS = 0 is a perfect forecast.
Brier Score can be converted to a skill score by assuming the following::
Thus, the formula for Brier Skill Score (BSS) is:
If BS > BS_{ref}, then BSS < 0 or your forecast is worse than the reference forecast.
If BS < BS_{ref}, then BSS > 0 or your forecast is better than the reference forecast.
If you use MOS as the reference forecast, and your BSS is negative, you can be replaced by MOS.
Another approach to evaluting probability forecasts is the Reliability Diagram. This diagram plots probability along the x-axis and the verifying frequency of occurrence of each probability value along the y-axis.
For example, if for all of your 40 percent probability of precipitation (POP) forecasts you found that it rained on 35 percent of these forecasts, you would plot 0.35 in the y-axis direction for 40 percent POP on the x-axis. Ideally, you would like to see a 40 percent frequency of occurrence for your 40 percent POP forecasts.
Most of the verification measures discussed up to now were applied to point forecasts. However, you can also verify model grid forecasts using some of these measures. Described below are several verification measures that have been used for any type of gridded forecast data.
For any grid of forecast values, you can apply the MSE and RMSE formulae from above. The forecast-observation pairs are corresponding forecast and analyzed grid values. Thus for any forecast grid you can calculate a MSE/RMSE value for that grid. These numbers are a broad measure of accuracy in terms of an average error across the grid.
If you are interested in more detail about grid errors you will likely look at other measures. Something as simple of a plot of the forecast minus analyed value at each grid point provides such detail.
Anomaly Correlation is a more complex statistical approach to grid verification. You start by determining the anomaly at each grid point using the following method:
Use the anomaly grid point pairs to run a standard statistical correlation on the pairs of points to measure the field relationship.
Another question that may arise is: How good are the patterns generated by the forecast models? For example, how close are the surface low pressure center forecasts to the observed surface low pressure center location?
One approach is to plot the error in location as a function of error along the track and error perpendicular to the track. This type of plot gives you a sense of whether the surface low forecast is fast or slow, or to the left or to the right of the track.
Using these data you can develop a set of probability ellipses that indicates the chance of a surface low being within a specific distance of its forecast position.
The purpose of this web page was to describe several verfication methods that are commonly used in meteorology. Many more than described here are available, and in some cases, variations on these measures are designed to fit what is forecast.
If you are interested in more details on what has been described here, you are referred to the two texts listed in the references below. Remember, however, verification can become very heavy from a statistical perspective. Be prepared.