ValiPop Validation

Contingency Tables

Once the simulation has ended, if the output_tables config options is set to true, ValiPop will generate contingency tables for target population deaths, partnerships, separations, ordered births, and multiple births. A contingency table is a CSV file which contains records of an actual and expected entry for each type of the event that could occur for the property. The following shows a snippet of the deaths contingency table:

Source,YOB,Sex,Age,Died,Date,freq
SIM,1861,M,26,false,1887,6
STAT,1879,F,81,false,1960,1.814276533165328
STAT,1924,F,43,false,1967,16.05255878948318
SIM,1904,F,21,false,1925,10

Source refers to whether the frequency (freq) is the actual number (SIM) or expected number (STAT) for each event. An event here is represented as a unique tuple of YOB (year of birth), Sex, Age, Died, Date. For each tuple there should be at most an actual and expected number. If the frequency is zero it is not included in the contingency table.

The tuple of values varies for each type of contingency table, but they all contain a Source and freq field.

Validation

The validation is completed by R using the contingency tables. The script which is executed is stored within the ValiPop JAR file and written to the results directory to be executed when needed.

The R script reads in and cleans the data generated in the contingency table, and then passes it to the geepack function geeglm. geeglm calculates a Generalized Estimating Equations Generalized Linear Model to determine the significance between the expected and actual event frequencies. The results of this analysis is written to an output file and read in by ValiPop to interpret the results.

ValiPop counts the number of ‘stars’ generated by the analysis, which only occur if the p value (which indicates the significance of the results) is greater than a certain threshold. The more ‘stars’ counted, the more significant the simulated population is from the input distributions. Ideally, we want 0 ‘stars’ counted as that indicates that the simulated population and input distributions are statistically similar.