Generates synthetic genealogical populations at (small) country scale.
Each individual distribution file contains at least the following meta information at the start of the file.
YEAR <year>
POPULATION <location>
SOURCE <source>
LABELS <tab-separated column labels>
DATA
...
Each meta field and value must be separated by a tab character. You may include your own fields, but only the following are read by ValiPop:
YEAR
specifies what year the distribution applies to.POPULATION
specifies what population the distribution is based on.SOURCE
specifies where the distribution was acquired.LABELS
specifies the labels of each column in the data, separated by tab characters.Everything after the DATA
field will be recorded as the input distribution data.
The format of the data depends on what the distribution is for, but all formats use tab characters to separate values.
Notably, ranges can be used to represent several positive integer values within the data. Ranges can either be of the form a-b
representing the values between positive integers a
and b
inclusive, or as c+
representing the values of c
and greater.
The data is separated into a ‘year’ and ‘value’ column. Each row specifies the value for the given year. For any time, ValiPop will use the value of the nearest given year. The YEAR
meta field is ignored by ValiPop here as the data represents the values across multiple years.
The following shows an input distribution for the property birth/ratio_birth
(proportion of births born male), which uses single input data.
...
DATA
1600 0.5
1650 0.56
1700 0.51
1801 0.48
1892 0.47
The data is separated into a ‘name’ and ‘probability’ column. Each row specifies the probability for a name. The sum of the probabilities should sum to one.
The following shows the name data for the property annotations/female_forename/
(probability of female forenames).
...
DATA
Aaisha 3.37840120541355e-05
Aaishah 2.02704072324813e-05
Aalia 3.04056108487219e-05
Aaliya 3.7162413259549e-05
Aaliyah 0.000375002533801
Aamena 2.02704072324813e-05
Aamenah 1.68920060270678e-05
Aamina 4.72976168757897e-05
...
The data is a 2D table with age or age ranges in the first column, and probabilities in the remaining columns. Each remaining column represents an enumerated value, and each row represents the probability distribution of the enumerated values at an age or age range. The sum of the probabilities on each row should sum to one. The LABELS
meta field should specify the enumerated value of each column (skipping the age column)
The following shows the 2D age-dependent enumerated data for the property annotations/occupation/male/
(probability of male occupations at a given age). The first labelled column ‘ ‘ represents unemployment in this case.
...
LABELS Farmer Teacher Chimney Sweeper
DATA
0-10 1.0 0.0 0.0 0.0
11-16 0.82 0.0 0.0 0.18
17-18 0.61 0.34 0.0 0.05
19 0.41 0.38 0.2 0.01
20-31 0.14 0.5 0.36 0
32+ 0.16 0.52 0.32 0
The data is a 2D table with both row and columns representing enumerated values. The first column specifies the enumerated value for each rows and the LABELS
meta field specifies the enumerated value of each column. The values represent probabilities and each row should sum to one.
The following shows 2D enumerated data for the property annotations/occupation/change/male/
(proportion of occupations males change to for each current occupation).
...
LABELS Farmer Teacher Chimney Sweeper
DATA
0.7 0.25 0.05 0.0
Farmer 0.15 0.8 0.05 0.0
Teacher 0.2 0.2 0.6 0.0
Chimney Sweeper 0.6 0.3 0.0 0.1
The data is separated into ‘age or age range’ and ‘value’ columns. Each row specifies a value for a given age or age range.
The following shows 1D age-dependent data for the property death/males/lifetable
(probability of death of at a given age or age range).
...
DATA
0-4 0.06089
5-10 0.00821
11-14 0.00483
15-19 0.00724
20-29 0.00916
30-39 0.01058
40-49 0.01443
50-59 0.02170
60-69 0.04430
70-79 0.09948
80-89 0.20741
90-99 0.36215
100+ 0.28125
The data is a 2D table with age or age ranges for each row, and some numerical value or value range for each column. The first column specifies the age or age range for each row, and the LABELS
meta field specifies the value or value range for each column.
The following shows 2D double age-dependent data for the property birth/ordered_birth/
(probabilities of having some number of children for each age).
LABELS 0 1 2 3 4 5+
DATA
0-14 0 0 0 0 0 0
15-19 0.0622909 0.010868 0.0015411 0 0 0
20-24 0.065174 0.033412 0.018144 0.004536 0.001134 0.0001134
25-29 0.03808 0.03872 0.044088 0.016032 0.00668 0.000668
30-34 0.011442424 0.025012121 0.030751515 0.012872727 0.005721212 0.0005721212
35-39 0.0022 0.00658 0.00946 0.00462 0.00264 0.000264
40-49 0.000264 0.000662 0.000864 0.000528 0.000432 0.0000432
50-54 0 0 0 0 0 0
55+ 0 0 0 0 0 0
The structure of the input distributions directory is shown in the following tree:
my-input-distribution/
│
├───annotations/
│ ├───female_forename/
│ ├───male_forename/
│ ├───surname/
│ ├───geography/
│ │
│ ├───occupation/
│ │ ├───change/
| │ │ ├───female/
| │ │ └───male/
| | |
│ │ ├───female/
│ │ └───male/
| |
│ └───migration/
│ ├───female_forename/
│ ├───male_forename/
│ ├───surname/
│ └───rate/
|
├───birth/
│ ├───adulterous_birth/
│ ├───multiple_birth/
│ ├───ordered_birth/
│ └───ratio_birth/
│
├───death/
│ ├───females/
│ │ ├───cause/
│ │ └───lifetable/
│ │
│ └───males/
│ ├───cause/
│ └───lifetable/
│
└───relationships/
├───marriage/
├───partnering/
└───separation/
annotations/female_forename
The probability of each name a newborn female could be given. Uses the Name Data format.
annotations/male_forename
The probability of each name a newborn male could be given. Uses the Name Data format.
annotations/surname
The probability of each surname a newly spawned family could be given. Uses the Name Data format.
annotations/geogrpahy
The geography the population is set in. This requires a single JSON file which defines the array of Areas a person can inhabit. An Area is defined by the following minimal JSON:
{
"place_id": <OSM place id>,
"road": <road>,
"suburb": <suburb>,
"town": <town>,
"county": <county>,
"state": <country>,
"postcode": <postcode>,
"boundingbox": [<min lat>, <max lat>, <min long>, <max long>]
}
annotations/occupation/change/female
The proportion of occupations a female will change to from their current occupation. Uses the 2D double enumerated data format.
annotations/occupation/change/male
The proportion of occupations a male will change to from their current occupation. Uses the 2D double enumerated data format.
annotations/occupation/female
The probabilities of occupations for a male at a given. Uses the 2D age-dependent enumerated data format.
annotations/occupation/male
The probabilities of occupations for a male at a given. Uses the 2D age-dependent enumerated data format.
annotations/migration/female_forename
The probability of each forename a newly immigrated female could have. Uses the Name Data format.
annotations/migration/male_forename
The probability of each forename a newly immigrated male could have. Uses the Name Data format.
annotations/migration/surname
The probability of each surname a newly immigrated person could have. Uses the Name Data format.
birth/adulterous_birth
The proportion of illegitimate births among all births. Uses the 1D age-dependent data format.
birth/multiple_birth
The proportion of maternities producing a given number of children. For example, whether a pregnancy results in twins, triplets, or just a single child. Uses the 2D age-dependent data format, with the number of children produced on the columns. Each row should sum to one (or zero if no births allowed at that age).
birth/ordered_birth
The probability of a mother having a given number of children at a given age. Uses the 2d age-dependent data format.
birth/ratio_birth
The proportion of children born male. Uses the single input data format.
death/females/cause
The proportions of causes of death for female deaths at a given age. Uses the HICOD notation to enumerate causes of deaths. Uses the 2d-age-dependent-enumerated data format.
death/females/lifetable
The probability of a female dying at a given age. Uses the 1d age-dependent data format.
death/males/cause
The proportions of causes of death for male deaths at a given age. Uses the HICOD notation to enumerate causes of deaths for each column. Uses the 2d-age-dependent-enumerated data format.
death/males/lifetable
The probability of a male dying at a given age. Uses the 1d age-dependent data format.
relationships/marriage
The proportion of children born within a marriage, as opposed to a civil partnership. Uses the 1d age-dependent data format.
relationships/partnering
The proportion of male ages females of a given age will partner with. Female ages for each row, male ages for each column. Uses the 2d age-dependent data format.
relationships/separation
Out of the total number of marriages with children, how many divorce in a given year for each number of children in the marriage. Uses the 2d-age-dependent-data format.
Each end directory (directory without sub directories) represents a property of the population. Within an end directory is any number of input distribution files for that property, often for different years. The files can have any name, but must be located in the correct end directory.
In the following example, the marriage property (which defines the proportion of parents that are married), contains three input distributions from different years. (The different years must be specified in the YEAR
meta field for ValiPop to understand).
└───relationships/
└───marriage/
├───marriage_1938.txt
├───marriage_1953.txt
└───marriage_1973.txt
Each input distribution of a property will apply for a period of time during the simulation. The length of the period is defined by the input_width
option in the config file. ValiPop will divide the given input distributions into these equal periods based on which input distribution is closest to the end time of that period.
For example, using the input distributions defined above in a simulation running from years 1900 to 2000 with an input width of 10 years. The input distributions will be divided over the following periods.
marriage_1953.txt
┌─────┴─────┐
marriage_1938.txt marriage_1973.txt
┌───────────┴───────────┐ ┌───────────┴───────────┐
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤