Generates synthetic genealogical populations at (small) country scale.
This explains how to use ValiPop Factor Search in the case that population your validation score is too high.
Once the target population has been simulated, the validation phase will determine how similar it is to the input distributions. Due to the inherent randomness of population simulation, sometimes the population may differ noticeably from the input distributions, indicated by a high validation score.
Fortunately, ValiPop provides two configuration options which dynamically compensates for deviations from the input distributions during the simulation runtime. recovery_factor
compensates for deviations from one dimensional input distributions. proportional_recovery_factor
compensates for deviations from two dimensional input distributions. The larger the value, the more strictly ValiPop corrects deviations, where 0
means no corrections are done during the simulation. The default values for both these factors is 1
, which is typically enough to ensure most populations remain close to the input distributions.
In the unlikely case where more fine tuned recovery factors are needed, the ValiPop repository provides a factor search program to identify effective values for recovery_factor
and proportional_recovery_factor
.
The factor search program takes a series of configuration properties, and a list of recovery factors to test, and will generate configurations to simulate the population with. It will then attempt to simulate each population in parallel using Apache Spark. The validation scores of each combination of factors can then be observed in the results summary file to determine the best combination.
The program accepts 10 arguments:
var_data_files
)t0_pop_size
)run_purpose
)0,0.5,1.0
)0,0.5,1.0
)results_save_location
)summary_results_save_location
)ct_tree_precision
)project_location
)Factor search can be run on your local computer using the ValiPop Jar. It will require the dependencies of the ValiPop JAR, and will additionally require Apache Spark installed.
The JAR may then be passed to the spark-submit
included with the Spark installation with the required argument
--class uk.ac.standrews.cs.valipop.implementations.DistributedFactorSearch
The following demonstrates this with additional machine-specific Spark configuration
# Windows/MacOs/Linux terminal
# (Windows may require all arguments to be on the same line)
spark/bin/spark-submit \
--class uk.ac.standrews.cs.valipop.implementations.DistributedFactorSearch \
--master "local[*]" \
--driver-memory 24G \
--conf spark.driver.host=localhost \
--conf spark.driver.port=5055 \
valipop.jar \
src/main/resources/valipop/inputs/synthetic-scotland/ \
10000 \
distributed \
1 \
"0,0.5,1" \
"0,0.5,1" \
results \
results \
1E-66 \
.
The above example runs the factor search in parallel on the local machine.
--master "local[*]"
specifies to use all available local cores--driver-memory 24G
specifies m--conf spark.driver.host=localhost
and --conf spark.driver.port=5055
specifies the address localhost:5055
which can be visited to view the progress of the search.Alternatively, an the address of a Spark compatible cluster manager can be given to --master
to distribute the program across a networked cluster.
Read about the supported cluster manager types.
The ValiPop repository provides some pre-configured Docker images to create a standalone Spark cluster. This includes a leader and worker image which contain the dependencies needed to run ValiPop. These images can be installed by running the following commands
# Windows/MacOs/Linux terminal
docker pull ghcr.io/stacs-srg/valipop-leader:master
docker pull ghcr.io/stacs-srg/valipop-worker:master
The leader image takes two arguments when run:
localhost
)23177
)The worker image takes two arguments when run:
localhost:23177
)localhost
)These images may then be launched on different machines on the same network to establish the cluster with the following command
# Windows/MacOs/Linux terminal
# To run the leader
docker run ghcr.io/stacs-srg/valipop-leader:master
# To run the worker
docker run ghcr.io/stacs-srg/valipop-worker:master
The worker additionally runs with the following environmental variables set:
SPARK_WORKER_MEMORY=30G
: The memory allocated to the workerSPARK_WORKER_CORES=12
: The number of cores available to the workerSPARK_WORKER_INSTANCES=1
: The number of instances per workerThese can be overwritten during container execution using the -e
option with Docker
.
The ValiPop repository also provides a Docker image to run the factor search. This can be installed with the following command
docker pull ghcr.io/stacs-srg/valipop-search:master
The factor search image takes 12 arguments, the first two relate to the cluster management:
local[*]
)localhost
)The remaining 10 arguments are passed to the factor search and are described earlier
The image may be run with the following command
docker run ghcr.io/stacs-srg/valipop-search:master
As docker runs isolated from you local machine, you may need to mount directories to give Docker access to them. Read more about working with Docker.