11 Optimisation File

The optimisation file is used to specify the type of optimisation to be used, along with any parameters that are needed for the optimisation algorithm. This file is specified by a “-opt \(<\)filename\(>\)” command line option when Gadget is started, for example, this would take the optimisation information from a file called “optinfo.txt”:

gadget -l -opt optinfo.txt

There are three types of optimisation algorithms currently implemented in Gadget - these are one based on the Hooke & Jeeves algorithm, one based on the Simulated Annealing algorithm and one based on the Broyden-Fletcher-Goldfarb-Shanno (“BFGS”) algorithm. These algorithms are described in more detail in the following sections. Gadget can also combine two or more of these algorithms into a single hybrid algorithm, that should result in a more efficient search for an optimum solution.

All the optimisation techniques used by Gadget attempt to minimise the likelihood function. That is, they look for the best set of parameters to run the model with, in order to get the best fit according to the likelihood functions you have specified. Thus, the optimiser is attempting to minimize a single one-dimensional measure of fit between the model output and the data, which can lead to unexpected results.

The optimisation process will tend to produce realistic values for the most significant parameters (typically growth, recruitment in large year classes and fishing selectivity) before finding realistic values for less significant parameters. It should be noted that the parameters governing recruitment and growth in the last few years with data in a simulation will be among the last to settle on realistic values. It is important to ensure that an optimum solution has been reached before accepting the model, especially if the model is to be used to predict the population into the future or to examine the strength of the last few year classes.

11.1 Hooke & Jeeves

11.1.1 Overview

Hooke & Jeeves is a simple and fast optimising method, but somewhat unreliable, and it is often described as a ‘hill climbing’ technique. From the initial starting point the algorithm takes a step in various ‘directions’, and conducts a new model run. If the new likelihood score is better than the old one then the algorithm uses the new point as it’s best guess. If it is worse then the algorithm retains the old point. The search proceeds in series of these steps, each step slightly smaller than the previous one. When the algorithm finds a point which it cannot improve on with a small step in any direction then it accepts this point as being the ‘solution’, and exits. It can be seen that this renders the scheme vulnerable to producing local solutions, accepting a local dip as being the solution even if a better solution exists elsewhere, beyond a local ‘hill’ that the algorithm cannot see past. In order to combat this tendency it is strongly recommended that you re-run the optimisation, using the final point of one run as the start of the next. This will effectively re-set the searching step size to large steps, and give Gadget a chance of escaping from local solutions. Finding the same result twice in a row does not guarantee it is the best possible solution, but finding different results certainly indicates that the larger result is not the solution you are seeking.

The Hooke & Jeeves algorithm used in Gadget is derived from that originally presented by R. Hooke and T. A. Jeeves, “Direct Search Solution of Numerical and Statistical Problems” in the April 1961 (Vol. 8, pp. 212-229) issue of the Journal of the ACM, with improvements presented by Arthur F Kaupe Jr., “Algorithm 178: Direct Search” in the June 1963 (Vol 6, pp.313-314) issue of the Communications of the ACM.

Hooke & Jeeves is the default optimisation method used for Gadget, and will be used if no optimisation information file is specified.

11.1.2 File Format

To specify the Hooke & Jeeves algorithm, the optimisation file should start with the keyword “[hooke]”, followed by (up to) 4 lines giving the parameters for the optimisation algorithm. Any parameters that are not specified in the file are given default values, which work reasonably well for simple Gadget models. The format for this file, and the default values for the optimisation parameters, are shown below:

[hooke]
hookeiter  1000  ; number of hooke & jeeves iterations
hookeeps   1e-04 ; minimum epsilon, hooke & jeeves halt criteria
rho        0.5   ; step length adjustment factor
lambda     0     ; initial value for the step length

11.1.3 Parameters

11.1.3.1 hookeiter

This is the maximum number of Gadget model runs that the Hooke & Jeeves algorithm will use to try to find the best solution. If this number is exceeded, Gadget will select the best point found so far, and accept this as the ‘solution’, even though it has not met the convergence criteria. A warning that Gadget has stopped without finding a solution that meets the convergence criteria will be printed.

11.1.3.2 hookeeps

This is the criteria for halting the Hooke & Jeeves algorithm at a minimum, and accepting the current point as the ‘solution’. The algorithm has “converged” when the size of each step in the search has been reduced to less than this value, and no further improvements can be made. This means that each parameter in the Gadget model is less than hookeeps from their value at the (local) optimum. Larger values of hookeeps give a quicker running time, but a less accurate estimate of the minimum. Conversely, smaller values will give a longer running time, but a more accurate estimate of the minimum.

11.1.3.3 rho

This is the resizing multiplier for the step length of the search made by the algorithm. The Hooke & Jeeves algorithm works by taking “steps” from one estimate of a optimum, to another (hopefully better) estimate of the optimum. Taking big steps gets to the minimum more quickly, at the risk of ‘stepping over’ an excellent point. When no improvements to the minimum can be made by taking steps of the current length, the step length is reduced by multiplying it by rho.

Small values of rho correspond to big step length changes, which make the algorithm run more quickly. However, there is a chance that these big changes will accidentally overlook a promising point, leading to non-convergence. Larger values of rho correspond to smaller step length changes, which force the algorithm to carefully examine nearby points instead of optimistically forging ahead, which improves the probability of convergence. The value of rho must be between 0 and 1.

11.1.3.4 lambda

This is the initial value for the size of the steps in the search, which will be used for the the first search, before any modification to the step length. All the parameters in the Gadget model are initially scaled so that their value is 1, and the initial search will then look at the points \(1 \pm \lambda\) for the next optimum. Setting lambda to zero will set the initial value for the step length equal to rho. The value of lambda must be between 0 and 1.

11.2 Simulated Annealing

11.2.1 Overview

Simulated Annealing is a global optimisation method that distinguishes between different local minimum. From the initial starting point the algorithm takes a random step in various directions, and conducts a new model run. If the new likelihood score is better than the old one then the algorithm uses the new point as it’s best guess. If it is worse then the algorithm may accept this point, based on the probabilistic “Metropolis Criteria”, and thus the algorithm can escape from a local minimum. The search proceeds in series of these steps, with the point that gives the overall lowest likelihood score is stored as a “best point”. The algorithm exits when a stable point is found which cannot be improved on with a small step in any direction, and the Metropolis Criteria rejects all the steps away from the current best point. The best point is then accepted as being the ‘solution’.

The Metropolis Criteria will accept a move in an uphill direction (ie. a point with a worse likelihood score) based on a function of both the size of the move and a parameter termed “temperature”, and is given in equation (11.1) below:

\[\begin{equation} \begin{aligned} \tag{11.1} M & = & e^{\displaystyle \frac{-\Delta F} {t}} \nonumber \\ P & = & \begin{cases} 1 & \textrm{if $M$ > $r$} \\ 0 & \textrm{otherwise} \end{cases}\end{aligned}\end{equation}\]

where: \(<\Delta F>\) is the change in likelihood score \(<\) t \(>\) is the “temperature” of the algorithm \(<\) r \(>\) is a random number between 0 and 1

Note that when the “temperature” is very high (t \(\rightarrow \infty\)), the Metropolis Criteria will always accept any ‘uphill’ move, and the Simulated Annealing algorithm will simplify to a form of a random search algorithm. Conversely, when the temperature is very low (t \(\rightarrow\) 0), the Metropolis Criteria will always reject any ‘uphill’ move, and the Simulated Annealing algorithm will simplify to a form of a hill-climbing algorithm, similar to Hooke & Jeeves. By slowly reducing the temperature of the algorithm, the number of uphill moves that are accepted are reduced and the algorithm will find a minimum.

The terminology for the Simulated Annealing algorithm comes from a metaphor with the thermodynamic process of cooling molten metals. At very high temperatures, the atoms in molten metals move very fast, and slow considerably at lower temperatures. If the metal is cooled slowly (termed “annealing”), the atoms have time to line up to form a highly ordered, strong, crystalline structure. If the metal is cooled too quickly (“quenching”) the resulting solid will have a weaker, less ordered, structure.

In comparison to the Hooke & Jeeves optimisation algorithm, where Hooke & Jeeves performs a ‘local’ stepwise search, Simulated Annealing searches much more widely over the surface in order to find the best point. By doing this it is less likely than Hooke & Jeeves to be fooled by a local minimum, and more likely to home in on the true optimum. However the price to paid for doing this is that it can take considerably more computer time to reach a solution.

The Simulated Annealing algorithm used in Gadget is derived from that presented by Corana et al, “Minimising Multimodal Functions of Continuous Variables with the ‘Simulated Annealing’ Algorithm” in the September 1987 (Vol. 13, pp. 262-280) issue of the ACM Transactions on Mathematical Software and Goffe et al, “Global Optimisation of Statistical Functions with Simulated Annealing” in the January/February 1994 (Vol. 60, pp. 65-100) issue of the Journal of Econometrics.

11.2.2 File Format

To specify the Simulated Annealing algorithm, the optimisation file should start with the keyword “[simann]”, followed by (up to) 11 lines giving the parameters for the optimisation algorithm. Any parameters that are not specified in the file are given default values, which work reasonably well for simple Gadget models. The format for this file, and the default values for the optimisation parameters, are shown below:

[simann]
simanniter 2000  ; number of simulated annealing iterations
simanneps  1e-04 ; minimum epsilon, simann halt criteria
t          100   ; simulated annealing initial temperature
rt         0.85  ; temperature reduction factor
nt         2     ; number of loops before temperature adjusted
ns         5     ; number of loops before step length adjusted
vm         1     ; initial value for the maximum step length
cstep      2     ; step length adjustment factor
lratio     0.3   ; lower limit for ratio when adjusting step length
uratio     0.7   ; upper limit for ratio when adjusting step length
check      4     ; number of temperature loops to check

11.2.3 Parameters

11.2.3.1 simanniter

This is the maximum number of Gadget model runs that the Simulated Annealing algorithm will use to try to find the best solution. If this number is exceeded, Gadget will select the best point found so far, and accept this as the ‘solution’, even though it has not met the convergence criteria. A warning that Gadget has stopped without finding a solution that meets the convergence criteria will be printed.

11.2.3.2 simanneps

This is the criteria for halting the Simulated Annealing algorithm at a minimum, and accepting the current point as the ‘solution’. The algorithm has “converged” if the point found at the end of a temperature loop is within simanneps of the best point, and is also within simanneps of the point found at the end of the last ‘check’ temperature loops. This will mean that during each of the last ‘check’ temperature loops, the accepted point has moved less than simanneps away from the overall best point, showing that the best point found so far is stable. Lower values for simanneps will give a more stable estimate of the minimum, at the cost of a longer running time. Note that the convergence criteria is only checked at the end of each temperature loop.

11.2.3.3 t

This is the initial value for the “temperature” of the Simulated Annealing algorithm, used to control the speed of the convergence of the algorithm, using the Metropolis Criteria given in equation (11.1) above. High values for the temperature (t \(\gg \Delta F\)) will mean that most of the uphill moves are accepted, and the algorithm will be performing a random search, which is computationally very intensive. Low values for the temperature (t \(\ll \Delta F\)) will mean that most of the uphill moves are rejected and the algorithm will be performing an inefficient local search.

11.2.3.4 rt

This is the temperature reduction factor, used to lower the temperature of the algorithm as it moves closer to the minimum. Higher values here will mean that the temperature is reduced slowly, which will mean that the algorithm is will examine the search area more thoroughly, leading to better convergence at a cost of much higher computational time. Lower values of rt will mean that the temperature is reduced quickly, leading to faster convergence, possibly to a local minimum that the algorithm cannot escape from (this is analogous to “quenching” in the thermodynamic metaphor). The value of rt must be between 0 and 1.

11.2.3.5 nt

This is the number of loops that the algorithm will perform before reducing the temperature parameter. Higher values here will mean that the algorithm will explore the current search area more thoroughly, improving the chance of finding a global minimum, at the cost of considerably higher computational time. Conversely, lower values will mean that the current search area is not explored as thoroughly, with a corresponding reduction in computation time. Note that each temperature loop consists of all of the “ns” step loops.

11.2.3.6 ns

This is the number of loops that the algorithm will perform before adjusting the maximum step length parameter. The maximum step length is periodically adjusted so that approximately half of all moves are accepted, since too many, or too few, accepted moves is a waste of computational effort.

When the algorithm adjusts the maximum step length, the ratio of accepted moves to rejected moves since the last adjustment of the maximum step length is taken into account. If more moves are accepted than rejected, the maximum step length will fall and so the search will focus on the most promising area. However, if more moves are rejected that accepted, the maximum step length will rise so the search area will widen.

11.2.3.7 vm

This is the initial maximum step length for the Simulated Annealing algorithm. The maximum step length will get adjusted during the optimisation process. Note that each of the steps taken by the Simulated Annealing algorithm are not uniform, and will have a random length of between zero and the maximum step length.

It is important to note that unlike the Hooke & Jeeves algorithm, the parameters are not scaled during the optimisation process, and so the maximum step length should be selected so that the initial search area will cover the area of interest. For efficient use of the Simulated Annealing algorithm, it is recommended that the parameters to be estimated are scaled (by the user) so that they are all approximately the same order of magnitude.

11.2.3.8 cstep, lratio and uratio

These parameters control how the maximum step length for the Simulated Annealing algorithm is adjusted, at the end of each of the “ns” step loops. The algorithm keeps track of the number of moves that are accepted and rejected for each direction, and then adjusts the value of the maximum step length for each direction so that the ratio of accepted to rejected moves is approximately 50:50. The maximum step length for each direction is adjusted according to equation (11.2) below:

\[\begin{equation} \tag{11.2} vm_{i} = \begin{cases} vm_{i} * (1 + \frac{C (R_{i} - U)}{L}) & \textrm{if $R_{i} > U$} \\ vm_{i} / (1 + \frac{C (L - R_{i})}{L}) & \textrm{if $R_{i} < L$} \\ vm_{i} & \textrm{otherwise} \end{cases}\end{equation}\]

where: \(< R_{i} >\) is the ratio of accepted moves for direction i \(<\) C \(>\) is the value of the “cstep” parameter \(<\) U \(>\) is the value of the “uratio” parameter \(<\) L \(>\) is the value of the “lratio” parameter

11.2.3.9 check

This is the number of temperature loops that the Simulated Annealing algorithm will check to confirm that the current best point that has been found is a stable minimum, so that it can be accepted as a solution.

11.3 BFGS

11.3.1 Overview

BFGS is a quasi-Newton optimisation method that uses information about the gradient of the function at the current point to calculate the best direction to look in to find a better point. Using this information, the BFGS algorithm can iteratively calculate a better approximation to the inverse Hessian matrix, which will lead to a better approximation of the minimum value.

From an initial starting point, the gradient of the function is calculated and then the algorithm uses this information to calculate the best direction to perform a linesearch for a point that is “sufficiently better”. The linesearch that is used in Gadget to look for a better point in this direction is the “Armijo” linesearch. The algorithm will then adjust the current estimate of the inverse Hessian matrix, and restart from this new point. If a better point cannot be found, then the inverse Hessian matrix is reset and the algorithm restarts from the last accepted point.

For a point at a stable minimum, the magnitude of the gradient vector will be zero, since there is no direction that the algorithm can move in to find a better local point. However, finding such a point using an iterative process can take an infinite number of steps, so the algorithm exits when the magnitude of the gradient vector is less than a small number. The current point is then accepted as being the ‘solution’.

The “Armijo” linesearch calculates the stepsize that is to be used to move to a point that is “sufficiently better”, along the search direction vector calculated from the gradient, to be \(\beta^n\), where n is the first integer that satisfies the Armijo rule given by inequality(11.3) below:

\[\begin{equation} \tag{11.3} f(x) - f(x + \beta^{n}d) \geq - \sigma \beta^{n}d \nabla f(x) ^{T}\end{equation}\]

where: \(<\nabla f(x)>\) is the gradient of the function at the current point \(<\) d \(>\) is the search direction vector

In comparison to the other optimising algorithms that are currently implemented in Gadget, BFGS performs a local search, and so it doesn’t cover the wide search area that the Simulated Annealing algorithm can use to look for an optimum. BFGS will usually take more computer time to find an optimum than the Hooke & Jeeves algorithm, since numerically calculating the gradient of the function is computationally very intensive. However, the optimum found by the BFGS algorithm will usually be better than that found by the Hooke & Jeeves algorithm, since a gradient search method is usually more accurate than a stepwise search method.

The BFGS algorithm used in Gadget is derived from that presented by Dimitri P Bertsekas, “Nonlinear Programming” (\(2^{nd}\) edition, pp22-61) published by Athena Scientific. The forward difference gradient algorithm used to calculate the gradient is derived from that presented by Dennis and Schnabel, “Numerical Methods for Unconstrained Optimisation and Nonlinear Equations” (“Classics” edition, published by SIAM).

11.3.2 File Format

To specify the BFGS algorithm, the optimisation file should start with the keyword “bfgs”, followed by (up to) 7 lines giving the parameters for the optimisation algorithm. Any parameters that are not specified in the file are given default values, which work reasonably well for simple Gadget models. The format for this file, and the default values for the optimisation parameters, are shown below:

[bfgs]
bfgsiter   10000 ; number of bfgs iterations
bfgseps    0.01  ; minimum epsilon, bfgs halt criteria
sigma      0.01  ; armijo convergence criteria
beta       0.3   ; armijo adjustment factor
gradacc    1e-06 ; initial value for gradient accuracy
gradstep   0.5   ; gradient accuracy adjustment factor
gradeps    1e-10 ; minimum value for gradient accuracy

11.3.3 Parameters

11.3.3.1 bfgsiter

This is the maximum number of Gadget model runs that the BFGS algorithm will use to try to find the best solution. If this number is exceeded, Gadget will select the best point found so far, and accept this as the ‘solution’, even though it has not met the convergence criteria. A warning that Gadget has stopped without finding a solution that meets the convergence criteria will be printed.

11.3.3.2 bfgseps

This is the criteria for halting the BFGS algorithm at a minimum, and accepting the current point as the ‘solution’. The algorithm has “converged” if the magnitude of the gradient vector is less than bfgseps. Lower values of bfgseps will give a better estimate of the minimum, at the cost of a considerably increased running time. At the true minimum, the magnitude of the gradient vector will be zero, since there would be no direction for the algorithm to move to, but this point could take an infinite number of iterations to find.

11.3.3.3 sigma

This is the criteria for stopping the Armijo linesearch at a point that is “sufficiently better”, and recalculating the gradient of the function at this new point. Lower values of sigma will increase the size of the acceptance region of the Armijo linesearch (by relaxing the condition for the inequality, see equation (11.3) above), which will mean that the linesearch will stop earlier, leading to larger steps by the BFGS algorithm. Conversely, higher values of sigma will decrease the size of the acceptance region, which will mean that the BFGS algorithm will take smaller steps. The value of sigma must be between 0 and 1, and should be close to zero.

11.3.3.4 beta

This is the adjustment factor for the Armijo linesearch, used to calculate the size of the step taken by the linesearch to find the next point. Lower values of beta will mean that the size of the step that the algorithm tries to take in the direction of the gradient vector is smaller, but the step is more likely to be in the acceptance region. Conversely, higher values of beta will result in larger steps being tried, but these steps are more likely to be rejected since they may be outside the acceptance region. The value of beta must be between 0 and 1.

11.3.3.5 gradacc, gradstep and gradeps

These parameters control the accuracy that is used for the gradient calculations. The gradient of the function is calculated numerically by Gadget using a forward difference algorithm, as shown in equation (11.4) below:

\[\begin{equation} \tag{11.4} \nabla f(x) \approx \frac{f(x + \delta x) - f(x)} {\delta x}\end{equation}\]

where: \(<\delta>\) is the value of the “gradacc” parameter

When the BFGS algorithm is reset (that is, if the Armijo linesearch fails to find a better point) the gradient accuracy parameter is made smaller to increase the level of accuracy that is used in the gradient calculations. This is done by multiplying the gradacc parameter by the gradstep parameter, which is a simple reduction factor (and as such must be between 0 and 1). To prevent the gradacc parameter getting too small, the BFGS algorithm will stop once the value of gradacc is less that the value of gradeps. Both gradacc and gradeps must be between 0 and 1, with gradeps smaller than gradacc, and the gradient calculations are more accurate when the gradacc parameter is very small.

11.4 Combining Optimisation Algorithms

11.4.1 Overview

This method attempts to combine the global search of the Simulated Annealing algorithm and the more rapid convergence of the local searches performed by the Hooke & Jeeves and BFGS algorithms. It relies on the observation that the likelihood function for many Gadget models consists of a large ‘valley’ in which the best solution lies, surrounded by much more ‘rugged’ terrain.

A period of Simulated Annealing at the start of the optimisation run serves to move the search into this valley, at which point the Simulated Annealing algorithm becomes inefficient so the Hooke & Jeeves algorithm and/or the BFGS algorithm takes over and homes in on a solution within that valley. Hopefully the Simulated Annealing algorithm will have moved the current point to the correct side of any ‘hills’ to avoid the local searches becoming trapped into an unrealistic local minimum.

11.4.2 File Format

To specify a combination of optimisation algorithms, the optimisation file should simply list the algorithms that should be used, along with the parameters for each optimisation algorithm. Any parameters that are not specified in the file are given default values, which work reasonably well for simple Gadget models. As an example, the format for this file used to specify all three optimisation algorithms, and the default values for the optimisation parameters, is shown below:

[simann]
simanniter 2000  ; number of simulated annealing iterations
simanneps  1e-04 ; minimum epsilon, simann halt criteria
t          100   ; simulated annealing initial temperature
rt         0.85  ; temperature reduction factor
nt         2     ; number of loops before temperature adjusted
ns         5     ; number of loops before step length adjusted
vm         1     ; initial value for the maximum step length
cstep      2     ; step length adjustment factor
lratio     0.3   ; lower limit for ratio when adjusting step length
uratio     0.7   ; upper limit for ratio when adjusting step length
check      4     ; number of temperature loops to check
[hooke]
hookeiter  1000  ; number of hooke & jeeves iterations
hookeeps   1e-04 ; minimum epsilon, hooke & jeeves halt criteria
rho        0.5   ; step length adjustment factor
lambda     0     ; initial value for the step length
[bfgs]
bfgsiter   10000 ; number of bfgs iterations
bfgseps    0.01  ; minimum epsilon, bfgs halt criteria
sigma      0.01  ; armijo convergence criteria
beta       0.3   ; armijo adjustment factor
gradacc    1e-06 ; initial value for gradient accuracy
gradstep   0.5   ; gradient accuracy adjustment factor
gradeps    1e-10 ; minimum value for gradient accuracy

It should be noted that the optimisation algorithms will be performed in the order that they are specified in the input file, so for this example the order will be: first Simulated Annealing, second Hooke & Jeeves and finally BFGS.

11.4.3 Parameters

The parameters for this combined optimisation algorithm are the same as for the individual algorithms, and are described in sections 11.1.3 (for the Hooke & Jeeves parameters), 11.2.3 (for the Simulated Annealing parameters) and 11.3.3 (for the BFGS parameters).

11.5 Repeatability

The optimisation algorithms used by Gadget contain a random number generator, used to randomise the order of the parameters (to ensure that the order of the parameters has no effect on the optimum found) and to generate the initial direction chosen by the algorithm to look for a solution. For the Simulated Annealing algorithm, this is also affects the Metropolis criteria used to accept any changes in an ‘uphill’ direction.

The use of a random number generator means that it is unlikely that two optimising runs will produce exactly the same ‘optimum’, even if they start from the same point (although it is hoped that the optima would be very similar if a stable starting point was chosen). This causes a problem with repeatability, since there is no guarantee that an optimising run can be repeated. However, it is possible to ‘seed’ the random number generator with an integer to avoid this problem. To specify a seed for the random number generator, an extra line must be added to the optimisation file, as shown below:

seed       n     ; seed the random number generator with n

It is also possible to specify the value that is used to seed the random number generator from the command line, by starting Gadget with the “-seed <number>” switch. It should be noted that any value for the seed that is specified in the optimisation file will override the value that is specified as a command line option.