Incorporating observation data into models • gadget3

Each run of a model will generate a likelihood score, by summing the result of any likelihood actions in the model together. To fit a model to observation data, we need to add a likelihood action that will compare observation data to the model.

To do this you either need g3l_catchdistribution() or g3l_abundancedistribution(). As the names suggest, g3l_catchdistribution() will compare catch data of stocks from the provided fleets/predators, g3l_abundancedistribution() will compare total abundance of stocks. Otherwise they are identical.

This comparison is a 2 stage process:

Aggregation: The columns in the observation data are inspected, and converted into an array. Model data is then aggregated into an identically-sized array ready for the next step
Comparison: A comparison function (function_f) is supplied to g3l_*distribution(), to convert to a likelihood score. For example g3l_distribution_sumofsquares(), which compares the relative abundance of each grouping vs. the model, and sums the square of these

We saw the following example in vignette("introduction-single-stock"):

# TODO: This isn't a brilliantly-named example, something else?
# Generate random data
expand.grid(year = 1990:1994, step = 3, area = 'IXa') |>
    # Fill in a number column with total numbers in that year/step/area combination
    mutate(number = runif(n(), min = 10000, max = 100000)) ->
    dist_si_acoustic

actions_likelihood_si_acoustic <- list(

  g3l_abundancedistribution(
    "dist_si_acoustic",
    obs_data = dist_si_acoustic,
    
    stocks = list(fish),
    function_f = g3l_distribution_surveyindices_log(alpha = NULL, beta = 1),
    area_group = area_names,
    report = TRUE,
    nll_breakdown = TRUE),
  NULL)

The first step generates random observation data. We have the following columns:

year: We want to do comparisons in 5 years, 1990..1994. Other years nothing will happen.
step: We compare abundance in the autumn step
area: We only compare with stock within the ‘IXa’ area
number: We compare against number individuals.

There are no age or length columns, so this is total abundance within that area/time.

We use g3l_distribution_surveyindices_log(), which calculates the likelihood score by performing a linear fit using the log scale, the slope (beta) is fixed.

The model optimisation process will minimise the total likelihood score, and in the process fit the model to the observation data.

gadget2<->gadget3 translation

The function of many of gadget2’s likeihood components has been collapsed down into these 2 functions. Here is a summary of how to translate:

CatchDistribution likelihood component
- Use g3l_catchdistribution()
- Set function_f parameter to the matching g3l_distribution_*() function, e.g. g3l_distribution_sumofsquares()
- Supply data with columns year, step, area, age, length, number
SurveyIndices likelihood component
- Use g3l_abundancedistribution()
- Set function_f parameter to g3l_distribution_surveyindices_log() or g3l_distribution_surveyindices_linear()
- Supply data with columns year, step, area, length, number
CatchInKilos likelihood component
- Use g3l_catchdistribution(function_f = g3l_distribution_sumofsquares())
- Supply data with columns year, step, area, fleet, weight

Observation data format

gadget3 bases it’s decision on how to aggregate on the incoming data. Whilst it tries to do the “right thing” in most cases, it’s important to get the shape of this data to match what you require.

In doing so, the incoming data.frame() is converted into an array(). We can use the g3_distribution_preview() function to see what that array will look like, and in doing so see how gadget3 will treat the observation data.

Critically, your data needs to have column names it recognises. The following breaks down what can be by each column.

weight and number columns

At least a weight or number column needs to be supplied. If a number column is present then each value will be compared to number of individuals in that group. If a weight column is present then each value will be compared to total biomass in that group.

The weight is suitable for total catches in kilos, otherwise the number column will generally be used.

year and step columns: Landings, abundance indices

If a year column is given, then catch/abundance will be aggregated by year. Gaps are allowed, if so then no comparisons will be made for that year/step.

The following observations will be compared against the total number of indivduals caught in years 1999, 2000, 2002, 2003:

g3_distribution_preview(read.table(header = TRUE, text="
year  number
1999    1000
2000    1002
2002    1004
2003    1008
"))

##        time
## length  1999 2000 2002 2003
##   0:Inf 1000 1002 1004 1008

If step is given, then it will restrict it to that step within the year (see ?g3a_time for how to define steps in a year).

In the following example, we will aggregate spring of 1999 & 2000, autumn of 2001. Any catch in spring of 2001, or other periods not mentioned, will be ignored:

g3_distribution_preview(read.table(header = TRUE, text="
year step number
1999    2   1020
2000    2   2040
2001    3   1902
"))

##        time
## length  1999-02 2000-02 2001-03
##   0:Inf    1020    2040    1902

length column: Length distribution data

Adding a length column will aggregate catch/abundance data by the same length bins as in the observation data.

For instance:

g3_distribution_preview(read.table(header = TRUE, text="
year  length number
1999  [0,10)  1023
1999 [10,20)  2938
1999 [20,30)  3948
1999 [30,40)  3855
2000  [0,10)  1023
2000 [10,20)  2938
# NB: No [10,30)
2000 [30,40)  3855
"))

##        time
## length  1999 2000
##   0:10  1023 1023
##   10:20 2938 2938
##   20:30 3948   NA
##   30:40 3855 3855

Note that unlike with the year & step columns, here gaps in data are preserved, in the preview output we see NA for the missing year & length combination. By default we will compare to 0 at this point, this behaviour is controlled with the missing_val parameter to ?g3l_catchdistribution.

Length aggregations do not have to be hand-crafted like we do above, a length column could be generated using dplyr::group_by() and cut():

# Generate an unaggregated length distribution
ldist.lln.raw <- data.frame(
    year = c(1999, 2000),
    length = sample(10:75, 100, replace = TRUE),
    number = 1,
    stringsAsFactors = FALSE)

# Group length into 10-long bins
ldist.lln.raw |> dplyr::group_by(
  year = year,
  length = cut(length, breaks = seq(10, 100, by = 10), right = FALSE)
) |> dplyr::summarise(number = sum(number), .groups = 'keep') -> ldist.lln.agg
ldist.lln.agg

## # A tibble: 14 × 3
## # Groups:   year, length [14]
##     year length  number
##    <dbl> <fct>    <dbl>
##  1  1999 [10,20)     11
##  2  1999 [20,30)      5
##  3  1999 [30,40)      7
##  4  1999 [40,50)     11
##  5  1999 [50,60)      8
##  6  1999 [60,70)      7
##  7  1999 [70,80)      1
##  8  2000 [10,20)      3
##  9  2000 [20,30)      6
## 10  2000 [30,40)     14
## 11  2000 [40,50)     10
## 12  2000 [50,60)      9
## 13  2000 [60,70)      6
## 14  2000 [70,80)      2

# NB: The last 2 bins are empty, but because cut() creates a factor column,
#     gadget3 knows about them even though they don't appear in the data.
g3_distribution_preview(ldist.lln.agg)

##         time
## length   1999 2000
##   10:20    11    3
##   20:30     5    6
##   30:40     7   14
##   40:50    11   10
##   50:60     8    9
##   60:70     7    6
##   70:80     1    2
##   80:90    NA   NA
##   90:100   NA   NA

gadget3 will also automatically read the aggregation attributes used by ?mfdb::mfdb_sample_count.

# Import data into a temporary database
library(mfdb)
mdb <- mfdb(tempfile(fileext=".duckdb"))
ldist.lln.raw$month <- 1
ldist.lln.raw$areacell <- 'all'  # NB: We have to have an areacell mapping for MFDB
mfdb_import_area(mdb, data.frame(name = c('all'), size = c(5)))
mfdb_import_survey(mdb, ldist.lln.raw)

# Use mfdb_sample_count to extract & group in the same manner as above
ldist.lln.agg <- mfdb_sample_count(mdb, c('length'), list(
    year=1999:2000,
    length = mfdb_interval("len", seq(10, 100, by = 10)) ))[[1]]
g3_distribution_preview(ldist.lln.agg, area_group = c(all=1))

mfdb_disconnect(mdb)

age column: Age-Length distribution data

Age-length aggregations can be performed by adding an age column in a very similar manner to the length column:

We can both group by individual age values:

g3_distribution_preview(read.table(header = TRUE, text="
year  age  length number
1999    1  [0,10)  1026
1999    1 [10,20)  2936
1999    1 [20,30)  3962
1999    1 [30,40)  3863
1999    2  [0,10)  1026
1999    2 [10,20)  2936
1999    2 [20,30)  3962
1999    2 [30,40)  3863
"))

## , , time = 1999
## 
##        age
## length  age1 age2
##   0:10  1026 1026
##   10:20 2936 2936
##   20:30 3962 3962
##   30:40 3863 3863

…or group ages together:

g3_distribution_preview(read.table(header = TRUE, text="
year  age    length number
1999  [1,1]    [0,10)  1026
1999  [1,1]   [10,20)  2936
1999  [1,1]   [20,30)  3962
1999  [1,1]   [30,40)  3863
1999  [2,4]  [0,10)  1011
# Missing [2,4] [10,20)
1999  [2,4] [20,30)  3946
1999  [2,4] [30,40)  3872
"))

## , , time = 1999
## 
##        age
## length   1:1  2:4
##   0:10  1026 1011
##   10:20 2936   NA
##   20:30 3962 3946
##   30:40 3863 3872

As before, gaps in data are preserved, and missing_val is used to decide what to do with them.

Again, gadget3 will also interpret aggregation generated by group_by(age = cut(...)) or ?mfdb::mfdb_sample_count.

area column

If a stock is divided up into multiple areas, then data can be broken down by area

area_names <- g3_areas(c('IXa', 'IXb', 'IXc'))

g3_distribution_preview(read.table(header = TRUE, text="
year    area    number
1999   IXa   1000
1999   IXb   4305
2000   IXa   7034
2000   IXb   2381
2001   IXb   3913
"), area_group = area_names)[length = '0:Inf',,]

##       area
## time    IXa  IXb
##   1999 1000 4305
##   2000 7034 2381
##   2001   NA 3913

As before, gaps in data are preserved, and missing_val is used to decide what to do with them.

However, if an area isn’t mentioned at all (note that IXc does not figure in the above data), then it won’t be compared.

stock column: Maturity stage distribution

If you have multiple stocks, for example because you have divided up your species into mature and immature substocks, you can use this division in likelihood components also:

st_imm <- g3_stock(c(species = 'fish', 'imm'), 1:10)
st_mat <- g3_stock(c(species = 'fish', 'mat'), 1:10)

g3_distribution_preview(read.table(header = TRUE, text="
year    stock    number
1999   fish_imm   1000
1999   fish_mat   4305
2000   fish_imm   7034
2000   fish_mat   2381
2001   fish_mat   3913
"), stocks = list(st_imm, st_mat))[length = '0:Inf',,]

##           time
## stock      1999 2000 2001
##   fish_imm 1000 7034   NA
##   fish_mat 4305 2381 3913

The stock names have to match what gadget3 uses, or an error will be generated.

You can also use partial stock names, for example:

stocks <- list(
   g3_stock(c(species = 'fish', 'imm', 'f'), 1:10),
   g3_stock(c(species = 'fish', 'imm', 'm'), 1:10),
   g3_stock(c(species = 'fish', 'mat', 'f'), 1:10),
   g3_stock(c(species = 'fish', 'mat', 'm'), 1:10) )

drop(g3_distribution_preview(read.table(header = TRUE, text="
year    stock    number
1999   imm   1000
1999   mat   4305
2000   imm   7034
2000   mat   2381
"), stocks = stocks))

##      time
## stock 1999 2000
##   imm 1000 7034
##   mat 4305 2381
## attr(,"stock_map")
## fish_imm_f fish_imm_m fish_mat_f fish_mat_m 
##      "imm"      "imm"      "mat"      "mat"

The imm columns will compare to the sum of fish_imm_f & fish_mat_f.

The parts do not have to be in order, the following is also valid:

drop(g3_distribution_preview(read.table(header = TRUE, text="
year    stock    number
1999   fish_f   1000
1999   fish_m   4305
2000   fish_f   7034
2000   fish_m   2381
"), stocks = stocks))

##         time
## stock    1999 2000
##   fish_f 1000 7034
##   fish_m 4305 2381
## attr(,"stock_map")
## fish_imm_f fish_imm_m fish_mat_f fish_mat_m 
##   "fish_f"   "fish_m"   "fish_f"   "fish_m"

NB: A stock can only appear in one grouping:

# NB: Wrong!
drop(g3_distribution_preview(read.table(header = TRUE, text="
year    stock    number
1999   f         1000
1999   imm       4305
2000   f         7034
2000   imm       2381
"), stocks = stocks))

##      time
## stock 1999 2000
##   f   1000 7034
##   imm 4305 2381
## attr(,"stock_map")
## fish_imm_f fish_imm_m fish_mat_f fish_mat_m 
##        "f"      "imm"        "f"         NA

fish_imm_f will only appear in the f rows, not imm rows. In this case, multiple likelhood components would be a better approach.

As before, gaps in data are preserved, and missing_val is used to decide what to do with them.

fleet column

fleets <- list(
    g3_fleet(c('comm', country = 'se')),
    g3_fleet(c('comm', country = 'fi')),
    g3_fleet(c('surv', country = 'se')) )

g3_distribution_preview(read.table(header = TRUE, text="
year    fleet    number
1999   comm      1000
1999   surv_se   4305
2000   comm      7034
2000   surv_se   2381
2001   surv_se   3913
"), fleets = fleets)[length = '0:Inf',,]

##          time
## fleet     1999 2000 2001
##   comm    1000 7034   NA
##   surv_se 4305 2381 3913

The name matching works in the same way as stocks above, and should be either the name gadget3 uses or parts of it.

As before, gaps in data are preserved, and missing_val is used to decide what to do with them.

predator column

predators <- list(
    g3_stock(c('seal', 'imm', 'f'), 10:20),
    g3_stock(c('seal', 'mat', 'f'), 10:20),
    g3_stock(c('seal', 'imm', 'm'), 10:20),
    g3_stock(c('seal', 'mat', 'm'), 10:20) )

drop(g3_distribution_preview(read.table(header = TRUE, text="
year    predator    number
1999   seal_f   1000
1999   seal_m   4305
2000   seal_f   7034
2000   seal_m   2381
"), predators = predators))

##         time
## predator 1999 2000
##   seal_f 1000 7034
##   seal_m 4305 2381
## attr(,"predator_map")
## seal_imm_f seal_mat_f seal_imm_m seal_mat_m 
##   "seal_f"   "seal_f"   "seal_m"   "seal_m"

The name matching works in the same way as stocks above, and should be either the name gadget3 uses or parts of it.

As before, gaps in data are preserved, and missing_val is used to decide what to do with them.