--- title: "Available Optimizers: How to Find Maximum A Posteriori?" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Available Optimizers: How to Find Maximum A Posteriori?} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk[["set"]]( collapse = TRUE, comment = "#>", cache = FALSE ) old_options <- options(scipen = 999) # turn off scientific notation ``` ```{r setup} library(gips) ``` ```{r help, eval=FALSE} ?find_MAP() ``` # What are we optimizing? The goal of the `find_MAP()` is to find the permutation $\sigma$ that maximizes the a posteriori probability (MAP - Maximum A Posteriori). Such a permutation represents the most plausible symmetry given the data. This a posteriori probability function is described in-depth in the **Bayesian model selection** section of the `vignette("Theory", package="gips")`, also available as a [pkgdown page](https://przechoj.github.io/gips/articles/Theory.html). `gips` can calculate the logarithm of it by the `log_posteriori_of_gips()` function. In the following paragraphs, we will refer to this a posteriori probability function as $f(\sigma)$. We have $f(\sigma) > 0$. # Available optimizers The space of permutations is enormous - for the permutation of size $p$, the space of all permutations is of size $p!$ ($p$ factorial). Even for $p=12$, this space is practically impossible to browse. This is why `find_MAP()` implements multiple (3) optimizers to choose from: * `"brute_force"`, `"BF"`, `"full"` | recommend for $p\le 9$. * `"Metropolis_Hastings"`, `"MH"` | recommend for $p\ge 10$. * `"hill_climbing"`, `"HC"` ### Note on computation time The `max_iter` parameter functions differently in Metropolis-Hastings and hill climbing. For Metropolis-Hastings, it computes a posteriori of `max_iter` permutations, whereas for hill climbing, it computes ${p\choose 2} \cdot$ `max_iter` of them. In the case of the Brute Force optimizer, it computes all $f(\sigma)$ values. The number of all different $\sigma$s follows [OEIS sequence A051625](https://oeis.org/A051625). ## Brute Force It searches through the whole space at once. This is the only optimizer that will certainly find the actual MAP Estimator. Brute Force is **only recommended** for small spaces ($p \le 9$). It can also browse bigger spaces, but the required time is probably too long. We tested how much time it takes to browse with Brute Force (Apple M2, single core), and we show the time in the table below: | p=2 | p=3 | p=4 | p=5 | p=6 | p=7 | p=8 | p=9 | p=10 | |-----------|-----------|----------|---------|-------|---------|--------|-------|---------------| | 0.005 sec 1 | 0.010 sec | 0.025 sec | 0.075 sec | 0.3 sec | 1.8 sec | 13 sec | 1.8 min | 47 min | ### Example Let’s say we have the data `Z` from the unknown process: ```{r brute_force_2} dim(Z) number_of_observations <- nrow(Z) # 13 perm_size <- ncol(Z) # 6 S <- cov(Z) # Assume we have to estimate the mean g <- gips(S, number_of_observations) g_map <- find_MAP(g, optimizer = "brute_force") g_map ``` Brute Force needed 362 calculations, as predicted in [OEIS sequence A051625](https://oeis.org/A051625) for $p = 6$. ## Metropolis-Hastings This optimizer implements the *Second approach* from [[1, Sec 4.1.2]](https://arxiv.org/abs/2004.03503). It uses the Metropolis-Hastings algorithm to optimize the space; [see Wikipedia](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm). This algorithm used in this context is a special case of the Simulated Annealing the reader may be more familiar with; [see Wikipedia](https://en.wikipedia.org/wiki/Simulated_annealing). ### Short description In every iteration $i$, an algorithm considers a permutation, say, $\sigma_i$. Then a random transposition is drawn uniformly $t_i = (j,k)$, and the value of $f(\sigma_i \circ t_i)$ is computed. * If a new value is bigger than the previous one (i.e., $f(\sigma_i \circ t_i) \ge f(\sigma_i)$), then we set $\sigma_{i+1} = \sigma_i \circ t_i$. * If a new value is smaller ($f(\sigma_i \circ t_i) < f(\sigma_i)$), then we will choose $\sigma_{i+1} = \sigma_i \circ t_i$ with probability $\frac{f(\sigma_i \circ t_i)}{f(\sigma_i)}$. Otherwise, we set $\sigma_{i+1} = \sigma_i$ with complementary probability $1 - \frac{f(\sigma_i \circ t_i)}{f(\sigma_i)}$. The final value is the best $\sigma$ ever computed. ### Notes This algorithm was tested in multiple settings and turned out to be an outstanding optimizer for this problem. Especially given it does not need any hyperparameters tuned. The only parameter it depends on is `max_iter`, which determines the number of steps described above. One should choose this number rationally. When decided too small, there is a missed opportunity to find a much better permutation. When decided too big, there is a lost time and computational power that does not lead to growth. We recommend plotting the convergence plot with a logarithmic OX scale: `plot(g_map, type = "best", logarithmic_x = TRUE)`, then decide if the line has flattened already. Keep in mind that the OY scale is also logarithmic. For example, a marginal change on the OY scale could mean $10000$ **times** the change in A Posteriori. For more information about continuing the optimization, see the **Continuing the optimization** section below. This algorithm has been analyzed extensively by statisticians. Thanks to the ergodic theorem, the frequency of visits to a given state converges almost surely to the probability of that state. This is the approach explained in [[1, Sec.4.1.2]](https://arxiv.org/abs/2004.03503) and shown in [[1, Sec. 5.2]](https://arxiv.org/abs/2004.03503). One can obtain estimates of posterior probabilities by setting `return_probabilities = TRUE`. ### Example Let's say we have the data `Z` from the unknown process: ```{r Metropolis_Hastings_2} dim(Z) number_of_observations <- nrow(Z) # 50 perm_size <- ncol(Z) # 70 S <- cov(Z) # Assume we have to estimate the mean g <- gips(S, number_of_observations) suppressMessages( # message from ggplot2 plot(g, type = "heatmap") + ggplot2::scale_x_continuous(breaks = c(1, 10, 20, 30, 40, 50, 60, 70)) + ggplot2::scale_y_reverse(breaks = c(1, 10, 20, 30, 40, 50, 60, 70)) ) g_map <- find_MAP(g, max_iter = 150, optimizer = "Metropolis_Hastings") g_map ``` After just a hundred and fifty iterations, the found permutation is unimaginably more likely than the $\sigma_0 = $ `()` permutation. ```{r Metropolis_Hastings_3} plot(g_map, type = "best", logarithmic_x = TRUE) ``` ## Hill climbing It uses the Hill climbing algorithm to optimize the space; [see Wikipedia](https://en.wikipedia.org/wiki/Hill_climbing). It is performing the local optimization iteratively. ### Short description In every iteration $i$, an algorithm considers a permutation; call it $\sigma_i$. Then, all the values of $f(\sigma_i \circ t)$ are computed for every possible transposition $t = (j,k)$. Then the next $\sigma_{i+1}$ will be the one with the biggest value: $$\sigma_{i+1} = argmax_{\text{perm} \in \text{neighbors}(\sigma_{i})}\{f(perm)\}$$ Where: $$\text{neighbors}(\sigma) = \{\sigma \circ (j,k) : 1 \le j < k \le \text{p}\}$$ The algorithm ends when all neighbors are less likely, or the `max_iter` is achieved. In the first case, the algorithm will finish at a local maximum, but there is no guarantee that this is also the global maximum. ### Pseudocode ### Example Let’s say we have the data `Z` from the unknown process: ```{r hill_climbing_2} dim(Z) number_of_observations <- nrow(Z) # 20 perm_size <- ncol(Z) # 25 S <- cov(Z) # Assume we have to estimate the mean g <- gips(S, number_of_observations) plot(g, type = "heatmap") g_map <- find_MAP(g, max_iter = 2, optimizer = "hill_climbing") g_map plot(g_map, type = "best") ``` The above warnings are expected. # Continuing the optimization When `max_iter` is reached during Metropolis-Hastings or hill climbing, the optimization stops and returns the result. Users are encouraged to plot the result and determine if it has converged. If necessary, users can continue the optimization, as shown below. ```{r continuing_2} g <- gips(S, number_of_observations) g_map <- find_MAP(g, max_iter = 50, optimizer = "Metropolis_Hastings") plot(g_map, type = "best") ``` The algorithm was still significantly improving the permutation. It is reasonable to continue it: ```{r continuing_3} g_map2 <- find_MAP(g_map, max_iter = 100, optimizer = "continue") plot(g_map2, type = "best") ``` The improvement has slowed down significantly. It is fair to stop the algorithm here. Keep in mind the y scale is logarithmic. The visually "small" improvement between 100 and 150 iterations was huge, $10^{52}$ times the posteriori. # Additional parameters The `find_MAP()` function has two additional parameters: `show_progress_bar` and `save_all_perms`, which can be set to `TRUE` or `FALSE`. When `show_progress_bar = TRUE`, `gips` will print "=" characters on the console during optimization. Remember that when the user sets the `return_probabilities = TRUE`, a second progress bar will indicate the calculation of the probabilities after optimization. The `save_all_perms = TRUE` will save all visited permutations in the outputted object, which significantly increases the required RAM. For instance, with $p=150$ and `max_perm = 150000`, we needed 400 MB to store it, whereas `save_all_perms = FALSE` only required 2 MB. However, `save_all_perms = TRUE` is necessary for `return_probabilities = TRUE` or more complex path analysis. # Discussion We are also considering implementing the **First approach** from [1] in the future. The Markov chain travels along cyclic groups rather than permutations in this approach. We encourage everyone to discuss on available and potential new optimizers on [ISSUE#21](https://github.com/PrzeChoj/gips/issues/21). One can also see why some optimizers were implemented but not yet added to gips there. # References [1] Piotr Graczyk, Hideyuki Ishi, Bartosz Kołodziejek, Hélène Massam. "Model selection in the space of Gaussian models invariant by symmetry." The Annals of Statistics, 50(3) 1747-1774 June 2022. [arXiv link](https://arxiv.org/abs/2004.03503); [DOI: 10.1214/22-AOS2174](https://doi.org/10.1214/22-AOS2174) [2] "Learning permutation symmetries with gips in R" by `gips` developers Adam Chojecki, Paweł Morgen, and Bartosz Kołodziejek, available on [arXiv:2307.00790](https://arxiv.org/abs/2307.00790). ```{r options_back, include = FALSE} options(old_options) # back to the original options ```