Design of experiments (DoE) in machine learning research

May 4, 2020

In this post, we will study the applicability of design of experiments (DoE) in machine learning (ML) experiments, to do so we will use a machine learning paper as a case study. I assume that the reader is familiar with RNN’s. For a simple introduction to factorial designs with replication, I consider these slides a great resource. Some starting code in python for factorial designs can be oun here. All the code necessary to reproduce these experiments can be found on here.

Motivation

Machine learning (ML) is kind of a special science field, while sometimes unnecessary mathematical and theoretical and many times rightfully so, the majority of machine learning progress is driven by experiments, either by proposing new architectures, new optimizers, or new paradigms of training. As a mainly experimental field, it is of extreme importance to carry out rigorous experiments to support the claims that cannot be proven mathematically and to make sure that results cannot be explained by alternative hypotheses other than our own.

Recently, there have been concerns about the reproducibility or applicability of a significant proportion of published research in ML. Some of these include the work of Henderson et. al. [7] that concludes that recent advancements in deep reinforcement learning might not be significant due to problems in experimental design and evaluation. Likewise the work of Musgrave et. al. [8] which has cast doubt at advancements in metric learning caused by several flaws in experiment design, unfair comparisons, and testing set leakage. These are not isolated events as it is easy to find other examples in other sub-areas of ML [9, 10].

These problems are not unique to ML research, in fact fields such as psychology where lack of experimental rigor that have been going for years have eroded trust in the entire field [11]. In ML, the trends driving these problems are recent and have been identified by some scholars (shamefully taken from this excellent paper [6] from professor Zachary Lipton); these are:

Failure to distinguish between explanation and speculation.
Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning.
Mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g. by confusing technical and non-technical concepts.
Misuse of language, e.g. by choosing terms of art with colloquial connotations or by overloading established technical terms.

Int this post we only focus on trends 1 and 2.

Disclaimer

The fact that these problems have been identified by people from within the community is itself a great step. Likewise, many researchers have already started suggesting solutions to these problems by proposing workflows for lab collaboration and reproducibility [12], by proposing widespread use of statistical tools like ablation studies [6], significance testing and other statistical methods. In some sense, this blog post can be seen as a lightweight version of the work of Boquet et. al. [13] (if you have the time please read this paper), where they propose the use of design of experiments (DoE) which can help address the failure to identify the factors of empirical variation and failure to distinguish between explanation and speculation highlighted by [6]. Unlike them, I only choose one paper [1] to provide a concrete example, while I understand that the results of presented in this blog can be seen as an attack on the authors and I do stand by the claims made here, it is not my intention to, in any way attack the authors of said paper. Finally, I try to end on the positive note, namely that this work has inspired more research [2, 3, 4] tipping the balance of knowledge in science in a positive way.

Case Study: Using Fast Weights to Attend to the Recent Past [1]

What are fast weights?

Fast Weights extend standard vanilla recurrent neural network architectures with an associative memory. In the context of this paper, the authors identify two types of memory in traditional recurrent neural networks, hidden activity vectors \(\mathbf{h}_t\), that are updated every time-step, and serve as short-term memory and slow weights (traditional weights matrices) that are updated at the end of a batch and that have more memory capacity. The authors motivate a third type of memory called fast weights that has much higher storage capacity than the neural activities but much faster dynamics than the standard slow weights [1]. (We note, as the author did, that these concepts were developed much early in [14] and [15])

The author also give biological motivations for the concept of fast weights, namely that human do not store exact patterns of neural activity as memory, instead memory retrieval involves reconstructing neural patterns through a set of associative weights which can map to many other memories as well.


Figure 1: The fast associative memory model. Extracted from [1].

Figure 1 shows a diagram of how fast weights affects hidden activity vector. After hidden activity \(\mathbf{h}_t\) is computed a brief iterative settling process (of size \(S\)) is started, during this process a fast weight matrix \(\mathbf{A}\) is updated using a form of Hebbian short-term synaptic plasticity (outer product)

\[\begin{aligned} \mathbf{A}_t = \lambda \mathbf{A}_{t-1} + \eta \mathbf{h}_t \mathbf{h}_t^\intercal, \end{aligned} \tag{1}\label{1}\]

where \(\lambda\) and \(\eta\) are called decay rate and fast learning rate respectively. \(\mathbf{A}_t\) (assumed to be zero at the start of the sequence) maintains a dynamically changing short-term memory of the recent history of hidden activities in the network.

The next hidden activity is computed unrolling an inner loop of size \(S\) that progressively changes the hidden state (red path in Figure 1) using the input \(x_t\) and the previous hidden vector. At each iteration of the inner loop, the fast weight matrix is exactly equivalent to attention mechanism between past hidden vectors and the current hidden vector, weighted by a decay factor [1]. The final equation for the model is

\[\begin{aligned} \mathbf{h}_{t+1} = f\left(\mathcal{LN}\left[\mathbf{W}_h \mathbf{h}_{t} + \mathbf{W}_x \mathbf{x}_{t} + (\eta \sum_{\tau=1}^{\tau=t-1} \lambda^{t - \tau -1} \mathbf{h}_{\tau} \mathbf{h}_{\tau}^\intercal ) f(\mathbf{W}_h \mathbf{h}_{t} + \mathbf{W}_x \mathbf{x}_{t})\right]\right) \end{aligned} \tag{2}\label{2}\]

where \(\mathcal{LN}(.)\) refers to layer normalization (LN) and the unrolling is run for \(S=1\) steps.

Main claim of the fast weights paper

Using four experiments the authors try to justify the advantages of fast weights over traditional recurrent architectures. These experiments are associative retrieval, MNIST classification using visual glimpses, Facial expression recognition using visual glimpses and reinforcement learning. Results of these experiments seem to suggest that the incorporated fast weight matrix is the sole responsible for the observed superior performance. However, there are two factors of variation not accounted for in the paper by Ba et. al. [1]. I am not the first person to identity these factors in fact researcher Emin Orhan is the first to identify these problems in [5] (His blog is great, you should definitely check it out). These factors are:

As proposed in equation \(\eqref{2}\) the model has more depth than standard recurrent architectures. In [5] Orhan noted that as proposed this architecture is not biologically plausible and that there are ways to incorporate the fast weight matrix without increasing the effective depth. This is in fact how the original fast weights were proposed in [14].
Layer normalization has been shown to improve the performance of vanilla recurrent networks and no classical RNN with layer normalization or fast weight RNN without are tested in the paper, this implies that some of the improvement is due to LN.
Ba et. al. [1] hypothesize that fast weights allows RNN’s to use their recurrent units more effectively, allowing to reduce the hidden vector size without harming performance. To show this the authors compare with an LSTM, but the comparison should be carried out using standards RNN to see if the performance gains are not due to factors (1) and (2) or better initialization schemes, or the use of the optimizer.

The role of Design of experiments (DoE)

Quoting the work of [13]:

We control almost completely the environment where the experiments are run and thus the data-generating process, we can define a specific design to reason about statistical reproducibility while comparing the results of different runs of different algorithms.

For our purposes, this design is the result of having formulated three hypotheses about different factor that could explain the superior performance seen in [1] rather than the fast weight matrix. We turn to Design of experiments (DoE) for a framework that will allow us to test these hypotheses. More specifically we will perform a \(2^k r\) factorial designs with replications where \(k\) is the number of factors and \(r\) the number of replications. In our simple case we will assume that our observations are i.i.d, this turns the problem of estimating the effects of each factor into a linear regression model with a binary explanatory variable

\[\begin{aligned} \mathbf{y} = \mathbf{X}^\intercal \mathbf{q} + \mathbf{q_0} + \epsilon \end{aligned} \tag{3}\label{3}\]

where \(\mathbf{y}\) is vector of responses, \(\mathbf{X}\) is a binary matrix that encodes the factors and their iterations (linear, quadratic, cubic), \(\mathbf{q}\) and \(\mathbf{q_0}\) are called fixed effects and \(\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) is random noise. We the must perform experiments for all factor combinations (and interactions), namely \(2^k\).

For our case study these factors are:

DEPTH: a binary variable that represent whether we use \(\eqref{2}\) which increases the overall depth of the network or the following: \(\begin{aligned} \mathbf{h}_{t+1} = f\left(\mathcal{LN}\left[ \left(\mathbf{W}_h + \eta \sum_{\tau=1}^{\tau=t-1} \lambda^{t - \tau -1} \mathbf{h}_{\tau} \mathbf{h}_{\tau}^\intercal \right) \mathbf{h}_{t} + \mathbf{W}_x \mathbf{x}_{t}\right]\right) \end{aligned} \tag{4}\label{4}\) which doesn’t increase the effective depth and would be more biologically plausible.
LN: a binary variable for Layer normalization
HS: a binary variable that encodes the hidden size of the network (64, 128)

For control, we perform two more experiments dubbed CTRL with no fast weights and no LN, only varying the hidden size. Every experiment is repeated three times (More would not be possible for me due to limited computational budget).

The task that we choose, to perform to test our hypotheses, is associative retrieval. We start with various key-value pairs in a sequence. At the end of the sequence, one of the keys is presented and the model must predict the value that was temporarily associated with the key. Like Ba et. al. [1] we used strings that contained characters from English alphabet as keys, together with the digits 0 to 9 as values.

Input string	Target
c9k8j3f1??c	9
j0a5s5z2??a	5

We followed the same experimental protocol as [1] and generated 100000 training examples and 10000 validation examples. Figure 2 shows the results on the validation set of all models created after all factor combinations.

<b>Figure 2</b>: Cross entropy and accuracy of all models created after all factor combinations. — **Figure 2**: Cross entropy and accuracy of all models created after all factor combinations.

Right from the start, we observe the following in Figure 2:

The model with all factor combinatios RNN-FW-LN-DEPTH-HS=128 reaches the highest accuracy basically solving the task.
Simple RNN models, RNN-CTRL-HS=64 and RNN-CTRL-HS=128 are a powerful baseline reaching median accuracies of 67% and 71%, respectively.
The simple RNN models RNN-CTRL-HS=64 and RNN-CTRL-HS=128 reach superior performance than their fast weights counterparts with no increased depth RNN-FW-HS=64 and RNN-FW-HS=128. Although statistical test suggest that the diference is not significant with a p-value of 0.9766 for HS=64 and 0.8672 of HS=128. This would suggest that fast weights with no extra depth and a simple RNN would reach the same accuracy on this task.
Layer normalization seems to be hurting the model RNN-FW-LS-HS=64 but not its counterpart RNN-FW-LS-HS=128, which in fact reaches higher accuracies than the simple RNN baselines but this difference is still not significant with a p-value of 0.3253. This would suggest that fast weights are not in fact different from simple RNN in how the two use efficiently their weight connections.

In order to calculate effects and percentage of variance explained by the models, we use as response the increase in accuracy of the fast weights model over the average of the control models, this is \(y_{\text{FW-MODEl}} / \text{AVG}(y_{\text{CTRL-MODEl}})\). We summarize the result of the design in Table 1.

Factor	Effects	%Variance	Conf. Interval
I	0.162	---	(0.116, 0.209)
LN	0.060	1.429	(0.013, 0.106)
DE	0.388	60.773	(0.342, 0.435)
HS	0.227	20.779	(0.180, 0.274)
LN-DE	0.069	1.923	(0.022, 0.116)
LN-HS	0.103	4.261	(0.056, 0.150)
DE-HS	-0.021	0.184	(-0.068, 0.025)*
LN-DE-HS	-0.059	1.392	(-0.106, -0.012)
Error	---	9.259	---
Table 1: Effects and variance explained by the models. *=not significant

From Table 1, we observe:

The biggest effect is due to the increased depth, this means that as we increase this factor level from a model with no extra depth to a model with extra depth we will see an increase of in accuracy of the fast weights model over the control models. This factor accounts for 60% of the variance observed in the response, meaning that most of the observed increase in performance in this task is due to the extra depth.
Layer normalization plays a similar role as depth, accounting for 20% of the variance observed in the response.
Two-ways effects have a positive effect on the response, except for the combination DEPTH-HS, which has no significant effect.
Factor of variation not considered (learning rate, initialization, optimizer and other hyperparameters) account for 10% of the observed variation in the response.

Finally, we can plot the response surface which offer a visual alternative to Table 1.

<b>Figure 3</b>: Response surface plot — **Figure 3**: Response surface plot

Fast weights: Do they really work as advertised?

Note: The title of this section is a reference to [5] which first suggested that the increased performance observed in the paper of Ba et. al. [1] might actually be to factors not accounted in the authors experimental design. We note that the author of [5], unlike us, did not carry out the experiments necessary to prove this.

On this final section of our case study, I am inclined to answer that “it depends”. I think that fast weights as introduced in the paper do work, they increase the performance on the task, and reach higher accuracy faster than baseline methods. Our experimental design seems to strongly indicate that this increase in performance is mainly due to the extra depth added to the model. While the author of [5] seems to suggest that the Ba et. al. have fallen in trends 1 and 2 identified in [6]. I instead suggest that Ba et. al. only fail to identify the sources of empirical gains. Does this invalidate the results of the paper? Not at all, I consider fast weights as a type of self-attention mechanism, unlike common self-attention mechanisms which use the scaled dot-product, fast weights use the outer-product becoming a sort of “outer-product self-attention”. The developments of fast weights have followed a similar fashion to its dot-product counter parts which too started with very little to no extra depth added [16], and now extra depth is such an important part of the model that it is made explicit with extra matrices to modulate the output (these matrices are the key, query and value weights) [17]. In this vein, we see a similar story, of incrementing fast weights with extra depth, like in [2] where they replace equation \(\eqref{1}\) with

\[\begin{aligned} \mathbf{A}_t = \mathbf{W}_A \odot \mathbf{A}_{t-1} + \mathbf{W}_H \odot \mathbf{h}_t \mathbf{h}_t^\intercal, \end{aligned} \tag{5}\label{5}\]

which according to the author allows the network to intelligently distribute inputs in \(\mathbf{A}_t\) to increase memory capacity. Similarly, the work of [3] which leverages several outer products to create tensors of higher orders embedded even with mode depth and capacity which as the author hypothesize introduce the combinatorial bias necessary to solve relational tasks. Finally, we note the work of [4] which update the original fast weights with keys, value and queries pair similar to the attention mechanism used in transformers, [17] but using outer-product rule instead of dot-product, which allows them to be state of the art in Question Answering on bAbi.

Conclusion

In this post, we have outlined the use of Design of Experiments (DoE) as tool for machine learning researches. We motivate its use with a discussion on recent troublesome trend in ML scholarship. We showed the strengths of DoE using a case study where we tested the claims of a ML paper [1] using DoE and found that the observed increases in performance where due to factors not accounted in original experimental design. This methodology can be extended by including other hyper-parameters such as learning rate, optimizer, initialization schemes, random seeds, among others; and replacing the simple linear model used here with a hierarchical model as done in [13]. The response surface methodology shown here can be used to select optimal combination of hyper-parameters in ML experiments as well as to give clarity into the true sources of empirical gains.

Cited as:

@article{hernandez2020experiments,
  title   = "Design of experiments (DoE) in machine learning research",
  author  = "Hernandez, Jefferson",
  journal = "https://jeffhernandez1995.github.io/",
  year    = "2020",
  url     = "https://jeffhernandez1995.github.io/design_of_experiments,/machine_learning,/english/2020/05/04/fastweights/"
}

References

[1] Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., & Ionescu, C. (2016). Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems (pp. 4331-4339).

[2] Zhang, W., & Zhou, B. (2017). Learning to update auto-associative memory in recurrent neural networks for improving sequence memorization. arXiv preprint arXiv:1709.06493.

[3] Schlag, I., & Schmidhuber, J. (2018). Learning to reason with third order tensor products. In Advances in neural information processing systems (pp. 9981-9993).

[4] Le, H., Tran, T., & Venkatesh, S. (2020). Self-Attentive Associative Memory. arXiv preprint arXiv:2002.03519.

[5] Orhan, E. (2017). A note on fast weights: do they really work as advertised?. url.

[6] Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341.

[7] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018, April). Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence.

[8] Musgrave, K., Belongie, S., & Lim, S. N. (2020). A metric learning reality check. arXiv preprint arXiv:2003.08505.

[9] Lucic, M., Kurach, K., Michalski, M., Gelly, S., & Bousquet, O. (2018). Are gans created equal? a large-scale study. In Advances in neural information processing systems (pp. 700-709).

[10] Melis, G., Dyer, C., & Blunsom, P. (2017). On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589.

[11] Open S1ience Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

[12] Chen, X., Dallmeier-Tiessen, S., Dasler, R., Feger, S., Fokianos, P., Gonzalez, J. B., … & Rodriguez, D. R. (2019). Open is not enough. Nature Physics, 15(2), 113-119.

[13] Boquet, T., Delisle, L., Kochetkov, D., Schucher, N., Atighehchian, P., Oreshkin, B., & Cornebise, J. (2019). DECoVaC: Design of Experiments with Controlled Variability Components. arXiv preprint arXiv:1909.09859.

[14] Hinton, G. E., & Plaut, D. C. (1987, July). Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society (pp. 177-186).

[15] Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1), 131-139.

[16] Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

[17] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).