HiRE: High Recall Approximate Top$k$ Estimation for Efficient LLM Inference
Abstract
Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memorybound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model to operate on a top$k$ fraction of rows/columns (where $k\approx 0.05$), there by suggesting a way to reduce the transfer of model parameters, and hence latency. However, exploiting this sparsity for improving latency is hindered by the fact that identifying top rows/columns is datadependent and is usually performed using full matrix operations, severely limiting potential gains. To address these issues, we introduce HiRE (High Recall Approximate Top$k$ Estimation). HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) $\textrm{DATOP}k$: an efficient multidevice approximate topk operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47\times$ on a single TPUv5e device (Cloud, ).
1 Introduction
Deploying large language models (LLMs) is almost prohibitively expensive due to cost of running inference on accelerators and also has significant environmental costs. For example, (Patterson et al., 2021) attribute $8090\%$ of the total carbon emissions during the life cycle of an LLM to running inference.
Latency/cost of generative LLMs is dominated by the autoregressive nexttoken generation which inturn is memorybound on standard accelerators (GPU/TPUs) due to shuttling large matrices from highbandwidth memory (HBM) to the accelerator cache. Several recent approaches seek to address this challenge by using block parallel decoding (Stern et al., 2018), speculative decoding (Leviathan et al., 2023), mixture of experts (Du et al., 2022), more efficient architectures (Gu & Dao, 2023) as well as by using classical methods such as distillation (Liang et al., 2021).
Despite these techniques, generative LLM inference is still primarily memorybound, indicating significant room for latency/cost reduction. Furthermore, for typical input length (e.g. $\lessapprox 2000$), and even for models with up to one billion parameters, softmax and feedforwad (FFN) layers account for $\geq 90\%$ of the latency. So in this work, we primarily focus on these two components^{1}^{1}1while our ideas are applicable to the attention layer as well, we leave a more detailed investigation of that to future work.. Multiple recent works show that LLMs have inherent “sparsity” or redundancy which seem to actually grow with larger models. For example, (Li et al., 2022; Mirzadeh et al., 2023) show that the feedforward (FFN) layers activations – with ReLU based activation functions – are very sparse. More concretely, for a $1$hidden layer FFN: $\overrightarrow{\mathbf{x}}_{\textrm{out}}=\sum_{j=1}^{m}\phi(\langle% \overrightarrow{\mathbf{u}}_{j},\overrightarrow{\mathbf{x}}\rangle)% \overrightarrow{\mathbf{v}}_{j}$ with input $\overrightarrow{\mathbf{x}}_{\textrm{in}}$, output $\overrightarrow{\mathbf{x}}_{\textrm{out}}$, $\phi(\cdot)$ a ReLU based activation function and $\overrightarrow{\mathbf{u}}_{j}$ and $\overrightarrow{\mathbf{v}}_{j}$ being the first and second layer weights of the FFN respectively, for any input $\overrightarrow{\mathbf{x}}_{\textrm{in}}$, only a small fraction of weight vectors $\overrightarrow{\mathbf{u}}_{j}$ have $\phi(\langle\overrightarrow{\mathbf{u}}_{j},\overrightarrow{\mathbf{x}}\rangle% )\neq 0$. In fact, the sparsity level can be further enhanced ($\lessapprox 5\%$ nonzeros) by using explicit top$k$ procedure in the FFN layer computation during training. Similarly, softmax layers usually need to focus only on tokens corresponding to the highest magnitude outputs (logits).
This raises the following question that we seek to answer: can we compute top elements of softmax output and/or FFN activations efficiently (on accelerators) and accurately?
Challenges with topk computation in softmax/FFN: Firstly, the topk elements can be input dependent, so the challenge is to estimate top elements without fully evaluating the softmax/FFN layer. Second, even after estimating the top elements’ coordinates, it is challenging to gather the top elements efficiently on standard accelerators (GPUs/TPUs).
Contributions: In this work, we propose ideas to overcome these challenges and show that they substantially improve the inference latency of autoregressive LLMs.
Estimating top$k$ elements efficiently: We propose using an approximate but high recall procedure to estimate the indices of top$k$ outputs of FFN/softmax layers. In particular, we employ a smaller dimensional/low rank projection as well as aggressive quantization for doing this approximate estimation, but the framework is more general. While there are a few recent works (Liu et al., 2023; Csordás et al., 2023; Alizadeh et al., 2023) which employ an approximate top$k$ estimation procedure, they suffer a loss in performance due to the approximation. One of our key ideas is to identify that if this approximate procedure has high recall, then following it by exact computation on the predicted set can regain the lost performance. We call this approach HiRE, standing for High Recall Approximate Top$k$ Estimation.
Distributed approximate top$k$ (DATOPk): Due to the size of LLMs, even during inference, the models are sharded onto multiple devices (say $\ell$ devices), and similarly computations (and outputs/activations) in each of the softmax/FFN layers are also distributed across the devices. In this context, a vanilla application of the top$k$ operator turns out to be very expensive since parameters stored on all the different devices are collected into a central location before performing the top$k$ operation. To overcome this, we propose a distributed, approximate top$k$ operator which performs the top$\frac{k}{\ell}$ operation on each of the $\ell$ devices, and then uses a union of these to approximate the overall top$k$ operator.
HiRESoftmax: In the softmax layer, we are interested only in the top few ($k\lessapprox 32$) outputs from a very large output space with hundreds of thousands of tokens. On a model with about one billion parameters, our implementation of HiRE gives an endtoend latency improvement of about $1.22\textrm{x}$ without any quality drop.
HiREFFN: While LLMs trained with a top$k$ operation have relatively sparse activations, $k\gtrapprox 5\%$ is usually required to ensure that there is no loss in accuracy (in contrast to the softmax layer, where the effective sparsity is $\lessapprox 0.3\%$). Now, the amount of time taken to gather the relevant top$k$ columns of the weight matrix from HBM to cache turns out to be substantially more than that required to bring an equivalently sized dense matrix. To overcome this hardware limitation, we propose a group sparse top$k$ operation, where columns of the weight matrix are grouped into groups of small sizes ($8$ in our experiments) which significantly improves the efficiency of memory transfers. Second, motivated by the observation that there is substantial overlap in nonzero activations across tokens, we propose two approaches to further enhance sparsity levels. For exploiting static overlap (i.e., activations that are common to most tokens), we propose to use a hybrid FFN layer which combines a standard/dense (but small) FFN, together with a sparse/top$k$ FFN layer. For exploiting dynamic overlap (i.e., activations that are common to related responses), while doing inference with a larger batch size or generating multiple parallel responses for the same query, we propose to select sparsity depending on the size of union of nonzero activation indices for each of the tokens, instead of using a fixed level of sparsity. On the same model as above, HiREFFN gives an endtoend latency improvement of about $1.16\textrm{x}$, again without any quality drop.
Putting together: A combination of all the ideas discussed above give an endtoend latency improvement of about $1.47\textrm{x}$ compared to the fully dense model.
We note that concurrent to our work, (Alizadeh et al., 2023) also proposes an efficient, approximate procedure to estimate nonzero activations. While this approach is similar to our “estimating top activations cheaply” contribution, we note that emphasis of the two works is quite different. In particular, we demonstrate that along with estimation of top activations, we also need to adjust thresholds for high recall, implement DATOPk style operator, and promote group sparsity+overlap to provide compelling endtoend latency reduction on accelerators.
Outline: In Section 2, we describe some closely related works. In Section 3, we explain the key ideas. In Section 4 we present the experiment setting and our main results, and provide additional results in appendix.
Notation: We use nonbold letters $a,A$ etc. for scalars, $\overrightarrow{\mathbf{a}},\overrightarrow{\mathbf{b}}$ etc. for vectors and $\mathbf{A},\mathbf{B}$ etc. for matrices. Given a set $S$ of coordinates, we also use $\overrightarrow{\mathbf{x}}_{S}$ to denote the restriction of $\overrightarrow{\mathbf{x}}$ to the coordinates in $S$.
2 Related Work
Sparsity of activations: As mentioned earlier, recent works (Li et al., 2022) have demonstrated that large language models have highly sparse activation patterns with ReLU based activation functions. (Mirzadeh et al., 2023) show that using ReLU activation indeed achieves comparable performance to and argue for reinstating the ReLU activation function instead of more popular alternatives such as GeLU. There have been several recent attempts to exploit activation sparsity for faster inference such as (Liu et al., 2023; Grimaldi et al., 2023; Piórczyński et al., 2023; Csordás et al., 2023; Alizadeh et al., 2023), but all of them suffer some loss in quality since the activation pattern cannot be predicted exactly. In contrast, one of our key insights is that estimating activation sparsity with high recall can ensure matching quality, since exact computation can be performed on all of the activations that are estimated to be nonzero, and even if some of them later turn out to be zero, it doesn’t affect the actual output. We also note that we introduce several new ideas such as group sparsity with very small group sizes, using a common dense path, exploiting activation overlap across tokens, and distributed gathering of weights, which are critical to improving efficiency without loss in accuracy and in extending this approach to larger batch sizes, larger number of samples and larger models with model partitioning.
Sparse attention: There have also been several approaches to enforce and exploit sparsity in the attention layers (Wang et al., 2021; Madaan et al., 2023). In particular, they directly approximate full alltoall attention with sparse attention and we believe that similar to what we do for activation sparsity in this paper, estimating a high recall set of nonzero attention weights and performing exact attention computation on them can ensure that there is no loss in quality with sparse attention.
Structured sparsity and hardware support: From the hardware/systems point of view, there have been several works developing efficient sparse operations on GPUs and TPUs such as (Gale et al., 2020). Since unstructured sparsity is not supported with efficient execution on GPUs, there have been works that try either different forms of structured sparsity such as $1$ in $4$ sparsity (Jaszczur et al., 2021), group sparsity with large group sizes (Dong et al., 2023) or explore the efficiency of sparse transformers on CPUs (Song et al., 2023; Zeng et al., 2023). In this work, we show that while fully unstructured sparsity is not efficiently supported on TPUs, surprisingly, group structured sparsity with even small group sizes like $8$, is efficiently supported on TPUs (see Figure 3), while not losing in terms of pretraining or downstream quality.
Mixture of experts (MoE): MoE (Du et al., 2022) can be thought of as an extreme version of structured sparsity, where the group size is very large, and have been explored both for training (Li et al., 2022) as well as inference (Yi et al., 2023) efficiency. However, MoE models are usually harder to train, since one needs to both learn a gating mechanism, as well as ensure that tokens are roughly equally distributed across experts. In contrast, small group sizes (or experts) are much easier to train, particularly with the Top$k$ operation.
Complementary approaches for inference efficiency: There have been several complementary approaches for speeding up LLM inference such as model compression (Han et al., 2015; Jaiswal et al., 2023; Xia et al., 2023), quantization (Li et al., 2023; Ahmadian et al., 2023; Lin et al., 2023; Zhao et al., 2023), speculative decoding (Leviathan et al., 2023; He et al., 2023), early exit and parallel decoding (Bae et al., 2023), structured matrices in FFN layers (Baykal et al., 2023), other systems aspects such as effective partitioning of the model (Pope et al., 2023), communication protocols across multiple devices (Aminabadi et al., 2022; Sheng et al., 2023) etc.
3 Main Ideas
In this section, we will introduce the precise setting and present the main ideas behind HiRE more rigorously.
3.1 Problem Setup
A transformer begins with an embedding layer which produces the representation for the input tokens, then processes them through a sequence of alternating attention and feed forward layers, and finally ends with a softmax layer that produces the output probabilities (Vaswani et al., 2017). In this work, we will be primarily concerned with the softmax and feed forward (FFN) layers, which we now describe.
The softmax layer takes an input $\overrightarrow{\mathbf{x}}\in\mathbb{R}^{d}$ and outputs:
$\displaystyle\textrm{Softmax}\left(\overrightarrow{\mathbf{x}}\right):=\frac{% \exp({\mathbf{W}}\overrightarrow{\mathbf{x}})}{\left\\exp({\mathbf{W}}% \overrightarrow{\mathbf{x}})\right\_{1}}\in\mathbb{R}^{c},$  (1) 
where ${\mathbf{W}}\in\mathbb{R}^{d\times c}$, $d$ is called the model dimension and $c$ is the number of output classes. In almost all of the applications, we only care about accurately estimating the conditional distribution on the top$k$ output probabilities for some $k$ i.e.,
$\displaystyle\textrm{Softmax}_{k}\left(\overrightarrow{\mathbf{x}}\right):=% \frac{\textrm{Top}_{k}\left(\textrm{Softmax}\left(\overrightarrow{\mathbf{x}}% \right)\right)}{\left\\textrm{Top}_{k}\left(\textrm{Softmax}\left(% \overrightarrow{\mathbf{x}}\right)\right)\right\_{1}}\in\mathbb{R}^{c},$  (2) 
where $\textrm{Top}_{k}\left(\cdot\right)$ operation takes a vector as input, retains the values of the top$k$ entries and sets the remaining to zero. In particular, we only care about estimating the location and values of the top$k$ entries of $\textrm{Softmax}\left(\overrightarrow{\mathbf{x}}\right)$.
The FFN layer takes an input $\overrightarrow{\mathbf{x}}\in\mathbb{R}^{d}$ and outputs:
$\displaystyle\textrm{FF}\left(\overrightarrow{\mathbf{x}}\right):=\sum_{j=1}^{% m}\phi(\langle\overrightarrow{\mathbf{u}}_{j},\overrightarrow{\mathbf{x}}% \rangle)\overrightarrow{\mathbf{v}}_{j},$  (3) 
where $\phi(\cdot)$ is an activation function, $m$ is the number of hidden units and $\overrightarrow{\mathbf{u}}_{j}\in\mathbb{R}^{d}$ and $\overrightarrow{\mathbf{v}}_{j}\in\mathbb{R}^{d}$ are the first and second layer weights respectively. We refer to equation 3 as the standard or dense FFN layer. It turns out that for specific activation functions like (powers of) ReLU, for any given $x$, the number of $j$ for which the activations i.e., $\phi(\langle\overrightarrow{\mathbf{u}}_{j},\overrightarrow{\mathbf{x}}\rangle)$ is nonzero in a trained LLM is small ($\lessapprox 10\%$) (Li et al., 2022; Mirzadeh et al., 2023). If we explicitly do a top$k$ step in the feedforward evaluation i.e.,
$\displaystyle\textrm{FF}_{k}\left(\overrightarrow{\mathbf{x}}\right):=\sum_{j% \in S(\overrightarrow{\mathbf{x}})}\phi(\langle\overrightarrow{\mathbf{u}}_{j}% ,\overrightarrow{\mathbf{x}}\rangle)\overrightarrow{\mathbf{v}}_{j},$  (4) 
where $S(\overrightarrow{\mathbf{x}}):=\textrm{TopInd}_{k}\left(\phi(\mathbf{U}^{\top% }{\overrightarrow{\mathbf{x}}})\right)$, where $\mathbf{U}$ is the matrix whose $j^{\textrm{th}}$ column is $\overrightarrow{\mathbf{u}}_{j}$ and $\textrm{TopInd}_{k}\left(\overrightarrow{\mathbf{z}}\right)$ denotes the set of indices where the top$k$ entries occur in $\overrightarrow{\mathbf{z}}$, then the number of nonzeros reduces even further without drop in quality for $k\approx 5\%$ (Li et al., 2022). We refer to equation 4 as the sparse or top$k$ FFN layer.
We now abstract out the key component that can speed up both top$k$ based softmax equation 2 as well as top$k$ based FFN equation 4 as follows. Given a matrix ${\mathbf{Z}}\in\mathbb{R}^{d\times\ell}$ and a vector $\overrightarrow{\mathbf{x}}\in\mathbb{R}^{d}$, we wish to compute:
$\displaystyle S=\{(i,\phi(\langle\overrightarrow{\mathbf{z}}_{i},% \overrightarrow{\mathbf{x}}\rangle)):i\in\textrm{TopInd}_{k}\left(\phi({% \mathbf{Z}}^{\top}{\overrightarrow{\mathbf{x}}})\right)\},$  (5) 
where $\overrightarrow{\mathbf{z}}_{i}$ is the $i^{\textrm{th}}$ column of ${\mathbf{Z}}$ and $\phi(\cdot)$ is any given activation function (could be identity for softmax equation 2), and $d\ll\ell$. Our goal is to design an efficient mechanism to compute $S$ in equation 5 compared to the baseline approach of computing $\phi({\mathbf{Z}}^{\top}{\overrightarrow{\mathbf{x}}})$ followed by a top$k$ operation.
Theoretical proxy for quantifying efficiency
: Autoregressive decoding for small batch sizes is memory bound i.e., the time taken for transferring model parameters across different hierarchies of memory (e.g., HBM/RAM to cache) of the accelerator device (GPU/TPU) is the largest component of the total inference latency (Leviathan et al., 2023). To quantify this intuition, and get a sense of how much we are improving inference latency, only in the current section, we use the number of effective parameters (measured in terms of bytes taken to store them) used by a given algorithm $\mathcal{A}$ for the computation in equation 5 as a proxy for its latency, and refer to it as $\textrm{PS}\left(\mathcal{A}\right)$. For example, the Baseline which first computes $\phi({\mathbf{Z}}^{\top}{\overrightarrow{\mathbf{x}}})$, followed by a top$k$ operation has $\textrm{PS}\left(\textrm{Baseline}\right)=2d\ell$ since ${\mathbf{Z}}$ is a $d\times\ell$ matrix, which is stored in $16$bit floating point (bf16) format.
3.2 HiRE: Overall Approach
In this section, we present the key ideas behind HiRE, along with its application to the softmax and FFN layers.
3.2.1 Solving equation 5 accurately and efficiently
The key idea behind HiRE is that given an approximate, and smaller version of the matrix $Z$, denoted by ${\mathbf{Z}_{\textrm{approx}}}$, we first approximate $\phi({\mathbf{Z}}^{\top}\overrightarrow{\mathbf{x}})$ with $\phi({\mathbf{Z}_{\textrm{approx}}}^{\top}\overrightarrow{\mathbf{x}})$, use it to compute the set of top$k^{\prime}$ elements $S^{\prime}$ for some $k^{\prime}>k$, compute $\phi({\mathbf{Z}}_{S^{\prime}}^{\top}\overrightarrow{\mathbf{x}})$ for only indices restricted to $S^{\prime}$, and then perform the top$k$ operation on the resulting vector. A pseudocode of HiRE is presented in Algorithm 1 and a schematic diagram is presented in Figure 1. It is easy to see that as long as $\textrm{TopInd}_{k}\left(\phi({\mathbf{Z}}^{\top}\overrightarrow{\mathbf{x}})% \right)\subseteq S^{\prime}$, we have that $\textrm{Top}_{k}\left(\phi({\mathbf{Z}}^{\top}\overrightarrow{\mathbf{x}})% \right)=\textrm{Top}_{k}\left(\phi({\mathbf{Z}}_{S^{\prime}}^{\top}% \overrightarrow{\mathbf{x}})\right)$. If the size of ${\mathbf{Z}_{\textrm{approx}}}$ is chosen to be substantially smaller than that of ${\mathbf{Z}}$, and if $k^{\prime}\ll\ell$, then HiRE executes much faster than the naive version of implementing equation 5. In this paper, we consider two ways of choosing ${\mathbf{Z}_{\textrm{approx}}}$.
HiRELR
: In this case, we choose ${\mathbf{Z}_{\textrm{approx}}}$ to be a low rank matrix factorized as ${\mathbf{Z}}_{1}*{\mathbf{Z}}_{2}^{\top}$, where ${\mathbf{Z}}_{1}\in\mathbb{R}^{d\times r}$ and ${\mathbf{Z}}_{2}\in\mathbb{R}^{\ell\times r}$ for some $r\ll d$. In this case, $\textrm{PS}\left(\textrm{HiRE}\textrm{LR}\right)=2(dr+r\ell+dk^{\prime})$, which will be much smaller than $2d\ell$ since $r\ll d\ll\ell$ and $k^{\prime}\ll\ell$. Note that this requires us to train ${\mathbf{Z}_{\textrm{approx}}}$ either in an end to end manner, or by performing low rank decomposition on ${\mathbf{Z}}$. However, there is a high potential for latency gains here since $r$ could potentially be chosen to be very small.
HiREQ
: In this case, we choose ${\mathbf{Z}_{\textrm{approx}}}$ to be a aggressively quantized version of ${\mathbf{Z}}$ for example in $4$bit integer format (int4). In this case, $\textrm{PS}\left(\textrm{HiRE}\textrm{Q}\right)=\frac{d\ell}{2}+2dk^{\prime}$, which will be much smaller than $2d\ell$ since $k^{\prime}\ll\ell$. Note that this does not require any training since we can just apply a standard quantization routine to ${\mathbf{Z}}$ to obtain ${\mathbf{Z}_{\textrm{approx}}}$. However, the potential gains are limited due to inherent limits on quantization. Note that we can also combine both HiRELR and HiREQ to obtain further inference latency improvements.
3.2.2 $\textrm{DATOP}k$: Multidevice serving with approximate distributed top$k$ operation
Very large models do not fit on the RAM of a single GPU/TPU, so for serving such models, the parameters of the model are usually distributed on a cluster of multiple devices. In this case, using HiRE (Algorithm 1) as is leads to large costs in communication. To mitigate the communication costs, we modify HiRE to use distributed, approximate top$k$ computation. The pseudocode for the resulting algorithm is given in Algorithm 2. We will now present the application of Algorithms 1 and 2 to make softmax and FFN layers more efficient.
3.3 HiRESoftMax: Efficient Softmax using HiRE
Since we usually care only about the top$k$ logits of the softmax layer equation 4, we can directly apply Algorithms 1 and 2 to solve equation 2 more efficiently. We note that while obtaining ${\mathbf{W}_{\textrm{approx}}}$, an approximate version of ${\mathbf{W}}$, is straightforward with quantization, obtaining a low rank approximation of ${\mathbf{W}_{\textrm{approx}}}$ is also relatively cheap since we can distill the inputs and outputs of ${\mathbf{W}}$ to obtain ${\mathbf{W}_{\textrm{approx}}}$.
3.4 HiREFFN: Efficient FFN using HiRE
Recalling equation 4, we note that we can compute $S(\overrightarrow{\mathbf{x}}):=\textrm{TopInd}_{k}\left(\phi(\mathbf{U}^{\top% }{\overrightarrow{\mathbf{x}}})\right)$ using HiRE.
While HiREFFN is indeed theoretically more efficient compared to a naive computation of equation 4, it turns out that for the relative values of $k^{\prime}$ and $m$ that are needed to ensure that there is no accuracy drop ($\frac{k^{\prime}}{m}\approx 0.05$), the transfer of parameters across different hierarchies of memory is too inefficient to give any gains^{2}^{2}2In contrast, the corresponding factor in the softmax layer is much smaller, around $10^{4}$, which gives us inference latency improvements, despite this inefficiency in memory transfer.. Our key insight is that group sparse structure substantially improves the efficiency of parameter transfer across different hierarchies of memory. For example, Figure 3 in Appendix B shows that the efficiency of the transfer operation of columns of a matrix across different hierarchies of memory improves substantially when we transfer groups of adjacent columns instead of single columns. More concretely, we first divide the entire set of $m$ hidden units into $m/g$ groups with $g$ hidden units each (in all our experiments, we use $g=8$). Given an approximate group activation computation procedure $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{{\frac{m}{g}}}$, i.e., $\Phi(\overrightarrow{\mathbf{x}})\approx\left(\sum_{k=0}^{g1}\left\phi(% \langle\overrightarrow{\mathbf{u}}_{jg+k},\overrightarrow{\mathbf{x}}\rangle)% \right:j\in[m/g]\right)$ let us denote $S^{\prime}_{g}:=\textrm{TopInd}_{k^{\prime}}\left(\Phi(\overrightarrow{\mathbf% {x}})\right)$, and use:
$\displaystyle\widetilde{\textrm{FF}}_{g,\frac{k^{\prime}}{g}}\left(% \overrightarrow{\mathbf{x}}\right):=\sum_{j\in\widetilde{S}_{g}}\sum_{\ell=0}^% {g}\phi(\langle\overrightarrow{\mathbf{u}}_{g*j+\ell},\overrightarrow{\mathbf{% x}}\rangle)\overrightarrow{\mathbf{v}}_{g*j+\ell},$  (6) 
where $\widetilde{S}_{g}\subseteq[m/g]$ is a subset of groups selected as $\widetilde{S}_{g}:=\textrm{GroupTopInd}_{g,k/g}\left(\phi({\mathbf{U}}_{{S^{% \prime}}_{g}}^{\top}\overrightarrow{\mathbf{x}})\right)$, where ${S^{\prime}}_{g}$ is an estimate of the groups of neurons that will actually be activated. We can again use Algorithm 1 (resp. Algorithm 2) to compute $\widetilde{S}_{g}$ in the single (resp. multiple) device settings.
3.4.1 Enhancing sparsity further by exploiting static and dynamic activation overlap
Motivated by the observation that there is substantial overlap in nonzero activations across tokens, we also propose the following approaches to exploit static and dynamic activation overlap.
Static overlap
: The first idea is to augment the sparse FFN layer with a dense common path of neurons that are activated for all tokens, so that the number of nonzeros in the sparse part can be reduced further. More concretely, the feedforward layer comprises of two sets of $m_{1}$ and $m_{2}$ hidden units respectively, where the first set of neurons is used in a dense manner equation 3, while the second set is used in a group sparse manner equation 6. We have:
$\displaystyle\widetilde{\textrm{FF}}_{\textrm{CommonPath}}\left(% \overrightarrow{\mathbf{x}}\right)=\sum_{j=1}^{m_{1}}\phi(\langle% \overrightarrow{\mathbf{u}}_{j}^{d},\overrightarrow{\mathbf{x}}\rangle)% \overrightarrow{\mathbf{v}}_{j}^{d}$  
$\displaystyle\quad+\sum_{j\in\widetilde{S}_{g}(\overrightarrow{\mathbf{x}})}% \sum_{\ell=0}^{g}\phi(\langle\overrightarrow{\mathbf{u}}_{g*j+\ell}^{gs},% \overrightarrow{\mathbf{x}}\rangle)\overrightarrow{\mathbf{v}}_{g*j+\ell}^{gs},$  (7) 
where $\overrightarrow{\mathbf{u}}^{d},\overrightarrow{\mathbf{v}}^{d}$ denote the first set of hidden units used in a dense manner, while $\overrightarrow{\mathbf{u}}^{gs},\overrightarrow{\mathbf{v}}^{gs}$ denote the second set of neurons used in a sparse manner.
Dynamic overlap
: In order to further improve efficiency while producing a larger number of samples for the same query (e.g., to rank these responses and output the best one (Mudgal et al., 2023)), we first compute $\widetilde{S}_{g}(\overrightarrow{\mathbf{x}}_{1}),\cdots,\widetilde{S}_{g}(% \overrightarrow{\mathbf{x}}_{s})$ for the latest token from each of the $s$ samples and use their union $\cup_{u\in[s]}\widetilde{S}_{g}(\overrightarrow{\mathbf{x}}_{u})$ on line $1$ of Algorithm 1 and 2.
$\mathcal{M}_{\textrm{sm}}$  
$\mathcal{M}_{\textrm{base}}$  HiRELR ($r=25\%$, $k^{\prime}=384$)  HiREQ (int4, $k^{\prime}=128$)  HiRELR ($r=25\%$, $k^{\prime}=384$) + HiREQ (int4)  
Pretraining Performance  Top1 Accuracy  57.15%  57.07%  57.12%  57.07% 
Top32 Intersection  32.0  29.26  31.48  29.03  
Downstream Performance  Machine Translation  47.92  47.73  47.9  47.77 
SuperGLUE Benchmark  62.0  61.39  61.56  61.02  
Question Answering  29.65  29.58  29.54  29.58  
Discriminative Tasks  51.69  50.51  50.92  50.40  
Speedup  1.0x  1.16x  1.16x  1.22x 
4 Experiments
In this section, we present experimental evaluation of HERD and demonstrate its efficacy in reducing inference latency.
4.1 Training details
We mostly follow the training setup described in (Anil et al., 2023) including the training dataset and optimizer. Most experiments were conducted on a model with about $1$ billion parameters – we denote it by $1B$ model. $d_{\textrm{m}}$ denotes its model dimensions, i.e., dimension of representations.
HiREFFN: After pretraining $1B$ for a large number of steps ($\mathcal{M}$), we continue to pretrain it for a similar number of additional steps with group sparse top$k$ operation (equation 6) on odd FFN layers to obtain $\mathcal{M}_{\textrm{ffn}}$ i.e., if the model has $\ell$ layers, then we modify only the $1^{\textrm{st}},3^{\textrm{rd}},...,(2*\lfloor\frac{\ell}{2}\rfloor1)^{% \textrm{th}}$ layers to be group sparse. For fair comparison we continue to pretrain $\mathcal{M}$ for the same number of additional steps as $\mathcal{M}_{\textrm{ffn}}$, but without the top$k$ operation to obtain the baseline model $\mathcal{M}_{\textrm{base}}$. While the exact computations are all performed in bf16 (i.e., $16$ bit floating point), the approximate computations in HiRE are performed in int4.
HiRESoftmax: We apply HiRESoftmax to $\mathcal{M}_{\textrm{base}}$ or $\mathcal{M}_{\textrm{ffn}}$ with different choices for approximate computation.

1.
For approximate computation with low rank, we train ${\mathbf{W}_{\textrm{approx}}}$ using cross entropy (CE) loss with respect to ground truth labels, with rank $r$ chosen to be $r=\frac{d_{\textrm{m}}}{4}$.

2.
For approximate computation with quantization, we perform the softmax computation in int4.
We also present results combining both low rank and quantization approximation. We refer to the model with HiRE on both Softmax and FFN layers as $\mathcal{M}_{\textrm{full}}$, while HiRESoftmax applied to $\mathcal{M}_{\textrm{base}}$ is referred to as $\mathcal{M}_{\textrm{sm}}$.
Baseline  $\mathcal{M}_{\textrm{ffn}}$ (HiREQ)  $\mathcal{M}_{\textrm{ffn}}$ (HiREQ) + $\mathcal{M}_{\textrm{sm}}$ (HiREQ + HiRELR)  
Pretraining Performance  Top1 Accuracy  57.15%  57.03%  56.93% 
Perplexity  2.045  2.056  NA  
Downstream Performance  Machine Translation  47.92  46.95  46.94 
SuperGLUE Benchmark  62.0  62.49  61.74  
Question Answering  29.65  30.88  30.86  
Discriminative Tasks  51.69  51.14  50.08  
Speedup  1.0x  1.16$\times$  1.47$\times$ 
$\mathcal{M}_{\textrm{ffn}}$ (HiREQ)  $\mathcal{M}_{\textrm{ffn}}$ + HiREQ + (DATOPk)  
Pretraining Performance  Top1 Accuracy  57.03%  56.82% 
Perplexity  2.056  2.064  
Downstream Performance  Machine Translation  46.95  47.03 
SuperGLUE Benchmark  62.49  61.37  
Question Answering  30.88  30.11  
Discriminative Tasks  51.14  50.33  
Speedup  1.0$\times$  2.27$\times$ 
4.2 Evaluation
We evaluate the performance of different models both in terms of quality as well as inference latency. We measure quality using evaluation on pretraining dataset as well evaluations on multiple downstream datasets. Following standard works in the domain, we compute downstream performance of the model by applying it to multiple tasks, each focusing on a specific capability of LLMs. Collectively these datasets provide us a thorough assessment framework. To facilitate evaluation of our pretrained base models we perform 1shot evaluations on all datasets.
 1.
 2.
 3.

4.
Discriminative Tasks: We use additional tasks to evaluate commonsense reasoning, ranking and textual entailment. See appendix for the full list of datasets.
For Question Answering evaluations we use Exact Match as our metric (Except F1score for TyDiQA), Accuracy for Discriminative tasks and Character ngram Fscore (Popović, 2015) for machine translation. For each task group, we report the MacroAverage of task metrics in our downstream evaluations.
For evaluating inference latency, we use a single TPUv5e device with batch size of one, and for generating one full response per query. We also demonstrate the importance of DATOPk in scaling to multiple devices, by presenting results for a slice of $4$ TPUv5e devices. We report Speedups by measuring number of response tokens generated per unit time step.
4.3 Results
HiRESoftmax: Table 1 presents a comparison of $\mathcal{M}_{\textrm{sm}}$ (i.e., HiRE applied to the softmax layer of $\mathcal{M}_{\textrm{base}}$) with the baseline dense model ($\mathcal{M}_{\textrm{base}}$). Note that both HiREQ and HiRELR provide about $1.16\times$ speedup over baseline while providing almost the same nexttoken prediction accuracy on the pretraining dataset. Downstream accuracy is also within $0.2\%$ of the baseline. We observe that combining $\textrm{HiRE}\textrm{Q}$ and $\textrm{HiRE}\textrm{LR}$ improves the latency speedup to $1.22\times$ with less than $0.5\%$ overall drop in accuracy.
HiREFFN: Table 2 presents a comparison of both HiREFFN ($\mathcal{M}_{\textrm{ffn}}$) as well as the full HiRE approach applied to both the Softmax and FFN layers ($\mathcal{M}_{\textrm{full}}$), with respect to the original model ($\mathcal{M}_{\textrm{base}}$). Similar to softmax layer, here again combining HiRELR and HiREQ leads to as much as $1.47\times$ latency reduction while maintaining almost similar pretraining and downstream evaluation accuracy.
Multiple devices: In latency critical applications, LLMs are served on multiple accelerator devices, with their weights distributed across them. Implementing HiRE as is on multiple devices in this fashion turns out to be suboptimal due to the cost of identifying top$k$ activations jointly across all the devices. To overcome this, we proposed a distributed version of the top$k^{\prime}$ operation DATOPk, which first performs top$\frac{k^{\prime}}{d}$ operation on each of the $d$ devices and then takes a union of those activations to approximate top$k^{\prime}$ (Algorithm 2). As we can see from Table 3 improves the latency by $2.27\times$, compared to the vanilla implementation of HiRE (Algorithm 1), with comparable quality on average across downstream tasks.
Why high recall estimation?
HiRELR for Softmax (r=25%)  $k^{\prime}=\textrm{None}$  $k^{\prime}=32$  $k^{\prime}=64$  $k^{\prime}=128$  $k^{\prime}=256$  $k^{\prime}=384$  $k^{\prime}=512$  
Pretraining Performance  Top1 Accuracy  51.87%  56.43%  56.63%  56.77%  56.86%  56.89%  56.92% 
Top32 Intersection  19.24  19.24  24.13  26.94  28.72  29.26  29.93  
HiREQ for Softmax (int4)  $k^{\prime}=\textrm{None}$  $k^{\prime}=2$  $k^{\prime}=4$  $k^{\prime}=8$  $k^{\prime}=32$  $k^{\prime}=64$  $k^{\prime}=128$  
Pretraining Performance  Top1 Accuracy  55.94%  56.76%  56.94%  56.98%  56.99%  56.99%  56.99% 
Top32 Intersection  27.68  1.98  3.96  7.90  27.68  30.86  31.30 
Consider Algorithm 1 and equation 5. We define Recall as the intersection between the set of elements, $S^{\prime}$ computed using ${\mathbf{Z}_{\textrm{approx}}}$ matrix and $S$ using ${\mathbf{Z}}$ matrix. In case of Softmax, we compute the intersection between $S^{\prime}$ and $S$ for both HiRELR and HiREQ in Table 4. We see a systematic increase of recall with increase in $k^{\prime}$ of the algorithm. In case of $\mathcal{M}_{\textrm{ffn}}$, we evaluate models with different ($k^{\prime}$, $k$) pairs to study the importance of high recall in pretraining metrics. From Table 6, we observe that, as we increase $k^{\prime}$, the model shows an improvement, further signifying the importance of high recall.
Importance of learned projections in HiRELR: While HiREQ is much easier to implement since it does not require any training, the potential for gains is limited due to the inherent nature of quantization. On the other hand, HiRELR has the potential to deliver larger latency improvements, but requires retraining as discussed in Section 4.1. In this section, we explore whether random projections can be directly used instead of learning them. Table 7 in Appendix B demonstrates that random projections can lead to as much as $10\%$ reduction in downstream evaluation accuracy at similar rank justifying our approach of learned lowrank projections.
Location of sparse layers: In all the results so far, we have modified only the odd FFN layers to be sparse (equation 6), while the remaining were dense (equation 3). In Table 8, we present results for different choices of sparse layers, which demonstrate that modifying all the layers to be sparse leads to substantial drop in accuracy, and the location of sparse layers makes an effect on the overall accuracy. It is interesting to identify the optimal selection of sparse layers.
Baseline  Sparse  CommonPath  
Sparsity  Static Sparsity  100%  0%  16.66% 
Adaptive Sparsity  NA  5.55%  3.33%  
Pretraining Performance  Top1 Accuracy  57.15%  57.03%  57.03% 
Perplexity  2.045  2.056  2.055  
Downstream Performance  MT  47.92  46.95  46.93 
SuperGLUE  62.07  62.49  61.66  
Q & A  29.65  30.88  29.94  
Disc. Tasks  51.69  51.14  51.14  
Speedup  1.0$\times$  1.16 $\times$  1.22$\times$ 
Common path for increasing sparsity further: We now present results demonstrating the utility of a common path, to further enhance activation sparsity as described in Section 3.4.1, equation 7. The results, presented in Table 5 show that our commonpath technique is indeed able to improve dynamic sparsity by $2\%$ while still maintaining accuracy, which leads to about $6\%$ lower latency.
Exploiting activation overlap for larger num samples with adaptive sparsity: It has been shown that generating multiple diverse responses for a single query, and subsequent ranking aids in selecting the most suitable output (Mudgal et al., 2023). We hypothesize that such responses share significant overlap in their nonzero neural activations, offering potential for enhancing effective sparsity in FFN layers. This is supported empirically (Figure 2), where the union of nonzero activations across multiple responses is consistently smaller than the sum of number of individual response activations.
Broader Impact Statement
We expect utilizing HiRE can lead to more energy efficient large model inference. Beyond that there are potential societal consequences of helping large models become more accessible, none which we feel must be specifically highlighted here.
References
 Ahmadian et al. (2023) Ahmadian, A., Dash, S., Chen, H., Venkitesh, B., Gou, S., Blunsom, P., Ustun, A., and Hooker, S. Intriguing properties of quantization at scale. ArXiv, abs/2305.19268, 2023. URL https://api.semanticscholar.org/CorpusID:258967189.
 Alizadeh et al. (2023) Alizadeh, K., Mirzadeh, I., Belenko, D., Khatamifard, K., Cho, M., Del Mundo, C. C., Rastegari, M., and Farajtabar, M. Llm in a flash: Efficient large language model inference with limited memory. arXiv preprint arXiv:2312.11514, 2023.
 Aminabadi et al. (2022) Aminabadi, R. Y., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al. Deepspeedinference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE, 2022.
 Anil et al. (2023) Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., MeierHellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Ábrego, G. H., Ahn, J., Austin, J., Barham, P., Botha, J. A., Bradbury, J., Brahma, S., Brooks, K., Catasta, M., Cheng, Y., Cherry, C., ChoquetteChoo, C. A., Chowdhery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S., Devlin, J., Díaz, M., Du, N., Dyer, E., Feinberg, V., Feng, F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S., Gonzalez, L., and et al. Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/ARXIV.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
 Bae et al. (2023) Bae, S., Ko, J., Song, H., and Yun, S.Y. Fast and robust earlyexiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424, 2023.
 Baykal et al. (2023) Baykal, C., Cutler, D., Dikkala, N., Ghosh, N., Panigrahy, R., and Wang, X. Alternating updates for efficient transformers. arXiv preprint arXiv:2301.13310, 2023.
 Berant et al. (2013) Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from questionanswer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 1821 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1533–1544. ACL, 2013. URL https://aclanthology.org/D131160/.
 Bisk et al. (2020) Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. In The ThirtyFourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 712, 2020, pp. 7432–7439. AAAI Press, 2020. doi: 10.1609/AAAI.V34I05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
 Bojar et al. (2014) Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., SaintAmand, H., et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp. 12–58, 2014.
 Clark et al. (2019) Clark, C., Lee, K., Chang, M., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 27, 2019, Volume 1 (Long and Short Papers), pp. 2924–2936. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N191300. URL https://doi.org/10.18653/v1/n191300.
 Clark et al. (2020) Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. Tydi qa: A benchmark for informationseeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
 Clark et al. (2018) Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
 (13) Cloud, G. Cloud tpu v5e inference. URL https://cloud.google.com/tpu/docs/v5einference. Accessed on Feb 1, 2024.
 Csordás et al. (2023) Csordás, R., Irie, K., and Schmidhuber, J. Approximating twolayer feedforward networks for efficient transformers. arXiv preprint arXiv:2310.10837, 2023.
 de Marneffe et al. (2019) de Marneffe, M.C., Simons, M., and Tonhauser, J. The commitmentbank: Investigating projection in naturally occurring discourse. 2019. URL https://api.semanticscholar.org/CorpusID:203595067.
 Dong et al. (2023) Dong, H., Chen, B., and Chi, Y. Towards structured sparsity in transformers for efficient inference. In Workshop on Efficient Systems for Foundation Models, at ICML2023, 2023.
 Du et al. (2022) Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Efficient scaling of language models with mixtureofexperts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.
 Gale et al. (2020) Gale, T., Zaharia, M., Young, C., and Elsen, E. Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE, 2020.
 Grimaldi et al. (2023) Grimaldi, M., Ganji, D. C., Lazarevich, I., and Sah, S. Accelerating deep neural networks via semistructured activation sparsity. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1179–1188, 2023.
 Gu & Dao (2023) Gu, A. and Dao, T. Mamba: Lineartime sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
 Han et al. (2015) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 He et al. (2023) He, Z., Zhong, Z., Cai, T., Lee, J. D., and He, D. Rest: Retrievalbased speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
 Jaiswal et al. (2023) Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., and Yang, Y. Compressing llms: The truth is rarely pure and never simple. arXiv preprint arXiv:2310.01382, 2023.
 Jaszczur et al. (2021) Jaszczur, S., Chowdhery, A., Mohiuddin, A., Kaiser, L., Gajewski, W., Michalewski, H., and Kanerva, J. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907, 2021.
 Joshi et al. (2017) Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
 Khashabi et al. (2018) Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262, 2018.
 Kwiatkowski et al. (2019) Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
 Lai et al. (2017) Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. H. RACE: largescale reading comprehension dataset from examinations. In Palmer, M., Hwa, R., and Riedel, S. (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 911, 2017, pp. 785–794. Association for Computational Linguistics, 2017. doi: 10.18653/V1/D171082. URL https://doi.org/10.18653/v1/d171082.
 Levesque et al. (2012) Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
 Leviathan et al. (2023) Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
 Li et al. (2023) Li, L., Li, Q., Zhang, B., and Chu, X. Norm tweaking: Highperformance lowbit quantization of large language models. arXiv preprint arXiv:2309.02784, 2023.
 Li et al. (2022) Li, Z., You, C., Bhojanapalli, S., Li, D., Rawat, A. S., Reddi, S. J., Ye, K., Chern, F., Yu, F., Guo, R., et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, 2022.
 Liang et al. (2021) Liang, K. J., Hao, W., Shen, D., Zhou, Y., Chen, W., Chen, C., and Carin, L. Mixkd: Towards efficient distillation of largescale language models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 37, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=UFGEelJkLu5.
 Lin et al. (2023) Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. Awq: Activationaware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
 Liu et al. (2023) Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y., Re, C., et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.
 Madaan et al. (2023) Madaan, L., Bhojanapalli, S., Jain, H., and Jain, P. Treeformer: Dense gradient trees for efficient attention computation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 15, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=DWn1TEb2fK.
 Mihaylov et al. (2018) Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31  November 4, 2018, pp. 2381–2391. Association for Computational Linguistics, 2018. doi: 10.18653/V1/D181260. URL https://doi.org/10.18653/v1/d181260.
 Mirzadeh et al. (2023) Mirzadeh, I., Alizadeh, K., Mehta, S., Del Mundo, C. C., Tuzel, O., Samei, G., Rastegari, M., and Farajtabar, M. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023.
 Mostafazadeh et al. (2016) Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. F. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696, 2016. URL http://arxiv.org/abs/1604.01696.
 Mudgal et al. (2023) Mudgal, S., Lee, J., Ganapathy, H., Li, Y., Wang, T., Huang, Y., Chen, Z., Cheng, H., Collins, M., Strohman, T., Chen, J., Beutel, A., and Beirami, A. Controlled decoding from language models. CoRR, abs/2310.17022, 2023. doi: 10.48550/ARXIV.2310.17022. URL https://doi.org/10.48550/arXiv.2310.17022.
 Nie et al. (2020) Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., and Kiela, D. Adversarial NLI: A new benchmark for natural language understanding. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 510, 2020, pp. 4885–4901. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACLMAIN.441. URL https://doi.org/10.18653/v1/2020.aclmain.441.
 Patterson et al. (2021) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Rothchild, D., So, D., Texier, M., and Dean, J. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
 Pilehvar & CamachoCollados (2019) Pilehvar, M. T. and CamachoCollados, J. Wic: the wordincontext dataset for evaluating contextsensitive meaning representations. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 27, 2019, Volume 1 (Long and Short Papers), pp. 1267–1273. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N191128. URL https://doi.org/10.18653/v1/n191128.
 Piórczyński et al. (2023) Piórczyński, M., Szatkowski, F., Bałazy, K., and Wójcik, B. Exploiting transformer activation sparsity with dynamic inference. arXiv preprint arXiv:2310.04361, 2023.
 Pope et al. (2023) Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
 Popović (2015) Popović, M. chrf: character ngram fscore for automatic mt evaluation. In Proceedings of the tenth workshop on statistical machine translation, pp. 392–395, 2015.
 Rajpurkar et al. (2018) Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad. In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 1520, 2018, Volume 2: Short Papers, pp. 784–789. Association for Computational Linguistics, 2018. doi: 10.18653/V1/P182124. URL https://aclanthology.org/P182124/.
 Roemmele et al. (2011) Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS1106, Stanford, California, USA, March 2123, 2011. AAAI, 2011. URL http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418.
 Sakaguchi et al. (2021) Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
 Sheng et al. (2023) Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., and Zhang, C. Flexgen: Highthroughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp. 31094–31116. PMLR, 2023.
 Song et al. (2023) Song, Y., Mi, Z., Xie, H., and Chen, H. Powerinfer: Fast large language model serving with a consumergrade gpu. arXiv preprint arXiv:2312.12456, 2023.
 Stern et al. (2018) Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
 Wang et al. (2019) Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for generalpurpose language understanding systems. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’AlchéBuc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, pp. 3261–3275, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6Abstract.html.
 Wang et al. (2021) Wang, H., Zhang, Z., and Han, S. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pp. 97–110. IEEE, 2021.
 Xia et al. (2023) Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., and Song, S. L. Flashllm: Enabling costeffective and highlyefficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
 Yi et al. (2023) Yi, R., Guo, L., Wei, S., Zhou, A., Wang, S., and Xu, M. Edgemoe: Fast ondevice inference of moebased large language models. arXiv preprint arXiv:2308.14352, 2023.
 Zellers et al. (2019) Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 August 2, 2019, Volume 1: Long Papers, pp. 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P191472. URL https://doi.org/10.18653/v1/p191472.
 Zeng et al. (2023) Zeng, Z., Davies, M., Pulijala, P., Sankaralingam, K., and Singh, V. Lookupffn: making transformers computelite for cpu inference. In International Conference on Machine Learning, pp. 40707–40718. PMLR, 2023.
 Zerva et al. (2022) Zerva, C., Blain, F., Rei, R., Lertvittayakumjorn, P., de Souza, J. G. C., Eger, S., Kanojia, D., Alves, D. M., Orasan, C., Fomicheva, M., Martins, A. F. T., and Specia, L. Findings of the WMT 2022 shared task on quality estimation. In Koehn, P., Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costajussà, M. R., Federmann, C., Fishel, M., Fraser, A., Freitag, M., Graham, Y., Grundkiewicz, R., Guzman, P., Haddow, B., Huck, M., JimenoYepes, A., Kocmi, T., Martins, A., Morishita, M., Monz, C., Nagata, M., Nakazawa, T., Negri, M., Névéol, A., Neves, M., Popel, M., Turchi, M., and Zampieri, M. (eds.), Proceedings of the Seventh Conference on Machine Translation, WMT 2022, Abu Dhabi, United Arab Emirates (Hybrid), December 78, 2022, pp. 69–99. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.wmt1.3.
 Zhang et al. (2018) Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Durme, B. V. Record: Bridging the gap between human and machine commonsense reading comprehension. CoRR, abs/1810.12885, 2018. URL http://arxiv.org/abs/1810.12885.
 Zhao et al. (2023) Zhao, Y., Lin, C.Y., Zhu, K., Ye, Z., Chen, L., Zheng, S., Ceze, L., Krishnamurthy, A., Chen, T., and Kasikci, B. Atom: Lowbit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102, 2023.
Appendix A Dataset details
 1.

2.
SuperGLUE Benchmark: We use multiple SuperGLUE (Wang et al., 2019) tasks such as BoolQ (Clark et al., 2019), CommitmentBank (de Marneffe et al., 2019), Choice of Plausible Alternatives (Roemmele et al., 2011), Multi Sentence Reading Comprehension (Khashabi et al., 2018), Recognizing Textual Entailment, Words in Context (Pilehvar & CamachoCollados, 2019), Winograd Schema Challenge (Levesque et al., 2012) and Reading Comprehension with Commonsense Reasoning (ReCoRD) (Zhang et al., 2018).
 3.

4.
Discriminative Tasks: We use additional tasks to evaluate commonsense reasoning (ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), PIQA (Bisk et al., 2020) and Story Cloze (Mostafazadeh et al., 2016)), question answering (RACE, (Lai et al., 2017) and OpenBookQA (Mihaylov et al., 2018)) and Textual Entailment (ANLI (Nie et al., 2020)).
Appendix B Additional Results
(128, 128)  (192, 128)  (256, 128)  (Full, 128)  Baseline  
Pretraining Performance  Top1 Accuracy  56.81%  56.96%  57.03%  57.05%  57.15% 
Perplexity  2.071  2.06  2.056  2.054  2.045 
Baseline  Trained, k=384  Random, k=384  
r=25%  r=25%  r=33%  r=50%  r=66%  
Pretraining Performance  Top1 Accuracy  57.40%  57.29%  46.78%  49.26%  54.02%  55.81% 
Top32 Intersection  32.0  29.26  15.53  18.42  24.21  26.89 
Baseline  ($D$ $S$) x $L/2$  Last $L/2$  ($D$ $S$ $S$) x $L/3$  Last $2L/3$  $L$ layers  
Pretraining Performance  Top1 Accuracy  57.15%  57.03%  57.03%  56.84%  56.83%  56.24% 
Perplexity  2.045  2.056  2.058  2.069  2.071  2.108  
Downstream Performance  Machine Translation  47.92  46.95  46.35  46.17  45.83  44.35 
SuperGLUE Benchmark  62.07  62.49  60.43  59.37  60.74  59.3  
Question Answering  29.65  30.88  29.86  28.88  29.09  27.51  
Discriminative Tasks  51.69  51.14  50.99  50.37  50.8  49.65  
Speedup  1.0x  1.16x  1.18x  1.17x  1.18x  1.44x 