Brief notes from the workface

Just too busy for a while

Jun 16, 2025

Hi readers, I have been busy. I enjoy the consulting at SemiAnalysis and there is a low to be done, so my blog has languished. I can’t say that will change soon, but I do have a few short notes on ideas that might interest some readers.

**The Congreve Rolling Ball Clock © The Trustees of the British Museum**. Shared under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) licence.

Clocks have gone brrrr for centuries. All three topics today relate to time.

The Gravity Telescope

This is my current project. For those who did not see it mentioned in my earlier blogs it uses the GNSS satellites (the ones used for satellite navigation) to try to get an accurate fix on the gravity from the sun. Why would I do that? I'm curious to know whether the direction will be the same as where we see the sun, which is where the sun was 8 minutes ago, or it will be from where the sun is now. There is a reasonable argument the “now” direction will be correct.

The GNSS satellites have enough data precision (years of data typically better than 30mm at an orbit height above 22,000 km for 80 satellites) for this distinction to be made, but I got distracted by work. If any of you have ever had the experience of coming back to a programming project intermittently you may have experienced how you spend much of the time remembering what you were doing and not much time doing the next step.

I did improve on this by taking better notes so it's easier to figure out what I was doing and also by starting to use Cursor for coding which allows me to move quickly through steps that I was planning. The step which is blocking me now is getting an optimization fitting the orbital calculations to the actual measurements. I've tried to use off-the-shelf optimization libraries and they failed pretty miserably. It seems that the problem is highly nonlinear and standard optimizers easily get lost at a local minimum. If I get a day or two I may hammer on that problem a bit more, as the libraries do have parameters you can tweak to try to avoid that kind of problem, but it has been a time waster so far.

There are off-the-shelf programs that do fit orbits, yes I'm aware of that. The problem is that they all return Kepler parameters for the orbit which is not intuitive to work backward from that to isolate the direction of the sun. Still I should be able to use it to greatly refine my starting velocity, and that may allow me to more easily fine tune the optimizer since I am likely to start within the valley of the best optimum.

By themselves the finite-step calculations seem quite accurate. Short time steps and quad precision math fit an orbit end point to the start point with about 20 microns accuracy. I intend to keep working on it.

Resonant Clocks

One of the wastes of power in designing VLSI circuits is clocks. Most power in CMOS drives transitions on the gates. This is because the gates have the most capacitance and moving charge through resistance between two voltages is how you waste power. Capacitance is not changing much with new processes since it fundamentally relates to how much current is needed to drive loads and the dimensions of the gate needed to control that current in silicon. Voltages are not changing much because they need to be multiples of the temperature-driven Boltzmann voltage in order to have a good on/off ratio.

However, clocks have one potential advantage and that is that they may be regular. Traditionally, low power CMOS devices use variable clock rates to save power when load is not high but in AI the load is always high, you can switch off machines if you don't have much load but use the ones which are on as fast as they can go. This means you might consider optimizations based on a fixed clock since inductors are a complex kind of component in VLSI, you might not want too many. Variable clocks could use switched inductors, if you really want both benefits.

Circuits with an inductor and capacitor working together for resonance do not need to dissipate energy in resistance. Inductance added to resonate with the gate capacitances would create a resonant clock circuit which are not dissipating much power. The voltage may be higher than normal to ensure the resonant will move through the gate on off transition range in a fraction of the cycle. Within some limit that will not damage the circuit. We can probably do this with a peak resonant voltage less than 1 volt peak to peak. The critical on/off transition will use about 50mV which is under 5% of the cycle for 1V peak to peak. That leaves 95% available for logic between clock transitions.

Clock circuits are pervasive in modern pipelined logic and may use half the power. Regular logic CMOS pairs switch on average 50% per clock cycle while the clock transistors must always switch twice per clock cycle, for about four times the power density. If we can reduce the clock losses by adding some inductance it could be a valuable gain.

VLSI power density of calculation is constraining how fast the chip clock rate can be. As we add denser gates we have to slow down so we don't get that much more computation per unit of energy. It goes against recent practice to move to a fixed clock but may be worth looking at whether a resonant fixed clock in compute-intensive workloads is a better path forward.

Where did the time go?

BERT was one of the seminal transformer models and is still in use. The B in BERT originally meant bidirectional - forming attention relationships between tokens in both the forward and backward direction. You might write an essay with choice of words at the start influenced by the points you intend to make later. Words later in the essay definitely bear upon the correct meaning of words earlier in the essay, and extracting those meanings and relations is the purpose of Attention. It may seem odd that the model seems to work backwards in time, but really it is just about left and right on the holistic inputs to the model.

BERT has two phases, encode and decode. In encode it takes all the material prepared, the prompts, any information retrieved by a process like RAG or search, and the query describing what is to be generated. The encode phase is all about extracting meaning and relationships between these given materials. Encode is also known as prefill.

The decode phase is when new tokens are generated, usually 1 or 2 tokens per decode. Decode runs repeatedly through the model, all the while using that prefill context as guide. Prior output tokens extend the prior context as later tokens are generated. The prefix prefill does not need to be recalculated because it is not changing while generating later tokens. In some cases you might say that as you make choices on output you're splitting ambiguities in the prefill and deciding to go one way or another, but current LLMs mostly block backwards changes.

At some point LLM’s switched to a decode only model, for example in Llama. These perform both the prefill and the decode with the same model coefficients. A mask is used in the matrix arithmetic which essentially allows only forward calculations in time, and zeros out any backward calculations. This has the effect that anything you've previously calculated will not change and this is a wonderful improvement in speed for token decoding.

This unchanging prior history becomes the KV Cache which may be loaded instead of calculated, which can save 99% or more of the calculation in the decode phase.

However, since the model has decode only, the backward blocking mask applies to prefill also. It doesn't save anything on the inference prefill because it is the first path and creates the KV Cache, not consuming it. The cache all prior tokens is calculated in that one pass.

Why are we not allowing bidirectional relationships during the prefill? Not only does it seem to lose some potential extraction of meaning and relationships, but it actually adds to the calculations since the mask calculation is an extra step. There seems to be some loss of potential information and accuracy in the model while delivering no improvement in efficiency while actually losing efficiency. At least that is the inference time point of view.

It may be that during training the decode and encode phases separately make training more complicated. Until recently training difficulties dominated discussions, so that would have been a prudent optimization. With the rise of inference decode, which is even needed for forms of training, that trade-off seems worth re-examining. I suspect there are clever ways of training a decode only model and then fine tuning the prefill to work without the masking on a slightly different set of weights.

Looking forward

So maybe one day I will get around to working on the clocks or bidirectional prefill ideas. If any of you do let me know! And if it's already been done let me know that as well. I'd be curious to know whether it worked and if does not work, why not.

main

I think pure causal is conclusively better.

> but it actually adds to the calculations since the mask calculation is an extra step.

Masking saves prefill compute w/ flash attention (or, more generically, any attn impl that computes mask(q@k.T) block-wise, instead of separating matmul & mask steps). So FLOPs is definitely worse.

> There seems to be some loss of potential information and accuracy in the model

You can take the performance of encoder-based MLLMs as a proxy for the potential perf gains of noncausal attn. Because ViTs (and other modality encoders) are often bidirectional (for good reason), the use of full attention (for the cross-modal subsequence) is common.

For example, in https://arxiv.org/html/2409.03206#:~:text=Table%203%3A,Different%20Method%20Settings, it is found that a block-causal mask obtains better video understanding results than a full mask.

So it is possible that causal attention is actually beneficial for text overall.

> I suspect there are clever ways of training a decode only model and then fine tuning the prefill to work without the masking on a slightly different set of weights.

If I had to hedge, perhaps it will later be discovered that models trained on pure next-token-prediction objective will fail benefit from said fine-tuning, but models that were pretrained with a mixture-of-denoisers objective will be more amenable to your suggested approach.

Expand full comment

Poratbo

Discussion about this post