<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Konstantin Weitz's Blog]]></title><description><![CDATA[Hard stuff explained to ordinary programmers.]]></description><link>https://weitz.blog</link><image><url>https://substackcdn.com/image/fetch/$s_!Oiuy!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5152859-dd27-441b-bc91-1e328f985273_500x500.png</url><title>Konstantin Weitz&apos;s Blog</title><link>https://weitz.blog</link></image><generator>Substack</generator><lastBuildDate>Tue, 19 May 2026 03:04:25 GMT</lastBuildDate><atom:link href="https://weitz.blog/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Konstantin Weitz]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[weitz@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[weitz@substack.com]]></itunes:email><itunes:name><![CDATA[Konstantin Weitz]]></itunes:name></itunes:owner><itunes:author><![CDATA[Konstantin Weitz]]></itunes:author><googleplay:owner><![CDATA[weitz@substack.com]]></googleplay:owner><googleplay:email><![CDATA[weitz@substack.com]]></googleplay:email><googleplay:author><![CDATA[Konstantin Weitz]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Management Lessons Learned from Playing Factorio]]></title><description><![CDATA[An After-the-Fact Justification]]></description><link>https://weitz.blog/p/factorio-management-lessons</link><guid isPermaLink="false">https://weitz.blog/p/factorio-management-lessons</guid><dc:creator><![CDATA[Konstantin Weitz]]></dc:creator><pubDate>Fri, 14 Nov 2025 19:27:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!emL9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!emL9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!emL9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 424w, https://substackcdn.com/image/fetch/$s_!emL9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 848w, https://substackcdn.com/image/fetch/$s_!emL9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!emL9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!emL9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg" width="1024" height="785" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:785,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273889,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://weitz.blog/i/178836685?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5b7c02c-6776-4a5f-8877-cfdeff89ca5b_1024x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!emL9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 424w, https://substackcdn.com/image/fetch/$s_!emL9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 848w, https://substackcdn.com/image/fetch/$s_!emL9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!emL9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10d3f229-927d-4bd4-bcac-8b3fb853bdab_1024x785.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;ve sunk more hours into playing Factorio than I&#8217;m willing to admit. This post is my after-the-fact justification for how that was all really just a ploy for professional development.</p><p>I mean, from the outset, you wouldn&#8217;t expect to learn any management skills from playing a video game. But with enough desperation, you can convince yourself that Factorio&#8217;s focus on task delegation makes it a surprisingly effective simulation for many parts of the management job; it even holds two great advantages over real-world management: First, the barrier to entry is very low &#8212; you don&#8217;t have to be entrusted with a 1,000-person team to start. Second, the stakes are a lot lower &#8212; it&#8217;s far easier to reload a previous factory snapshot after it&#8217;s been overrun by enemies than it is to restart your career after your entire team walks out.</p><p>Anyway, here&#8217;s what I&#8217;m pretending to have learned:</p><p>Disclaimer: I&#8217;m a manager working on ML Infrastructure at Google. All opinions are my own, not necessarily those of my employer.</p><p><strong>Delegation is key.</strong> Factorio thrives on the idea that you can delegate a lot of your tasks to automation. In the beginning, you mine your own resources, but you can quickly build a miner to do that for you. Crafting items is quickly outsourced to assemblers, and moving items around gets delegated to belts and inserters. Eventually, you get construction robots to build your factory, and artillery to deal with enemies. You can even download blueprints for your factory and outsource the factory design. While you can do everything yourself, you can ultimately only do a single thing at a time, so delegation is absolutely key to make your factory grow &#8212; and the factory must grow. As a real manager, you have almost unlimited freedom to delegate tasks, and you have to use that power to scale.</p><p><strong>Protecting your time.</strong> In Factorio, you can almost always get &#8220;that one thing&#8221; faster by just doing it manually. You need just five more assemblers to finish a build? It&#8217;s easiest to just get all the items and craft them by hand. Have a couple thousand items stuck in a chest? Go run around and distribute them to the right places in your factory. Are you overrun by enemies and need laser turrets ASAP? Just craft them by hand. Your space ship just ran out of ammo and is about to be destroyed? Jump into the action and manually move ammo around. While those moments feel great, meaningful, and are sometimes necessary, they are distractions from building your actual factory. If you want to be successful, you need to build reliable self-sufficient systems that minimize the number of interruptions. It&#8217;s the same for managers. When things go wrong, you need to be there to quickly get things back on track, but to scale, you need to build a team that doesn&#8217;t need your help all the time, even if that means things will take longer while you get that system running.</p><p><strong>Management isn&#8217;t for everyone.</strong> My kids love playing Minecraft. They love mining resources, they love building houses, and they love running around collecting chicken eggs. That&#8217;s also how they play Factorio. They don&#8217;t want to build a miner, because they enjoy mining themselves. No offense to my children, but that&#8217;s not how you beat Factorio. Personally, while I outsource almost everything in Factorio, I still do all the design myself, because that&#8217;s what I love &#8212; but I would be much more effective if I just download the best blueprints. In real life, you may be an incredible individual contributor, and be great at what you do, but being a manager and delegating everything is a very different job. You may not be good at it, you may not enjoy it, and you may have to delegate the very thing you love the most.</p><p><strong>You need to know how things are going.</strong> Your factory will fail from time to time. Maybe one of your resource patches ran dry, maybe some biters made it through your defense, maybe a new part of your factory is draining resources from some other part, and now your whole production has come to a halt. When this happens, it is your job to discover that things are broken, and quickly get things back on track. This means you need good visibility into how things are going. End-to-end metrics, like science per minute produced, are key, but you also want some lower-level metrics for crucial things that are very hard to fix once broken: like the available fuel for your space ship, the bioflux available for your biter spawners, etc. You also want alerts when those metrics go below acceptable levels. Once an alert goes off, you need to be very quick to come up to speed on the failed component and figure out a way to fix it. This is the same for managers. You need to be aware of key metrics: what is your customer&#8217;s perception of your product, are your projects on track, how do you compare to competitors, etc. You need to be alerted before things go really bad, and you need to be able to quickly find out what&#8217;s going wrong, and fix it.</p><p><strong>Managers need to be versatile.</strong> In Factorio, you need to be incredibly versatile. While your miner can be good at just mining, and your assembler can be good at just assembling, and your laser turret can be good at just shooting biters, in an emergency, you have to be able to do it all. You don&#8217;t have to be the best, but you need to be passable at it, until you find a way to delegate it. It&#8217;s the same for managers. If your product manager/software engineer/recruiter/etc quits, or turns out to not be up to the task, you can&#8217;t just throw up your hands and quit, you have to fill the gap. You are the ultimate backstop.</p><p><strong>The limits of delegation.</strong> In Factorio, at some point you will have a factory that automatically mines all its resources, moves them around, processes them into useful stuff, automatically kills all enemies, repairs stuff, and self-replicates to take over the whole world. And all you have left to do is sit there, marvel at your creation, and see the numbers go up. Well, at my real job, I&#8217;m not quite there yet, and there&#8217;s still a lot of stuff I haven&#8217;t figured out how to delegate. But, in my free time, I have this little side hustle of running the world economy. And all I do is put some money in my worldwide diversified index fund every once in a while, and that&#8217;s it. I just sit back, relax, and see the numbers go up. All the hard parts are efficiently delegated away. Some active traders make sure I invest in the right things, market makers ensure that I can trade anytime for almost free, someone catalogs all the companies in the world for me, etc. The one thing you cannot delegate is the overall responsibility; you are the ultimate backstop. When the government changes its tax scheme, or hyperinflation ravages your country<strong>,</strong> your investment strategy, or place of residence, may need some rethinking.</p><p>Please take these insights with a grain of salt. For all its complexities, Factorio is still just a very crude simulation of real life. Also, this post is likely just an after-the-fact rationalization to make me feel better about playing a video game, so maybe don&#8217;t take it too seriously. If you liked this post and want to encourage me to create more, please subscribe or find some other way to let me know I&#8217;m not just screaming into the void.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://weitz.blog/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://weitz.blog/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[You don't need tensors to understand attention]]></title><description><![CDATA[If you&#8217;ve ever tried to understand Large Language Models (LLMs), you&#8217;ve likely run into a wall of tensors and matrix multiplication ...]]></description><link>https://weitz.blog/p/attention-explained-to-ordinary-programmers</link><guid isPermaLink="false">https://weitz.blog/p/attention-explained-to-ordinary-programmers</guid><dc:creator><![CDATA[Konstantin Weitz]]></dc:creator><pubDate>Wed, 22 Oct 2025 23:45:26 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7c324d95-292c-462b-986b-935466b16ce6_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gAxX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gAxX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!gAxX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!gAxX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!gAxX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gAxX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5602344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://weitz.blog/i/174211261?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gAxX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!gAxX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!gAxX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!gAxX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9068ddd1-6975-488a-a14d-6a2a41158562_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;ve ever tried to understand Large Language Models (LLMs), you likely ran into a wall of tensors and matrix multiplication. It&#8217;s easy to assume that without the heavy math, any explanation is doomed to be hand-wavy or imprecise. The surprising truth, however, is that we can describe the attention mechanism with total precision&#8212;without relying on tensors or linear algebra.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>And that&#8217;s exactly what we&#8217;ll be doing in this blog post. We are going to build a  linear-algebra-free implementation of attention in pure Python, capable of running Llama 2 (albeit slowly) at around one token per second on a CPU with 64 GB of memory (no GPU required).</p><p>Disclaimer: I work on ML Infrastructure at Google. All opinions are my own, not necessarily those of my employer.</p><p>Reading time: 7 minutes</p><h1>Introduction</h1><p>LLM are essentially autocomplete on steroids, so a call to an <em>llm</em> function may look like this:</p><pre><code><code>llm(&#8221;Apples are a kind of&#8221;) == "Fruit"</code></code></pre><p>Internally, LLMs implement this autocomplete by using an attention mechanism to focus on certain parts of the input. In this example, the LLM may be focusing (among many other things) on the noun (i.e. people, places, things) of the input sentence. The somewhat simplified internals of the LLM may thus look something like this:</p><pre><code><code>llm(input):
  ...
  noun = attend_to(input, is_noun)
  ...
  if noun == "Apples":
    result = "Fruit"
  ...</code></code></pre><p>In this post we&#8217;ll talk about how this <em>attend_to </em>function is implemented.</p><p>But before we dive deep into attention, we quickly need to understand the concept of <strong>tokenization. </strong>Instead of processing strings directly, LLMs split their input into a sequence of tokens. A simple tokenization could look the one below; where the tokens (e.g. ALWAYS) are just arbitrary numbers:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><pre><code>llm([APPLES, ARE, A, KIND, OF]) == FRUIT</code></pre><p>With this out of the way, let&#8217;s dive right into attention.</p><h1>Attention Scores</h1><p>Attention is basically just a search function, that takes a list of <em>inputs</em>, and a <em>score</em> function, and returns the inputs whose score is non-zero.</p><p>For example, if you have the input sequence [APPLES, ARE, A, KIND, OF] and you have a score function that returns 1 for nouns (i.e. things, places, or people) and 0 for everything else, then you&#8217;d get:</p><pre><code><code>attend_to([APPLES, ARE, A, KIND, OF], is_noun) = APPLES</code></code></pre><p>I&#8217;m sure you can think of a million easy ways to implement this. The way we&#8217;ll be implementing this is by multiplying all the tokens by their score and summing them up:</p><pre><code><code>= is_noun(APPLES)  * APPLES +
  is_noun(ARE)     * ARE + 
  is_noun(A)       * A + 
  is_noun(KIND)    * KIND + 
  is_noun(OF)      * OF+ 

= 1 * APPLES + 
  0 * ARE + 
  0 * A + 
  0 * KIND + 
  0 * OF 

= APPLES</code></code></pre><p>And here&#8217;s the full code:</p><pre><code>def attend_to(inputs, score):
  result = 0
  for input in inputs:
    result += score(input) * input
  return result</code></pre><p>And believe it or not, this is pretty close to what an LLM actually does! Internally, LLMs will call the <em>attend_to </em>function many times to figure out what token to generate next; e.g. the LLM will attend to nouns to figure out what the sentence is all about, and then generate a reasonable completion, like FRUIT.</p><p>While we provided a concrete implementation of the <em>score</em> function here (i.e. <em>is_noun</em>), LLMs usually learn their score functions during the training phase. We won&#8217;t go into detail in this post, but these learned functions are very simple. They don&#8217;t have loops, don&#8217;t call APIs, don&#8217;t have side-effects, and don&#8217;t have much control flow. They mostly just do a bunch of number crunching.</p><h1>Attention Values</h1><p>The above approach works well if the score function returns 1 for exactly one input, and 0 for everything else. But when there are multiple inputs with a non-zero score, the output may become nonsensical (in the example below, we&#8217;re literally adding APPLES and ORANGES).</p><p>To fix this, we pass one additional argument to the attention function. The <em>value</em> argument is a function that maps each input to a value that actually makes sense to add up.</p><p>For example, we could pass <em>sweetness</em> as the value function &#8212; which returns 1 for very sweet tokens, and 0 for tokens that are not sweet at all. The attention function would then again pay attention to the input tokens that are nouns, but instead of adding up the nouns themselves, it would add up the nouns&#8217; sweetness. </p><p>So to figurer out the sweetness of a fruit-salad, we&#8217;d write:</p><pre><code>attend_to([APPLES, ORANGES, AND, FIGS, ARE], is_noun, sweetness)

= is_noun(APPLES)  * sweetness(APPLES) +
  is_noun(ORANGES) * sweetness(ORANGES) + 
  is_noun(AND)     * sweetness(AND) + 
  is_noun(FIGS)    * sweetness(FIGS) + 
  is_noun(ARE)     * sweetness(ARE) + 

= 1 * 0.6 +
  1 * 0.4 +
  0 * 0.0 + 
  1 * 0.9 +
  0 * 0.0

= 1.9</code></pre><p>With real LLMs, just like the <em>score</em> function, the <em>value</em> function is usually quite simple, and learned during the training phase.</p><p>And that's pretty close to what we want! Just two more small additions: 1) We normalize the output by the total score, so our sweetness level doesn&#8217;t grow higher and higher the more fruits we add to our salad. Hmm, yummy 0.633 (= 1.9 / 3) sweetness fruit salad. 2) For context, we also pass the last element of the <em>inputs</em> sequence to the <em>score</em> function.</p><p>With those in mind, here is our final attention implementation:</p><pre><code>def attend_to(inputs, score, value):
  result = 0
  total_score = 0
  last_input = inputs[-1]

  for input in inputs:
    result += score(input, last_input) * value(input)
    total_score += score(input, last_input)

  return result / total_score</code></pre><p>And that&#8217;s it! You can literally use this attention function verbatim in an LLM implementation and it works. Here is an <a href="https://github.com/konne88/ml">example Llama2 implementation</a>, which when prompted with &#8220;Always answer with Haiku. I am going to Paris, what should I see?&#8220;, it prints:</p><div class="pullquote"><p>Eiffel Tower high<br>Love locks on bridge embrace<br>River Seine's gentle flow</p></div><p>But how can it be so simple, you may ask. If you&#8217;ve previously read about attention, you probably heard terms like embeddings, keys, and queries, softmax, KV caches. What about all of that? Good questions. Here we go:</p><h1>Embeddings</h1><p>We&#8217;ve been passing tokens directly into the attention function; real LLMs use embeddings instead. An embedding replaces a token with the properties of that token, for example, the embedding structure for a token could be:</p><pre><code>@dataclass
class Embedding:
  is_fruit
  is_animal
  is_noun
  is_plural
  sweetness</code></pre><p>The embedding of APPLES would then be:</p><pre><code>embed(APPLES) = Embedding(
  is_fruit = 1.0
  is_animal = 0.0
  is_noun = 1.0
  is_plural = 1.0
  sweetness = 0.6
)</code></pre><p>Here we&#8217;ve provided a concrete embedding of APPLES. Real LLMs usually learn the embedding for every single token. With embeddings, our <em>score</em> and <em>value</em> functions become simpler &#8212; which also means they become much easier for the LLM to learn:</p><pre><code>def is_noun(embedding):
  return embedding.is_noun

def sweetness(embedding):
  return embedding.sweetness</code></pre><p>Llama2&#8217;s 7B model embeds tokens into vectors of 4096 floats (instead of the 5 floats in our example); and embeds values (returned by the value function) into vectors of 128 floats (instead of the single sweetness float used in our example)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><h1>Keys and Queries</h1><p>Unlike the <em>value</em> function which is completely learned, an LLM&#8217;s score function has some fixed structure, and only some aspects can be learned. This fixed structure is the following:</p><pre><code>def score(input, last_input):
  combine(key(input), query(last_input))</code></pre><p>First, the score function computes some properties of the <em>input</em> with the learned <em>key</em> function; then computes some properties of the <em>last_input</em> in the input sequence with the learned <em>query</em> function;  and lastly combines the two using a fixed <em>combine</em> function.</p><p>Both the <em>key</em> and <em>query</em> function are usually quite simple and learned during the training phase. In Llama2 7B both the <em>key</em> and <em>query</em> function return vectors of 128 floats.</p><h1>Softmax</h1><p>The <em>combine</em> function is fixed (not learned) and the only part of the attention mechanism which I can&#8217;t explain without using linear algebra; sorry. The combine function takes the key and value vectors, and then performs the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> to turn them into a single floating point number; followed by a division with a constant (in Llama2 7B, that constant is the square root of 128). I&#8217;m not quite sure why this dot product and division is the right thing to do, and maybe there are other strategies that could make sense, but this is what it is. </p><p>Finally, the combine function raises <em>e</em> to the power of the score computed so far. This has the effect that the difference between scores gets amplified, so that the value with the highest score gets the vast majority of the attention (even if the score difference to the other values is relatively small). This property of putting most of the weight behind the maximum, is why this mechanism is referred to as softmax.</p><p>Here&#8217;s the full <em>combine</em> implementation for Llama2 7B:</p><pre><code>def combine(key, value):
  math.exp((torch.dot(query, key) / math.sqrt(128)</code></pre><h1>KV Cache</h1><p>LLMs call the attention function many times for similar input sequences, and so a lot of computational resources can saved by using a cache to <a href="https://en.wikipedia.org/wiki/Memoization">memoize</a> the <em>key</em> and <em>value</em> function calls. This cache is called the key-value cache (aka KV cache).</p><h1>Summary</h1><p>For easy reference, here&#8217;s a copy of all the important code:</p><pre><code><code>def attend_to(inputs, score, value):
  result = 0
  total_score = 0
  last_input = inputs[-1]

  for input in inputs:
    result += score(input, last_input) * value(input)
    total_score += score(input, last_input)

  return result / total_score

def combine(key, value):
  math.exp((torch.dot(query, key) / math.sqrt(N)</code></code></pre><h1>Conclusion</h1><p>I hopefully provided you with a decent explanation of the attention function inside LLMs. And yet, there are still many things to talk about. How does the attention function really get used inside an LLM, e.g. what are encoders/decoders, what are transformers? What is the linear algebra used to implement the learned key, value, and query functions? What is normalization, what is batching? How do you actually make this run fast on a GPU using matrix multiplication, how do you train it efficiently on 20K GPUs (this is what I actually do on a daily basis)? How do you generate images? I&#8217;d love to write more about these. If you&#8217;d like to read them, please subscribe or something, so I know this is useful, and I&#8217;m not just screaming into the void :-)</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://weitz.blog/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://weitz.blog/subscribe?"><span>Subscribe now</span></a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The trick to make this possible is to treat matrix multiplication of tensors as the invocation of regular functions, that happen to be linear.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The tokenization above is a bit simplistic; e.g. real tokenizations usually splits words like &#8220;fishing&#8221; into the tokens FISH and ING.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>When values are a vector, multiplication of the value vector V with the score S is interpreted as scalar multiplication, i.e. every element of V is multiplied by S.</p></div></div>]]></content:encoded></item></channel></rss>