{"id":54,"date":"2026-04-29T15:48:05","date_gmt":"2026-04-29T15:48:05","guid":{"rendered":"https:\/\/shaheenahmedc.com\/?page_id=54"},"modified":"2026-04-29T15:52:02","modified_gmt":"2026-04-29T15:52:02","slug":"ai-safety-camp-mechanistic-interpretability-open-source-library","status":"publish","type":"page","link":"https:\/\/shaheenahmedc.com\/?page_id=54","title":{"rendered":"AI Safety Camp: Mechanistic Interpretability Open-Source Library"},"content":{"rendered":"\n<p>[Presentation <a href=\"https:\/\/youtu.be\/s0LE1tfn1Vs?si=26h9_sV0wmjkSFwn\">link<\/a>]<\/p>\n\n\n\n<p>If you want to know what&#8217;s happening inside an open-source LLM, you must first grapple with a four-dimensional grid of activations spread across layers, positions, heads, and hidden dimensions. The mechanistic interpretability community has invented a set of methods for this such as logit lenses, future lenses, activation patching and attention-head probing, with each typically available across separate codebases, with potentially subtly different implementation details. <\/p>\n\n\n\n<p>In January 2024, Google PAIR&#8217;s&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2401.06102\" target=\"_blank\" rel=\"noreferrer noopener\">Patchscopes paper<\/a>&nbsp;made a nice observation: every one of these methods is the&nbsp;<em>same operation<\/em>&nbsp;with different parameters. You take a hidden state from a source forward pass&nbsp;and patch it (possibly via some mapping) into a target forward pass. The logit lens is one choice of those tuples, token identity is another, cross-model patching is another. Once you stare at it, every &#8220;new&#8221; lens is just another point in a 5-tuple parameter space, and the case for each being implemented separately reduces.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"738\" src=\"https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1024x738.png\" alt=\"\" class=\"wp-image-55\" srcset=\"https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1024x738.png 1024w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-300x216.png 300w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-768x554.png 768w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1536x1107.png 1536w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image.png 1584w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><code class=\"\">obvs<\/code>\u00a0is the library\u00a0my AI Safety Camp team wrote (<a href=\"https:\/\/github.com\/obvslib\/obvs\">GitHub<\/a>, <a href=\"https:\/\/pypi.org\/project\/obvs\/\">PyPi<\/a>)\u00a0to provide an implementation of the methods falling under this abstraction to mechanistic interpretability researchers. Much of the library is essentially three objects: a\u00a0<code class=\"\">SourceContext<\/code>\u00a0and a\u00a0<code class=\"\">TargetContext<\/code>\u00a0that mirror the paper&#8217;s tuples one-for-one, and a\u00a0<code class=\"\">Patchscope<\/code>\u00a0that, given the preceding two objects, handles the tracing forward passes, the residual-stream intervention, and the generation that follows, all on top of\u00a0<a href=\"https:\/\/nnsight.net\/\" target=\"_blank\" rel=\"noreferrer noopener\"><code class=\"\">nnsight<\/code><\/a>. Because every lens is now the same object with different parameters,\u00a0experimentation moves closer to playing around with configuration files, rather than battling with implementation details. Loop over layers to watch the model&#8217;s answer take shape, swap the target prompt to ask the model to describe its own hidden state in plain English, or patch into a bigger LLM to get a richer description. A small attempt to make a few mechanistic interpretability methods more accessible to researchers.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"722\" src=\"https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1-1024x722.png\" alt=\"\" class=\"wp-image-56\" srcset=\"https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1-1024x722.png 1024w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1-300x211.png 300w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1-768x541.png 768w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-1.png 1422w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"896\" height=\"1024\" src=\"https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-2-896x1024.png\" alt=\"\" class=\"wp-image-57\" srcset=\"https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-2-896x1024.png 896w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-2-262x300.png 262w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-2-768x878.png 768w, https:\/\/shaheenahmedc.com\/wp-content\/uploads\/2026\/04\/image-2.png 1046w\" sizes=\"auto, (max-width: 896px) 100vw, 896px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>[Presentation link] If you want to know what&#8217;s happening inside an open-source LLM, you must first grapple with a four-dimensional grid of activations spread across layers, positions, heads, and hidden dimensions. The mechanistic interpretability community has invented a set of methods for this such as logit lenses, future lenses, activation patching and attention-head probing, with [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-54","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=\/wp\/v2\/pages\/54","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=54"}],"version-history":[{"count":2,"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=\/wp\/v2\/pages\/54\/revisions"}],"predecessor-version":[{"id":60,"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=\/wp\/v2\/pages\/54\/revisions\/60"}],"wp:attachment":[{"href":"https:\/\/shaheenahmedc.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=54"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}