WHY THIS MATTERS IN BRIEF
Reliable long-context retrieval turns AI from assistant into autonomous coordinator, reshaping how companies staff and run engineering.
Matthew Griffin is the World’s #1 Futurist Keynote Speaker and Global Advisor for the G7 and Fortune 500, specialising in exponential disruption across 100 countries. Book a Keynote or Advisory Session — Join 1M+ followers on YouTube and explore his 15-book Codex of the Future series.
In the artificial intelligence world, every few months we get a new headline number, whether it is a bigger model, a longer context window, a higher benchmark score, or a flashier demo. Claude Opus 4.6 arrived with one of those headline numbers: a 5x context window expansion, going from 200,000 tokens to 1 million tokens. But if you actually look at what makes Opus 4.6 feel different, the million-token window is not the main story. The real story is what happens when AI systems can reliably retrieve, coordinate, and act across large contexts. That shift is subtle, but it changes everything. Because once retrieval becomes dependable, the model stops being a chatbot that remembers more text. It becomes something closer to a system that can hold a working mental model of an entire project, a team, and even an organisation. And that is where things start getting interesting.
A million tokens sounds impressive, and it is, but most people misunderstand what it means. A big context window is like having a huge filing cabinet; you can stuff a lot of documents inside it, but that does not guarantee you can find the right page when you need it. This is exactly what many large-context models looked like in early 2026. They could accept long inputs, but their ability to retrieve specific details deep inside the context was unreliable, making it feel as though the model could see the beginning of the text clearly while the rest became blurry. Opus 4.6 changes that because it dramatically improves a benchmark originally developed by OpenAI that measures long-context retrieval quality: MRCRv2, often described as a “needle-in-a-haystack” test. The benchmark is simple in spirit, testing whether a model can find a specific piece of information hidden inside a long document. In real-world work, that is basically the entire game since codebases, audit logs, technical specs, legal documents, incident reports, architecture diagrams, and compliance evidence are all haystacks.
The retrieval numbers illustrate this massive shift, where Claude Sonnet 4.5 scored 18.5% and Gemini 3 Pro reached 26.3%, while Opus 4.6 achieved a striking 76% at 1 million tokens and 93% at 256,000 tokens. This is not a small improvement; it represents a different category of system. A model that retrieves correctly only 1 out of 5 times is not reliable enough to act autonomously because it constantly needs supervision, retries, and human verification. A model that retrieves correctly 3 out of 4 times begins to feel usable, but a model that retrieves correctly 9 out of 10 times starts behaving like something you can trust to operate inside real production workflows. This is why the 5x context expansion is almost a distraction. The real breakthrough is that the model can now use the context window as working memory, and working memory is where reasoning becomes genuinely useful.
This transformation matters deeply for engineering, introducing what can be thought of as the Senior Engineer Effect. If you compare a contractor reading code to a senior engineer who understands a system holistically, the difference is clear. Senior engineers do not just read files; they carry a living mental model of how modules interact, where state is shared, what will break if you change a seemingly unrelated component, and which part of the system is brittle and why. That is what makes them valuable, not their typing speed. Opus 4.6 is capable of holding around 50,000 lines of code in memory, up from around 10,000 previously, but the line count itself is not the key point. The breakthrough is that the model can reason across the whole system at once, allowing it to plan changes without constantly losing context. In practical terms, this is the difference between simple AI autocomplete and an AI engineer with true architectural awareness. Once that happens, the bottleneck stops being code generation and becomes specification quality, validation, and decision-making, which outlines a very different job description.
The Future of AI, Software, Work, and Leadership | CAVU, UK | Matthew Griffin | Futurist Keynote, by Futurist Keynote Speaker Matthew Griffin
A clear example of this shift is the Rakuten deployment story, which shows why Opus 4.6 is not just another model update. Rakuten deployed Claude Code in production and observed something striking: Opus 4.6 autonomously closed 13 issues, assigned 12 issues to the correct team members in a single day, managed routing across a 50-person engineering organisation spanning six separate repositories, and knew exactly when to escalate problems to humans. This is important because it is no longer just AI helping engineers write code; it is AI doing something closer to operational leadership. It is handling triage, delegation, ownership mapping, and dependency-aware assignment. In other words, it is managing the operational parts of engineering management that are a massive time sink in modern software organisations, filling Jira boards and consuming entire sprint planning cycles.
Rakuten’s case suggests that agentic AI is not only capable of doing tasks, but can also handle coordination, which is where organisations spend a significant portion of their resources. The surprise was that the model did not just write code; it understood the org chart. It grasped which team owned which repository, which engineer had context on a specific subsystem, and what could be handled autonomously versus what needed to be escalated. That is not a normal code assistant capability; it is organizational reasoning, which is one of the most overlooked areas of AI disruption. Most discussions about AI replacing jobs focus purely on the act of producing output, like writing code, creating content, or generating designs. However, large companies rarely fail because they cannot produce output; they fail because they cannot coordinate output at scale. A model that can coordinate work across multiple repositories and teams is no longer just an engineering tool, but an organizational primitive – AKA a “people” manager.
There is another Anthropic demonstration that received less mainstream attention but highlighted a different kind of reasoning. Opus 4.6 was given tools like Python, debuggers, and fuzzers, and pointed at an open-source codebase with no specific vulnerability-hunting instructions. It reportedly discovered over 500 previously unknown, high-severity zero-day vulnerabilities. If true, this is not just impressive, but alarming, because traditional security automation relies mostly on pattern matching through static analysis, known vulnerability signatures, fuzzing frameworks, and heuristic scanning. When the model got stuck during this test, it decided on its own to inspect the Git history to understand how the code had evolved over time, using commit logs as reasoning evidence without any instruction to do so. That behaviour is much closer to how human security researchers think, asking what changed recently, what was rushed, where assumptions were patched rather than fixed, and what the evolution of the code reveals. This is not simple tool use; it is hypothesis-driven investigation, implying that reasoning paired with working memory can create a new kind of vulnerability discovery engine that is far less dependent on known exploit patterns.
Perhaps the most important revelation is how agent teams naturally converge into hierarchies without humans forcing them to do so. Anthropic introduced agent teams, internally referred to as team swarms, where multiple Claude Code instances run in parallel. Each instance possesses its own context window, and they coordinate through a task system featuring a lead agent, specialist agents, peer-to-peer communication, and shared task states like pending, in progress, and completed. This looks suspiciously like a human engineering organisation, and that is precisely the point. It is easy to assume hierarchy is purely cultural, believing humans invented management because they like power structures. However, a more functional explanation is that hierarchy is simply a coordination strategy that naturally emerges when intelligent systems try to solve complex problems with heavy dependencies. If tasks can be parallelised, you inherently need decomposition, ownership, status tracking, dependency management, conflict resolution, and integration responsibility. This is not corporate politics; it is just distributed systems theory applied to cognition, proving that agent hierarchy is structural rather than cultural. Once viewed this way, you realise that agentic AI is not just a model with tools, but a new organizational building block.
This leads to milestones that would have sounded absurd just a year ago, such as the C compiler story. In this instance, 16 Opus 4.6 agents coded autonomously for two weeks straight and delivered a functional C compiler consisting of over 100,000 lines of Rust, which was capable of building the Linux kernel and passing torture test suites. The most interesting part is not that it wrote a compiler, but that the autonomy lasted for two entire weeks. This marks a completely new scaling regime. In 2025, most autonomous coding sessions broke down after 30 minutes because the model would lose context, drift, hallucinate constraints, or get stuck in repetitive loops. The difference between 30 minutes and 2 weeks is not an incremental improvement; it is a total phase change where the model is no longer assisting you, but running a parallel engineering effort.
This shifts the nature of employment toward what Anthropic’s Scott White calls "vibe working." The idea is simple: you describe outcomes rather than steps. You do not tell the model which formula to use, but rather what the spreadsheet should show, and you do not describe how to build a dashboard, but instead describe the dashboard you want. This major shift aligns perfectly with modern enterprise modernisation and AI-first architecture work. The technical bottleneck is no longer typing code or writing queries; the bottleneck is intent clarity. People who can articulate requirements clearly, evaluate quality, and spot subtle risks will gain incredible leverage, while people who only know how to execute mechanical tasks will see their advantage evaporate. This is not because execution is unimportant, but because execution is becoming incredibly cheap.
The new competitive metric reflecting this shift is revenue per employee, as seen in the latest ratios of AI-native companies. Cursor reached $100 million ARR with around 20 people, Midjourney generated roughly $200 million with about 40 people, and Lovable reached approximately $200 million in just 8 months with around 15 people. Even if these numbers fluctuate, the underlying trend is undeniable. Small teams equipped with agentic AI can now deliver output that previously required massive engineering organisations, fundamentally changing the economics of startups, product development, and enterprise transformation. In my own work I have spent years helping organisations modernise brittle systems into scalable, event-driven architectures. Historically, the hardest part was never building the new services, but rather coordinating the change safely across multiple teams, repositories, and legacy dependencies. If AI agents can coordinate and execute in parallel, modernisation becomes less about staffing and more about governance, architectural decisions, and risk containment. The old question of how many engineers are needed is quickly being replaced by asking what the right ratio of agents per engineer should be.
Stepping back, it becomes obvious that Opus 4.6 is not just a better model, but a preview of a future where AI systems maintain deep working memory, reason over time and history, coordinate in teams, create hierarchical structures naturally, and route work across real organisations. These are structural capabilities emerging right now in production deployments. The biggest misunderstanding about AI today is the belief that the future relies on one single, all-powerful model, when the real future clearly looks like many models – a trend called Compound AI – and many agents coordinated like a team. Once agent teams become the norm, the software industry will not just speed up; it will completely reorganise. Software development is not only about writing code; it is about coordination, and coordination is exactly what Opus 4.6 is beginning to automate.
A 1 million token context window is an impressive engineering feat, but the real story is that a model capable of reliably retrieving information, coordinating tasks, and sustaining autonomy for weeks is no longer a tool – it’s an actor. When these actors can form teams, create hierarchies, and manage dependencies, you are no longer looking at simple AI assistance, but at the early shape of AI native organisations.
That is what Opus 4.6 truly reveals, and it is why a 76% retrieval score matters so much more than a 5x context window. Memory without retrieval is just storage, but memory with retrieval is intelligence.
Can AI agents really manage human engineers?
Opus 4.6 has begun handling triage, task routing and escalation across real engineering teams, taking on coordination work that was previously a human management role.















