The Haystack and the Needles: Why Lossless Cybersecurity Analytics Require Deterministic Pipelines

Charles Herring 23 June 2026

We cut the haystack by more than 95% and lost not one needle. I have come to believe that is only possible one way, and that the probabilistic shortcut the whole industry is reaching for can never get there.

Almost 25 years ago I was standing in our brand-new SOC at the US Naval Postgraduate School in Monterey, trying to work out how to keep up with the alert volume pouring out of every tool we owned. I had the same realisation thousands of practitioners have had in the decades since: humans cannot process anything this volumetric. We have to distil it somehow.

The years since have produced a long line of attempts. We started by pivoting alerts to hosts so we could build "top hosts." Then we grouped hosts and users by their relationships (UEBA, which I was already picking apart on stage by 2019). Then we wired up elaborate (and brittle) SOAR playbooks to connect and enrich records, which I have complained about at length. Each step distilled a little. None of them solved the underlying problem.

Around the same time, SIEM platforms started tipping over under their own data volume and the hardware bill that came with processing and storing it. That triggered a mandate every one of us knew, instinctively and intellectually, was the wrong one: throw away the signal that is probably not useful. A whole category of pipeline tools grew up around it, promising to "intelligently triage" the firehose. (I wrote a post back in 2013 about the word "solution"; this is exactly the kind of thing I had in mind.) The triage was always probabilistic, a best guess, and it always produced a degraded analytic in the SIEM. The variations on the theme were endpoint-only, or user-only, or network-only. But anyone who has run a full incident-response cycle knows what throwing data away actually does: it forces responders, and the businesses behind them, to make critical decisions on a guess.

Those wrong turns gave us a decade of products that ran on either limited visibility or probabilistic algorithms. And now, with the whole world (rightly) awestruck by how accurate large language models can be, we have doubled down. We sample the evidence with heuristics and we point agentic AI at the gaps, asking it to guess what happened and what to do about it. I am not anti-AI; anyone who has read this blog over the past year knows I am building a cybersecurity platform with the stuff. (I have also been parsing overconfident vendor claims here since 2014, and I would rather not trade vendor magic-thinking for AI magic-thinking.) There is a difference between using a guess to start an investigation and using a guess as the verdict.

A little more than 10 years ago, my fellow WitFookins and I decided to get off the "probabilistic road of guessing better" and try something else. We ran more than 4,000 experiments to find out whether a lossless, deterministic cybersecurity analytics pipeline was even possible.

We scoped success carefully. We took the strongest SIEM and XDR tools we could put our hands on and fed them unsampled data from known attacks. That was our control. For our pipeline to count as "lossless," it had to reproduce 100% of those true positives with 100% of the supporting evidence an investigator would need to close the case. The second bar was efficiency: the processing required had to drop by at least 90%.

So the question reduced to something almost childishly simple to state: how do you shrink a haystack by 90% without losing a single needle?

The first thing that became obvious is that detection and response fall apart the moment you discard evidence an investigation needs. As we pulled the data apart and laid it against the way humans actually work a case, graph theory kept presenting itself as a solid foundation for deciding when a signal could be safely dropped. A signal really does only one job: it adds to (increments) what we know about an object such as a user, a computer, a file, or a service (a node, in graph terms) and how those objects relate to one another (the edges).

That reframing led to the question we now ask very early in the pipeline: will this signal change the graph at all? We answer it with something we call ProtoGraph (an early, lightweight graph) processing. ProtoGraph looks at every object, action, and characteristic in a signal and keeps state on whether that exact combination has already gone downstream to the SIEM or XDR.

The deduplication results were startling. Because so many security products are non-stateful in what they emit, they repeat the same ProtoGraph set thousands of times an hour, and not one of those repeats improves the organisation's graph. The real numbers depend entirely on the environment, but our first round of tests with enterprise and MSSP partners came in north of 95% reduction, with no needles thrown away. (The figure WitFoo publishes today, measured across deployments, is a 90 to 98% range.)

None of this works unless the data feeding ProtoGraph has been analysed deterministically (at least, that is the only way we have ever made it work). We use an approach we call Adaptive Parsing to fingerprint every signal, with non-overlapping fingerprints, and map those fingerprints to semantic frames that tell us how to parse the message, why it fired, where it came from, and what business impact it might be pointing at. The pipeline does not guess what a message is. It can prove what it is. That deterministic parsing is what makes ProtoGraph deduplication (and the other deterministic techniques layered on top of it, like graph theory and Temporal Link Analysis) possible in the first place. If you want the longer version, the Empathetic Processing whitepaper lays out the whole methodology.

In 2024, partners and customers started asking whether we could lift just the signal processing out of what was then our all-in-one platform, Precinct, so they could run it in front of their existing SIEM, XDR, or data lake. In October 2025 we released WitFoo Conductor, which does exactly that. It understands every signal it sees by reference to evidence it can produce, and it can show its working (the kind of thing that holds up in a boardroom or a courtroom, which is the entire point of my Birthing Perjury-free AI talks). A probabilistic heuristic can make a compelling, often correct guess, but it cannot back the claim up. That is precisely why it fails what I have started calling the needle test.

I have made this argument on the blog before, most directly in Which Detective Would You Hire?, where the choice is between the detective who pattern-matches a fast, confident guess and the one who can lay every claim on the table with a label on it. This post is the engineering underneath that choice: the haystack reduction, the graph, the parsing. It sits alongside the more theoretical Empathetic Processing and Temporal Link Analysis and the hands-on Three Prompts That Turn Your Data Lake Into an Empathetic Processor. (And away from security entirely, The Weight Guesser and the Scale makes the same point: prediction and measurement are simply two different jobs.)

Wrap Up

There is only one path I have found that delivers lossless reduction, and it is a deterministic, defensible pipeline. I have said it before and I will keep saying it: it should never be acceptable for a SOC to make decisions that rest on either discarded data or a probabilistic determination. None of this means AI has no place. Its place is generating the hypotheses that warm up an investigation, not signing the verdict at the end of it. Cybersecurity is not really a technology problem. It is an evidence problem. And if we want it taken seriously as a business, we have to make it defensible, not just believable.

Search

The Haystack and the Needles: Why Lossless Cybersecurity Analytics Require Deterministic Pipelines

Wrap Up

Tags