Recent investigations at Target and Neiman Marcus both revealed that alarms concerning their respective breaches were sent to responders without action following. The uninitiated may scratch their heads over such a thing, but those that have spent time in network security operations know the pain of “alarm numbness.” In breaches like these, the fact that they ignored the alarms coming from the detection products reflects on their lack of confidence in them. There are several factors that lead to this type of apparent failure.
There are some cases when law enforcement agencies reach out to the public and ask for “leads” in regard to specific crimes. This is not done on every crime because the number of leads produced is very high and the quality of leads is low. To process all the junk information that comes in, law enforcement has to increase staffing by several fold. It is a massive undertaking that is not sustainable for long periods of time.
Security operations centers (SOCs) run into the same challenge with the events being processed. Various systems send alerts that may or may not contain actionable intelligence. In many production networks that I visit, security alarm volumes can exceed 1 million events per day. Even the largest SOC can’t possibly investigate them all.
Why Volumes Are High and Quality Is Low
One of the reasons that alarm volumes are so high is a condition known as a false positive. Detection mechanisms like antivirus, advanced malware protection and intrusion detection systems get it wrong. In many environments, the false positive rate can be greater than 85%. The reasons are broken down into 1) insufficient data available to make an accurate assessment, 2) bad logic used in processing the data, or 3) broken systems in data collection or analysis.
Insufficient Data to Heuristics
To make accurate assessments, sufficient evidence is needed. In many detection systems, the information collected is very limited. When a detection engine doesn’t have enough information to make a conclusive call, it has two options. The first is to not alarm. This reduces false positives but opens the door to potential false negatives (failing to alarm when it should). Many vendors prefer false positives because in the wake of a breach they can make irresponsible claims of detection while ignoring the high percentage of alarms that only generated noise. As a result they tend to use the second approach of heuristic detection.
Heuristics is essentially an “educated guess.” This heuristic approach allows vendors the ability to claim detection when certainty does not exist. In one of the recent breaches, triggered alarms on the infections leading to the breach were given the generic categorization of “malware.binary,” for which the victim organization saw regular false positive noise.
The secondary problem in handling false positives is that the engines processing data may be flawed. This normally occurs when detection criteria is created without regard to valid conditions that will trigger it. An example is a host-based behavioral signature looking for the email worm-based behavior of reading the address book on a client. While that behavior can be associated with sending email worms or stealing contact information, it can also be associated with sync and backup software. The resulting alarms could be in the millions each day in an organization that is regularly syncing contact data.
In complex networks it is also possible for false positives to occur when configuration is not set correctly. Detection systems rely on endpoints, infrastructure, probes, cloud services and others to work correctly. When one or more of these components fails, the entire detection capability can be degraded. Depending on how these components are configured, these failure conditions will lead to false positives, false negatives or both.
Fixing the Problem
To address the trouble of false positives and to make alarms actionable, a few areas must be addressed.
The first approach is “tuning.” Tuning is taking the broad specifications of a detection product and customizing it to a specific organization. This includes steps of increasing available data, whitelisting accepted traffic, applications and files, and disabling alarms that will yield false positives. This process is largely neglected in many deployments. The requisite effort of doing this work can take weeks or months. The failure of management to invest in these operations makes the products purchased more of a problem than a help to the SOC. In addition to initial configuration, ongoing tuning must be a part of sustainable operations.
When systems are properly tuned, automated blocking and response is possible. This not only prevents events from occurring but reduces the number of events that need to be investigated. Only statistical analysis needs to be performed on prevented events.
Once tuning is in place and alarms are being received, they need to be investigated. One reason that this fails to happen is that processes have not been established for handling alarms. Just as elementary schools have plans for fires (and practice them) so must an Information Security organization institute incident response (IR) planning.
When organizations start building detailed incident response plans, they soon discover that they lack the necessary tools to execute on those plans. The collection of audit trails from system logs, network metadata (NetFlow) and strategic packet capture (PCAP) must be set up prior to an event occurring. If reliable evidence collection is not in place before a breach occurs, it is impossible for investigators to come to meaningful conclusions around the alarm or to answer all questions in the IR plan.
Collecting the data creates the possibility of effective incident response, but rendering the relevant data in a useful manner can make sustainable investigations plausible. Incident response tools need to effectively/quickly/inexpensively answer questions contained in the incident response plan. The reduction in time on the Mean Time to Resolution (MTTR) produced by a tool directly impacts an organization’s ability to respond to alarm volumes. No amount of tuning or enforcement will make actual events go away completely, which means MTTR needs to be optimized to reduce operational costs (payroll for investigators) and ensure all events can be thoroughly investigated.
Some questions to consider when evaluating tools used in incident response include:
- Do you have a way, when an event fires, to get more context in order to determine whether or not that event is real and deserves further investigation?
- How expensive is it to obtain that context? Do you have to go out and look at the potentially infected computer, or do you have telemetry flowing back from that computer into a system that is accessible to the SOC that they can investigate?
- If you have telemetry, what kind? Is it system-level telemetry that can be manipulated post breach, or is it network-level telemetry that is hard to manipulate?
- How close to the source are you collecting telemetry – are you capturing everything that infected host is doing or just its communications out to the Internet?
Where response cannot be automated because of complexity, analysts must investigate. Regardless of what a vendor says, a notice coming out of a detection system is not an alarm until it generates a traceable response ticket assigned to an individual. The reason ticket generation is so essential is that it provides a mechanism for auditing investigations (to ensure steps were not skipped). Ticketing can also be used to present investigation findings to management for sign-off. This level of accountability is commonly used in resolving routine IT problems such as forgotten passwords, but remains neglected in serious security breaches.
The high volumes of events coming into a SOC make responding to each one impossible. Vendors tend to build products to alarm on benign events so the product doesn’t fail to alarm on real ones. To remedy these challenges, organizations need to invest the time to fully configure detection systems and build detailed response processes.
- Special thanks to Lancope’s Tom Cross for his contribution to this article.