Type Here to Get Search Results !

Quickly validate controversial antivirus decisions using machine learning

Quickly validate controversial antivirus decisions using machine learning

This is how Avast collects potential false detections and reacts quickly

To provide optimal security, Avast employs a sophisticated ecosystem of detection technologies. Whenever a file enters or executes on your computer, a series of checks are usually performed within a fraction of a second. Most malicious files are found using fast detectors that run on your computer and take up very little space. However, sometimes these detectors fail to evaluate the file reliably (which may be the case if the file is new and unique). In these cases, the cloud backend can come into play because it is capable of running deep and extensive tests, leveraging all the knowledge Avast has. Such extensive testing is done using so much computing power and data that a standalone computer with antivirus software installed cannot handle it. The interaction of the client antivirus engine and its cloud backend ensures that all files are detected with the greatest achievable accuracy.

However, even at the most accurate, occasionally it happens that we evaluate a file as malicious and block it, when the user thinks (or knows) it isn’t. There is a “grey area” of files that can sometimes do harmful things without being designed to do so – these files use typical malicious techniques for benign purposes, or happen to be very similar to known malware new benign file. Not surprisingly, users may occasionally disagree when their antivirus software blocks files and renders their apps useless. In these cases (however rare they may be), the AV may indeed be wrong, although the error may be on the user’s side. Either way, we’ve given users the option to dispute the blocking action taken by our AV. The user can do this through the dialog shown in Figure 1.

Figure 1. Based on our file blocking action, users can send us files for analysis.

User-submitted reports are collected into a cohort, and each contested verdict is re-evaluated using a dedicated automated analysis pipeline. This targeted file analysis can go deeper than standard automated antivirus systems, but such analysis can be slow and may require the intervention of a human analyst.

The pipeline itself consists of a large number of expert systems. These include our custom proprietary tools and publicly available platforms, such as sandboxes, which provide static and dynamic analysis of a given binary. As you can imagine, all these systems output not only binary classifications, but additional structured data as well. This can include values ​​such as certificate information, file descriptors, or a simple taxonomy decomposition, as shown in Figure 2.

Figure 2. Example of structured output from one of our many systems. Other systems can provide additional information such as timestamps, certificates, behavioral patterns, etc.

The final step in the pipeline is a complex decision function that aggregates the outputs of the various systems into a final verdict. In some tricky cases, the available information may not be conclusive, so these binaries cannot be parsed automatically. For these cases, there is a manual queue with binaries that need to be checked by one of our analysts.

Historically, decision-making functions started out as simple weighted majority voting. Since then, it has been improved to cover more and more nuances, such as specific malware. As often happens, the number of systems in the pipeline has increased significantly since deployment, and subsequently the complexity of the final decision function. At some point, manual management of rules that exist in functions no longer works. The pipeline has been running in production for a long time and has collected a reasonable amount of historical binaries and verdicts. Therefore, we can take an extra step to learn functions from the data, which was not possible before.

as the picture shows previous blog post, Avast is developing a next-generation machine learning platform designed to automate data processing pipelines. The platform is capable of modeling loosely structured data. More specifically, it can:

  • Infer the schema of the input data
  • It is recommended to project non-numeric features of scalars or vectors
  • Build a neural network model that matches the data structure
  • Perform training and inference

Traditionally, learning classifiers from deeply nested tree-like data (such as the data we have) requires a lot of manual feature engineering. Furthermore, the implementation of the feature extractor must be maintained over time. Using our machine learning platform, this process is greatly simplified with built-in pattern inference, digitization, and implicit feature engineering done when training neural networks.

This all makes the platform ideal for our specific use case of training meta-classifiers on top of a rapidly evolving set of expert systems. The Learning Decision feature has been running in production for some time. During this period, a major data migration occurred. With platform support for this feature, we can worry less about sudden changes in the structure of the input data.

At this point, we can now safely say that deploying the new decision function significantly improved the overall performance of the pipeline. Our goal is to resolve as many potential false reports as quickly and confidently as possible. Figure 3 shows the daily percentage of reports that we can resolve immediately without putting them into a manual queue. Note the change point on July 7, 2021 in the chart. The deployment decision feature nearly tripled the volume of reports processed automatically, from 38% to 74%.

Figure 3. This graph shows how the reported binaries’ auto-parsing analysis percentage evolves over time. Please note the change point on July 7, 2021. It corresponds to the deployment of automated decision makers and significantly improves throughput. This throughput gain is due to the ability of machine learning models to learn more nuanced patterns from the data compared to handcrafted systems.

That’s not all. There are other advantages to using our machine learning platform.As we introduced earlier blog post Regarding interpretability in machine learning, we developed a framework mil.jl, which helps us better understand the model’s individual decisions. It does this by estimating the additional importance of individual input features present in our dataset using a game-theoretic approach. In an upcoming article, we’ll show you how we can leverage this framework to improve the UI of a decision support system that provides analysts with outputs from the various systems mentioned earlier.

All in all, Avast collected a range of potential false detections from customers that required a quick response. For this, we have a dedicated pipeline that processes individual reported binaries and tries to automatically determine if the reported binaries are clean. In some cases, analysts need to intervene to make a final ruling. By learning the decision function from the data, we are not only able to reproduce the function of the previous decision function, but also capture other patterns from the data. This is reflected in a 1.8x increase in samples, which we are able to solve automatically. This, in turn, significantly reduces average reaction time, as samples do not have to wait in manual queues for analysts to take action.

Read More..

Tags

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.

Top Post Ad

Below Post Ad