macOS System Extension Compatibility Issues

We’re encountering issues with a system extension that subscribes to multiple events. Some users are experiencing performance problems when running our extension alongside other system extensions like Microsoft Defender and Crowdstrike, which seem to generate a high volume of events. However, on certain Macs with an identical setup, there are no performance issues, making it difficult to pinpoint the cause.

Has anyone found ways to improve compatibility with other system extensions? Currently, we’re ignoring and caching events from other extensions to avoid unnecessary processing.

The specific ES events contributing to the issue seem to be:

•	ES_EVENT_TYPE_AUTH_EXEC
•	ES_EVENT_TYPE_AUTH_OPEN

I realize this is a broad question, but the documentation for endpoint security extensions is quite limited. Any insights or suggestions would be greatly appreciated!

Answered by DTS Engineer in 811665022

Has anyone found ways to improve compatibility with other system extensions?

That is a really, really hard problem which doesn't really have any easy answer. The first thing to understand here is what the problem actually is. Architecturally, an ES client has two primary concerns to deal with:

  1. Process individual events quickly.
  2. Minimize activities that generate additional events.

In the basic case, the second concern is easy to overlook. It's easy enough for a client to mute its own helper processes ("direct workers") and the work being done by other processes like Apple's daemons are (typically) relatively small. As long as the client has certain amount of parallelism, number two can often be relatively "hidden".

That breaks down once you hit this scenario:

Some users are experiencing performance problems when running our extension alongside other system extensions

At that point, what you basically have is a pipeline of ES clients which are both generating AND approving auth events for each other. Performance problems occur because some particular combination of specific events and external factors cause event processing to slow down, which then slows the entire machine down. That does raise an issue around this:

making it difficult to pinpoint the cause

In my experience, the specific "cause" isn't actually all that relevant to understanding or improving the situation. If a specific mitigation is possible ("ignore this file"), all you've really done is removed THAT particular factor/trigger, not resolved the underlying processing issues that led to visible failure. You need to be thinking in terms of how your engine processes data, not patching around specific glitches.

In terms of how you address these issues, mitigations like this do help:

Currently, we’re ignoring and caching events from other extensions to avoid unnecessary processing.

...however:

  • Many clients offload work to helper processes and reliably identifying those direct workers is a non-trivial task, assuming you've even realized this was an issue and made some attempt to solve it.

  • Indirect workers are still a factor without any easy fix. You won't be able identify the source of the actual auth event (the daemon can't tell you "why" it's doing whatever it's doing) and you won't be able to mute the daemon (otherwise, you would have).

My biggest suggestion here is to be your own worst enemy. I have some more specific suggestions below, but most of what I've suggested would have been found by a testing regime that was ACTIVELY focused on breaking the client. A few examples of what that can look like:

  • For design and testing purposes, turn off "is_es_client" checks, client muting, and even results caching. All of those reduce the volume of events into your client and that's not helpful when your goal is to increase your clients ability to process events. A client that works well without those will only work better with them.

  • Modify our sample code to be "pathological". For example, specifically open every file it's asked to ES_EVENT_TYPE_AUTH_OPEN. For bonus points, have it open a few more as well, all while processing only processing a single event at a time.

  • Create duplicate "version" of your ES client with different bundle IDs and then run all of them at the same time. This can replicate some of the chaos multiple ES clients create while also giving you a clear view of what all the ES clients are actually doing.

  • Focus on controlled tests that actually stress the system, not just real world testing. A single test app that is simply trying to generate open calls as quickly as possible is often FAR more useful than hours or even days of real world testing.

This approach shift the focus from the details of exactly what happens under real world conditions to designing for conditions that are significantly worse than real world conditions.

Finally, I want to return back to here:

  1. Minimize activities that generate additional events.

Ultimately, all of these disruptions are caused by your ES client generating event that the other client(s) need to approve. The simplest solution to that issue is to ensure that your processing engine simple does NOT generate auth events. That isn't necessarily easy but, at a minimum, you need to be very aware of EXACTLY what's involved in processing critical events like ES_EVENT_TYPE_AUTH_OPEN.

Related to that point, the fact you're having issues with ES_EVENT_TYPE_AUTH_EXEC could be somewhat concerning, as processing ES_EVENT_TYPE_AUTH_EXEC is typically based on the event data itself, not any kind of external resource. What are you actually doing in this event?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Has anyone found ways to improve compatibility with other system extensions?

That is a really, really hard problem which doesn't really have any easy answer. The first thing to understand here is what the problem actually is. Architecturally, an ES client has two primary concerns to deal with:

  1. Process individual events quickly.
  2. Minimize activities that generate additional events.

In the basic case, the second concern is easy to overlook. It's easy enough for a client to mute its own helper processes ("direct workers") and the work being done by other processes like Apple's daemons are (typically) relatively small. As long as the client has certain amount of parallelism, number two can often be relatively "hidden".

That breaks down once you hit this scenario:

Some users are experiencing performance problems when running our extension alongside other system extensions

At that point, what you basically have is a pipeline of ES clients which are both generating AND approving auth events for each other. Performance problems occur because some particular combination of specific events and external factors cause event processing to slow down, which then slows the entire machine down. That does raise an issue around this:

making it difficult to pinpoint the cause

In my experience, the specific "cause" isn't actually all that relevant to understanding or improving the situation. If a specific mitigation is possible ("ignore this file"), all you've really done is removed THAT particular factor/trigger, not resolved the underlying processing issues that led to visible failure. You need to be thinking in terms of how your engine processes data, not patching around specific glitches.

In terms of how you address these issues, mitigations like this do help:

Currently, we’re ignoring and caching events from other extensions to avoid unnecessary processing.

...however:

  • Many clients offload work to helper processes and reliably identifying those direct workers is a non-trivial task, assuming you've even realized this was an issue and made some attempt to solve it.

  • Indirect workers are still a factor without any easy fix. You won't be able identify the source of the actual auth event (the daemon can't tell you "why" it's doing whatever it's doing) and you won't be able to mute the daemon (otherwise, you would have).

My biggest suggestion here is to be your own worst enemy. I have some more specific suggestions below, but most of what I've suggested would have been found by a testing regime that was ACTIVELY focused on breaking the client. A few examples of what that can look like:

  • For design and testing purposes, turn off "is_es_client" checks, client muting, and even results caching. All of those reduce the volume of events into your client and that's not helpful when your goal is to increase your clients ability to process events. A client that works well without those will only work better with them.

  • Modify our sample code to be "pathological". For example, specifically open every file it's asked to ES_EVENT_TYPE_AUTH_OPEN. For bonus points, have it open a few more as well, all while processing only processing a single event at a time.

  • Create duplicate "version" of your ES client with different bundle IDs and then run all of them at the same time. This can replicate some of the chaos multiple ES clients create while also giving you a clear view of what all the ES clients are actually doing.

  • Focus on controlled tests that actually stress the system, not just real world testing. A single test app that is simply trying to generate open calls as quickly as possible is often FAR more useful than hours or even days of real world testing.

This approach shift the focus from the details of exactly what happens under real world conditions to designing for conditions that are significantly worse than real world conditions.

Finally, I want to return back to here:

  1. Minimize activities that generate additional events.

Ultimately, all of these disruptions are caused by your ES client generating event that the other client(s) need to approve. The simplest solution to that issue is to ensure that your processing engine simple does NOT generate auth events. That isn't necessarily easy but, at a minimum, you need to be very aware of EXACTLY what's involved in processing critical events like ES_EVENT_TYPE_AUTH_OPEN.

Related to that point, the fact you're having issues with ES_EVENT_TYPE_AUTH_EXEC could be somewhat concerning, as processing ES_EVENT_TYPE_AUTH_EXEC is typically based on the event data itself, not any kind of external resource. What are you actually doing in this event?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

macOS System Extension Compatibility Issues
 
 
Q