Application bundle is corrupted during installation

Hello,

I'm facing a weird issue when application bundle is corrupted during installation. The corruption always look the same - a certain sequence of bytes is erased (zeroed) in the file at multiple places, which obviously breaks the bundle signature.

It's a pkg built with productbuild, containing three component packages, up until recently no issue occurred. Everything is correctly signed and notarized and I don't do anything special in preinstall or postinstall scripts but I run gktool scan which reports "Scan completed, but failed because the software has been altered" which makes me believe that bundle is corrupted during installation or decompression. I'm using --compression latest for the bundle's component package.

I don't have a reliable repro but I see it happen a lot to our customers. I was suspecting a deployment tool or "security software" but I've seen the issue for manually installed packages too. It seems it happens only on Sonoma, but that may be related to fact most users are on latest macOS.

Is there any known issue with installer which could lead to bundle being corrupted this way? Did anyone faced anything similar?

Answered by DTS Engineer in 803465022

I'll be honest, this has me completely baffled. I have a few comment below, but here's what I'd like you to do:

  • File a code support incident using the button at the bottom of this page. In that request, make sure you mention that I asked you to and include the link to that forum post. I want to get my hands on the actual files involved, so lets move this off the forums.

  • I'd like to get my hands on "all" the data you've got and it may be to big to send as an attachment. If you can include a link I can download it all from, that would be helpful.

  • If you're in contact with any of your users how are able to reproduce the issue, then what would be most helpful is for them to reproduce the issue (install them app, wait a moment, launch the app and watch it fail), then collect a sysdiagnose. A few notes on that:

  1. It's helpful to reboot the machine before testing and to avoid running anything that isn't part of the test.

  2. It doesn't matter if the log is collected a few minutes after the issue and, in fact, it makes it easier to follow the log if you wait a few minutes before you trigger the sysdiagnose. The collection process itself generates a lot of logging, so doing it "immediately" can end up mixing that log noise into the "interesting" data.

  3. Don't reboot the machine before you collect the sysdiagnose. The reboot process ends up deleting a lot of log data, which often makes it basically useless.

On that last point, if a user has experienced the issue and HASN'T rebooted since it happened then the sysdiagnose might still be useful, even several days after the fact. Just makes sure they can provide some indication of when the issue actually happened, as digging through a large log "blind" doesn't really work.

  • The sysdiagnose is going to be large (300+ MB), so you'll need to work out how the user gets it to you and then over to me.

Looking at the details:

The modification is always a replacement of meaningful ASCII string in Mach-O (main executable or dylibs). It's often a single sequence, but sometimes it's two where the second one is subsequence of the first.

Is it always the same "5f45 4545" value being replaced?

It looks like the string is always replaced in both architectures in Mach-O, so it makes me think that whatever is doing that is aware of the executable format.

The bizarre thing here is that it looks like those are all mangled C++ method names, which means those are probably just symbol names that were left in the executable. There's really NO reason anything would be modifying that data, as it doesn't have any impact on actual execution. Frankly, if you were specifically trying to corrupt an executable, this seems like silly way to do so.

I'd like to understand how on earth this is happening but if you're looking for a solution "now", it's possible that making sure your executable was completely stripped of symbols before signing would bypass the entire problem. It's not exactly elegant, but if this is really tied to a particular byte sequence, then the problem won't happen without that byte sequence.

On this point:

It's interesting that not all occurrences of the string are replaced in the whole file.

How closely have you looked at the ones that aren't replaced? It's tricky to confirm this from an image but, as far as I can tell, all three of those changed this exact byte sequence:

45 4е53 5f39 616c 6c6f 6361 746f 7249 5337 5f45 4545 45

To this one:

45 4е53 5f39 616c 6c6f 6361 746f 7249 5337 0000 0000 45

I have no idea why that would have happened, but it would be interesting to confirm whether or not that pattern holds more broadly.

I get the zipped application bundles from customers and then compare the corrupted files to the bundles I extracted from pkg they use for install.

Just to clarify, was that the specific installer pkg they had, not just the same version? I want to make sure you've ruled out the possibility that the bad file wasn't already inside the installer.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Starting with the direct question first...

Is there any known issue with installer which could lead to bundle being corrupted this way? Did anyone faced anything similar?

No, I haven't heard of anything like this and what you're describing here is extremely weird:

I'm facing a weird issue when application bundle is corrupted during installation. The corruption always look the same - a certain sequence of bytes is erased (zeroed) in the file at multiple places, which obviously breaks the bundle signature.

By definition, the kind of corruption your describing can only be caused by a few different case:

  1. The bytes in the original data source were "wrong" from the start.

That's obviously worth verifying, but (presumably) you've already ruled that out.

  1. The installer wrote the data out wrong.

It's difficult to see how/why this would happen. The installer basically only has one job ("Go Forth and Write Files") and it's hard to see how/why it would ***** up that badly. More cynically, it's used so broadly that it's unlikely to have "small" mistakes.

  1. Something in the system corrupted the data.

This is very difficult to rule out, but it's hard to see what component would/could have done something like this. The system is generally pretty good about only opening for "write" when it actually needs to write, which means that much of the system couldn't make the kind of modification you're describing. It did look at "gktool" and all it actually does is use XPC to trigger exactly the same scan process you see in the broader system. It never actually opens any file (syspolicyd is what's actually doing all the work) and most of it's implementation is actually about turning error codes from syspolicyd into error messages.

  1. "Something else is going on".

This is the generic "catch-all" which covers all the unknowns but, in a case like this, it almost always involves some kind of 3rd party component/configuration/issue. You pointed to one obvious possibility, but even that raises odd questions:

I was suspecting a deployment tool or "security software" but I've seen the issue for manually installed packages too.

Security software using EndpointSecurity (and it's predecessor, "kauth") have a long history of causing all kinds of strange failures, as the power to "veto" all of the most critical system calls is obviously quite powerful and dangerous. However, I don't see how an ES client would/could do something like this: '

a certain sequence of bytes is erased (zeroed) in the file at multiple places

The issue here is that, because of how the I/O and VM systems are integrated, "writes" aren't something an ES client can actually block or interact with. That is, an ES client can control access by blocking "open", but if they allow the file to be opened they can't later block individual "write" calls.

A few questions I'd be looking at:

  • What were the actual modifications that occurred and what files did they occur on? You said that specific bytes were modified but what was actually changed?

  • Related to that point, how do you "know what you know" and are you sure you understand what actually happened? Having a whole "file" become zero'd is VERY different than specific bytes changing inside an otherwise intact file, so it's important to make sure you're sure about what's actually happening.

  • What other information have you gathered? What makes an issue like this appear "random" is almost always one (or more) "differences", which are what actually create the failure. Finding those differences will at least let you reproduce the issue, assuming they don't directly get you to the failure.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you for Kevin, I'll try to answer some questions

What were the actual modifications that occurred and what files did they occur on? You said that specific bytes were modified but what was actually changed?

The modification is always a replacement of meaningful ASCII string in Mach-O (main executable or dylibs). It's often a single sequence, but sometimes it's two where the second one is subsequence of the first. It's interesting that not all occurrences of the string are replaced in the whole file. I'm trying to make some sense out of it. It looks like the string is always replaced in both architectures in Mach-O, so it makes me think that whatever is doing that is aware of the executable format.

Related to that point, how do you "know what you know" and are you sure you understand what actually happened?

I get the zipped application bundles from customers and then compare the corrupted files to the bundles I extracted from pkg they use for install. Another clue I get is the install.log. In our postinstall we run gktool, so I see the output and it already complains there that bundle is altered.

The component package which is getting corrupted has PackageInfo setup like this:

Which forces installer to basically just replace the destination bundle instead of updating it so if the target bundle would be corrupted before installation it would not have any effect (unless it has a newer version) as installer would remove it and put in new bundle.

That all leads me to fact that new bundle has to be corrupted either during or right after it was installed but before postinstall script runs.

I'll be honest, this has me completely baffled. I have a few comment below, but here's what I'd like you to do:

  • File a code support incident using the button at the bottom of this page. In that request, make sure you mention that I asked you to and include the link to that forum post. I want to get my hands on the actual files involved, so lets move this off the forums.

  • I'd like to get my hands on "all" the data you've got and it may be to big to send as an attachment. If you can include a link I can download it all from, that would be helpful.

  • If you're in contact with any of your users how are able to reproduce the issue, then what would be most helpful is for them to reproduce the issue (install them app, wait a moment, launch the app and watch it fail), then collect a sysdiagnose. A few notes on that:

  1. It's helpful to reboot the machine before testing and to avoid running anything that isn't part of the test.

  2. It doesn't matter if the log is collected a few minutes after the issue and, in fact, it makes it easier to follow the log if you wait a few minutes before you trigger the sysdiagnose. The collection process itself generates a lot of logging, so doing it "immediately" can end up mixing that log noise into the "interesting" data.

  3. Don't reboot the machine before you collect the sysdiagnose. The reboot process ends up deleting a lot of log data, which often makes it basically useless.

On that last point, if a user has experienced the issue and HASN'T rebooted since it happened then the sysdiagnose might still be useful, even several days after the fact. Just makes sure they can provide some indication of when the issue actually happened, as digging through a large log "blind" doesn't really work.

  • The sysdiagnose is going to be large (300+ MB), so you'll need to work out how the user gets it to you and then over to me.

Looking at the details:

The modification is always a replacement of meaningful ASCII string in Mach-O (main executable or dylibs). It's often a single sequence, but sometimes it's two where the second one is subsequence of the first.

Is it always the same "5f45 4545" value being replaced?

It looks like the string is always replaced in both architectures in Mach-O, so it makes me think that whatever is doing that is aware of the executable format.

The bizarre thing here is that it looks like those are all mangled C++ method names, which means those are probably just symbol names that were left in the executable. There's really NO reason anything would be modifying that data, as it doesn't have any impact on actual execution. Frankly, if you were specifically trying to corrupt an executable, this seems like silly way to do so.

I'd like to understand how on earth this is happening but if you're looking for a solution "now", it's possible that making sure your executable was completely stripped of symbols before signing would bypass the entire problem. It's not exactly elegant, but if this is really tied to a particular byte sequence, then the problem won't happen without that byte sequence.

On this point:

It's interesting that not all occurrences of the string are replaced in the whole file.

How closely have you looked at the ones that aren't replaced? It's tricky to confirm this from an image but, as far as I can tell, all three of those changed this exact byte sequence:

45 4е53 5f39 616c 6c6f 6361 746f 7249 5337 5f45 4545 45

To this one:

45 4е53 5f39 616c 6c6f 6361 746f 7249 5337 0000 0000 45

I have no idea why that would have happened, but it would be interesting to confirm whether or not that pattern holds more broadly.

I get the zipped application bundles from customers and then compare the corrupted files to the bundles I extracted from pkg they use for install.

Just to clarify, was that the specific installer pkg they had, not just the same version? I want to make sure you've ruled out the possibility that the bad file wasn't already inside the installer.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you Kevin, I'll file the incident in a few hours. I'll try to get some fresh sysdiagnose logs as well. We didn't collect those because most users are in managed environments so we need to arrange everything with their admins.

Is it always the same "5f45 4545" value being replaced?

No, it's same value per instance, but different for each instance.

I have no idea why that would have happened, but it would be interesting to confirm whether or not that pattern holds more broadly.

Definitely not all instances are replaced. I can still find same sequence in the file. I'll send you the originals and corrupted files so you can see for yourself. At the beginning it looked like all instances of the sequence were replaced but once we got more of these corrupted files it's clear that's not the case - we were just lucky with the first ones.

Just to clarify, was that the specific installer pkg they had, not just the same version? I want to make sure you've ruled out the possibility that the bad file wasn't already inside the installer.

I think I pretty much ruled this out. I was originally suspecting that admins were installing some modified packages so I asked if I can get those, but they were either links to our CDN or just unmodified packages.

Application bundle is corrupted during installation
 
 
Q