Tip: BASE64 Encoded PowerShell Scripts are Recognizable by the Amount of Letter As

Published: 2019-06-03. Last Updated: 2019-06-03 22:05:17 UTC
by Didier Stevens (Version: 1)
2 comment(s)

We've often shown BASE64 encoded PowerShell scripts in our diary entries. And you might have noticed they contain lots of A characters (uppercase letter a).

Like the PowerShell script in one of our last diary entries. I've highlighted the As for you here:

It's a characteristic of BASE64 encoded PowerShell that helps with its identification.

But why is the prevalence of letter A high?

A PowerShell script passed as a command-line argument (option -EncodedCommand) has to be UNICODE text, encoded in BASE64, per PowerShell's help:

Property Unicode of System.Text.Encoding is little-endian UTF16. ASCII text (e.g. most PowerShell commands) requires only 7 bits to encode, but is encoded with 16 bits (2 bytes) in UTF16. These extra 9 bits are given value 0. Hence you have at least one byte (8 bits) that is composed of only 0 bits: byte 0.

Little-endian means that the least significant byte is stored first. Take letters ISC. In hexadecimal (ASCII), that's 49 53 43. In little-endian UTF16, we take 2 bytes in stead of 1 byte to encode each character, hence it becomes: 49 00 53 00 43 00 (big-endian is 00 49 00 53 00 43).

So, what I've shown here with this example, is that ASCII text encoded in UTF16 contains a lot of bytes with value 0.

In BASE64, a sequence of bytes to be encoded, is split into groups of 6 bits. This means that a byte value of 0 (8 bits 0) will produce 2 times out of 3 a 6-bit group of zeroes.

Let's illustrate this with a FF 00 FF 00 sequence:

11111111 00000000 11111111  00000000 11111111 00000000  11111111 00000000 11111111  00000000 11111111 00000000

111111 110000 000011 111111 000000 001111 111100 000000 111111 110000 000011 111111 000000 001111 111100 000000

The first line shows the bits grouped per 8 (e.g. a byte), and the second line shows the same bits grouped per 6 (e.g. a BASE64 unit). Of the 16 BASE64 units, there are 4 with value 000000 (that's 25%).

With true ASCII characters (most-significant bit is 0), there will be even more 000000 values (e.g. more than 25%).

Each possible BASE64 unit (there are 64 possibilities) is represented by a character: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/.

Unit 000000 is represented by character A, 000001 by character B, ...

Conclusion

Let's put all this together:

  1. ASCII text encoded as UTF16 contains many 0 values (50%)
  2. This sequence prepared for BASE64 contains many 000000 units (minimum 25%)
  3. And represented in BASE64, this sequence contains many A characters (minimum 25%)
  4. BASE64 encoded, command-line PowerShell scripts contains many A characters (minimum 25%)

In fact, the prevalence of character A in the example above is 41,417%

 

Didier Stevens
Senior handler
Microsoft MVP
blog.DidierStevens.com DidierStevensLabs.com

Keywords: BASE64 PowerShell
2 comment(s)

Comments

Hi Didier,

Thanks for the amazing tip. Quick question: Is there a way to create IDS rules based upon this rule? Do you have a template for it?

Thanks,
Let me think about that.

Diary Archives