Reverse engineering – Supercell – part 6

Hello there!!!!! long time passed since the part 5.. lot of days.. months… but now, I’m back here writing what, until now, is the best part of all the journey – one of the, probably the most, hardest reverse engineering i did – even more fun then reverse engineering the full Arena Of Valor protocol and encryption.

In the past

write-ups about the Supercell cat mouse, the most hardest thing was figuring out a single byte changed inside the Sodium encryption, for instance, the rounds of salsa and hsalsa core.

During the summer

Supercell engineer, which I finally met during his trip in Italy – and was one of the greatest day of my life – among with his team, made huge changes to the Sodium crypto.
The most relevant, was in fact the substitution of the whole salsa core functions which is now replaced with a custom hand written function. To make the things even more harder, this function as been obfuscated with Arxan, turning static and dynamic analysis a nightmare – very close to the word “impossible”. So I’m glad to say that i’ve finally achieved another goal – reverse engineering a whole encryption, which obfuscated, turns in around 4/5 millions, MILLIONS, of cyclic ARM instructions.

First

let’s take a look on how thois function looks with Arxan obfuscation which highlight 3 different kind of obfuscation techniques: – Opaque predicate, Control flow flattening, stack based obfuscation.

A lot of dead instructions, but also, relevant instructions inside stack obfuscation as highlighted in this screenshot.

To recap – understanding the code flow by looking at the disasm is just impossible, mostly because it’s quite impossible to understand where the code jumps on something like this:

where the CMP values are stored and moved continually during the code execution. Trying to figure a way to the de-obfuscate the code from any of IDA/Binja would probably turns in way more effort then cleaning it up later during emulation with the correct flow of execution.

Here is another screen for relevant encryption code, together with obfuscation and dead code:

Let’s now move to the practical part

First of, as I did for the single byte change reverse, the approach I used was and will always be, to emulate the relevant ASM code among with the complete memory layout from the device. There is no need to say that an approach like that, won’t be stopped or prevented in any way unless we start speaking of sick methods to prevent the phase 1 (taking the memory dumps) which will turn the things way more complex but way more fun since there will be something new to break. In this case, which is a nightmare even with a similar approach the routine was:

Finding a way to kang the memory layout from the device through the cyclic arxan crc (let’s not forget that anything that is not hardware, inject instructions to the code in order to debug/instrument so GDB, lldb, frida and so on)
Understanding the WHOLE encryption logic. The screens i just shown are for the crypto core. There is nonce generation, keypair and shared key generation happenings before this, which imagine what? are obfuscated as well.
Once the memory layout of the device was taken, it’s time to setup the emulator – which is more easy to say then to do. There are lot of patches needed to let the emulation run correctly and imagine once again what? I always forget basic stuffs like i started it by “hey, let’s NOP stack check guards, all the libc plt etc etc” – until after 9/10 hours of emulations probably stucked in some Arxan loops I said myself: oh.. NOP instructions alter the code… F**K my head.
Once the emulation reach the point I wanted it to reach, well… it’s time to dig though 3 GB of logs, instruction by instruction. But ok… maybe I can figure out some way to clean this sh*t up? So, let’s code even this.
Once the 3 GB of logs became… 2 – I said myself once again… let’s just go line by line

Coff… coff…. and COFF…

Let’s take a deeper look at all the code now, which as usual is open-sourced for the 2 main purposes I always care: a) sharing knowledge b) find out someone that text me saying “hey noob. you could save days by using this approach instead”.

First repo: https://github.com/iGio90/COC_2k18_pt2
which contains:

The frida script to kang the memory layout from the device
The emulator
An additional frida script very similar to the main one which helps to determine where the code jump (just something random I coded)

I won’t go so far by explaining everything in the details – to me – it looks quite understandable, but..

let’s start with the frida script

function attachSVC()

Alright, i forget to mention that with the recent updates, Supercell introduced some minor major changes to prevent debugging/instrumentation. One of those is the replacement of most of the involved libc plt with SVC instructions (syscalls). The 2 we are targeting are sendto and recvfrom which for some reasons was kept outside Arxan cyclic crc.

The code that follow, obviously require a pre-understanding of the game logic, for instance – the game send an unencrypted packet – ClientHello – which is followed by the receive of ServerHello, which is followed by the encrypted login.

if (waitingHello) {
    waitingHello = false;
    Interceptor.attach(Module.findExportByName('libc.so', 'open'), function () {
        var what = Memory.readUtf8String(this.context.r0);
        if (what.indexOf('urandom') >= 0) {
            var readWhat;
            var readLen;
            var rInt = Interceptor.attach(Module.findExportByName('libc.so', 'read'), {
                onEnter: function () {
                    readWhat = this.context.r1;
                    readLen = parseInt(this.context.r2);
                },
                onLeave: function () {
                    if (readLen < 32) {
                        Memory.writeByteArray(readWhat, [0xde, 0xad, 0xbe, 0xef, 0x12, 0x34, 0x56, 0x78,
                            0xde, 0xad, 0xbe, 0xef, 0x12, 0x34, 0x56, 0x78, 0xde, 0xad, 0xbe,
                            0xef, 0x12, 0x34, 0x56, 0x78]);
                        nonce = Memory.readByteArray(readWhat, readLen);
                        console.log('nonce');
                        console.log(nonce);
                        rInt.detach();
                    } else if (readLen === 32) {
                        Memory.writeByteArray(readWhat, [0xde, 0xad, 0xbe, 0xef, 0x12, 0x34, 0x56, 0x78,
                            0xde, 0xad, 0xbe, 0xef, 0x12, 0x34, 0x56, 0x78, 0xde, 0xad, 0xbe,
                            0xef, 0x12, 0x34, 0x56, 0x78, 0xde, 0xad, 0xbe, 0xef, 0x12, 0x34, 0x56, 0x78]);
                        key = Memory.readByteArray(readWhat, parseInt(readLen));
                        console.log('key');
                        console.log(key);
                        Interceptor.detachAll();

Once the hello got received, I chose to start from a specific point i know the game is using since the introduction of sodium crypto – /dev/urandom.
urandom is used twice before the encryption. The first time is used to build the sNonce, which is the nonce incremented by 2 each request that will be used to encrypt everything that follows the login. The second time is used to build the private key of the keypair.
Another smart security check introduced in the latest version, which for some weird reason stopped a lot of people, was a cycle 4 bytes check on the private key – for instance, when I did the reversing of the single byte change, I hardcoded 32 bytes of 0, which now would turn the game to crash – very smart, but in the end why none of the folks just tried to set some random 32 bytes to check if the crash was caused by a check or by altering the memory? welp, no matter.
Once the key is hardcoded in order to obtain always the same encryption result (which differed only on the first 24 bytes – a random generated session key sent in ServerHello) it’s time to play my usual game with trampolines to skip Arxan crc. There is nothing much I can explain on this, it’s a matter of finding them through the obfuscated blocks in order to properly hook whatever we need without crashing the game. Here are the trampolines I used:

Interceptor.attach(base.add(0x154214 + 1), function () {
    Interceptor.detachAll();

    console.log('taking dumps');
    kang(this.context);

    Interceptor.attach(base.add(0x2C06D8 + 1), function () {
        Interceptor.detachAll();
        Interceptor.attach(base.add(0x00171E20 + 1), function () {
            Interceptor.detachAll();
            Interceptor.attach(base.add(0x17242A + 1), function () {
                Interceptor.detachAll();
                var i = 0;
                Interceptor.attach(base.add(0x1721E2 + 1), function () {
                    if (i === 2054) {
                        Interceptor.attach(base.add(0x1E130C + 1), function () {
                            var p = Process.findRangeByAddress(this.context.r0);
                            send(p['base'], Memory.readByteArray(p['base'], p['size']));
                        });

                        Interceptor.attach(base.add(0x478E7C), function () {
                            Interceptor.detachAll();
                            console.log('login');
                            console.log(Memory.readByteArray(this.context.r1, parseInt(this.context.r2)));
                        });
                    }
                    i++;
                });
            });
        });
    });
});

First of, the function “kang”. The code is read-able from the sources and what it does is basically checking registers and the whole stack (looping on 4 bytes – since we are dealing with arm 32) to find ALL the ranges being used by the code. I found out while coding the emulator that a missing range was actually used, which is dumped later here:

Interceptor.attach(base.add(0x1E130C + 1), function () {
    var p = Process.findRangeByAddress(this.context.r0);
    send(p['base'], Memory.readByteArray(p['base'], p['size']));
});

kang function will also print the registers values which are needed by the emulator. It will saves the ranges with the base address as name – this, because I’m a programmer, at first, and I like to automate whatever can be automated and simplified, in this case, the emulator will just read everything from the dumps directory and map it at the right address (the name of the file), because you can understand the amount of times i inject the frida script into the process to take the dumps – which had different base address on each run.
Another special highlight goes to the hook that needs to be hit 2054 times before attaching to anything else. How has this found out is a special secret – but nothing that goes far from dealing with this since more then 1 year and test, test, test, test – but let’s just say that attaching anytime before it got hit 2054 times would turn the game crash by Arxan. The last interceptor is once again on the SVC sendto instruction, to log the encrypted login in order to see the result.

The emulation

and the needed patches, mostly for memcpy, memmove, memclr, free and stack check guards are done without altering the code but instead by hardcoding PC value to let it jump and do the stuffs – I.E memcpy which is done through python. The emulator will log all the instructions executed, highlighting the correct flow of execution. For a better readability, the logs are written and splitted up into multiple log files. During the emulation, I’ve also found out that unicorn require an “additional trick” to run ARM Neon instructions which has been achieved with the following code:

VFP = "4ff4700001ee500fbff36f8f4ff08043e8ee103a"

# Enable VFP instr
mu.mem_map(0x1000, 1024)
mu.mem_write(0x1000, binascii.unhexlify(VFP))
mu.emu_start(0x1000 | 1, 0x1000 + len(VFP))
mu.mem_unmap(0x1000, 1024)

From the moment i ran the frida script, till the crypto implementation, i just closed up IDA and Binja as static analysis on this it’s just not useful at all. Unfortunately i kept no backlogs of the initial crypto i wrote, which was like over 3000 lines of py code written line by line translating relevant ASM code.

The crypto implementation

which is now living in the second repository:
https://github.com/iGio90/supercell_sodium_core
is way clean and could be cleaned even more.

All the functions i’ve created in my implementation came from my mind and fantasy.

def h_stract(a, b):
    return ((a - 1) - ((-1 - b) & 0xffffffff)) & 0xffffffff


def h_spack(a):
    return [
        a & 0xff,
        a >> 8 & 0xff,
        a >> 16 & 0xff,
        a >> 24 & 0xff
    ]


def h_mul_add(a, b, m, r):
    if r > 0:
        b = (((b + (b << 2)) & 0xfffffff) << 6) & 0xffffffff
    return (b * a + m) & 0xffffffff


def h_aaas(a, b, c):
    b = (((c + a) & 0xffffffff + b) & 0xffffffff) & 0x000000ff
    c = c >> 8
    return b, c


def h_aas(a, b):
    return (b + a) & 0xffffffff, b >> 8

But there is a lot of code that could be translated even better to attempt to reduce the time it takes:

def h_core_a(a):
    r = [0] * 19
    s = [0] * 31

    r1 = a[0] >> 16
    r3 = a[1]
    r0 = r1 | a[2]
    r6 = a[3]
    r1 = (r0 + r3) & 0xffffffff
    r2 = a[16]
    r3 = r1 ^ r6
    r2 = (r2 + __ROR4__(r3, 20)) & 0xffffffff
    s[0] = r2
    r0 ^= r2
    r2 = r0 >> 0x18
    r2 = (-1 - r2) & 0xffffffff
    r0 = bic(r2, r0)
    r2 = a[4]
    r0 = (-1 - r0) & 0xffffffff
    s[1] = r0
    r0 = (r0 + r1) & 0xffffffff
    s[2] = r0
    r0 = (r0 ^ __ROR4__(r3, 20)) & 0xffffffff
    r1 = a[5]
    r0 = __ROR4__(r0, 0x19)
    s[7] = r0
    r0 = a[6]
    r3 = (r0 + r1) & 0xffffffff
    r6 = r3 ^ r2
    r5 = (r6 << 0x10) & 0xffffffff

    r0 = a[7]
    r2 = (r5 | __ROR4__(r6, 16)) & 0xffffffff
    r1 = a[6]
    r0 = (r0 + r2) & 0xffffffff
    r1 = (r1 ^ r0) & 0xffffffff
    r3 = (r3 + __ROR4__(r1, 20)) & 0xffffffff
    s[3] = r3
    r2 = (r2 ^ r3) & 0xffffffff
    r0 = (r0 + __ROR4__(r2, 24)) & 0xffffffff
    s[4] = r0
    r0 = (r0 ^ __ROR4__(r1, 20)) & 0xffffffff
    r1 = a[8]
    r3 = __ROR4__(r2, 0x18)
    r0 = __ROR4__(r0, 0x19)
    s[9] = r3
    s[10] = r0
    r0 = a[9]
    r0 = (r0 + r1) & 0xffffffff
    r1 = a[10]
    s[5] = r0
    r0 = r0 ^ r1
    s[6] = r0
    r0 = (r0 << 0x10) & 0xffffffff

    r1 = s[6] >> 16
    r3 = a[11]
    r0 |= r1
    r6 = a[9]
    r1 = (r0 + r3) & 0xffffffff
    r2 = s[5]
    r3 = (r1 ^ r6) & 0xffffffff
    r2 = (r2 + __ROR4__(r3, 20)) & 0xffffffff
    s[8] = r2
    r0 = (r0 ^ r2) & 0xffffffff
    r2 = __ROR4__(r0, 0x18)
    r0 = (r1 + __ROR4__(r0, 24)) & 0xffffffff
    s[11] = r0
    r0 = (r0 ^ __ROR4__(r3, 20)) & 0xffffffff
    r1 = a[12]
    r0 = __ROR4__(r0, 0x19)
    s[12] = r2
    s[13] = r0
    r0 = a[13]
    r0 = (r0 + r1) & 0xffffffff
    r1 = a[14]
    s[14] = r0
    r0 = (r0 ^ r1) & 0xffffffff
    s[15] = r0
    r0 = (r0 << 0x10) & 0xffffffff

    r2 = r0
    r1 = s[15] >> 16
    r6 = a[15]
    r5 = a[13]
    r1 = (r1 | r2) & 0xffffffff
    r2 = (r1 + r6) & 0xffffffff
    r3 = (r2 ^ r5) & 0xffffffff
    r0 = s[14]
    r3 = __ROR4__(r3, 0x14)

    r0 = (r0 + r3) & 0xffffffff
    s[16] = r0
    r0 = (r0 ^ r1) & 0xffffffff
    r1 = (r2 + __ROR4__(r0, 24)) & 0xffffffff
    s[17] = r1
    r1 = (r1 ^ r3) & 0xffffffff
    r2 = s[10]
    r1 = __ROR4__(r1, 0x19)
    s[18] = r1
    r1 = s[0]
    r1 = (r1 + r2) & 0xffffffff
    s[19] = r1
    r0 = (r1 ^ __ROR4__(r0, 24)) & 0xffffffff
    r0 = __ROR4__(r0, 0x10)
    s[20] = r0

    r0 = s[11]
    r1 = s[20]
    r0 = (r0 + r1) & 0xffffffff
    s[21] = r0
    r0 = s[10]
    r1 = s[21]
    r0 = (r0 ^ r1) & 0xffffffff
    r1 = __ROR4__(r0, 0x14)
    s[22] = r1
    r1 = s[19]
    r0 = (r1 + __ROR4__(r0, 20)) & 0xffffffff
    r1 = s[20]
    s[23] = r0
    r0 = (r0 ^ r1) & 0xffffffff
    r0 = __ROR4__(r0, 0x18)
    r[0] = r0

    r1 = s[21]
    r0 = (r0 + r1) & 0xffffffff
    r1 = s[22]
    r[1] = r0
    r0 = (r0 ^ r1) & 0xffffffff
    r0 = __ROR4__(r0, 0x19)
    r[2] = r0

    r0 = s[13]
    r1 = s[3]
    r3 = s[1]
    r0 = (r0 + r1) & 0xffffffff
    r1 = s[17]
    r6 = (r3 + r0) & 0xffffffff
    r3 = r3 | r0
    r2 = s[13]
    r6 = -(r6 - (r3 << 1))
    r3 = (r1 + __ROR4__(r6, 16)) & 0xffffffff
    r5 = __ROR4__(r6, 0x10)
    r1 = (r3 ^ r2) & 0xffffffff
    r0 = (r0 + __ROR4__(r1, 20)) & 0xffffffff
    r[3] = r0
    r6 = __ROR4__(r1, 0x14)

    r0 = (r0 ^ r5) & 0xffffffff
    r1 = __ROR4__(r0, 0x18)
    r0 = (r3 + __ROR4__(r0, 24)) & 0xffffffff
    r[4] = r0
    r0 = (r0 ^ r6) & 0xffffffff
    s[24] = r1
    r0 = __ROR4__(r0, 0x19)
    r[5] = r0
    r0 = s[8]
    r1 = s[18]
    r2 = s[9]
    r0 = (r0 + r1) & 0xffffffff
    r1 = (r2 ^ r0) & 0xffffffff
    r2 = __ROR4__(r1, 0x10)
    s[25] = r2
    r2 = s[2]
    r1 = (r2 + __ROR4__(r1, 16)) & 0xffffffff
    s[26] = r1
    r1 = s[18]
    r2 = s[26]
    r1 = (r1 ^ r2) & 0xffffffff
    r2 = __ROR4__(r1, 0x14)
    r0 = (r0 + __ROR4__(r1, 20)) & 0xffffffff
    s[27] = r2
    r[6] = r0

    r1 = s[25]
    r0 = (r0 ^ r1) & 0xffffffff
    r1 = __ROR4__(r0, 0x18)
    r[7] = r1
    r1 = s[26]
    r0 = (r1 + __ROR4__(r0, 24)) & 0xffffffff
    r1 = s[27]
    r[8] = r0
    r0 = (r0 ^ r1) & 0xffffffff
    r0 = __ROR4__(r0, 0x19)
    r[9] = r0

    r0 = s[16]
    r1 = s[7]
    r2 = s[4]
    r0 = (r0 + r1) & 0xffffffff
    s[28] = r0
    r0 = s[12]
    r1 = s[28]
    r0 = (r0 ^ r1) & 0xffffffff
    r1 = __ROR4__(r0, 0x10)
    r0 = (r2 + __ROR4__(r0, 16)) & 0xffffffff
    s[29] = r0
    r0 = s[7]
    s[30] = r1
    r1 = s[29]
    r0 = (r0 ^ r1) & 0xffffffff

    r1 = r0
    r2 = s[28]
    r1 = r1 >> 0x14
    r0 = (r1 | r0 << 12) & 0xffffffff
    r1 = (r0 + r2) & 0xffffffff
    r2 = s[30]
    r[10] = r1

    r1 = (r1 ^ r2) & 0xffffffff
    r2 = r1 >> 0x18
    r2 = bic(r2, r1)
    r1 = (r2 + (r1 << 8)) & 0xffffffff
    r2 = s[29]
    r[11] = r1
    r1 = (r1 + r2) & 0xffffffff
    r0 = (r0 ^ r1) & 0xffffffff
    r[12] = r1
    r0 = __ROR4__(r0, 0x19)
    r[13] = r0

    r1 = s[23]
    r0 = (r0 + r1) & 0xffffffff
    r[14] = r0
    r1 = s[24]
    r0 = (r0 ^ r1) & 0xffffffff
    r[15] = r0
    r[16] = (r0 << 0x10) & 0xffffffff
    r[17] = s[23]
    r[18] = s[24]
    return r

The Supercell C implementation would probably be way more different then this but I’m very happy to say that it took 22 milliseconds to decrypt the largest message, which looks fine enough at me.
As is written in the notes of the crypto and the readme of the repo, the generation of the nonce which is used to encrypt and decrypt the login/loginOk messages among with the shared key generation among with the server public key, will be given only to people that I’m sure will make a good use of it. Supercell, but me at first, are a bit tired of people selling stuffs built on top of the works made for free and passion without even knowing whose to thank for them but also, this time, I want to really make sure that the stuffs that are going to be created are not meant to allocate damage to the gaming company itself.

See y’all the next chapter <3

Some “bonus” screenshots with my notes and logs: