Introduction to Dynamic and Static Analysis

Learn fundamental malware analysis techniques including static and dynamic analysis of binary executables, and reverse engineering.

Lab Overview

In this lab on Reverse Engineering and Malware Analysis, you will delve into the world of malicious code analysis and gain a deeper understanding of how to dissect compiled binary programs. By learning the techniques and tools for static and dynamic analysis, you will equip yourself with the skills necessary to identify and analyze malware, investigate security incidents, and develop countermeasures against these threats.

Throughout this lab, you will learn about the two fundamental approaches to malware analysis: static and dynamic analysis. In static analysis, you will explore the contents of binary executable files, deciphering machine code instructions, and extracting useful information such as strings and metadata. You will use tools like hexdump, readelf, and strings to dissect the structure of executable files and understand their behavior without executing them. In dynamic analysis, you will run malware in a controlled environment, monitoring system calls and dynamic library functions using tools like strace and ltrace. Additionally, you will participate in practical exercises by solving reverse engineering Capture The Flag (CTF) challenges that apply the concepts learned in the lab. By the end of this lab, you will have a strong foundation in malware analysis techniques, preparing you for further exploration of lower-level concepts such as C and assembly code, which are essential in the world of cybersecurity.

Authors: Z. Cliffe Schreuders, and Tom Shaw

License: CC BY-SA 4.0

CyBOK Knowledge Areas: MAT: Malware Taxonomy dimensions kinds MAT: Malware Analysis analysis techniques analysis environments

Tags: malware-analysis reverse-engineering static-analysis dynamic-analysis ctf binary-analysis

Malware analysis

Malware analysis is the study of malicious code. Some motivations to conduct malware analysis include: investigating an incident to assess damage and determine what information was accessed, identifying the source of the compromise and whether this is a targeted attack or just malware that has found its way to our network, and to recover the system(s) after an attack. Malware analysis is essential when developing antivirus and/or IDS/IPS signatures to prevent the infection on other systems.

There are a number of analysis techniques that can be used:

Static analysis: analysing the contents of the file(s) without running the program. For example, comparing hashes, using antimalware scans, looking at the ASCII contents, executable metadata and dropper detection, and inspection of the machine instructions / source code.
Dynamic analysis: running the malware and potentially infecting a (virtualised) system to see what it does. This can involve manually stepping the malware through each instruction (debugging), or letting it run while tracking which files and registry entries change, along with the network connections and traffic that is involved.

A safe analysis environment

Warning: When doing any analysis of malware it is important to ensure you are working in a controlled environment, and when doing dynamic analysis that you have some kind of system you are willing to infect, for example a virtual machine and a dedicated host that has any available security updates applied.

Keep in mind that malware often “phones home” to the original attacker: connecting back to either a server controlled by the person that deployed or created the malware, or a botnet (which could be either centralised or distributed).

Preventing network connections using an isolated network is often a good idea because it:

Prevents the infected system that is being analysed from receiving instructions
Prevents attackers from learning your IP address (which may result in retaliation, and further attacks)

However, sometimes you do want to analyse the complete behaviour of the malware, and often malware downloads payloads from remote servers, which would be prevented if isolated.

Warning: Also, keep in mind that an analysis VM may not provide enough protection, as the malware may attempt to compromise the host OS.

The nature of compiled binary programs

Most malware (and more generally most systems software) is distributed in the form of a compiled program. Software is typically written in a language such as C, which (some) humans can understand, and compiled (and linked against code libraries) into an executable file, that includes machine code that a CPU / operating system can execute. The original source code is no longer available, and yet there are analysis approaches we can apply to figure out the behaviour and details of the code.

You are going to apply some fundamental analysis techniques to the standard ls program, and then apply those same techniques to a security reverse engineering challenge. While some of these specific tools are Linux-specific most of the same concepts can be applied to analysing Windows PE files.

Static malware analysis

Viewing the file contents (hex and ASCII)

The most direct way of exploring the contents of an executable file is by viewing the exact data that is stored to disk. An executable file is stored in an OS-specific file format (more on this in the next section), and is typically a binary file, meaning it contains data (such as machine instructions, images, or sound) that is not meant to be interpreted directly as text to be read by humans.

If we try to read a binary executable file in a standard text reader, it includes a lot of information that cannot be represented as human readable.

==action: Print the contents of the ls program to the screen, run:==

cat /bin/ls

Scroll up, and you can see that there is some human readable text, but mostly the values can’t even be clearly represented in a readable way. That’s because standard tools such as cat (and most text editors) assume a human-readable format such as ASCII (American Standard Code for Information Interchange), which represents each printable character as a value.

ASCII table showing character codes and their hexadecimal representations https://commons.wikimedia.org/wiki/File:Ascii-proper-color.svg

However, executable binary files are actually made up of machine code instructions, which include many values that are not in these standard printable ranges, and interpreting these instructions as ASCII (as with the cat output) for the most part does not provide meaningful information.

Hex is the standard format for viewing binary data, since binary representation, zeros and ones, is unmanageable for human interpretation.

“Unlike the common way of representing numbers with ten symbols, it uses sixteen distinct symbols, most often the symbols “0”–”9” to represent values zero to nine, and “A”–”F” (or alternatively “a”–”f”) to represent values ten to fifteen. Hexadecimal numerals are widely used by computer system designers and programmers, as they provide a human-friendly representation of binary-coded values. Each hexadecimal digit represents four binary digits, also known as a nibble (or nybble), which is half a byte. For example, a single byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.” https://en.wikipedia.org/wiki/Hexadecimal

==action: View the hex of the program:==

hexdump /bin/ls

Output of hexdump showing offsets and file contents in hexadecimal format

Note that the output includes the hexadecimal memory address (offset from the start of the file), and the file contents displayed in hexadecimal format.

It is also helpful to have an ASCII representation column, in case the data does include text information that a human could understand.

==action: View the hex data, along with ASCII:==

hexdump -C /bin/ls

Xxd is a similar hex viewing program, which can also show us the actual binary 0s and 1s of the program, as it’s stored physically on disk:

xxd -b /bin/ls

However, an actual literal binary view such as this is so unhelpful, it’s very rarely, if ever, used in practice.

On the other hand, string data parts of a program (parts that fall in the ASCII range) often do include helpful information about a program, and can include things such as IP addresses, so extracting strings out can be an effective and easy technique.

==action: Extract the ASCII text from a binary file using the strings command:==

strings -a /bin/ls

While this can often yield useful information, it doesn’t often tell you much about technical details of the executable or how the program behaves.

Executable metadata

Each executable is stored in an executable file format that is readable by the operating system. On Windows, executable files are stored in the Portable Executable (PE, also known as PE32) format, or on 64bit systems in PE32+ format. On Linux and most other Unix systems, Executable and Linkable Format (ELF) is used. Mac OS X uses the Mach-O format.

==action: Use the file program to identify the type of a file, regardless of its extension:==

file /bin/ls

Question: What format is this executable?

Executable files can contain metadata, including information such as the date the program was compiled, version information, linking information (to libraries and shared code), the machine instructions themselves, variables, debug symbols, icons, and so on.

Note: This information is helpful for malware analysis, but be aware that it can be intentionally misleading.

The ELF file is organised, starting with an ELF header (describing things such as whether the architecture is 32 or 64bit), program header table (with execution information), then sections from the program itself, including:

.text: the machine code instructions
.data: global variables
.bss: uninitialized arrays and variables
.rodata: constant data, such as strings

ELF file structure showing different sections and their purposes https://en.wikipedia.org/wiki/File:Elf-layout–en.svg

Have a look at this visualisation:

Detailed ELF file format diagram showing the complete structure

Click here to view this useful visualisation larger.

==action: Manually look at the contents of the ELF header:==

hexdump -n 64 -C /bin/ls

Note: -n specifies how many bytes to read, 64 bytes for the header

All ELF files start with the hex “7f 45 4c 46”, and you could manually look up what each of the following bytes in the header represents. For reference, here is the ELF specification.

==action: Use ReadELF to help parse and extract information:==

readelf --file-header /bin/ls

Note: Where path.to/executable is a Linux binary executable file. To start with, simply use “/bin/ls”, to see what kinds of information it extracts.

You can alternatively use “-h” rather than “–file-header”, but we’ll use the long versions here, as they are clearer to read.

Question: For the ls program, what is the entry under “Data:”, and what does it mean?

Question: Is this a 32bit or 64bit executable? What does that mean?

==action: Use ReadELF to view the program headers, and dynamic libraries:==

readelf --program-headers /bin/ls

readelf --dynamic /bin/ls

Question: Does this program load code from external libraries? Which one(s)?

==action: Use ReadELF to view the section headers:==

readelf --section-headers /bin/ls

The output includes a list of all the sections of the program itself.

Look at the output and note the section numbers (in square brackets) of the .text and .rodata sections.

Look at the flags (for example, WA, AX, or A) and note the “Key to Flags” at the bottom of the screen. Some sections are marked as executable code (X), others as writable variables (W).

==action: Use ReadELF to view a listing of the symbol table:==

readelf --symbols /bin/ls

The symbol table can include the names of functions and other objects and the mapping of those into the instructions, this alone can provide some useful insights into the external functions called by the program. For example, we can compare the output from ls to the output of the symbols in the ping program:

readelf --symbols /bin/ping

Notice that in the output there are now many networking related functions being called, and so even at this very cursory look we can tell this program involves network communication.

ReadELF can dump the contents of sections.

==action: Dump the contents of the .rodata section:==

readelf --hex-dump= ==edit:X== /bin/ls

readelf --string-dump= ==edit:X== /bin/ls

Note: Where X is the section number of the .rodata you noted earlier. (There is no space between the = and the X.)

Note that this includes loads of really useful data that could be helpful to understand the program.

==action: Dump the contents of the .text section:==

readelf --hex-dump= ==edit:X== /bin/ls

Note: Where X is the section number of the .text you noted earlier.

Now you are viewing the specific machine instructions, as they are stored to disk, for the program. However, even this targeted hex representation is very hard to make sense of the actual behaviour of the program (although technically possible).

Reverse engineering and disassembly: inspection of the machine instructions / source code

As previously mentioned, system software is typically developed using a high-level programming language, such as C or C++, then compiled into machine code instructions that a CPU can execute. The machine code is then saved into an executable file (along with metadata and so on).

Very few people directly work with machine code in a binary or hex view, since this is almost indecipherable for a human; it is much more intuitive to view the instructions in an executable file as assembly code. Assembly language describes the low level instruction steps for a CPU using (many) short lines of code representing machine code instructions. At one point in history (before the 1980s) Assembly was the primary way that program code was written. The figure below shows an example of compiled machine code (such as “B9FFFFFFFF”) and the assembly that describes the instruction (“mov ecx, -1”). In this case, this instruction sets the ECX CPU register to the value “-1”, which is clearly easier to understand in the assembly code rather than the machine code that the computer runs.

Example machine code, and corresponding assembly code, and description¹

There are various programs, known as disassemblers, that can be used to display an executable file’s instructions, as assembly code.

==action: Use objdump to disassemble a program and view the assembly instructions for ls:==

objdump -dj .text /bin/ls | less

Tip: This is piped through to less, so you can scroll through the output.

==action: Scroll through, and press Q to exit.==

As you can see, even simple software such as this can contain an extensive number of machine instructions.

Some of the most popular tools for malware analysis and reverse engineering of executables are Ghidra, Radare, and IDA Pro. We will visit these tools later.

Introducing reverse engineering CTF challenges for this module

In this module, we are using a binary analysis capture the flag (CTF) approach to marks for lab challenges. Each week you will find some programs in your home directory (under a challenges directory).

==action: Browse the challenges directory:==

ls /home/==edit:user==/challenges

When you run the program it will give you instructions and hints on how to solve the challenge.

==action: Run the first challenge (after changing to the directory):==

cd ~/challenges/Ch1_Readelf

./Ch1_Readelf

The ~ (tilde symbol) represents your home directory.

Flag: You can apply the techniques you have learned above to solve this challenge, to determine the password that will provide you with a flag.

Don’t forget to write up your solution and technical description in your log book.

Note that there are often multiple ways to solve each challenge, and you should record at a minimum the approach asked for (and you may even pick up extra marks by showing alternative solutions).

Dynamic malware analysis

Dynamic analysis involves running the malicious code, to analyse what it does, and how it interacts with its environment.

Often it’s simpler (but can lead to incomplete analysis) to run a program in a safe environment, and simply monitor its resulting behaviour.

There are various ways this can be achieved, including running the program in a full operating system / VM, a container / sandbox, or by debugging the program’s behaviour, either by instruction execution, or at a higher level of interaction.

Two tools available for running a Unix program while monitoring its higher level interactions are strace and ltrace.

Strace runs a program and intercepts and outputs the system calls that the program is making; for example, where the program asks the operating system to open or close a file.

Ltrace outputs the dynamic library functions that are called; for example, when a program calls a function from code it loads from shared libraries.

==action: Run these commands, and see how much of the behaviour you can make sense of:==

strace /bin/ls

ltrace /bin/ls

Dynamic analysis CTF challenges

==action: Run the dynamic analysis challenge (after changing to the directory):==

cd ~/challenges/Ch1_Ltrace

./Ch1_Ltrace

Flag: Use the techniques you have learned to dynamically reverse engineer the program to solve the challenge.

Warning: Note that the challenge programs run with setuid (elevated user privilege), and while you trace or debug them they do not run setuid (for security reasons), so once you have the solution you need to run the program again without tracing/debugging in order to successfully access the flag.

Tip: Always make sure you cd into the directory before running the binary.

Conclusion

At this point you have:

Performed static analysis of ELF files, dissecting them to understand what is included within executables, and extract useful information out of strings, and also seen that the program is made up of machine instructions;
Performed dynamic analysis to understand the high-level behaviour of programs by running and monitoring them for system calls and dynamic library calls, and extracted useful information this way.

Well done!

This sets the stage nicely for learning about reverse engineering, and the importance of understanding lower level concepts, including C and assembly code.

Based on an example from Wikipedia (Creative Commons Attribution-ShareAlike License.) ↩