Contents
Introduction
Although malware can be written in any language, the vast majority is still written in C, as this language offers developers unparalleled control of the computer’s memory and resources. This week, we will cover basic C programming. We will learn to write, compile and run C programs in preparation for disassembling them next week.
High and low level languages
When you run a program on your computer, what your CPU does behind the scenes is read and execute a number of machine language instructions. Ultimately, computers process zeros and ones, which is why machine code is represented in binary. Binary instructions however are very difficult for humans to read and comprehend. Can you make any sense of the instructions below?
Output of “xxd -b simple” showing binary representation of the “simple” executable file
Assembly language is a human-readable way of representing machine code. It uses mnemonics to represent instructions that the CPU executes, such as “push”, “mov” and “call”. Look at the assembly code below, is it not slightly more human-friendly? Don’t worry about trying to understand what it does for now. By the end of this module, you will be able to follow assembly instructions like these to work out what the code is doing.
Disassembly of main() function of the “simple” executable in gdb
Both assembly language and machine code are low level languages. For each instruction in machine code, there is exactly one instruction in assembly language. You can think of assembly language as a human-friendly way of representing the binary instructions that the processor executes, which is why it is so important to be able to understand it when you are analysing malware.
Assembly language and machine code are architecture-specific, so code that runs on one type of CPU architecture such as X86 will not run on another type, such as ARM. Since X86 is the most common type of CPU architecture used in personal computers, we are going to focus on it during this module.
Although it is not common to write programs directly in assembly language nowadays, particularly given the power and complexity of modern CPUs and programs, assembly language is still used by developers looking to achieve maximum utilisation or full control of the processor’s behaviour such as, for example, when working with micro-devices.
The diagram above shows how an assembler converts assembly instructions written by a developer into machine code which a computer can execute. Since the assembly instructions have a one-to-one relationship with machine code instructions, this is a simple conversion to make. As a malware analyst, you should note that it is also just as simple to reverse the process and convert machine code into assembly code which you can read.
Question: If humans can read and write assembly code, why do we need high-level languages?
If you have some programming experience, it is likely that you are familiar with languages operating at a much higher level of abstraction such as Java, C++ or Python. These are called high-level languages. They are architecture-agnostic, so the code is written the same way regardless of the architecture used to execute it. High-level languages abstract a great number of instructions away from the developer, who typically does not care about where the stack pointer is during the program’s execution, or whether certain registers have been populated before a function is called.
The figure above shows the typical process of compiling source code written in a high-level language into machine code that the computer can execute.
Question: How about interpreted languages, how are they different from compiled languages?
C is a high-level language. It does however offer developers great control over low-level features such as memory allocation and garbage collection. Modern reverse-engineering tools such as IDA-Pro and Ghidra usually go beyond disassembly and provide an attempt at reconstructing the source code. The language used is C, since it can most accurately represent the low-level instructions contained in the executable. If a different high-level language were to be used for the reconstruction, it would probably leave out a lot of the instructions contained in the assembly code, some of which could be crucial for you as a malware analyst.
Remember the assembly code we were looking at at the start? Here is the source code, in C. Isn’t it much easier to read now?
Source code of the “simple” executable
Getting started with C
Note: You may have noticed that this week’s lab is a little different. There are no flags to capture, although you should still attempt the programming exercises suggested.
The code we are writing this week is very simple, and the exercises can all be solved using the vi text editor and gcc compiler provided in this week’s VM. Since there are no flags in the actual VMs, however, you could choose to write your code on your desktop or a development environment of your choice. It is entirely up to you. If you prefer to use an IDE, go ahead and pick one you like. IDEs are generally recommended for larger, more complex projects, but it is fine to use them if you find that they suit your learning style better.
Hello world
Let’s start this week’s lab with a very simple “hello world” program in C.
==action: Open a terminal in your VM and type:==
vi hello-world.c
==action: Press the “i” key to enter “insert mode”.==
==action: Enter and save this content (Ctrl + Shift + V to paste):==
#include <stdio.h>
int main (void) {
printf("Hello, world!\n");
return 0;
}
==action: Press the “ESC” key to exit “insert mode”.==
==action: Now to quit and save the file press the “:” key, followed by “wq” (write quit), and press Enter.==
==action: Let’s compile our program using gcc. Type:==
gcc hello-world.c -o hello-world
Note: How you now have two files, the source code and the executable file.
==action: To run it, type:==
./hello-world
Did you get the same output?
Hello world with runtime argument
Now that we have written our first C program, let’s see if we can make it a little more interesting. ==action: Change the code slightly to capture a string argument passed in when the program is run:==
#include <stdio.h>
int main (int argc, char *argv[]) {
printf("Hello, %s!\n", argv[1]);
return 0;
}
==action: Compile your program.==
==action: To run the program with a runtime argument, type:==
./hello-world Thalita
Note how the main function’s signature changed to accept two arguments: an int and an array of character pointers. We will look at pointers in more detail later on so, for now, just think of argv as an array of strings. In C, there are two accepted signatures for the main function, one which takes no arguments (void) and the one which you used in this example.
Let’s take a closer look at the printf() function. Note how the “%s” placeholder is used to indicate where to insert the string passed to the program at runtime. Note also how the runtime argument is accessed as argv[1]. Arrays in C are zero-based.
Question: What does argv[0] contain?
Question: Change the code so it accepts two arguments and prints your name and surname.
Example output showing two arguments being passed to the program
Variables and data types
C is a strongly-typed language. Effectively, this means that, when you declare a variable to store a value of a certain type, you can’t later use it to store a value of a different type. There are four basic type specifiers in C: char, int, float and double. Their sizes are:
- char 1 byte (8 bits)
- int 4 bytes (32 bits)
- float 4 bytes (32 bits)
- double 8 bytes (64 bits)
Tip: It will really help to memorise the basic data types above and their sizes since we will use them throughout the module.
The basic data types can be used in conjunction with the modifiers: signed, unsigned, short and long, which modify how much space is allocated to each type.
Question: How large is a long int, in bytes? How about a short int? Consider writing a short program to find out.
==action: Let’s write some code to familiarise ourselves with these data types.==
==action: Create a file called data-types.c and enter the following code:==
#include <stdio.h>
int main (int argc, char *argv[]) {
int quantity = 12;
short size = 'S';
float price = 2.85;
double total = price * quantity;
printf("I bought %d size %c coffees. \n", quantity, size);
printf("They cost £%f each. \n", price);
printf("I paid a total of £%f. \n", total);
return 0;
}
==action: Compile and run your code.==
Look at the different format specifiers for printf():
https://www.tutorialspoint.com/c_standard_library/c_function_printf.htm
Question: Can you find a way of printing the float and double values neatly, without the trailing zeroes?
Action: Try modify the first printf statement so that, instead of coffee, you are buying whatever is specified at runtime.
Example output showing data types and their values
You can assign binary, octal or hexadecimal literals to integer variables by using the prefixes 0b (for binary), 0 (for octal) and 0x (for hexadecimal). So, for example, the number 26 in decimal could be assigned as 0b11010, 032 or 0x1a.
All of these give the same resulting value (26):
- int quantity = 26;
- int quantity = 0b11010;
- int quantity = 032;
- int quantity = 0x1a;
If you are not familiar with converting numbers between bases, or if it’s been a while since you learned it, here is a good 16min video to brush up your knowledge:
Number Systems - Converting Decimal, Binary and Hexadecimal
Action: Update your code to initialise (initially set) the quantity variable with a hexadecimal value instead of decimal.
Action: Do the same, but this time give it a binary value.
Action: Observe that the output is the same in both cases.
Arrays
Arrays are data structures used to store a predetermined number of elements of the same type. Since the elements of an array are stored next to each other in memory, once declared, an array cannot be extended or reduced in size. You can, however, modify the elements that are stored in it. The syntax for declaring and accessing elements of an array in C is similar to other high-level languages. In fact, you have been using arrays already. Go back a few steps and look at how you accessed the argv variable containing values passed to the program at runtime.
Let’s write some code to practice working with arrays.
==action: Create a file called arrays.c and enter the following code:==
#include <stdio.h>
int main(int argc, char *argv[]) {
int numbers[3];
numbers[0] = 1;
numbers[1] = 2;
numbers[2] = 4;
printf("The sum of all %d integers is %d.\n", sizeof(numbers)/sizeof(int), numbers[0] + numbers[1] + numbers[2]);
return 0;
}
==action: Compile and run your code.==
Question: Can you replace the first four lines of the main function with a one-line initialiser?
Tip: Ensure you understand the line of code that starts with printf, just above the return statement.
Strings
Sorry to break the news to you, but there are no strings in C. That’s right, the concept of a string in other high level languages is an abstraction that does not really exist in C. So how can we represent the word “hello” in C? Well, all we need to do is store the characters ‘h’, ‘e’, ‘l’, ‘l’ and ‘o’ next to each other in memory, right? So we use an array of characters.
Let’s check it out with some code.
==action: Create a file called strings.c and enter the following code:==
#include <stdio.h>
int main(int argc, char *argv[]) {
char word [5];
word[0] = 'h';
word[1] = 'e';
word[2] = 'l';
word[3] = 'l';
word[4] = 'o';
printf("The word is: %c%c%c%c%c.\n", word[0], word[1], word[2], word[3], word[4]);
return 0;
}
==action: Compile and run your code.==
Yes, it works. But how incredibly verbose and time-consuming! Luckily, because arrays of characters are so common, there are some shortcuts we can use.
==action: Replace the code above with the following:==
#include <stdio.h>
int main(int argc, char *argv[]) {
char word [5] = "hello";
printf("The word is: %s.\n", word);
return 0;
}
Much better, don’t you think? Now, if you run it, you might notice something odd happening after the word “hello” is printed. What is going on there? What are these funny characters?
Since a string is just an array, and arrays are just a bunch of spaces in memory used contiguously, the printf function does not know when to stop printing. If there is something else using the space in memory right next to the last character of your word, it will print that too, and will carry on printing. To prevent this behaviour, functions that deal with strings conventionally look for the null terminator ‘\0’ in an array of characters. In fact, the shortcut you used in the previous exercise to assign the value “hello” to the word array already takes care of that. If you change the array size to 6, you will see that the null terminator is now added for you, and the string is printed correctly.
#include <stdio.h>
int main(int argc, char *argv[]) {
char word [6] = "hello";
printf("The word is: %s.\n", word);
return 0;
}
Note that, if you are populating the array manually as we did in the first exercise, you will have to add the null terminator yourself.
#include <stdio.h>
int main(int argc, char *argv[]) {
char word [6];
word[0] = 'h';
word[1] = 'e';
word[2] = 'l';
word[3] = 'l';
word[4] = 'o';
word[5] = '\0';
printf("The word is: %s.\n", word);
return 0;
}
Conditionals
Conditional statements are a fundamental part of any programming language. Without conditional statements, all instructions would have to be executed in sequence, regardless of what happened at runtime. As an example, a very simple program to look for your missing house keys could be: “Look on the kitchen counter. If they are there, stop. If not, look under the bed. If they are there, stop. If not, check if the dog has eaten them. If the dog has eaten them, wait 24h. If not, call the locksmiths and change the lock.” Without conditional statements, your program would look in all places, wait 24h, call the locksmiths and change the locks - even if the keys happened to be under the bed!
C uses “if” and “switch” statements for making decisions based on conditions that evaluate to true or false. Let’s check how both of these work by writing some code:
==action: Create a file called conditionals.c and enter the following code:==
#include <stdio.h>
int main(int argc, char *argv[]) {
if (argc == 1) {
printf("You haven't entered any arguments.\n");
} else if (argc > 1 && argc < 3) {
printf("Your argument is: %s\n", argv[1]);
} else {
printf("You entered too many arguments.\n");
}
return 0;
}
==action: Compile and run your code.==
==action: Test the three possible execution paths.==
==action: Change the “else if” condition to something less verbose.==
==action: Change the program so that it prints up to three arguments.==
Aim for the output below:
You may have noticed that your code can quickly become quite verbose when there are too many if statements involved. Switch statements can be used in C (as in other high-level programming languages) to successively compare a value to a number of other values. The syntax for a switch statement is:
switch (expression) {
case constant:
statement;
break;
case constant:
statement;
break;
default :
statement;
}
Note that the breaks are optional, as is the default statement. Let’s rewrite the code from the previous exercise using a switch statement.
#include <stdio.h>
int main(int argc, char *argv[]) {
switch (argc) {
case 1:
printf("You haven't entered any arguments.\n");
break;
case 2:
printf("Your argument is: %s\n", argv[1]);
break;
case 3:
printf("Your arguments are: %s and %s\n", argv[1], argv[2]);
break;
case 4:
printf("Your arguments are: %s, %s and %s\n", argv[1], argv[2], argv[3]);
break;
default :
printf("You entered too many arguments.\n");
}
return 0;
}
Question: What happens if you remove the breaks?
==action: Create a file called fallthrough.c and enter the following code:==
#include <stdio.h>
int main(int argc, char *argv[]) {
if (argc == 1) {
printf("You haven't entered any arguments.\n");
return 1;
} else if (argc > 4) {
printf("You have entered too many arguments.\n");
return 1;
}
printf("Your arguments are:");
switch (argc) {
//TODO: ADD YOUR CODE HERE
}
return 0;
}
==action:Implement the switch statement above using fallthrough to achieve the outcome below.==
Loops
Loops are used in programming languages to execute statements repeatedly until some condition is met (or sometimes infinitely). There are three types of loops in C: the “for loop”, the “while loop” and the “do while loop”. They are similar to loops you may have encountered in other programming languages.
A typical for loop has three elements: initialisation, test and increment.
for (int i = 0; i < 10; i++) {
printf("%d\n", i);
}
The printf function is called 10 times in the example above, printing the numbers 0-9.
All three elements are optional.
Question: What happens when you run the code below?
#include <stdio.h>
int main (int argc, char *argv[]) {
for (; ; ) {
printf("%d\n", 0);
}
}
The while loop is similar to the for loop, but only the test can be specified within the brackets. Both initialisation and increment must be done outside the brackets. The code below is equivalent to our first for loop example.
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Finally, the do while loop is similar to the other two loops, but with one important difference.
Question: Can you figure out what it is?
int i = 0;
do {
printf("%d\n", i);
i++;
} while (i < 10);
Let’s put this into practice by writing some code.
==action:Using argc and argv, write some code that loops through the arguments passed by the user at runtime and prints them to the screen.==
Aim for the output below:
Functions
So far, we have been writing all our code inside the main function. However, we have called the printf() function to output text to the terminal, and earlier we called the sizeof() function to calculate the size of an array. When the main code calls a function, it passes the required values to it before delegating execution to the function. While the function is executing, the main function waits. Once the function returns, the main code continues from where it had stopped. We will take a much more in-depth look at what happens behind the scenes once we move onto assembly. For now, let’s get some practice writing functions.
==action: Create a file called simple-function.c and enter the following code:==
#include <stdio.h>
float average(float a, float b);
int main (int argc, char *argv[]) {
float x = 7.5f;
float y = 10.5f;
float result = average(x, y);
printf("The average is: %g\n", result);
}
float average(float a, float b) {
return (a + b)/2;
}
==action: Compile and run your code.==
Note how the function signature is declared before the main method.
Question: What happens if you remove that line? Why? Could you rearrange the code so you don’t need to declare the function signature?
It is your turn now.
Action: Modify the function above to calculate the average of three floating point numbers.
Action: Write a new function to calculate the VAT (20%) of a given price.
Pointers
Pointers are perhaps the most dreaded feature of the C language, but they are widely used and one of the most useful concepts for malware analysis. Malware analysts rarely have the luxury of being able to access the source code, so it is important to develop a solid understanding of how programs are loaded into memory and executed by the CPU, down to which values are stored in which memory addresses and how and when these are accessed by malicious code. Of course, it all begins with a good foundation in C, and you will find that pointers are particularly helpful, so let’s get started.
A good way to think of how pointers are represented in your computer’s memory is to imagine that each byte in memory is a bucket which can be empty or can contain data. Each bucket also has an address, as illustrated in the figure below.
When we store data in memory, we simply put it in these buckets. The number of buckets we use depends on the size of the data, so a char, for example, would take up 1 bucket, but an int would take up 4 buckets since it occupies 4 bytes of memory.
Let’s store the character ‘A’ in a bucket.
We know that the bucket’s address is 0x0c. What if we wanted to store this information somewhere? Well, first of all, we need to know the size of the address data type so we know how many buckets to allocate to it. As it happens, in 32bit systems, an address or pointer occupies 4 bytes in memory. Perfect, so let’s reserve 4 buckets to store our address.
Now buckets 0x0e to 0x11 contain data of type “pointer”. What exactly is this data though? It is simply the address of another bucket storing 8 bits or 1 byte of data (in this case, our character ‘A’). It is worth taking a slight diversion here to look at the concept of endianness. Let’s think of the address 0x0c as 0x0000000c (they both represent the number 12 in decimal). As you can see in the illustration above, the least significant byte of the address, 0c, is stored in the lowest address in memory. The highest address in the picture (the last bucket), stores the most significant byte of the address, in this case, 0. This is called little-endian ordering. Some architectures use the opposite ordering, big-endian, when storing bytes in memory.
Watch the first 3 min of this video on endianness:
Endianness Explained With an Egg - Computerphile
Pointers in practice
Let’s go back to our pointers to see what they look like in code.
char myChar = 'A';
Note: Nothing new here, right? I simply declared a variable called myChar and put the character ‘A’ in there.
char *myCharPointer;
Note: Ok, so what is going on here? This is where C syntax differs from other high-level languages you may be familiar with. Whenever you see a variable being declared with an asterisk (*) before its name, the variable is a pointer. So, in the example above, read: declare a variable called myCharPointer, of type pointer (to a char).
Now, let’s assign an address to our new variable:
myCharPointer = &myChar;
Note: Whenever you see a value being assigned with the ampersand (&) prefix, it means “the address of” in C. So, in the example above, read: assign the address of myChar to the myCharPointer variable.
Let’s put it all together and run some code.
==action: Create a file called pointers.c and enter the following code:==
#include <stdio.h>
int main(int argc, char *argv[]) {
char myChar = 'A';
char *myCharPointer;
myCharPointer = &myChar;
printf("The value of myChar is: %c\n", myChar);
printf("The value of myCharPointer is: %#x\n", myCharPointer);
return 0;
}
==action: Compile and run your code.==
Did you get the same output?
==action:Modify the code to use an int instead of a char. Print the sizes of both variables.==
Aim for the output below.
(don’t forget to compile your code as a 32-bit program by using the -m32 option as shown above)
Tip: If you get an error when trying to use the -m32 option, it could be because there may be a dependency missing in this week’s VM. Please use Week 1’s VM to run this exercise, and let us know (via GitHub).
Here is one last bit of syntax to remember for now. Suppose I have a variable of type char pointer, as in the first example we looked at.
char myChar = 'A';
char *myCharPointer = &myChar;
We know that the myCharPointer variable stores an address to a char. Now what if we wanted to follow the address and print the value of the char stored in there? We can use the dereferencing “*” operator:
#include <stdio.h>
int main(int argc, char *argv[]) {
char myChar = 'A';
char *myCharPointer = &myChar;
char c = *myCharPointer;
printf("The value stored in the address that is stored in myCharPointer is: %c\n", c);
return 0;
}
Here we create a new char variable called c and assign it the value stored in the address stored in myCharPointer, which evaluates to ‘A’. You can read it as: go to the address you have and give me the value stored there.
==action: Modify the code you used to print integer pointers. Add a line showing how to access the value stored in the address that your pointer variable is pointing to.==
To recap:
Code | Explanation |
---|---|
int *myPointer; |
Declare a variable called myPointer of type pointer (to an int). |
myPointer = &myInt; |
Assign the address of myInt to myPointer. |
int x = *myPointer; |
Follow myPointer and assign the value at that address to x. |
Pointers and arrays
This is where pointers get really interesting. Pointers are very similar to arrays and, a lot of the time, they can be used interchangeably. When you declare an array of 4 elements of type integer, you are saying: “give me a space in memory the size of 4 integers (remember each integer occupies 4 bytes, so we are asking for 16 bytes) and point to the first element in there”.
int myNumbers [4] = {1, 2, 3, 4};
Your array variable points to the first element so the computer knows where in memory the array starts. It has a fixed type and size so the computer knows how much memory is occupied and where the array ends.
Since an array variable simply points to the start of an array, you can assign it to a variable of type pointer.
int *myNumbersPointer = myNumbers;
Question: What do you think the code below will print?
#include <stdio.h>
int main(int argc, char *argv[]) {
int myNumbers [4] = {1, 2, 3, 4};
int *myNumbersPointer = myNumbers;
printf("The myNumbers array is: %p\n", myNumbers);
printf("The myNumbersPointer pointer is: %p\n", myNumbersPointer);
return 0;
}
==action: Create a file, add the code above, compile and run it to check if you were right.==
Action: Modify the code above to print the elements of the array using square brackets [ ] to access each element.
Question: Does the bracket notation work with both arrays and pointers? Try it out.
Create a new file called arrays-and-pointers.c and enter the following code:
==action: Create a file called arrays-and-pointers.c and enter the following code:==
#include <stdio.h>
int main(int argc, char *argv[]) {
int myNumbers [4] = {1, 2, 3, 4};
int *myNumbersPointer = myNumbers;
for (int i = 0; i < 4; i++) {
printf("%d ", myNumbers[i]);
}
printf("\n");
for (int i = 0; i < 4; i++) {
printf("%d ", *(myNumbersPointer + i));
}
return 0;
}
==action: Compile and run the code.==
A final note on pointers: you can have double pointers! Take a look at the example below:
char *names [] = {"this", "is", "the", "end"};
char **namesPointer = names;
Action: Write two for loops like you did in the arrays-and-pointers.c program to print the contents of the array and pointer declared above.
Conclusion
At this point you have:
- Written, compiled and executed C programs to familiarise yourself with basic C syntax
- Used arrays, conditionals, loops and functions in your programs
- Learned more advanced memory management features of the C language and practiced using pointers.
Well done!
We have covered quite a lot this week. If any of the concepts are not clear, feel free to create your own coding samples to investigate them further and, of course, feel free to share them with us on the Discord channel. All the good work you have done in C will set the foundation for being able to understand assembly instructions in the weeks to come.