How Compiling Works

Introduction

Beginning programmers write a C program like this:

#include <stdio.h>

int main()
{
  printf("Hello, world!\n");

  return 0;
}

… then think that "stdio.h" contains all the printing functionalities. But that is not true. "stdio.h" file only contains information about the printf() function. The actual printf() function is defined elsewhere. This How-To document will explain the “where.” I'll also explain why C programs work this way.

Minimal stdio.h

If I were to write a minimal version of stdio.h, this is what I'd write:

#ifndef STDIO_H_
#define STDIO_H_

int printf(char* text);

#endif /* STDIO_H_ */

This file has one purpose — define the printf() function's prototype. Once the prototype is defined, any file that includes this file will know how to call the printf() function. You can tell from the above prototype that, for example, that my minimal printf() takes in one string argument and returns an integer. The file also contains #ifndef, #define, and #endif. This logic exists only to make sure that stdio.h is included only once even if you try to include the file multiple times. For example, a program like this would cause a compiler error for declaring printf() function's prototype multiple times:

#include <stdio.h>
#include <stdio.h>  /* include stdio.h again; this will cause compilation error */

int main()
{
  printf("Hello, world!\n");

  return 0;
}

… but thanks to #ifndef, #define, and #endif inside stdio.h, the C preprocessor (a part of the compiler that runs before the actual compiling process) will make sure that anything between #ifndef and #endif is included only once.

Now you could argue you'd never include the same file twice, and ideally you'd be right. However, when your program gets large you'll be including multiple header files… and you may end up including two header files, one called "otherfile1.h" and the other "otherfile2.h", both of which include "stdio.h"… and you'd be including "stdio.h" twice. Now, thanks to #ifndef, #define, and #endif, you won't have to worry about the compiler error.

Minimal stdio.c

Now that printf() has a prototype, a definition of printf() is required. We'll put this definition into a separate file, called stdio.c, and I'll write the following code:

#include <stdio.h>

int printf(char* text)
{
  asm
  {
    /* some assembly language code */
  }

  return 0;
}

That's it! (Of course, the hard part is knowing what to put in the asm{} block.)

Compiling the files together

After writing the above code, this is how you'd create the final program, "hello":

  • Compile "hello.c" to create an object file (an object file is an intermediary, an "almost fully compiled" file) called "hello.o"
  • Compile "stdio.c" to create an object file called "stdio.o"
  • Link "hello.o" and "stdio.o" together to create the final program, "hello".

There are several reasons for taking these multiple steps into creating a file:

  • As you write bigger and bigger programs, it'll take you longer and longer to compile your code into the final program. By breaking up the compilation process into smaller compilation steps, you get to recompile the program quickly when you modify a small portion of your program — simply recompile only the portions affected by your code change then link them together.
  • If you ever want to let someone else use your code, but don't want to reveal your source, you can just give them the object file (the .o file) and the header files (the .h files) and still allow them to compile programs including your code.

Normally, a compiler will do all the compiling and linking for you. For example, in UNIX, you can just type this to produce "hello" from the two .c files:

% cc hello.c stdio.c -o hello

If you try this now, however, you'll get an error because printf() is already defined by the internal C library.

… but be assured that, behind the scenes, the C preprocessor, the compiler, and the linker are all being run, one after another automatically on your behalf. The C preprocessor's program is usually named cpp, the compiler is usually the cc, and the linker is usually named ld under UNIX. But “cc” will run the preprocessor and the linker automatically for you, to make things simpler. You could, however, compile them separately; in UNIX, you can compile the files separately like this:

% cc hello.c -c    (compile hello.c into hello.o)
% cc stdio.c -c    (compile stdio.c into stdio.o)
% cc hello.o stdio.o -o hello  (link hello.o and stdio.o to generate hello)

If you try this now, you'll be able to compile the *.c files but not link them together because the linking step will attempt to link your program to the internal C library which already has printf() defined.

Where is printf()?

When you normally compile a program, however, you don't have to worry about manually linking your program to the built-in library that defines printf(). This is because the standard C library (which defines prinf()) is automatically linked to your program by default. You can suppress this using the -nodefaultlibs option in GNU C Compiler. Other compilers should also have a similar option.

The standard C library's filename is usually /usr/lib/libc.a in many UNIX systems. The .a file is an “archive” file containing multiple .o files. You can create one, too, using the ar program in UNIX. I'll let you look up its usage on your own.

Correspondingly, if you try to compile hello.c using gcc and the -nodefaultlibs option, you'll get an error saying that printf() is not defined. And you'll be able to compile hello.c and stdio.c together with the -nodefaultlibs to compile the programs without any linker error.

If you check out your /usr/lib directory, you'll see a bunch of archive files. You can include any of them manually using the -l option. To include the libyoyo.a file, for example, you use the -lyoyo option:

% cc myprog.c -lyoyo

… and the compiler will automatically look for the libyoyo.a file and link your program with the *.o files within the archive.

Sometimes, you'll also see a file called *.so in UNIX, such as libyoyo.so. These are “dynamically linked” libraries. Such libraries are linked when you run your program as opposed to when you compile your program. The advantage of such mechanism is that it allows you to update the library to which your program links to without the need to update your program. It also saves your disk space and memory by having a single copy of the library in the disk/memory. An equivalent of such file in Windows has the *.dll extension. Writing dynamically linked libraries is system-dependent and is beyond the scope of this document.

What is in *.o?

Given what you know now, you should be able to guess what information should be included in the *.o file.

A *.o file should have just enough information to allow it to be linked to create the final program; all other information should be compiled away thoroughly so as to minimize the information that should be “compiled” by the linker when the final program is created.

In general, this means that an object file contains only the machine code, except in parts where it references a function or a variable of another object file. It should also contain information about how other object files can access its variables and functions. When the linker links the object files together, it'll figure out how the convert these variables and functions to machine codes to create the final program.

These functions and variable references are called symbols. A function is a symbol that points to an address within the object file; a variable is also a symbol that points to an address within the object file. To the linker, symbols are all pretty much the same thing — the linker just has to know the symbols and the addresses they refer to, and the linker will simply cross-reference the symbols in each object file to each other.

When you write your code, you can control whether a variable or function's symbol can be “exported.” An exported symbol is linkable from another object file; a non-exported symbol cannot be.

In C, you control this with the “static” keyword; any static global variable or any static function will not be addressable from outside the .c file because the symbol will not be available when the .o file is created.

However, the physical location of the variables and functions still exist, so if you find a way to access the variable or the function's memory location then the data will still be accessible.

There are programs to manipulate an object file's symbol table. strip is a popular program used to strip out all the symbols in an object, making the object file non-linkable. You can do the same using the -s option to the GNU C compiler. Non-linkable object file is not very useful, but you can run “strip” on the final program, which will reduce the size of your program slightly. Stripping out the symbol table from the final program, however, has a nasty side-effect that debugging programs and core dumps won't be able to give you much useful information when you try to analyze your program because debugging programs and core dumps often provide you information using the symbol table. Some vendors also run strip on the final program to hide the available symbols of their program which can deter reverse-engineering of the program.

What should be in *.h?

Notice that *.h file can be included by multiple files. For example, stdio.h was included from hello.c as well as stdio.c. This won't cause any problem as long as stdio.h contains only the function prototype for printf(). But if you had defined an actual printf() function in stdio.h (instead of stdio.c), then it will cause a problem because the symbol “printf” will become available in hello.o as well as in stdio.o… which will be a problem when linking because the linker won't know which version of “printf” — the version in hello.o or the version in stdio.o — should be used for linking to external object files.

The same is true of variables. No actual variable definition should be included, but only its declaration (“prototype”). Global variable declaration should proceed with the keyword “extern” to show that the variable is only a declaration, not an actual definition.

You can include #define constants and macros (macros are functions created using #define… if you don't know how that works, don't worry about it) because constants and macros do not export any symbols. These are preprocessor constants and functions that will disappear at the end of the preprocessing stage, long before the linker comes into play.

It is okay to include static variables and static functions, but remember that static variables and static functions can be referenced only within your current .c file. So two static variables may have the same name when you use the variable in your code, but if you're accessing the static variable in two different .c files then the variable cannot be used to pass values back-and-forth between your two .c files.

Conclusion

The way compilers compile a program is broken up into the compilation and linking stages (and, for C program, a C preprocessing stage as well.) By breaking up the compilation of a program into two separate stages, one gains the advantage of compilation speed and the ability to distribute one's code to others without revealing its source code. However, such privilege comes with the need to obey certain rules when writing the code, without which one can end up with software that cannot be compiled.

results matching ""

    No results matching ""