Build Systems (Basics)

In this unit we'll cover the basic of build systems and illustrate their functioning on the example of maven. We'll start be a short recapitulation of language compilation, notably in the context of the java programming language, and then investigate challenges associated with the assembling of binaries. Finally, we take a look at the core features of maven, notably dependency management, and make first attempts to customize build system behaviour with a configuration file.

Lecture upshot

Build systems have one purpose: make sure your source code can be reliably translated into a usable product. Although this sounds straightforward, it is all but a simple task. Buildsystems are a powerful and highly-configurable means to brings some order and reliability into the path from source code to product.

Java compiler recap

For a start, we will take a closer look at how a simple program is executed on a computer.

Executing code

There are two ways to execute code. Which path is taken depends on the programming language.

Using an interpreter: The computer tries to make sense of your source code, as it processes it, line by line. Examples:
- Python
- Bash
- Javascript
Using a compiler: The computer does not make sense of your source code, but expects you to translate it first into bytecode, using a compiler. Examples:
- Basic
- C/C++
- Java ???

About binaries

Usually, compiled code is bound to a specific target platform. That is, once a compiler has translated from source code to byte code, the outcome can only be used on specific hardware.

Interpreted VS compiled languages

So which one is better ?

Interpreted languages:
- Cross-platform compatibility
- Slightly faster development, no need to wait for compiler
- Often "easier" syntax, beginner friendly
Compiled languages:
- More performant: compiler optimizations for target platform, executed in native code
- Safer: Fewer runtime errors, more compile time errors

Note, that a compiler only needs to its job once, VS an interpreter needs to run every time a program is executed.

And what about Java ?
Java is a special case...

Java is a compiled language
Java bytecode works on every platform, because it is interpreted by a virtual machine, the JVM.
Java unites some advantages of both:
- Cross-platform compatibility
- Performant compiler optimizations for VM
- Security by compiler checks

Java compiler illustration

Step 1: Write java code

A human developer writes human-readable java code:

class HelloWorld {
  public static void main(String[] args) {
    System.out.println("Hello, World!");
  }
}

Step 2: Compile to java bytecode

The java compiler is called: javac HelloWorld.java and produces bytecode:

CAFE BABE 0000 0042 001D 0A00 0200 0307 0004 0C00 0500 0601 0010 6A61 7661 
2F6C 616E 672F 4F62 6A65 6374 0100 063C 696E 6974 3E01 0003 2829 5609 0008 
0009 0700 0A0C 000B 000C 0100 106A 6176 612F 6C61 6E67 2F53 7973 7465 6D01 
0003 6F75 7401 0015 4C6A 6176 612F 696F 2F50 7269 6E74 5374 7265 616D 3B08 
000E 0100 0D48 656C 6C6F 2C20 576F 726C 6421 0A00 1000 1107 0012 0C00 1300 
1401 0013 6A61 7661 2F69 6F2F 5072 696E 7453 7472 6561 6D01 0007 7072 696E 
746C 6E01 0015 284C 6A61 7661 2F6C 616E 672F 5374 7269 6E67 3B29 5607 0016 
0100 0A48 656C 6C6F 576F 726C 6401 0004 436F 6465 0100 0F4C 696E 654E 756D 
6265 7254 6162 6C65 0100 046D 6169 6E01 0016 285B 4C6A 6176 612F 6C61 6E67 
2F53 7472 696E 673B 2956 0100 0A53 6F75 7263 6546 696C 6501 000F 4865 6C6C 
6F57 6F72 6C64 2E6A 6176 6100 2000 1500 0200 0000 0000 0200 0000 0500 0600 
0100 1700 0000 1D00 0100 0100 0000 052A B700 01B1 0000 0001 0018 0000 0006 
0001 0000 0001 0009 0019 001A 0001 0017 0000 0025 0002 0001 0000 0009 B200 
0712 0DB6 000F B100 0000 0100 1800 0000 0A00 0200 0000 0300 0800 0400 0100 
1B00 0000 0200 1C

(Hex dump produced with: xxd -u -p HelloWorld.class | sed 's/..../& /g')

Do you see something unusual ?

The hex-dump of java bytecode shows that every compiled class starts with CAFEBABE. Apparently an easter egg, added by the java developers.

Step 3: Run byte code on JVM

Finally, the byte code is distributed to various target systems.

Note that any system needs a JVM to run java bytecode.
Other compiled languages do not have this requirement, as they directly produce CPU-executable code.

How is the JVM best described?

The JVM is an intpreter. It reads in java bytecode and immediately sends execution instructions to the host in the CPU's native machine language.

JARs

In most cases your java program will be more than a single class.
- You could still translate all classes, and ship them, maybe as a zip-file
- But Java already has a file format for that: JARs.
JAR stands for "Java ARchive"
- JARs are zip files
- JARs have all classes, and a manifest, with meta information, notably the entry point to your application
JARs are still executed by the JVM, and in the best case run on all systems

JAR usage

Creating a JAR file from sources is relatively simple:

# Compile all java files to *.class files, place them in a new build directory
javac -d ./build *java

# Enter build directory
cd build

# Create a java archive (JAR) file, using all *.class files.
# Add MANIFEST.MF pointing to HelloWorld as launcher class.
jar cfe MyDeliverable.jar HelloWorld *class

This produces a JAR file: MyDeliverable.jar

Content:

MyDeliverable.jar
 ├── HelloWorld.class
 └── META-INF
     └── MANIFEST.MF

With MANIFEST.MF content:

Manifest-Version: 1.0
Created-By: 22.0.2 (Oracle Corporation)
Main-Class: HelloWorld

The JAR file can be directly executed, using the JVM: java -jar MyDeliverable.jar

JAR files for libraries

JAR files are also a great way to provide functionality to other programmers. Most java libraries are provided as JAR files. Whoever uses your code is most likely only interested in the functionality you offer, not the source code-itself.

Dependencies

Most of the time you do not want to program everything form scratch (See previous lecture on reuse-oriented development)

JSON example

We will now look at how compiling and execution changes when additional libraries are involved.

Imagine we want to serialize (create a machine-readable string representation) of a java object:

class Student {
  private final int age;
  private final String firstName;
  private final String lastName;

  public Student(int age, String firstName, String lastName) {
    this.age = age;
    this.firstName = firstName;
    this.lastName = lastName;
  }

  //... and getters
}

A student object, as created by new Student(34, "Maximilian", "Schiedermeier") should be serialized to:
```
{
  "age": 34,
  "firstName": "Maximilian",
  "lastName": "Schiedermeier"
}
```

Manual string creation

Of course, I could manually construct a JSON String:

    // Create student
    Student myStudent = new Student(34, "Maximilian", "Schiedermeier");

    // Export student
    String jsonString =
        "{\n"
            + "\t\"age\": " + myStudent.getAge()
            + ",\n\t\"firstName\": \"" + myStudent.getFirstName()
            + "\", \n\t\"lastName\": \"" + myStudent.getLastName()
            + "\"\n}";
    System.out.println(jsonString);

But what if I need to export another object ? What if object structure changes ?

The GSON library

A lot easier would be to reuse the existing Google GSON library:

import com.google.gson.Gson;

class MainWithGson {

  public static void main(String[] args) {

    // Create student
    Student myStudent = new Student(34, "Maximilian", "Schiedermeier");

    // Export student
    String jsonString = new Gson().toJson(myStudent);
    System.out.println(jsonString);
  }
}

However, we are now using code that is not ours, and the compiler, as well as the JDK need to know about this dependency.
- Download Gson library JAR file:
- This time we compile with the -cp (classpath) argument, telling the compiler that there are additional classes to consider. javac -cp gson-2.11.0.jar *java
- Same, when running the compiled bytecode, the JVM must know about the GSON library: java -cp gson-2.11.0.jar:. MainWithGson

What could possibly go wrong?

By re-using the Google GSON library we have created a "dependency". Without that library at hand, our code can be neither compiled, nor executed.

The problem with JARs

JARs are a straightforward way to pass around functionality, but as projects grow, several issues tend to persist:

The more dependencies you have, the more JARs you carry with you.
- Where to store the JARs? In the repo? What if you need the same JAR in multiple projects, do you store them twice?
- Everytime a new developer joins the project you need to pass on all the JARs and have them manually extend their classpath.
- Just compiling your project becomes somewhat tedious, because you always have to check a long list of dependencies are correctly installed.
- The client complains that your software is not running. Most likely they overlooked to install a JAR, or installed the wrong version. How do you find out which one it is?
A JAR is a snapshot, it is one fixed version.
- What if a security vulnerability was found in a JAR you've downloaded. How would you know?
- You lost a JAR that you need to build your project, where do you find it again? Which version was it again that works with your project?

A true horror story

In a previous research lab we had a software that was particularly hard to work with. Before a developer could even write a single line of code, they needed to spend at least 30 minutes to 1 hour of manual project configuration. The project had even JARs where no-one knew where exaclty they came from, whether they were still needed, or what exaclty they were contributing. There was some rumor of some intern who once was around 3 years ago, who had created the JARs. But the intern was long gone and no one had contact information. At the same time these were fat software artefacts that bloated up our software executable.
Countless developer hours were wasted, because of poor dependency management.

Dependency management

Dependency management aims to eliminate all aforementioned issues by rather specifying which dependencies exist (and where to get them), instead of manually managing JAR files.

In essence, the ingredients for any dependency management tool are:

An online repository, systematically archiving all versions of all libraries
A local configuration file, describing for every dependency:
- A unique identifier, e.g. "Google GSON library"
- The specific version, e.g. "2.11.0"

Advantages:

Configuration files are textual and lightweight. They can be stored in the project itself.
Configuration files are written in a machine-interpretable syntax. A tool can collect all dependencies for you and even modify the classpath when needed.
You have a clear trace of all exact dependency versions. You can easily scan your project for security vulnerabilities.
No damage is done if you lose a library JAR, you can easily retrieve it again from the repository.

Maven

Maven is a build system for Java that offers exactly these two components:

A central repository, with almost every java library ever created: mavencentral.org
A project configuration file that (among others) lists all project dependencies: pom.xml
- POM stands for "Project Object Model"
- XML is a machine-readable file format
- A dependency is stated as:
```
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.11.0</version>
</dependency>
```

Instead of ourselves downloading JAR files and placing them on the classpath, we ask maven to ensure all listed dependencies are in place.

Never ever

Never ever manually interfere with dependency management in maven-ready project. If you need an additional library, edit the pom.xml, but never-ever drag-and-drop a JAR file into your project, or edit the classpath.

Repositories

The local repository:

Maven also maintains a local repository on your computer, the ~/.m2 directory. Every library you ever used is cached in this directory.
The local repository has two purposes:
- Performance: It is faster to reuse a cached JAR file, than to download it from the internet every time
- Offline mode: You might not be online all the time. With the dependencies cached, you can develop without an internet connection

Third party repositories:

You might encounter situations where you need a library that is not in the official maven central repository.
Examples:
- Libraries that are not free to use, and therefore not publicly accessible
- Your own libraries, that you do not want to upload
Anyone can set up their own repository
- An online repository is just a few files accessible over an HTTP webserver
- However, by default maven does not know about third-party repositories. If you want maven to search your own repository, you need to edit the pom.xml file and indicate the location of your third party repository..

Mavens dependency resolve algorithm

To build a project, maven tries to satisfy all dependencies with corresponding artifacts (the JAR files, and some metadata). To satisfy a dependency, maven will:

First check the local .m2 repository for a cached file.
If not cached, it will check if any thrid-party repo is defined. (Usually there are none defined)
Contact the official maven repository servers to retrieve the needed artifact

flowchart LR
    resolve[\Resolve depdendency/]
    resolve --> localcheck{Artifact in local repo ?}
    localcheck -.  yes .-> done([Success])
    localcheck ==>|no| remotecheck{3rd party repo defined ?}
    remotecheck -.  yes .-> 3rdpartycheck{Artifact in 3rd party ?}
    3rdpartycheck -.  yes .-> done
    3rdpartycheck -.  no .-> centralcheck{Artifact in central ?}
    remotecheck ==>|no| centralcheck
    centralcheck ==>|yes| done
    centralcheck -.  no .-> fail([Fail])

What happens when a project is built for the second time ?

Maven will already have all dependencies cached. It will take the topmost path.

Maven in action

We'll now cover some basic usage scenarios for maven.

Maven project layout

Maven projects stipulate a specific internal structure.
We are not going to create the project structure manually, but use maven to initialize our projects:

  mvn archetype:generate \
  -DgroupId=ca.uqam.info \
  -DartifactId=MavenHelloWorld \
  -DarchetypeArtifactId=maven-archetype-quickstart \
  -DinteractiveMode=false

Note: Some systems (windows) cannot handle multi-line commands. Remove the \ and place everything in a single line.

Let's take apart the above command:

archetype translates to "we want to use a project template"
- There are different archetypes, for different purposes. E.g. for a webapp, or server backend we would have used a different archetypeArtifactId.
Similar to any dependencies you might need, your own software should have a unique identifier. Other developers might actually end up using your software as a library!
- groupId represents an organization specific string, usually this is just the revered domain name of the company you are working for. Since we are all at UQAM's computer science department we use ca.uqam.info
- artifactId stands for the software you are building. It should be a descriptive name, indicating what your software does.

Once executed the above command will have created the following folder and file structure:

MavenHelloWorld/
├── pom.xml
└── src
    ├── main
    │   └── java
    │        └── ca
    │            └── uqam
    │                └── info
    │                    └── App.java
    └── test
        └── java
            └── ca
                └── uqam
                    └── info
                        └── AppTest.java

12 directories, 3 files

For now, we are only interested in the pom.xml and the initial class file App.java. We will deal with tests in a later lecture.

Initial App class

The initial pom file is just a stub HelloWorld class:

package ca.uqam.info;

/**
 * Hello world!
 *
 */
public class App {
  public static void main(String[] args) {
    System.out.println("Hello World!");
  }
}

Package structures

Notice how the initial groupId argument has affected to project's package naming and internal folder structure ?

Initial pom file

The initial pom file looks, as created by the as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>ca.uqam.info</groupId>
    <artifactId>MavenHelloWorld</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>MavenHelloWorld</name>
    <url>http://maven.apache.org</url>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>

We already see a first dependency entry, namely for junit.

In the spirit of good software development, maven assumed that we will test our software.
However, junit is not part of standard java. Hence, we need a dependency block.

Anything peculiar about the dependency block ?

The junit dependency block actually has an additional <scope>test<scope> entry. This is because maven makes a distinction between dependencies needed to build a software, VS dependencies needed to run a software. Junit is not needed at runtime, therefore maven added an additional test scope tag.

Building with maven

Let's use maven to build the project, that is, create java bytecode. The corresponding command is mvn package.

The first time you run mvn package, we'll actually see how maven downloads junit.

There will be some logging messages:

...
Downloading from central: 
  https://repo.maven.apache.org/maven2/org/apache
  /maven/surefire/common-java5/3.2.5/common-java5-3.2.5.pom
Downloaded from central: 
  https://repo.maven.apache.org/maven2/org/apache
  /maven/surefire/common-java5/3.2.5/common-java5-3.2.5.pom
(2.8 kB at 156 kB/s)
...

Once the command is finished, we'll find a new directory target, with the following content:

target/
├── MavenHelloWorld-1.0-SNAPSHOT.jar
├── classes
│        └── ca
│              └── uqam
│                     └── info
│                            └── App.class
...

21 directories, 10 files

Among others, this is exactly the same outcome as we could have created manually, using the java compiler:
- A jar file
- Class files for our source code

Running Maven artifacts

Running the generated artifacts is almost identical to running manually created binaries.

Class files

We can without issues run the generated class files. Note however, that we must be at the package structure's root to call our program:

Calling App.class program from wrong location:

$ cd target/classes/ca/uqam/info; java App
Error: Could not find or load main class App
Caused by: java.lang.NoClassDefFoundError: App
(wrong name: ca/uqam/info/App)

Calling App.class program from package root location:

$ cd target/classes/
$ tree
.
└── ca
    └── uqam
        └── info
            └── App.class
$ java ca/uqam/info/App
Hello World!

Jar files

Running the jar file is not possible without specifying the main class, as by default the manifest does not contain a reference to launcher class.

Trying to run jar file without arguments:

$ cd target; java -jar MavenHelloWorld-1.0-SNAPSHOT.jar
no main manifest attribute, in MavenHelloWorld-1.0-SNAPSHOT.jar

When we inspect the jar internal MANIFEST file, we see there is no launcher specified:
```
Manifest-Version: 1.0
Created-By: Maven JAR Plugin 3.4.1
Build-Jdk-Spec: 22
```

Running jar file with custom main class as classpath argument:

$ java -cp MavenHelloWorld-1.0-SNAPSHOT.jar ca.uqam.info.App
Hello World!

Note: Maven of course offers a way to integrate a working MANIFEST into the produced jar file. More on that in a bit.

A clean build

The target directory accumulates all artifacts ever built. If you modify your code or pom.xml and re-build, new files might be added and it can be confusing to distinguish between old and new files. A good trick is to always use the clean argument before building, which wipes the entire target directory: Build your project systematically with **mvn clean package**

Maven plugins

Apart from downloading and caching dependencies, for usage in the local classpath, maven also has a second purpose: Modifying the build pipeline.

By default, all that happens on mvn clean package is the standard compiling of source files (using any specified libraries for the process).
But most of the time you want to do more, e.g. produce a human-readable documentation, run tests, or create a build artifact with all dependencies included.
Maven's comportment regarding the build-pipeline can be modified with plugins.

A plugin is a short (or sometimes not so short) snippet in a dedicated plugins section of the pom.xml. There can be as many plugins as you want in the pom.xml:

<project>
    <build>
        <plugins>
            <!-- First plugin details -->
            <plugin>
                ...
            </plugin>
            <!-- Second plugin details -->
            <plugin>
                ...
            </plugin>
            ...
        </plugins>
    </build>
</project>

Every plugin has a default location in the build pipeline, because most tasks make only sense at a given moment of the process.
Example: building a jar with all dependencies inside should happen at the end, after all classes are compiled, all tests have passed etc.

We'll look at how plugins work in more detail, and maven's understanding of plugins variation points in the build process in a future lecture, for now we'll look at some short useful plugin examples.

Exec

The exec plugin lets you specify a main class for your code, that should be called by default when the code is executed.

This is closest to the infamous green triangle (" ▶")
All you need to do is point to the main class to be called on execution:

<!-- Specify main class for exec goal -->
<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>exec-maven-plugin</artifactId>
    <version>1.6.0</version>
    <executions>
        <execution>
            <goals>
                <goal>java</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <mainClass>full.package.name.YourMainClassLauncher</mainClass>
    </configuration>
</plugin>

Once the plugin defined, you can conveniently run your program with: mvn clean compile exec:java

Add an IDE maven run configuration

Once the exec plugin defined in your pom.xml, modify the IDE's "Run Configuration" (a.k.a. what is called when the green triangle is clicked) to simply call maven's exec plugin!

Maven Jar

The Maven jar plugin allows you to add additional information when your program is packaged into a JAR.

Previously we've seen that a maven produced JAR cannot be launched, without explicitly stating the main class
The maven-jar-plugin allows you to provide a default information, on which main class should be listed in the JAR's manifest.

<!-- specify main class for JAR manifest-->
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-jar-plugin</artifactId>
    <version>3.2.0</version>
    <configuration>
        <archive>
            <manifest>
                <mainClass>full.package.name.YourMainClassLauncher</mainClass>
            </manifest>
        </archive>
    </configuration>
</plugin>

JavaDoc

In the second lab session you've learned a command to manually extract all JavaDoc information from your code, to generate a human-readable website. The JavaDoc plugin lets you automatize this step, as standard component of the build process.

Enabling the JavaDoc plugin is also a good practice, as you directly see whether there are issues in your code style, whenever you compile your code.
Ideally the plugin is configured to fail on warnings, so no developer is ever tempted to work with or produce undocumented code
- "I'll document that later", easily turns into "I'll document that never."

<!-- Plugin to ensure all functions are commented and generate javadoc -->
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-javadoc-plugin</artifactId>
    <version>3.4.1</version>
    <configuration>
        <javadocExecutable>${java.home}/bin/javadoc</javadocExecutable>
        <reportOutputDirectory>${project.reporting.outputDirectory}/docs
        </reportOutputDirectory>
        <failOnWarnings>true</failOnWarnings>
        <quiet>true</quiet>
    </configuration>
    <executions>
        <execution>
            <id>attach-javadocs</id>
            <goals>
                <goal>jar</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Use a snippet library

Most developers do not manually create their pom.xml line by line, but stitch it together from prepared blocks. Use a snippet library, e.g. https://m5c.github.io/MavenSnippetLibrary/ to rapidly create a working build pipeline.

Literature

Inspiration and further reads for the curious minds: