Continuous Integration

In this unit we discuss the underlying concepts of server sided continuous integration practices on the example of GitLab. You will be learning how to create a reliable container environment for maven project and how to integrate automatic quality checks on every pushed commit.

Lecture upshot

Gitlab runners allow your to systematically run the entire maven build lifecycle on every code modification. Direct feedback on the quality checks perdformed on every commit are an essential concept for ensuring a reliable and stable codebase.

CI definition and motivation

Continuous Integration (or short "CI") aims at supporting project success by having developers regularly send their code to a central repository and performing automated builds and tests.
In previous lectures you've already seen two building blocks to support these goals:
- Build systems, e.g. maven: Highly configurable, use textual representation for explicit definition of build process and code quality requirements.
- Version control systems, e.g. git: Synchronization of work effectuated by developers, using separate machines.

A need for CI configurations

Both components provide an essential functional component for CI, but:
- There is no guarantee developers use the build system reliably before pushing code
- There is no direct visual feedback on build system feedback, when combining work
CI is most relevant when combining work, i.e. when pushing commits or merging branches. In essence, you want to...
- enforce that developers use branches for features, and cannot directly work on main.
- be sure the work about to be merged into main is not breaking anything previously working.
Without a CI configuration, you have no guarantee about the state of your main branch, in the worst case:
- Your TP submission misses points, because you did not submit something well tested and working.
- You present recent progress to a client, but the demoed version actually has fewer working features because the last fast push broke everything.
- You send out a security patch, and now millions of PCs are caught in an infinite boot loop.

CI configurations are a safety mechanism

A good CI configuration ensures one thing: A protected main branch that cannot be corrupted, neither by honest mistake nor by blatant recknessness. Whatever happens, your project goes only forward, never backward.

In the remainder of this lecture we'll take a look at how to set up various protection mechanisms using GitLab.

Server sided checks example

Trust is good, control is better: The only way to reliably know whether a commit is "good", is to assess it on server side.
What checks should be run on server-side to assess "good" ?
- Does the software actually compile?
- Is it documented?
- Does it respect checkstyle?
- Is it tested, is the coverage high enough?
- Is there clutter in the repo, e.g. class files?
- ...
As we'll see shortly, exactly this is possible with GitLab:
- Once configured correctly, GitLab will provide you with fine-grained information on various quality checks for each commit:
- Based on the test results, we directly get a sense for a commit's quality:
  - All checks Passed
  - Some checks with Warning
  - At least on checks Failed
- Illustration:

Note: With "checks", we're not only referring to unit tests, but all thinkable code quality checks (including unit tests).

Merge, don't push

A key protection for any repository is to prohibit direct pushes to main.
- See: Settings -> Repository -> Branches -> Protected branches
- At the start of the semester, main was protected (you were not able to push)
- Now at the end we'll go back to that habit (because you now know how to create branches and merges)

Whatever feature added, must be first pushed to a branch.

If some attempts to directly push to main:

$ git commit -m "reckless direct push to main"
...
$ git push
...
remote: GitLab: You are not allowed to push code to protected branches on this project.
error: failed to push some refs

Instead of merging code locally and then pushing, you'll merge on server side, using merge-requests.

Merge requests

Merge requests translate to:
"I've created something useful on a branch, please add it to main".
- Often times the person actually merging is not the developer.
- To initiate the process, the developer creates a new merge request. (GitLab webui, big banner)
Ideally, the merge request itself offers all information of server-sided checks at aa glance

The actual CI merit

It does not matter if the code works on the developer's machine, it only matters if the code works for the client. Server-side checks give piece of mind to whoever has to decide on a merge request.

Containers as CI background

The idea of server-sided checks is charming.
But testing the software also means the server must be able to compile and run the software.
- Reminder: Dynamic testing requires code execution.
We cannot run server side tests, unless the server has:
- The source code
- An operating system, allowing us to run code
- All program related SDK requirements: Maven, Java compiler, JVM
Gitlab naturally has the source code, but absolutely not the environment.

How to provide an environment

The classic approach of installing a software development environment is not viable.
You cannot walk to the server running GitLab and yourself start to install Maven, Java compiler, JVM
- Time-consuming
- Requires root access
- Not reliably replicable
In the classic, native approach you have a stack of three components:
- Software
- Libraries needed for software
- Operating system providing kernel to run libraries and software.
There are two ways of interfering with this stack to obtain software, without installing requirements manually. Both modify the above native stack:

Virtual machines

Virtual machines are a snapshot of an entire operating system, i.e.
- An OS kernel (e.g. windows)
- All libraries
- The actual software
Shipping a virtual machine is reliable, however:
- Performance drop: An intermediate Hypervisor is needed to simulate an entire OS top of an existing OS kernel.
- Voluminous: An entire OS is shipped along with the few software components actually needed.

The JVM is not that kind of VM.

The JVM is not to be confused with operating system VMs. The JVM only interprets java bytecode. Operating system VMs can run any bytecode, that is, run any software the simulated machine could run (such as e.g. a JVM).

Docker containers

Docker containers are a response to operating system virtual machines.
Docker containers only provide the required libraries and software, and reuse the existing host OS kernel.
Usually not even the container itself is shipped, only instructions on how to create it step by step (also called Images).
Compared to VMs, docker images are:
- Small in size.
Compared to VMs, docker containers are:
- Almost as performant as the host system (e.g. the GitLab server on which they run).

GitLab context

Docker images are like blueprints, telling a machine what exactly is needed to work with a project.
A server, e.g. GitLab, can use such an image to construct a reliable environment, e.g. to obtain a java compiler, JVM, maven, etc...
- Pointing to the right image is a single line of code
- Once the image provided, there is no need to manually install java compiler, JVM, maven, etc...

Why is this useful for GitLab ?

We can tell GitLab to use an image leading to a reliable environment.
Using that environment we can assess our source code on server side.
We can assess excessively, and fast.

Gitlab CI

Almost all configuration GitLab CI configutation is done with just a single file: gitlab-ci.yml

It only needs to exist, that is as soon as it is in your project repo and pushed, GitHub will use it.
Whatever we specify in this file, GitLab will try to assess source code on every commit, based on the contained instructions.

The first thing we add to the file, is the reference to the docker image to use.

In the context of INF2050, we'll always use the line: image: maven:3.9.8-amazoncorretto-21
This image leads to a container environment with:
- Java 21 (JVM + Compiler)
- Maven
Since the container runs on a linux server, we also get access to all standard linux commands !

YAML syntax

Next we'll take a deeper dive into the exact way of defining the CI process for GitLab, using the file: gitlab-ci.yml
For things to be understood by GitLab, we have to stick to the exact requested keywords.
- The file ending is yml, which stands for YAML, the acronym for Yet Another Markup L anguage.
- YAML files, similar to XML files, or JSON files must respect the correct formatting and keywords.
We've already seen how to specify the CI image to use.
Next we'll look into how to specify the main components of a CI configuration, using GitLabs formatting and keywords.
In details we'll look at:
- How to define custom stages (the order of things to happen)
- How to define what jobs (the exact individual things to happen)

What's the relationship between YAML and GitLab ?

GitLab uses the YAML notation for configuring the CI behaviour. There are many other YAML files, using the same syntax, but not necessarily the same keywords.

General YAML notation

All yaml files are dictionaries and use a key/value notation.

(Optional) document start marker: ---
(Optional) document end marker: ...
Dictionary key: foo:
List of item values: -
- Abbreviated form, only values: ['value1', 'value2', '...']
- Abbreviated form, dictionaries: { name: Max, job: Professor, age: 34 }

Example:

---
# After file start marker, enumerate all key/value pairs.
university: UQAM
course: INF2050
students: 164
# Next an entry with multiple values for same key
staff:
  - Max
  - Ahmed
  - André-Pierre
  - Felix
  - George
prerequisites: [ 'INF1070', 'INF1120' ]
...

YAML is a JSON superset

YAML is a JSON superset with emphasis on human readability. Every JSON file is also a valid YAML file, but not the other way round.

Defining stages

Similar to maven, GitLab's CI configuration foresees a certain order of things, common to most software projects:

preparation
building the software
testing
deploying
post completion

These are called stages:

Whenever we define a new job, we need to explicitly state at which phase it should take place.
- For this we use the below keywords:
```
.pre
build
test
deploy
.post
```
If we want to add additional stages, we can do so by listing them in the .gitlab-ci.yml:
- Note however, that all default stages are overruled as soon as we define our own set.
```
# Definition of custom stages, to provide an implicit job order.
stages:
  - lint
  - build
  - test
  - deploy
```
The individual jobs (which we'll define next) will each belong to exactly one phase.
- Since the phases define an order, we also refer to the CI execution as a "CI pipeline".

Defining jobs

A job definition is in essence one or multiple commands to execute.
- We can use any of the commands provided by the container
- Since the container is built on an image for java / maven, we have the java, javac, and mvn command.
- On top we an use any standard linux command.
In YAML syntax, we have to describe
1. The job name.
2. The stage at which to execute the job stage.
3. The commands to be called by the job

Example:

sample-job:
   stage: build
   script:
     - echo "Using a linux command to log something to console"

time-consuming-job:
  stage: test
  script:
    - sleep 20
    - echo "Hello, $GITLAB_USER_LOGIN!"

Runner stage behaviour

The definition of stages seems somewhat contrived, why would we ever need to define stages, and not just define all jobs in a sequence ?

There's a natural interest in the CI execution not taking overly long.
If all jobs are performed sequential, we're potentially not making best use of the server resources.
But not all jobs should be run in parallel.
Examples:
- Testing for Checkstyle and Javadoc in parallel is ok. There is no dependency.
- Building a JAR file in parallel to compiling files is not ok. There is a dependency,

Stages allow improved resource consumption:

All jobs within the same stage are executed in parallel.
All jobs of subsequent stages are executed sequentially. Later jobs are cancelled if at least one job of an earlier stage failed.

Visualization of three phases, with parallel and sequential jobs:

Source: doc.gitlab.com.

Maven CI configuration

The previous example was pretty pointless:

We do not just want to call linux commands...
We want to call commands that assess our source code!

Simplest case

In the simplest case, we use some of the standard linux commands to verify if the repository is free of compiled code:

check-clutter-job:
  stage: lint
  script:
    - CLUTTER=$(find . -name \*.class)
    - if [[ ! -z $CLUTTER ]]; then exit 1; else exit 0; fi

Explanation:

The first script line stores a list of all class files in a variable
The second script checks if the variable is empty:
- Variable not empty returns 1 (job failure)
- Variable empty returns 0 (job success)

Atomic maven build

Still, we're not yet making use of the container!

We've selected the image, specifically because the resulting container allows us to use maven
We already have a pom.xml configuration in our project, with many quality checks, let's use it!

image: maven:3.9.8-amazoncorretto-21

maven-job:
  stage: build
  script: "mvn clean package -B"

In this case the CI pipeline is atomic. It has just a single pipeline job, doing all the heavy lifting.

Can you use the above image for a python project

No. The image has to match the project requirements and a java/maven image cannot be used to create a container for python processing. A Runner cannot possibly work if not all requirements are satisfied by the container.

Maven phases as stages

Using a pipeline with just a single command works as code quality check.
- However, if our build fails, we do not immediately see what is the issue.
- Notably for merge requests, this is inconvenient.
- It would be a lot better to have individual pipeline stages, for the individual maven phases.
The fist step would be to define all maven phases as stages:

stages:
  - validate
  - compile
  - test
  - package
  - verify
  - javadoc

Afterward, we can define individual maven jobs for each pipeline step:

validate-job:
  stage: validate
  script: "mvn clean validate -B"

compile-job:
  stage: compile
  script: "mvn clean compile -B"

test-job:
  stage: test
  script: "mvn clean test -B"

package-job:
  stage: package
  script: "mvn clean package -B"

Lifecycle redundancy

By default, maven will execute the entire lifecycle until the specified phase, for every individual mvn command in a runner.
- That is quite wasteful, e.g. we only need to lint once, to ensure the code is correctly formatted.
- Better would be to run the phases individually.
Unfortunately maven does not quite allow single-phase execution.
- But there's a trick: We can carry forward build artifacts, and run individual plugins (not phases) on them.
- We'll practice this technique for resource optimization in the next lab session.

Runners are isolated

Careful though when building a sequential pipeline for individual maven phases. The artifacts added to target are not automatically carried forward from one runner to the other. For subsequent runners to work, we have to manually define which artifacts, from which previous runner to reuse.

Using build artifacts

Each runner lives in its own environment, which you can think of as a sandbox.

Whatever files produced by one runner are not immediately visible to another runner. It is as if they were operating on two separate machines.
But sometimes you want to preserve something generated by a runner, to retain information on a commit.
Examples:
- The surfire plugin produces a test report to target
- The jacoco plugin produces a coverage report to target
- The javadoc plugin produces a navigable website source code to target
All these generated files are lost, as soon as the creating runner goes back to sleep.

Luckily there's a configuration keyword to instruct GitLab to extract files from a runner, before its dissolved:

job-name:
  script:
    # Run some command that produces files, e.g. maven calling javadoc plugin.
    - command-that-produces-files
  # Define which files / folders created by this runner should survive the build process.
  artifacts:
    paths:
      - folder-to-survive-runner

Next we'll take a look at how artefacts form various build stages are best used for additional feedback on your commit quality.

Test example

Usually it does not suffice to know that some test failed, you want to know exactly which tests failed.
Gitlab provides a dedicated interface to display this information, however you have to provide it.
- This is done by means of a build artifact!
- You can simply extract the test report created by maven (in your target directory), and GitLab will automatically add a new tab to each commit with the detailed test report:

If you also have a coverage plugin configured, e.g. jacoco, you can likewise extract coverage reports for additional insight.

test-job:
  stage: test
  script: "mvn clean test -B"
  artifacts:
    when: always
    reports:
      junit:
        - target/surefire-reports/TEST-*.xml
        - target/failsafe-reports/TEST-*.xml
      coverage_report:
        coverage_format: jacoco
        path: target/site/jacoco/jacoco.xml

JavaDoc example

A second artifact type that you should extract is generated documentation.

The JavaDoc plugin already creates decent, human-readable documentation in the target folder.
- Likewise, GitLab has a built-in feature to repository files on a webserver, for convenient browser access.
- However, unless you are an oldschool webpage coder, your website is most likely generated, not hand-written:
  - HTML files.
  - CSS files.
  - JavaScript files.
But wait! We do not want any generated files in our repository!!
- Let's use a CI pipeline job to generate the documentation on server-side, save the artifact, and only then host the documentation on the internet!
- The first thing we need to do is configure the CI job to move whatever documentation created into a folder named public.

pages:
  script:
    - mvn javadoc:aggregate
    - mkdir public
    - cp -r target/site/docs/apidocs/* public
  # Define which files / folders created by this runner should survive the build process.
  artifacts:
    paths:
      - public

Once the runner configured, you still need to tell GitLab to actually host the javadoc on its file-server:
Access your GitLab project on the GitLab webui
On the left side-bar, select Deploy -> Pages
Optional: Deselect the custom URL checkbox
Access your project's webpage.

Only on main

A caveat with documentation, is that it is only relevant for released software, i.e. no one cares about documentation for some feature still in the making on some secret branch.
That translates to: we only want to deploy documentation for commits on the main branch.
Luckily there is an extra keyword for restricting runners to specific branches:

pages:
  script:
    - ...
  # Only keyword allows restricting for which git branch the job is applied.
  only:
    - main

MISC

This course content is provided to you, with the help of a GitLab CI pipeline !

I write sources in MarkDown format
I push the sources to a GitLab repo
GitLab runs a CI pipeline that:
1. Creates a container with python support
2. Uses a python program to translate:
  - Markdowns into navigable and indexed HTML pages
  - Mermaids into SVGs
  - HTML and SVGs into a PDF
3. Stores the produced HTML/SVG/PDF artifact in the public folder
GitLab serves the course websites on its file server, at https://inf2050.uqam.ca/en/

Every time I change a singe typo, the entire CI pipeline is re-executed, and the webpages automatically update on every push. :)

Literature

Inspiration and further reads for the curious minds: