5 Infrastructure
Yes, there is a third area besides your research and carpentry level programming that I suppose you should get an idea of. Again, you don’t have to master hosting servers or even clusters, but a decent overview and an idea of when to use what will help you tremendously to plan ahead.
Admittedly, unless your favorite superhero is Inspector Gadget or you just always had a knack for Arduinos, Raspberry Pis or the latest beta version of the software you use, infrastructure may be the one area researcher perceive as overhead.
Once I was trouble shooting a case in which a colleague was unhappy with the speed gain of using the university’s heralded cluster. It turned out he had passed his behemoth of a computation problem on to the login node of the ETH cluster. Unsuprisingly, the bouncer was not particularly good at math.
Before you laugh: Do you know how many cores, how much disk space or RAM your local notebook has? I’ve seen many bright researchers who were not able to tell. Note, I am not writing this to make you feel bad or to put you in front of another fat rock. There is a reason why we have sysadmins, database admins and other specialists. I am not asking researchers to become one of those. A lot of developers are lacking a decent understanding of infrastructure, too.
While specialization has been the story of the last decades, there is a strong case for hybrid profiles. In software development, DevOps is such a sought after hybrid profile: A software developer with knowledge of operations or a server operator who is able to develop a program.
The following section intends to share insights that help you set up a playground for trying out new things, to get a rough idea of computation hours needed or simply to put you in better position to discuss your needs with your organization’s High Performancing Computing (HPC) group.
5.1 Benchmarking
Benchmarking refers to observation of computation performance in order to evaluate the efficiency of a program or simply to assess the computation hours needed. Also some programs benefit from parallel computing a lot more than others and the effect of speed gains is not always linear. Being able to set up a basic toy example and get a rough idea how long a computation might take is a very valuable skill for an empirical researcher. And it’s up for grabs. Really.
5.2 Your Local Machine / localhost (It’s NOT always about Computing Power!)
A simple benchmark can help to assess whether your local machine, i.e., your desktop or notebook computer may be sufficient for your ‘big’ data project. Maybe some overnight computation with home court advantage is the better solution than an away game in a cloud environment you never worked with before.
Also, it’s important to understand that it is not always about computing power. Your local computer can be a great testing and development environment for many things, e.g., a report rendering service, a website and many other things that might run elsewhere in production. Many of today’s applications run on some kind of webserver or in a web based architecture. Remember: a server does not necessarily look like a fridge with emergency power supply. It’s just a program that listens. Such a program can run on a fat rack and your little notebook alike which allows you to test an application locally. The popular blog software hugo for example, ships with a little webserver to test things at home before going public.
hugo serve
When run inside a folder with a hugo based website, the command spins up a local webserver and exposes the current state of the hugo based website through a specific port (1313). This allows the developer to visit the site using their favorite web browser while its running on their local machine.
5.3 Where Should I Host?
But what if you do not want to run your stuff locally. Maybe because you want to expose a web application or site which has to be online 24/7. Maybe because you do not more additional computing power.
In principle there are four options.
Bare Metal. Buy hardware, hook it up with the internet, install your software, have fun. If you have a consistantly high usage rate and can maintain the server efficiently, this is likely the cheapest computing hour. You saw that if, didn’t you?
Either use physical or virtual hardware and make your local IT department host your service on-premise (inhouse). You will give up a little bit of your flexible in exchange for help with maintenance and server administration.
Cloud Computing.
Software-as-a-Service
5.3.1 Cloud Computing
Cloud computing is the on-demand of hosting. It continues to be popular with users looking for storage and computing power, particularly when users have high peaks of usage and low and long valleys of idle time.
From a business perspective, cloud computing is one of the most compelling business cases of the past few decades. Economies of scale are just tremendous for companies with right (=massive) infrastructure. Amazon who happened to have the infrastructure due to its global retail business developed into one of the leading cloud computing companies turning hosting into one of its fastest growing an most profitable branches.
For the users, cloud computing is just convenient. Yet, pound-for-pound, if used 24/7 cloud computing ressources are more expensive than traditional hosting. Also, limiting your budget is not trivial especially if you a sharing the costs with multiple stakeholders inside your organization.
5.4 Software as a Service (SaaS)
5.5 Virtual Machines
Just like any software, tools like hugo have their dependencies. Depending on your operating system and the software you want to run, dependencies may become an issue. Dependencies may cause conflicts with other software or dependencies, e.g., some software might depend on Python 2 while another piece depends on Python 3 and so forth. The more stuff you install, the higher the chances you run into a conflict or some hard to install piece of software.
One good solution to these issues are Virtual Machines. Virtual Machines can either run locally using software like Oracle’s Virtual Box which allows to run a Virtual Windows or Linux inside Mac OS and vice versa. Running a virtual box locally may not be the most performant solution but it allows to have several test environments without altering one’s main environment. In the meantime though, Virtual Machines are mostly known as flexible computation nodes offered by (cloud) hosting providers. Automated, scripted procedures to set and spin up these virtual servers make Virtual Machines a very powerful, reproducible and scaleable approach.
5.6 Docker (Containers)
At the first glimpse containers look very much like Virtual Machines to the practitioner. The difference is that every Virtual Machine has its own operating system while containers use the the host OS to run a container engine on to top of the OS. By doing so containers can be very light weight and may take only a few seconds to spin up while spinning up Virtual Machines can take up to a few minutes - just like booting physical computers. Hence docker containers are often used as single purpose environments: Fire up a container, run a task in that environment, store the results outside of the container and shut down the container again. Docker the most popular containerized solution quickly became the synonym to environments configured in a file. So called docker images are build layer by layer based on other less specific docker images and a DOCKERFILE that holds the ingredients and set up for the next image. The platform dockerhub hosts a plethora of pre-built docker images vom ready-to-go database to Python ML environments or minimal Linux containers to a simple shell script.
Though a bit out of fashion and somewhat different, virtual machines are good starting point to explain single purpose container environments such as docker. A virtual machine (VM) is basically a computer in a computer, like a Linux environment running on your Windows notebook. Oracle’s free Virtual Box is the most popular piece of free software to easily install another operating system inside your local computer’s host OS.
While VMs are still common and one can potentially have lots of images for different purposes, images are too heavyweight and take to long to boot to be the go-to solution for many application developers. Hence so-called containers that run within a container host and fire up within seconds have become popular as single purpose environments.
The most popular of them all is docker which allows users to configure an environment in a Dockerfile (essentially a text file) including the operating system and software packages installed in the container. The text file can either be used to create a docker image which is kind of a blueprint for a container. Containers run inside a docker host and can either be used interactively or in a batch which executes a single task in an environment specifically built for this task.
One of the reasons why docker is attractive to researchers is its open character: Dockerfiles are a good way to share a configuration in a simple, reproducible script, making it easy to reproduce. Less experienced researchers can benefit from Dockerhub which shares images for a plethora of purposes from mixed data science setups to database configuration. Side Effect free working environments for all sorts of task can especially be appealing to developers with limited experience in system administration.
Beside simplification of system administration, docker is known for its ability to work in the cloud. All major cloud hosters offer docker hosts and the ability to deploy docker containers that were previously developed and tested locally. You can also use docker to tackle throughput problems using container orchestration tools like Docker Swarm or K8 (say: Kubernetes) to run hundreds of containers (depending on your virtual resources).