DockOne share (eighty-six): in-depth analysis of DC / OS 1.8 – highly reliable micro-service and large data management platform

[Editor's Note] Apache Mesos is a leader in cluster resource management and scheduling software that has been validated through mass production systems. DC / OS with Apache Mesos as the core, through a unified interface and command line, service discovery, load balancing, flexibility and other mechanisms, as well as micro-service and large data package management platform, so that users can manage a host as a whole management Data center, so DC / OS is the data center operating system. In September, the latest DC / OS 1.8 was released based on the latest Mesos 1.0 and introduced many new features.

People working in the container may have heard of DC / OS, often misunderstood DC / OS is Marathon + Mesos, in fact, DC / OS contains a lot of components, DC / OS 1.8 released in September, the share to everyone to do An introduction.

DC / OS basic idea

The so-called DC / OS, called the data center operating system, the basic idea is to make the operation and maintenance personnel to operate the entire data such as the operation of a computer.

What technology does DC / OS use to do this?
Figure, the left is the ordinary Linux operating system, the right is DC / OS, here made a comparison.

No matter what kind of operating system, need to manage external hardware devices, the most important four kinds of hardware resources that CPU, memory, storage, network.

The initial use of assembly language to write the program predecessors, or need to specify the use of those hardware resources, such as specify which register to use, put in the memory where to write or read the serial port, for the use of these resources, the need for programmers themselves My heart is very clear, or else JUMP wrong position, the program can not run. This is like the operation and maintenance of the data center of a Taiwan physical machine predecessors, that program on which machine, how much memory, how much hard drive, need to be very clear heart.

In order to liberate the programmer from the direct operation of the hardware, to improve the efficiency of programming, and thus have the operating system layer, to achieve a unified management of hardware resources. A program which CPU, which part of the memory, which part of the hard disk, the program only need to call the API on it, by the operating system to allocate and manage, in fact, the operating system only one thing is scheduling. Corresponding to the data center, also need a scheduler, the operation and maintenance personnel from the designated physical machine or the liberation of the pain of the virtual machine, which is Mesos. Mesos even if the core of the data center operating system.

When using the operating system, we can develop drivers to identify new hardware resources, develop kernel modules (such as openvswitch.ko) to intervene in the use of hardware resources, and for Mesos, you can also develop isolators to identify new hardware resources Such as GPU, you can also develop Executor to interfere with the use of resources.

In the kernel above, is the system services, such as Systemd, is used to maintain the process of running, if the systemctl enable xxx, then ensure that the service hangs automatically restart. For DC / OS, the long run of the service is Marathon, but only Marathon is not enough, because the service is started on multiple machines, and there is a dependency between the services, a service hangs, and another Taiwan machines start up, how to maintain the call between the service does not require manual intervention? This requires additional technology, called service discovery, mostly through DNS, load balancing, virtual machine IP and other technologies to achieve.

The use of the operating system, the need to install some software, so the need for Yum class package management system, so that the software users and software compiler to separate the compiler to understand the software need to install the package which package, The dependency is what, where the software is installed, and the software user only needs yum install on it. DC / OS there is such a package management software, and other container management platform need to compile their own Docker mirror, write their own YML, their management depends on different, DC / OS software users only need dcos package install can install the software , The software configuration, the number of nodes, dependencies are software compiler settings.

In the outermost layer, DC / OS, like a normal operating system, has a unified interface and command line. Through them, you can manage the installation package, manage the node, run the task. DC / OS is not just the platform for running the container, if only the container is running, is the container management platform, rather than the data center operating system. With DC / OS, you can run a command on each machine for a unified configuration without having to log in to each machine. You can run container applications and large data analysis applications and share resources, and can discover each other, which is more in line with modern Internet applications, micro services and large data inseparable. And Mesos architecture is very open, you can through the development of Framework, Executor, Modules, Hooks, etc., easy intervention micro-service or large data tasks to the implementation process, to customize your application. This also conforms to the concept of the operating system microkernel.

DC / OS kernel module

The Mesos architecture is as follows

This map is more famous, there are many articles to introduce this figure, the details can see the article " Mesos Architecture ", here do not do too much introduction.

Can be seen from the figure, Mesos Framework (Framework inside the Scheduler), Master (Master inside Allocator), Agent, Executor, Task several parts. There are two layers of the Scheduler, a layer in the Master inside, Allocator will be fair resources to each of the Framework, the two layers in the Framework inside, Framework's Scheduler resources will be assigned to the Task.

Mesos these roles in a task running the life cycle, the relationship is as follows:
Agent will report the resources to the Master, Master will be based on Allocator's strategy to the resources of the Framework Scheduler. Scheduler can accept this resource, run a Task, Master will Task to the Agent, Agent to the Executor to really run the Task.

This diagram is relatively simple, the real details of the process than this complex a lot, we can refer to the blog " DC / OS data center operating system " at the code level analysis of the entire task to run the process, also painted a lane map:
To study Mesos, familiar with the whole process is very important, so that when a task runs a problem, can it be better to locate the problem if it is resolved. Mesos will be a simple task of running the process, divided into so many levels, so many roles to do, is to double the scheduling and flexible configuration, which is a kernel should do.

How do we intervene in the operation of a Task?

First, write a Framework

If you want to completely control the operation of the Task itself, rather than let Marathon to run and keep a stateless Task long run, you need to write a Framework, inside your Framework, the relationship between the three tasks you can define their own And not like Marathon, Task * 3, 3 tasks regardless of each other, your Framework can control the three Task a master, you can control the three Task startup sequence, you can start a Task IP , Location, etc. through the environment variable to inform the other two Task.

Writing a Framework requires writing a Scheduler that implements some interfaces, as described in the documentation … uide / .

Then use MesosSchedulerDriver to run the Scheduler.
In fact, Mesos communication between these modules are through the Protocol Buffer definition of the message to interact, but if the developers of the developers to learn how to use the Protocol Buffer message and Mesos Master communication, is a very painful thing, so MesosSchedulerDriver help you do This thing, you only need to achieve the definition of the Scheduler interface can be, do not need to know who these interfaces call, call the interface, the message how to pass to Mesos Master.

All the interfaces, the most important thing is the resourceOffers function, according to the offer (each slave has the number of resources), create a series of tasks, and then call MesosSchedulerDriver launchTasks function, MesosSchedulerDriver will be included in this task LaunchTasksMessage sent to Mesos Master The

Second, write an Allocator

Through the above description, Mesos has two layers of scheduling, the first layer is Allocator, the allocation of resources to the Framework.

Mesos allow users to write their own way through the module, write a so, and then start into the time, and then in the command line which specified the use of which in the Module.

Of course, write Allocator not much, because Mesos DRF algorithm is the core of Mesos, if you do not use this algorithm, not as good as Mesos.

The default Allocator in the Mesos source, that is, the location of the HierarchicalDRFAllocator in $MESOS_HOME/src/master/allocator/mesos/hierarchical.hpp , and the Sorter for each Framework in the DRF is located at $MESOS_HOME/src/master/allocator/sorter/drf/sorter.cpp , you can view the source code to understand how it works.

Basic Principles of Hierarchical

The decision to make an offer allocation is made by the resource allocation module Allocator, which is present in the master. The resource allocation module determines that the Framework accepts the order of the offer and, at the same time, ensures that the resources are shared fairly under the conditions of maximizing resource utilization.

Because MesS dispatches resources for heterogeneous data centers and is heterogeneous resource requirements, resource allocation will be more difficult than ordinary scheduling. So Mesos uses DRF (Dominant Resource Fairness).

The highest percentage of all the resource shares that Framework owns is the dominant share of the Framework. The DRF algorithm uses all registered Frameworks to calculate the dominant share to ensure that each Framework can receive an equitable share of its dominant resources.

for example

Consider a 9CPU, 18GBRAM system that has two users, where the demand vector for the task that user A runs is {1CPU, 4GB}, the demand vector for user B runs {3CPU, 1GB}, and the user can execute as much Task to use the system's resources.

In the above scheme, each task of A consumes 1/9 of the total CPU and 2/9 of the total memory, so the dominant resource of A is memory; each task of B consumes 1/3 of the total CPU and 1 of the total memory / 18, so the dominant resource of B is CPU. DRF will balance the user's dominant shares, the implementation of three users A task, the implementation of two users B tasks. The task of the three users A consumes a total of {3CPU, 12GB}, and the two user B tasks consume a total of {6CPU, 2GB}; in this allocation, each user's share share is equal, and user A gets 2 / 3 of the RAM, and user B gets 2/3 of the CPU.

The above allocation can be calculated as follows: x and y are the number of assignments for user A and user B respectively, then user A consumes {xCPU, 4xGB}, user B consumes {3yCPU, yGB} Three users A and user B consume the same dominant resource; user A's dominant share is 4x / 18, and user B's dominant share is 3y / 9. So DRF allocation can be solved by solving the following optimization problem:

  Max (x, y) # (Maximize allocations) 

Subject to

X + 3y <= 9 # (CPU constraint)

4x + y <= 18 # (Memory Constraint)

2x / 9 = y / 3 # (Equalize dominant shares)

Finally, x = 3 and y = 2, so user A gets {3CPU, 12GB}, and B gets {6CPU, 2GB}.

The HierarchicalDRF kernel algorithm is implemented in the HierarchicalAllocatorProcess :: allocate function in Src / main / allocator / mesos / hierarchical.cpp.

In general, three Sorter (quotaRoleSorter, roleSorter, frameworkSorter) are called, all the Frameworks are sorted, which gets resources first, and which gets resources.

In general, two steps: first to ensure that there quota's role, call quotaRoleSorter, and then no other resources quota again, call roleSorter.

For each big step in two levels of sorting: a layer is sorted by role, the second layer is the same role for the different Framework sort, call the frameworkSorter.

Each layer of the sort is in accordance with the calculation of the share to give priority to who, give to whom.

Here are a few concepts that are easy to confuse: Quota, Reservation, Role, Weight:

  • Each Framework can have a Role, both for permissions and for resource allocation.
  • You can give a role in the offerResources time to return Offer :: Operation :: RESERVE, to subscribe to a slave above the resources. Reservation is very specific, specific to which machine which resources belong to which Role.
  • Quota is the minimum guarantee for each Role, but not specific to a node, but in the entire cluster to ensure that there are so many on the line.
  • Reserved resources are also counted in Quota inside.
  • There can be Weight between different roles

At the end of the Allocator algorithm, it calls Master :: Offer and finally calls the ResourceOffers of the Framework's Scheduler to make the secondary scheduling. With the above logic in series.

Third, write a Hook

You can write hook module, the code is inserted in a lot of key steps to rewrite the entire Executor or Docker or Task to start the whole process.

The place where the Hook can be intervened is defined in mesos / hook.hpp.

Class hook is defined as follows:
Which is more commonly used slavePrelaunchDockerHook, Docker can start before doing some things, such as preparation.

There are slaveRemoveExecutorHook, this can be done at the end of the executor, such as doing things, such as cleaning up.

Fourth, create Isolator

When you have a new resource that needs to be managed, and each Task needs to be quarantined for this resource, it is necessary to write an Isolator.

For example, the default container can not dynamically specify and limit the size of the task hard disk use, so mesos-containerizer will have a "disk / du" to regularly check the task using the hard disk size, when the limit is taken to take action.

Src / slave / containerizer / mesos / containerizer.cpp which lists the currently supported isolator, you can also achieve their own isolator, and load through the modules parameter.

Isolator defines the following functions
In the operation of a container at the end, will call each isolator isolate function, through this function, you can restrict the resources, such as writing cgroup files, but for hard disk usage, in fact, no cgroup can be set, need a period Time du some, which need to achieve watch function, over time to check the hard disk usage, after doing some of the operation.

Fifth, write an Executor

If you run a normal container, or the command line, you do not need to implement Executor, only Mesos default Executor will be able to achieve this function. If you need to do a lot of custom work in the Executor, you need to write your own Executor.

Write an Executor need to achieve some of the interface, the most important is the launchTask interface, and then MesosExecutorDriver the Executor to run up.

Like Framework, Executor is through the protocol buffer protocol and Mesos-Agent to communicate, through MesosExecutorDriver, you do not need to care about the agreement, only need to achieve the interface can be.

DC / OS core module

The following diagram depicts the deployment diagram for DC / OS:
In the DC / OS view, all nodes are divided into three regions, one is the management area, the main handling of the management of the operation of the service, such as additions and deletions change, start and stop expansion. For high availability, the Master node can be multiple, and before a number of Master nodes, a load balancer is required. The second is the external service area, that is, the outside world can access the DC / OS internal service area, the region inside the service for the external Nginx like, there will be marathon-lb to do the external load balancer, A load balancer is required in addition to the nodes of all external service areas. The third area is the internal service area, for the deployment of internal services, such as databases, message bus, etc., these internal nodes can not external access.

First, Admin Router

AdminRouter is a reverse proxy, it is the external area and the internal area completely isolated, in the admin router, you can access through the public network, within the admin router are all private network address, this provides Secure unified access mechanism.

After installing Open DC / OS, install a dcos command line tool, through this tool can ssh to the master node.

  Eval `ssh-agent -s` 
Ssh-add .ssh / aws01.pem
Dcos node ssh --master-proxy --leader

In this node on the / etc / systemd / system path below there are three Systemd service, Open DC / OS all components are managed with Systemd.

  Ip-10-0-7-1 system # ls -l | grep adminrouter 
Lrwxrwxrwx. 1 root root 135 Oct 3 08:00 dcos-adminrouter-reload.service -> /opt/mesosphere/packages/adminrouter--cee9a2abb16c28d1ca6c74af1aff6bc4aac3f134/
Lrwxrwxrwx. 1 root root 133 Oct 3 08:00 dcos-adminrouter-reload.timer -> /opt/mesosphere/packages/adminrouter--cee9a2abb16c28d1ca6c74af1aff6bc4aac3f134/
Lrwxrwxrwx. 1 root root 128 Oct 3 08:00 dcos-adminrouter.service -> /opt/mesosphere/packages/adminrouter--cee9a2abb16c28d1ca6c74af1aff6bc4aac3f134/

You can see that dcos-adminrouter.service is a path to / opt / mesosphere / packages, and all components of Open DC / OS are installed under this path.

In the /opt/mesosphere/packages/adminrouter--cee9a2abb16c28d1ca6c74af1aff6bc4aac3f134/nginx/conf This path below, there is a file nginx.master.conf, open the file, you can see the familiar configuration for Nginx.

  Upstream mesos { 
Server leader.mesos: 5050;

Upstream marathon {
Server master.mesos: 8080;

Location / mesos / {
Access_by_lua 'auth.validate_jwt_or_exit ()';
Proxy_set_header Host $ http_host;
Proxy_pass http: // mesos /;

Location / marathon / {
# Enforce access restriction. Auth-wise, treat / marathon *
# Equivalently to / service / marathon *.
Access_by_lua 'auth.validate_jwt_or_exit ()';
Proxy_set_header Host $ http_host;
Proxy_pass http: // marathon /;

From this configuration file can be seen, all the internal visit Marathon page, visit Mesos pages, are through leader.mesos, the domain name is given by mesos-dns, corresponding to the internal IP address, if from External access to Marathon or Mesos pages, you must pass admin router, through http: // admin-router-external-ip / marathon or http: // admin-router-external-ip / mesos to visit.

Second, Mesos-DNS

For the data center operating system, service discovery and load balancing is the most core function, only with these features, can make the service of the physical layout, service dependencies, service hang after the automatic repair does not require users Care, so that users can use the same as a computer using the entire data center.

If the service does not use the IP address between the call, and the use of domain names, the problem will be much simpler.
As shown in the figure, for each Task running on Mesos, Mesos-DNS can be obtained by calling the Mesos-Master API and assigning a domain name and IP counterpart to each Task. If a Task needs to visit another Task, you need to configure the domain name can be, no matter how the task hang, how to allocate to other nodes to run, the domain name will not change, of course, Task IP may change, but do not worry, Mesos -DNS will update it. Each Mesos-Agent only needs to configure /etc/resolv.conf to point to Mesos-DNS.

When a Task is running, Mesos-DNS will create a domain name <task>. <Service> .mesos Corresponds to:

  • The IP address of the Mesos-Agent
  • If it is Mesos Containerizer, it returns the IP of the Task internal container

In addition, <task>. <Service> .slave.mesos will also provide the IP address of the physical machine. This through the hostport and Mesos-DNS given the domain name, you can achieve the discovery of services.

Third: Marathon-lb

The use of DNS can achieve self-discovery of services, but it is not easy to achieve the service load balancing and flexible scaling, and marathon-lb to achieve these functions.
Marathon-lb is a HAProxy-based load balancer, but it will listen to the marathon event bus. Whenever the number of services registered on Marathon-lb changes, Marathon-lb will automatically update the HAProxy configuration file to implement the load balanced. Marathon-lb can be achieved in the external load balancing, but also can achieve internal services between the call load balance.

Marathon installation can be in the interface inside the universe search Marathon-lb installation, you can also execute through the command line dcos package install Marathon-LB installation, the default installation of the external load balancer.

We create the following applications in the service:

"Id": "nginx",
"Container": {
"Type": "DOCKER",
"Docker": {
"Image": "nginx: 1.7.7",
"Network": "BRIDGE",
"PortMappings": [
{"HostPort": 0, "containerPort": 80, "servicePort": 10000}
"ForcePullImage": true
"Instances": 1,
"Cpus": 0.1,
"Mem": 65,
"HealthChecks": [{
"Protocol": "HTTP",
"Path": "/",
"PortIndex": 0,
"TimeoutSeconds": 10,
"GracePeriodSeconds": 10,
"IntervalSeconds": 2,
"MaxConsecutiveFailures": 10
"Labels": {
"HAPROXY_GROUP": "external"

In this application, servicePort is 10000 that we registered to Marathon-lb on the external port is 10000, paste which is written external, that is, registered to the external load balancer.

At this time, we visit the public port on the 10000 port, you can see the start of the Nginx page , other applications can be through the http: //marathon-lb.marathon.mesos: 10000 to access this Nginx.

If we visit the public slave haproxy configuration page , you can see the following mapping.
External Marathon-lb monitor 10000 port, the internal mapping for the 20215 port on, if we look from the service page, really start the nginx is listening to port 20215.
Next we deploy marathon-lb-autoscale, which monitors HAProxy and finds that RPS (request per seconds) exceeds a certain number and expands the application flexibly.

"Id": "marathon-lb-autoscale",
"Args": [
"- marathon", "http: //leader.mesos: 8080",
"- haproxy", "http: //marathon-lb.marathon.mesos: 9090",
"- target-rps", "100",
"--apps", "nginx_10000"
"Cpus": 0.1,
"Mem": 16.0,
"Instances": 1,
"Container": {
"Type": "DOCKER",
"Docker": {
"Image": "brndnmtthws / marathon-lb-autoscale",
"Network": "HOST",
"ForcePullImage": true

Next, we deploy Siege to send requests to Nginx

"Id": "siege",
"Args": [
"Http: //marathon-lb.marathon.mesos: 10000 /"
"Cpus": 0.5,
"Mem": 16.0,
"Instances": 1,
"Container": {
"Type": "DOCKER",
"Volumes": [],
"Docker": {
"Image": "yokogawa / siege",
"Network": "HOST",
"Privileged": false,
"Parameters": [],
"ForcePullImage": false

If we look at HAProxy stats page, found that the request has been sent over. This time we add Siege to 10, to Nginx pressure.
After a period of time will find marathon-lb-autoscale already action.
Change a Nginx to 8 Nginx.
When we changed Siege from 10 to 0.

Fourth, Minuteman

Minuteman is an internal east-west load balancer that can be used to set up VIPs, and multiple instances use the same VIP for load balancing.
In the creation of services, select Load Balanced, then the following will appear a line address: 80, this is Minuteman assigned VIP.
When the service is created, through curl http: //nginxdocker.marathon.l4 … ry: 80 can access the service, but if we ping the domain name is unreasonable, but also for the IP address is also very strange IP address, the IP is VIP.

How is this done? Minuteman load balancer is based on Netfilter, DCOS slave node, we can see more out of the four iptables rules. The first two rules are in the raw table inside, the latter two rules are inside the filter table.

  -A PREROUTING -p tcp -m set --match-set minuteman dst, dst -m tcp --tcp-flags FIN, SYN, RST, ACK SYN-j NFQUEUE --queue-balance 50:58 
-A OUTPUT -p tcp -m set --match-set minuteman dst, dst -m tcp --tcp-flags FIN, SYN, RST, ACK SYN-j NFQUEUE --queue-balance 50:58
- A FORWARD -p tcp -m set --match-set minuteman dst, dst -m tcp --tcp-flags FIN, SYN, RST, ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m set --match-set minuteman dst, dst -m tcp --tcp-flags FIN, SYN, RST, ACK SYN -j REJECT --reject-with icmp-port-unreachable

According to the rules of iptbles raw rules in the table will be the first implementation, once reached the filter table Minuteman package are filtered out.

The NFQUEUE rule indicates a process that will hand over the processing right of the package to the user state. –queue-balance means that the package will be sent to several Queue, and then the user state process will use libnetfilter_queue connected to these Queue, the package read out, according to the contents of the package decision-making back to the kernel to send.

Run this Minuteman process on each Mesos-Agent node, listen to these Queue, and we can view the VIP mapping by visiting the API, curl http: // localhost: 61421 / vips .
We can see the VIP followed by two nodes and 4989, just correspond to two instances of Nginx.

DC / OS micro service and large data management mechanism

DC / OS is based on Mesos, Mesos flexible framework mechanism can make DC / OS both to deploy the container, but also to deploy large data frame, large data frame does not run the task, almost no resources, so that the real micro-service And large data framework for resource sharing.

Before we deploy the container, they are prepared to deploy Marathon's json, which requires the use of services and design services of people the same professional.

DC / OS uses a package management mechanism, will run a micro-service or the framework of the various configurations required for the template, the template produced by the professional upload to the package repository, the user does not need to be so professional, as long as the operation dcos Package install can be installed.

Mesosphere provides the official package repository called Universe, , and the corresponding code can be found on GitHub.
For a package, it often contains the following parts:

  • Package.json: This keeps some metadata data, for example, spark
      "Name": "spark", 
    "Description": "Spark is a fast and general cluster computing system for Big Data. Documentation:"
    "Licenses": [
    "Name": "Apache License Version 2.0",
    "Url": ""
    "Tags": [
  • Config.json: save some configuration items, such as for spark
  "Name": { 
"Default": "spark",
"Description": "The Spark Dispatcher will register with Mesos with this as a framework name. This service will be available at http: // <dcos_url> / service / <name> /"
"Type": "string"
"Cpus": {
"Default": 1,
"Description": "CPU shares",
"Minimum": 0.0,
"Type": "number"
"Mem": {
"Default": 1024.0,
"Description": "Memory (MB)",
"Minimum": 1024.0,
"Type": "number"
"Role": {
"Description": "The Spark Dispatcher will register with Mesos with this role.",
"Type": "string",
"Default": "*"

  • Marathon.json.mustache: is a template, inside some of the variables will be replaced with config.json inside the content, and ultimately can be sent directly to the Marathon request. Take Spark as an example:

  "Id": "{{}}", 
"Cpus": {{service.cpus}},
"Mem": {{service.mem}},
"Container": {
"Type": "DOCKER",
"Docker": {
"Image": "{{resource.assets.container.docker.spark_docker}}",
"Network": "HOST",
"ForcePullImage": true
Resource.json: is some resources, such as image, tar.gz files
"Assets": {
"Container": {
"Docker": {
"Spark_docker": "mesosphere / spark: 1.0.2-2.0.0"

All of these configurations are the same as the template has been written in advance, when installed on the interface point, or a line of orders to install.
Of course, if you click Advanced Installation, all configurations can be customized
Like Yum inside the same, the mysql-server Yum package maker and MySQL users separate, ordinary users as a user, do not need to understand too much detail, use it wants.

If you want to use the data center inside the package management, you can generate your own local universe, which into their own applications, as long as the professionals design once, you can use many times. You can also install multiple software at a time to form a group, which contains micro-service, but also contains large data, both can be found through the service to visit each other.

We are here to install a Spark software,
Initially installed Spark, but found only one Docker,
Spark is not a cluster computing framework, how could there be only one Docker? This is Mesos special management of large data framework. Spark does not run the task, just take this one Docker, in fact, is a framework.
The installation process as shown:

  1. Dcos package install spark will submit the request to admin router
  2. Admin router will be submitted to the request cosmos, that is, package management services
  3. Cosmos combines config.json, resource.json, marathon.json into a Marathon request to submit to Marathon
  4. Marathon will request to the mesos-master, and then to the mesos-agent
  5. Mesos-agent starts a container running spark
  6. The Spark container will be registered into Mesos as a new framework

When the Spark task is actually run, there are other tasks that take up resources to be created.

  Dcos spark run --submit-args = '- Dspark.mesos.coarse = true --driver-cores 1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi https: //downloads.mesosphere. Com / spark / assets / spark-examples_2.10-1.4.0-SNAPSHOT.jar 30 ' 

Spark operation process as shown:

  1. Dcos spark run Submit the task to admin router
  2. Admin router Submit the task to the spark framework
  3. Spark framework Submit the task to mesos-master
  4. The mesos-master distributes the tasks to the mesos-agent for separate processing
  5. After the task is run, all the resources occupied by the mesos-agent are released.

It is this model, to achieve micro-service and large data framework of shared resources, and this corresponds to the use of Docker to deploy Spark cluster, and then self-management cluster, not to Mesos management. When you create a Spark cluster, you need to specify the resources that the Spark worker occupies, such as 16G, but the 16G resource will be used by any other frame, regardless of whether or not it is calculated.

DC / OS 1.8 new features

For the latest DC / OS 1.8, there is a blog " Introducing DC / OS 1.8 GA! " That describes the latest features.

One of the first important functions for the Mesos 1.0 and the Universal Container Runtime, which can use the mesos-containerizer to run the Docker mirror. This is also the DC / OS for the management of the container more and more independent embodiment.

We can view the mesos-agent on the machine

  Ip-10-0-1-78 # ps aux | grep mesos-agent 
Root 1824 0.6 0.3 1069204 46948? Ssl Oct03 9:57 / opt / mesosphere / packages / mesos - 19a545facb66e57dfe2bb905a001a58b7eaf6004 / bin / mesos-agent

Mesos-agent configuration in the path / opt / mesosphere / packages / mesos – 19a545facb66e57dfe2bb905a001a58b7eaf6004 below, in /opt/mesosphere/packages/packos–19a545facb66e57dfe2bb905a001a58b7eaf6004/ which is mesos- Slave start parameter settings, through the Mesos document, we know that the parameters for the Mesos can be used to set the environment variable.

  Ip-10-0-1-78 # cat dcos-mesos-slave.service 
Description = Mesos Agent: DC / OS Mesos Agent Service

Restart = always
StartLimitInterval = 0
RestartSec = 5
KillMode = control-group
Delegate = true
LimitNOFILE = infinity
TasksMax = infinity
EnvironmentFile = / opt / mesosphere / environment
EnvironmentFile = / opt / mesosphere / etc / mesos-slave-common
EnvironmentFile = / opt / mesosphere / etc / mesos-slave
EnvironmentFile = / opt / mesosphere / etc / proxy.env
EnvironmentFile = - / opt / mesosphere / etc / mesos-slave-common-extras
EnvironmentFile = - / var / lib / dcos / mesos-slave-common
EnvironmentFile = - / var / lib / dcos / mesos-resources
EnvironmentFile = - / run / dcos / etc / mesos-slave
ExecStartPre = / bin / ping -c1 ready.spartan
ExecStartPre = / bin / ping -c1 leader.mesos
ExecStartPre = / opt / mesosphere / bin / bootstrap dcos-mesos-slave
ExecStartPre = / opt / mesosphere / bin / / var / lib / dcos / mesos-resources
Execendo = / bin / bash -c 'for i in / proc / sys / net / ipv4 / conf / * / rp_filter; do echo 2> $ i; echo -n "$ i:"; cat $ i;
ExecStart = / opt / mesosphere / packages / mesos - 19a545facb66e57dfe2bb905a001a58b7eaf6004 / bin / mesos-agent

In the file / opt / mesosphere / etc / mesos-slave-common configuration of a large number of mesos-agent parameters,

  MESOS_MASTER = zk: //zk-1.zk: 2181, zk-2.zk: 2181, zk-3.zk: 2181, zk-4.zk: 2181, zk-5.zk: 2181 / mesos 
MESOS_LOG_DIR = / var / log / mesos
MESOS_MODULES_DIR = / opt / mesosphere / etc / mesos-slave-modules
MESOS_CONTAINER_LOGGER = org_apache_mesos_LogrotateContainerLogger
MESOS_ISOLATION = cgroups / cpu, cgroups / mem, disk / du, network / cni, filesystem / linux, docker / runtime, docker / volume
MESOS_DOCKER_VOLUME_CHECKPOINT_DIR = / var / lib / mesos / isolators / docker / volume
MESOS_NETWORK_CNI_CONFIG_DIR = / opt / mesosphere / etc / dcos / network / cni
MESOS_NETWORK_CNI_PLUGINS_DIR = / opt / mesosphere / active / cni /
MESOS_WORK_DIR = / var / lib / mesos / slave
MESOS_EXECUTOR_ENVIRONMENT_VARIABLES = file: ///opt/mesosphere/etc/mesos-executor-environment.json
MESOS_DOCKER_STORE_DIR = / var / lib / mesos / slave / store / docker
GLOG_drop_log_memory = false

The default mesos-containerizer isolation includes only CPU and memory, but in the latest Mesos version of which, the Provisioner this layer, in the above configuration inside the isolated MESOS_ISOLATION = cgroups / cpu, cgroups / mem, disk / du, network / Cni, filesystem / linux, docker / runtime, docker / volume, which can start the Docker mirror.

The second most important function is CNI, container network interface.
CNI needs to work in three parts:

First, DC / OS does not require external IPAM, but by the mesos-master's replicated_log is responsible for managing the allocation of IP addresses, Mesos need to start, loading overlay network modules.

In the path / opt / mesosphere / etc / mesos-slave-modules below the file overlay_slave_modules.json

  Ip-10-0-1-78 mesos-slave-modules # cat overlay_slave_modules.json 
"File": "/opt/mesosphere/active/mesos-overlay-modules/lib/mesos/",
"Name": "com_mesosphere_mesos_OverlayAgentManager",
"Key": "agent_config",
"Value": "/opt/mesosphere/etc/overlay/config/agent.json"

Second need to load CNI isolator, the MESOS_ISOLATION this environment variable which has been configured.

Finally, the navstar service is required to implement interoperation between IP addresses across nodes

Each mesos-agent machine has opt / mesosphere / packages / navstar – 589afdaef03114a17576ee648ae433a052f7a4b9 /, will run a navstar process.

Each machine will create a network card d-dcos, if the Docker container using CNI to obtain IP containers are attached to the card, rather than docker0 on.

Each network will create a network card m-dcos, if the Mesos container using CNI to obtain IP containers are attached to the card.

The d-dcos and m-dcos segments of each machine are different.

Each machine will create a vtep1024 card, as VTEP, behind the vxlan.

Each machine will create a default routing table, from the node to connect to other nodes by default vtep1024 this card. via dev vtep1024 via dev vtep1024 via dev vtep1024

The configuration of the DC / OS network is in the / opt / mesosphere / etc / dcos / network / cni path:
In order to test these two new features, we first create a CNI using the Mesos container, but the start is Docker's Image nginx

"Id": "nginxmesos",
"Cmd": "env; ip -o addr; sleep 3600"
"Cpus": 0.10,
"Mem": 512,
"Instances": 1,
"IpAddress": {
"NetworkName": "dcos"
"Container": {
"Type": "MESOS",
"Docker": {
"Network": "USER",
"Image": "nginx",
"PortMappings": [
"Host_port": 0,
"Container_port": 80,
"Protocol": "tcp"

In the log inside, print out the container's IP address is m-dcos network segment.
Then we start a Docker container using CNI

"Id": "nginxmesos1",
"Cmd": "env; ip -o addr; sleep 3600"
"Cpus": 0.10,
"Mem": 512,
"Instances": 1,
"IpAddress": {
"NetworkName": "dcos"
"Container": {
"Type": "DOCKER",
"Docker": {
"Network": "USER",
"Image": "nginx",
"PortMappings": [
"Host_port": 0,
"Container_port": 80,
"Protocol": "tcp"

From the log we see that the allocation of IP is d-dcos network segment, rather than docker0 network segment.
From Mesos we see that the two containers are on two nodes.
Login to the Docker container, ping another CNI Mesos IP is no problem.

  Ip-10-0-1-78 cni # docker ps 
E7908deb3017 nginx "/ bin / sh -c 'env; ip" 28 minutes ago Up 28 minutes 80 / tcp, 443 / tcp mesos-b3fbe6d9-236a-4856-a986-9babbba9c02c-S2.e3c96fa7-b5ff-4af6-9099-bbed399c7c37
A992929fb0d1 nginx "nginx -g 'daemon off" 6 hours ago Up 6 hours 443 / tcp,>80/tcp mesos-b3fbe6d9-236a-4856-a986-9babbba9c02c-S2.fca41f8d-816c-49cd- 9b19-ba059b95e885
8032756dd66e nginx "nginx -g 'daemon off" 6 hours ago Up 6 hours 443 / tcp,>80/tcp mesos-b3fbe6d9-236a-4856-a986-9babbba9c02c-S2.c0fdd3db-6f17-41d3- Ab05-6f2d4d0bfa13
Ip-10-0-1-78 cni # docker exec -it e7908deb3017 bash
Root @ e7908deb3017: / # ip addr
1: lo: <LOOPBACK, UP, LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
Link / loopback 00: 00: 00: 00: 00: 00: 00: 00: 00: 00: 00: 00: 00: 00
Inet scope host lo
Valid_lft forever preferred_lft forever
Inet6 :: 1/128 scope host
Valid_lft forever preferred_lft forever
51: eth0 @ if52: <BROADCAST, MULTICAST, UP, LOWER_UP> mtu 1420 qdisc noqueue state UP group default
Link / ether 02: 42: 09: 00: 03: 82 brd ff: ff: ff: ff: ff: ff
Inet scope global eth0
Valid_lft forever preferred_lft forever
Inet6 fe80 :: 42: 9ff: fe00: 382/64 scope link
Valid_lft forever preferred_lft forever
Root @ e7908deb3017: / # ping
PING ( 56 data bytes
64 bytes from icmp_seq = 0 ttl = 62 time = 0.709 ms
64 bytes from icmp_seq = 1 ttl = 62 time = 0.535 ms

Q & A

Q: Will DC / OS currently test the number of nodes is the largest cluster?

A: Mesos node number Tweeter and Apple said the number of online million above, our company now has thousands of node level.

Q: What is the current scenario for DC / OS?

A: DC / OS current application scenarios are mainly micro-service and large data mixed deployment is more, but also the core of its design.

Q: Will the deployment of Stateful applications, such as MySQL?

A: is the use of DC / OS or the proposed start from the state of the service, with the operation and maintenance capabilities to improve the deployment of state services can be deployed, but the state of the service is not recommended to use Marathon to deploy, on the one hand it can not distinguish The difference between the instance, on the other hand need to meet the unified storage to achieve. For a state of the service, you can implement Framework, for example, we use the database is MongoDB, is also written their own Framework.

The above content according to the evening of October 4, 2016 micro-credit group to share content. Shareers Liu Chao, Chief Architect, Linker Networks, 10 years of experience in R & D and architecture in cloud computing, Open DC / OS contributors. Long-term focus on OpenStack, Hadoop, Docker, Spark, Mesos and other open source software enterprise applications and product. Personal blog: . DockOne organizes targeted technical sharing every week, and welcomes interested students. We are interested in the topic of liyoujiesz, who you want to hear or want to share.

    Heads up! This alert needs your attention, but it's not super important.