Some thoughts about the original of the cloud
Part I: Definitions
At Craig and I started building Heptio, we did a lot of thinking about the direction of our industry. We've been in Google for a long time (two people have been together for 16 years) and have a deep understanding of how Google builds and manages the system. But it is likely that you did not have experience with Google. So for an ordinary company, development or operation and maintenance personnel, how should we understand and apply these evolving new concepts?
- Use Borg for large-scale cluster management on Google [full text pdf]
- Google's use of Borg for large-scale cluster management (Chapter 1, Chapter 2)
- Google officially released its container optimized operating system
- 2.12 K8S Meetup review | Google · Chen Zhifeng "Google brain recent work progress introduction"
- What is the difference between Borg and Kubernetes and what is the future of the cloud?
- Google GAE operation and maintenance Secret: how to do more than 100 billion daily requests
What does the cloud mean and does not have a fixed or rigid definition . In fact, there are some concepts and ideas that overlap with them. But in its core of the original cloud, is the organization of the team, culture and technology, the use of automation and architecture to control the complexity of the power, breaking the speed of the shackles. In this mode of operation, not only is a technology expansion, but also a way of human expansion.
It is important that it is not necessary to run in the cloud environment can be called "cloud native". The technology can be used in the appropriate time, and help to smooth the process of transition to the cloud.
The real value of the cloud is far more than a basket of technology that is closely related. To really understand the industry trends, we need to think about where we can use the way for companies, teams and individuals to be more successful.
At present, these technologies have been technology-centric and visionary companies to verify, they have made efforts and put a lot of resources. Think about Google, Netflix or Facebook. There are also smaller, more flexible companies that realize value. However, in addition to the early adopters of these technologies, this philosophy is rarely used. When the projection to the entire IT field, we are still in this journey not far from the starting point.
With the early experience of being constantly verified and shared, what are the emerging themes today?
- More efficient and happier team . Cloud native tools make big problems cut into small problems, so that the division of labor to more focused and more agile team to solve.
- Reducing boring repetitive labor , by automating many manual operations, also avoids pain and downtime due to manual operation. This is usually done by self-healing, self-managing infrastructure. The system is expected to solve more things.
- More reliable infrastructure and applications . Constructing automated processes to deal with predictable failures is usually a better failure mode for unforeseen events and failures. For example, if a deployment application requires only one command or a click of a button, it is easier to automate the deployment in a disaster recovery scenario (either manually or automatically).
- Can be audited, can be investigated, can be debugged . Complex systems will become very obscure. The tools used by the native cloud are usually a clear idea of what is happening inside the application.
- Depth of security . Today, many IT systems are very strong shell, but the internal are very fragile. Modern systems should have security by default and take the minimum trust strategy. Cloud native allows application developers to play an important role in application security.
- The more efficient use of resources . Applications and services of the automated, cloud-like environment for the deployment and management, open the automation of the algorithmic opportunities. For example, a cluster scheduler / organizer can automatically complete the workload of the placement, without the need for operation and maintenance team with a similar excel way to manage.
At Heptio, we are particularly excited that we can help the wider IT field get the native benefits of the cloud. In the next section, we will discuss integration with existing systems, DevOps, containers and choreography, micro services and security.
Part II: Practice
As with any field that is experiencing change, there are complex concepts in the cloud world. The last part of the list of the concept, not everyone is clear how to properly use . At the same time, a lot of key projects, either too large, or too important, not suitable for heavy rewriting. Therefore, we feel it is best to use these new structures in new parts of new projects or old projects. After the old parts of the system have been improved, then take the time to reasonably learn and use other new technologies. Looking for ways to break down new features or systems into micro services.
There is no hard rule . Every institution is different. The practice of software development must be based on the side of the team and the project to make adjustments. Ideal is different from reality. Some projects can withstand experimental frustrations, but many very important projects should take a more cautious attitude. There are still some scenes between the two, some of the technology even if it has been verified, but still need to be standardized and undergo a large-scale test before they can be applied to the core system.
The cloud's original definition is inseparable from the better tools and systems. Without these tool chains, every newly deployed service in the production environment will have a high operational cost . Monitoring, tracking, configuration, etc. have increased the burden of a deal. This extra overhead is one of the main reasons why the segmentation of micro service size should be reasonable. The speed of the development team and the cost of running more (services) in the production environment, both of which need to be weighed . Similarly, the introduction of new technologies and languages, despite the possibility of fresh stimulation, must also carefully consider the risks and costs. On this topic, Charity Majors has a very good lecture .
Automation is the key to reducing the cost of operation and maintenance in building and running new services. Systems like Kubernetes, Containers, CIs, CDs, and so on have the same important goal – to make application development and operation and maintenance teams more efficient and faster to build products that are more reliable.
To achieve the original vision of the cloud, it is best to use a new generation of tools and systems, rather than the traditional configuration management tools, because they help to break down the problem and then division to different teams to deal with. In general, new tools often allow independent development and operation and maintenance teams to be autonomous and then increase productivity through self service IT .
Part III: DevOps
Dev0ps may be seen as a cultural transformation , developers now need to care about their application is how to run in the production environment. The operation and maintenance of the operation of the mechanism has a knowledge and granted the right to know, which can help build a reliable application to play an important role. It is key to enhance the understanding and build empathy between these teams.
But can go further. If we rethink the construction of our application and the structure of the operation and maintenance team, we can further deepen this relationship.
Google does not have a traditional sense of operation and maintenance team. In contrast, Google defines a new engineer called SRE (Site Reliability Engineer). They are skilled engineers (with the same level of compensation as other engineers). They are not only kept online at all times, they are empowered and given the attention they need to play a vital role in driving the application to become more robust.
When the alarm is triggered at two o'clock in the morning, anyone who responds to the alarm will do the same thing – try to figure out what the problem is, and go back to sleep early. The real definition of a SRE place is what happened the next morning at 10 o'clock. Whether the operation of the group will only complain, or will work together with the development team to ensure that the same alarm will not appear again? SRE and the development team have the same quest to make the application as stable and reliable as possible. Combined with non-obsolete after the analysis, you can maintain the health of the project, not to the accumulation of technical debt.
SRE in Google is one of the most valued people . In practice, many projects did not have the SRE to participate in the start, the development team shoulders their products in the production environment to run up the expectations. The process of introducing the SRE team usually requires the development team to prove to the SRE team that their products are ready for use. The development team needs to have all the preparatory work done, including the set of monitoring and alarm, alarm response strategy and release process. The development team to be able to show the alarm situation has reached the minimum level, and the vast majority of the problem has been automated.
With the operation of the role of a deeper level of cooperation with a particular application of higher relevance, so that an operation and maintenance team control the entire operation stack becomes unreasonable. This leads to the issue of operational specialization. In a sense this is a "anti-devops (anti-devops)" approach. Let's look from the bottom up:
- Hardware operation and maintenance . This layer is clearly can be separated. In fact, it is easy to think of the cloud IaaS as "hardware operation as a service."
- Operating system operation and maintenance . Someone must ensure that the machine can start smoothly, and there is a good kernel. Separating this part from application dependency management also reflects the minimization of the operating system releases (CoreOS, Red Hat Project Atomic, Ubuntu Snappy, Rancher OS, VMWare Photon, Google Container Optimized OS) for hosting containers The
- Cluster operation and maintenance . In a containerized world, a computing cluster becomes a logical infrastructure platform. The cluster system (Kubernetes) provides a set of primitives that allow many traditional operational tasks to be self-service.
- Application operation and maintenance . Each application can now have a dedicated application team as needed. If necessary, the development team has the ability and should play this role. This operation and maintenance team should be more in-depth application, because they do not need to study at other levels too deep. For example, at Google, the front-end SRE team at AdWords will have a lot of exchanges with AdWords Frontend's development team, more than they are communicating with the cluster team. This can lead to better results.
There may be other professional SRE team space. For example, storage services may be split into separate services, or, according to some policy, there may be teams responsible for verifying the base container mirroring used by all teams.
Part IV: Containers and Clusters
There are a lot of people excited about container-related technology. It is useful to understand why the root cause of all the excitement is so good. In my opinion, there are three reasons:
- Packaging and portability
Let's look in turn.
First, the container provides a packaging mechanism . This allows the system to be built from this deployment process. In addition, building the finished product and mirroring is better than traditional methods such as virtual machine portability. Finally, the deployment becomes more atomized. The traditional configuration management system (Puppet, Chef, Salt, Ansible) makes it easy for the system to be in a configuration that is incomplete and difficult to debug. It is also easy to leave some erroneous versions that are not easily found in the machine.
Second, the container is more lightweight, making the resource utilization increased . This is one of the main reasons why Google introduces cgroups (one of the main core technologies at the bottom of the container). By sharing a kernel and allowing more fluid overcommit, the container makes it easier for each part of the system to be wasted ("use every part of the cow"). Over time, it is expected to see more complex ways to balance the demand for coexistence of containers on a single host, putting an end to the noisy neighbors.
Finally, many users see the container as a security boundary . Although containers can be safer than a simple Unix process, pay attention when they are seen as a hard security boundary. The security guarantees provided by the Linux namespace are appropriate for those "soft" multi-tenant environments that run semi-trusted workloads, but are not applicable to hard multi-tenant environments that run hostile workloads.
There are many aspects of effort to blur the boundaries of the container and virtual machine. Some early studies such as unikernel are interesting, but for many years it can not be applied to production.
To achieve these goals, the container is a simple way, but it is not a necessary way. For example, Netflix, the tradition has been running a very modern technology stack, they are similar to the use of containers to package and use virtual machine image.
While the vast majority of the efforts surrounding the container focus on managing software on a single node to make it more reliable and predictable, the next step in this reform is to focus on clustering (often referred to as the organizer). Given a number of nodes and then bind them to the automation system, providing a set of logical infrastructure self-service for the development and operation and maintenance teams.
The cluster can help eliminate the boring in the operation and maintenance. With the container cluster We let the computer take over the work and decide which machine the load should handle. The cluster will silently fix the problem when the hardware fails, rather than need to notify someone.
The first advantage of the cluster is that it enables the operation and maintenance of the specification (see the third part) can make the operation and operation as a separate discipline to work hard. By defining a good cluster interface, application developers can focus on solving some of the application's own problems.
The second benefit of the cluster is that it makes it possible to start and manage more services . Thus allowing the use of a high-rate architecture for the development team (through the microcontrollers described in the next section).
Part 5: Micro Service
Micro service is a new name that already exists for a long time. Basically, it is a way to segment a large application so that they can be independently developed and managed. Let's take a look at the key concepts here:
- Powerful and clear interface . The tight coupling between services must be avoided. A clear document and a version management interface can help to strengthen this agreement, for these services, consumers and producers at the same time can maintain a certain degree of freedom.
- Independent deployment and management . Micro service should be able to be updated individually without having to synchronize with other services. At the same time, we all want to be able to easily roll back a micro-service version. This means that the deployed binaries must be forward and backward compatible with the API and any data format. This can be used as a touchstone for examining the level of cooperation and communication between operations and development teams.
- From the inside to create the tolerance . Micro service should be built and tested to have independent tolerance. When consuming a service is not available or unusual, those who consume a service should try to keep it work and make a reasonable response. Likewise, those services should provide protection against unexpected loads and exception inputs.
Determining the size of the micro service is a hard thing to do. My advice is to avoid choosing a small pico-service, and vice versa to keep the service in a natural border (programming language, asynchronous queue, telescopic requirements) while keeping the team size reasonable (for example, two Pizza team).
The architecture of the application should be allowed to grow in a practical and natural way. Instead of starting with 20 micro services, it is better to start from 2 to 3, and then split the service with the complexity of the field. Often the understanding of the architecture of an application will not become thorough until the application is in the development stage. This also shows that there are very few applications that have been completed, and they are always a project under construction.
Is micro service a new concept? Not also This is actually another type of software componentization (software componentization). We used to cut the code into libraries. This simply transforms the "linker" from the concept of a build phase into a run-time concept (in fact, Buoyant has an interesting project called linkerd, which is based on the Twitter Finagle system.) This is similar to the SOA boom years ago, but not the various styles of XML. Database from another point of view, has been almost a micro-service, because it is implemented and the way to meet the above list of points.
Constraints can be turned into productive forces. In spite of the fact that it is easy for each team to decide what language or framework to use on each micro service, consider taking down several languages or frameworks. Doing so will help the organization of knowledge and technology exchange accumulation, better response to the flow of personnel. However, if necessary, there must be an open mind to break the policy. This is a key advantage in the world, compared to PaaS's vertical integration structure, in other words, the constraints should come from the policy level, not from the constraints of the ability to be a matter of policy.
While the vast majority of people see micro services as a technology that implements a large application, there are many other types of services in the services spectrum:
- Service as implementation details . As described above, this is useful for cutting a large application team into small development and operation and maintenance teams.
- Result sharing, Shared Artifact (Private Instance). In this scenario, the development process is shared between many instances of the service. There may be a development team and a lot of operation and maintenance team, or is likely to be a joint operation and maintenance team, in the sharing of specialized examples of work. Many databases belong to this type, and many teams are using private instances of the same MySQL installation.
- Instance public . In this scenario, a team in the organization provides a shared service for many applications or teams. The service may partition the data or the operation according to the user (multi-tenant), or provide a wide range of simple services (such as displaying a generic brand of HTML, providing a machine learning model, etc.).
- Big-S Service. The vast majority of businesses will not build such services, but they may use them. This is a typical "hard" multi-tenant service, used to build a large number of scattered customers to provide services. This service requires some sort of accounting and reinforcement, usually in the enterprise is not necessary. SendGrid and Twilio belong to this category.
As services evolve from one implementation detail to a common implementation within a business, the service network evolves from one concept of each application to a concept that spans the entire company. Allowing this dependency is a both an opportunity and a worry.
Part VI: Security
Note: This article does not cover the emerging aspects of the "cloud native" and safety-related aspects. But at the same time, although I am not a security expert, security is my whole career have been concerned about things. Please take it as part of the list of matters worth considering.
Security is still a big problem in the cloud native field. The previous technology can not be perfect for the application, at the beginning of the cloud native could seem to be reversing. But in this fearless field, but also full of opportunities.
Container mirroring is safe
There are a number of tools that can help audit the container mirrors to ensure that they contain a variety of patches. Among the many options, I do not have any strong personal preference.
The real question is how to do it after finding a flawed image? This market has not yet provided a good solution. Once a loopholes in the mirror is detected, things from a technical problem into a program / process problem. You will want to find out which components in the organization are affected, the problem should be in the container mirror level tree where to repair, and what good way to test and release a new patch version.
CD / CD (Continuous Integration / Continuous Deployment) is an important part of responding to good practices because it can quickly and automatically publish new images. Its integration with the choreography system allows you to identify which users are using a flawed image. At the same time it also allows you to verify whether the production environment has been used to repair a good version. Finally, the strategy for deploying the system can help prevent containers that use a loopholes in the boot (in Kubernetes, this is called admission).
Micro service and network security
But even if all the software running on the cluster is already repaired, there is no guarantee that the network has no untrusted activity.
In the container world, which is characterized by dynamic scheduling and short-lived, traditional network-based security tools can not achieve the desired results . Short-lived containers may survive too short so that traditional scanning tools can not be scanned in time. Or was scanned and the scan report has been generated, but the relevant container has no longer exist.
In dynamic organizers, IP addresses no longer have a long meaning , and they can be automatically reused. The solution is to integrate the network analysis tools and organizers so that the logical names (in conjunction with other metadata) can be used with IP addresses. This may make the alarm easier to handle.
Many network technologies use the packaging technology to achieve a "container a IP" . This may cause problems with network tracking and diagnostic tools. If such a network system is deployed in a production environment, these tools must be improved accordingly. Fortunately, VXLAN, a large part of the VLAN has been standardized, or not packaged or virtualized, so that these systems can take advantage of these support tools.
However, personal views, the biggest problem is still related to micro-service issues. When running a lot of services in a production environment, it is important to ensure that a particular service can only be called by an authorized client. In addition, because of the repeated use of IP, the client needs to know that the service is called the correct service. In general, this is still an unresolved problem. To deal with this problem, there are two (not mutually exclusive) programs.
First, the network system and options that implement host-level firewall rules (outside of any container) are more flexible in order to fine-grained control access policies to control which containers can be invoked by other containers. I call this network micro-segmentation. One of the challenges here is how to configure the policy under dynamic scheduling. Although it is still in its infancy, many companies are trying to simplify this, by adding support in the network, or with the arranger collaboration, or in the high-level applications in the definition. A big problem is that the wider the use of services, the smaller the effect of micro-isolation. If a service has hundreds of callers, a simple model with "access implies authorization" is no longer applicable.
The second scenario is to make the application play a more important role in achieving authentication and encryption within the data center. This applies to those services that are large and have a "soft multi-tenant" status in large organizations. This requires a production environment to serve the identity system . I have already started an amateur project called SPIFFE (Secure Production Identity Framework For Everyone, the production environment security identity system framework for everyone). These ideas have been verified by companies like Google, but have not yet been implemented elsewhere.
Security is a very in-depth topic, there must be a lot of threats and precautions are not involved, but also need to continue to explore.
Here, our cloud native series of articles is over. Welcome everyone to leave their own thoughts in Medium, or contact jbeda, cmcluck or heptio on Twitter.
- Cloud Native Part 1: Definition
- Cloud Native Part 2: In Practice
- Cloud Native Part 3: DevOps
- Cloud Native Part 4: Containers and Clusters
- Cloud Native Part 5: Microservices
- Cloud Native Part 6: Security