DockOne Share (98): Insta360 Container & DevOps Road

As a Panorama / VR start-up company, with the increase in corporate personnel and the direction of globalization, the slash-and-burn CI / CD approach has been unable to meet current needs. Taking into account the current situation of personnel and technical structure of the expansion, we use a set of Ali cloud-based, Docker as the core, third-party services for the development of tools, testing, deployment process, as well as internal code submission, version management practices The

background

Division I is a set of hardware research and development and software development as one of the Internet start-up companies. 2016 is known as the panorama / VR first year, indicating the opportunity to come at the same time, but also doomed we will face some unprecedented problems, which pit countless, but not in the scope of this article, nonsense, we went straight theme.

Our current business is open to the user mainly contains the following three:

  1. Video sharing (2C)
  2. Pan / VR (2B)
  3. News media cooperation (2B)

Which video image sharing for C-side users, users around the world, requiring users around the user can quickly and easily share, but also requires a better browsing experience, due to peer-to-peer sharing of the characteristics of the normal flow of the part is not too much; but the news media And played a special role, such as November 25, due to Fenghuang Wang home page embedded in the Division I sharing page, from 8:00 to 9:30 during the last one hour (n) Gbps traffic & and every second (n) K The number of requests almost played a DDoS identity, and instantly drag the back-end statistical server, leading to the average user can not access. Similarly, the panorama / VR live is currently a test function, but there are still potential risks. So I asked the Secretary to establish a rapid response mechanism, as well as available preparation program.

challenge

The challenges faced are simply listed below:

  • Cluster deployment
  • Differentiated deployment
  • Global deployment
  • Environmental differences
  • Resource utilization is low
  • Number of projects & language increases

Specific to the content itself, first of all we need front-end server in the various regions of the cluster deployment, sharing access pressure, while the cluster in some cases need to provide online test environment (different from the conventional test environment is exactly the same as the formal environment Of the beta), which requires differentiated deployment capabilities to support.

Second, because of our strategy of globalization, business not only to consider the domestic users, but also for overseas users to provide a consistent experience, and therefore require global deployment.

Environment is large, refers to the use of front-end separation of the way after the development, front-end and Web services include Redis + Node.js environment, the back-end coexistence of PHP + Java + Python + C, the traditional way of deployment has been unable to meet the rapid response Demand, although the use of Ansible to meet the demand, but the configuration cumbersome, it was also abandoned.

At the same time, consider the expansion of business, stand-alone deployment of the above-mentioned various environments, the need to reserve a certain amount of resources for reserves, to prevent emergencies; even if the mirror of the current environment to package, in the event of unexpected circumstances, Need to be a long time, the response is too slow; integrated before the point of consideration, using to ensure stability and availability, reduce the low utilization of resources.

Finally have to say is that from the first few projects to today's dozens of projects (daily updates 10 ~ 20), if you continue in accordance with the previous way, you can only full-time responsible for the deployment of business. For a start-up company, it is more important to use more energy to develop new features and provide better experiences for users. To sum up, all the questions require us to change the original CI / CD way, using a more lightweight, more simple solution is imperative

Program

  • SSH / Fabric
  • Ansible
  • Docker

SSH / Fabric is the solution I originally tried, but the need for some of the column development, the basic stage in the experiment was abandoned; Ansible of course strong, but still do not use the above challenges to solve the main problems are as follows:

  1. Configuration cumbersome
  2. Poor scalability (relative)
  3. Poor reliability (using SSH, affected by the network)

And Docker is at this time to become one of the options, its advantages is self-evident:

  1. Flexible, will be applied to the system container, no additional dependencies
  2. Convenient, any Linux distribution configuration Docker Engine to start
  3. Open source & free, open source / free low cost, Linux kernel driver
  4. Lightweight, just add or reduce the mirror can be, in a server can be deployed on multiple containers
  5. Environmental consistency, the mirror itself contains the operating environment, to avoid the environment due to inconsistencies caused by a variety of anomalies and risks

Structure / containerization

First generation architecture

  • SSH
  • Fabric / Ansible

Slash and burn the SSH program with the Docker experiment phase, which has been completely abandoned.

Second generation architecture

02-stack-dep-physic.jpg

  1. Use Ali cloud VPC, internal use of Ansible management server
  2. Run the Docker command through Ansible for container deployment
  3. Backend services & RabbitMQ are still deployed in the traditional way

Third generation architecture

01-arch.png
Third-generation architecture is still in the perfect stage, the above diagram briefly describes the three regions of our service distribution in the application:

  • Hangzhou / the United States and the West
    • Front-end service
    • data storage
    • Mirror the warehouse
    • Image processing / video transcoding worker
    • Configuration service redis slave
      • Service IP address
      • Service domain name information
      • Service configuration information
  • Hong Kong
    • Data center (database)
    • Middleware (third party service)
    • Statistical system
    • message queue
    • Configuration service redis master
Front-end service

Which front-end services for the main browsing services, by the CDN + SLB + (Node.js + Redis) composition:
03-progress.jpg
Users access the domain name after the first load by dns-load-balancer, to the CDN to the different CNAME, CDN to determine the type of request:

  1. Resource mp4 / mp3 / jpg / png return to the client
  2. The request is forwarded to the SLB, and the SLB performs secondary load by weighted rotation
  3. Request to reach the front-end server (Node.js), through the internal Redis area to obtain data
  4. The data is returned, there is no request to the data center and cache, and then return to the client
  5. As a CDN source, through Nginx / HAProxy reverse proxy OSS, take Ali cloud network to provide external resources
    04-dep-hz-internal.jpg
data storage

Our business is divided into two categories:

  • OSS uses Aliyun OSS storage services to store media resources such as video and pictures
  • Volume, the use of Ali cloud ossfs built Docker Volume, storage of persistent data
Image Processing

Video processing is currently using the Ali cloud MTS transcoding service to do ordinary video video transcoding, at the same time, due to industry specificity, the need for panoramic video and image processing, by the Python + Celery + C configuration of the workers, the part of the content By the Hong Kong data center RabbitMQ unified management, the message arrived after RabbitMQ automatically distributed by the free worker and through the MQ to return the results (also tried to return before the HTTP way, but because the network environment is more harsh, may appear HTTP request can not To achieve, to deal with the wrong logic is more trouble, so the use of MQ, set a certain expiration time, if you can not get the results, then re-send the task), the current structure of the optimized version of MQ has been replaced by Kafka.

Kafka in memory footprint, much higher than RabbitMQ, stand-alone deployment of RabbitMQ, when the number of Queue up to 1W or so began to continue to deal with the situation, with the configuration machine installed Kafka, 100W test period around the task, the memory situation is still intact.

Configuration service

Configuration service is a simple Redis master and slave, the main function is to maintain some configuration information, such as the service IP address (measured results, overseas operators DNS resolution has a serious problem, and therefore give up the domain name to use IP); service configuration information Such as the name of the service, front-end service request data changes, etc .; the use of Redis is the same reason to be able to self-maintenance status, as far as possible to give up manual intervention, because the part of the smaller resources, Master to do permanent, Slave can run directly, Using Alpine mirror, just about 10m.

data center

The data for this section is the data stored in the database:

  1. Aliyun RDS (after the increase in business volume can be considered to transition to DRDS)
  2. Aliyun Mongodb
Task Queue / Message Queue
  • RabbitMQ
  • ZooKeeper cluster
  • Kafka (currently stand-alone, storage using ossfs)

Specification / process

Development

Project structure:

05-prj.png

  • Dockerfile
  • Src Place the project code
  • Root stores the Docker configuration information, overwriting the container's internal system configuration
Code submission
  • Branch / branch
    • Dev development branch, build development mirror (local build test)
    • Test test branch, used to build online test mirror
    • Master main branch, build the latest mirror
  • Version / tag
    • Rule: release-v {version}. {Month}. {Date}. {Order}
    • Example: release-v5.12.05.02

This version of the reference to the Ali cloud mirror service automatically build rules.

Construct

Build a service We currently have three sets of general:

  • Aliyun image service, automatically built for the official environment of the mirror release
  • CircleCi, automatically builds and tests for automatic building of the GitHub project
  • DroneCI, for internal construction, mainly used for automatic network construction and testing

After the build is successful, use Webhooks to push to the BearyChat notification Web group members:
05-notify.png
Hook interface processing will return:

  • time
  • name
  • version
  • Namespaces
  • Mirror full name

test

Hook service received information, according to tag to determine what should be sent to BearyChat group:
05-notify-group.png
After receiving the prompts, the test group members will be able to log in to the internal test platform (using Rancher), select the corresponding application test, and feedback the results to the product & project manager, and then complete the acceptance.
05-test.png

deploy

Development of the tag and push to Ali cloud mirror service, mirror the completion of the building after the completion of the Hook system according to the Tag call API for automatic deployment (Ali cloud service is not using the API, to control the risk, still using manual update)

Thinking

  1. How can I further improve my workflow?
    Improve more automation services, to further reduce the cost of manual communication such as: development submitted, the automatic access to git commit information, and sent to the test group.
  2. other

QA

Q: Why do not you directly use the container service provided by Ali cloud?

A: The reason is because of the goal of globalization, then Ali cloud service is not perfect, the United States and Hong Kong has no nodes; Hong Kong is still not, so we Hong Kong data center is built using Rancher.

Q: useful Jenkins?

A: No, we are Node.js and Python is a serious service, we do some basic in the GitHub mirror package, the project directly reference like.

Q: Will the production environment use the Docker deployment. What do you have to pay attention to in terms of performance? Many people in the industry have criticized Docker's network.

A: This question is a good answer (because it has just been pitched recently ..)

  1. Network part, Overlay we gave up, one of the very important reason is unable to obtain real IP, but also found that reached a certain order of magnitude, access speed slowed down, the specific test results will be issued later.
  2. Performance, the current discovery is RabbitMQ performance is relatively poor; Division I Node.js relatively heavy, Node.js in the front-end service on the performance or trustworthy, just said the example of the Phoenix thing, the back end completely hung, The front end because of the existence of Redis, no abnormalities.

Q: Docker Swarm?

A: Swarm is currently only in the internal network for testing, not a formal large-scale deployment, such as the problem can not be obtained IP, was directly passed.

Q: How do you collect information about container logs and business logs? What are the options?

A: log we currently have two main: the first is Ali cloud container service log, is derived through the API, and then in the private ELK on the analysis; the other part of the container log, where we are using the Kafka, ELK further retrieves the log from Kafka and then processes and returns the statistical system.

Q: Does Grunt build a mirror or run when the container starts?

A: completely useless, Node.js project is based on Webpack for packaging, when the project will be released once compiled, and then node_modules + static_res together on the mirror.

Q: Do not use Overlay. What about routing rules? still is?

A: We cloud services to Ali cloud and AWS-based, which Ali cloud server using SLB to load, here to say is that we are by adding a digital label to the application itself, for example, A service is 80, then use 8080 that its Use port 80; B is 81,8180 the same token.

Q: Does Nginx and Kafka use container mode online? What about performance?

A: Kafka is the full use of the container to deploy, just mentioned, single container million tasks completely no pressure; Nginx, then we are now CDN back to all through Nginx reverse proxy to OSS network, NTB static resources , Has not yet encountered performance problems.

Q: Are you packing for the source code and putting it into the scene? Or when the mirror to start a storage service to pull?

A: We code in Git, in the construction time to compile and package, the compiled code into the mirror; the latter considered, but the network situation is changing, too much instability, it is recommended not to do so;
Here, in order to speed up the construction speed, we do subcontracting; for example, a Python application, using the PIL, then build a basic Python image, and compile the PIL; application to the mirror based on the addition of PKG and start.

Q: Is the environment variable set in the mirror or when the container is started?

A: This is the case, the general situation is the default container environment variables, if not set to start the default parameters (for a region), start the environment variable the highest priority, you can override the internal environment variables.

The above content is based on the December 6, 2016 micro-credit group to share content. Shareers Su Yi (Yang Heqiang), senior front-end engineers and Web group technology leader, full-time front-end technology selection and architecture design. DockOne organizes targeted technology sharing every week, and welcomes interested students to join the letter: liyingjiesz, who is participating in the group of the world's 360 ° panoramic camera (VR camera) panorama leading brand, Shenzhen Lan Feng Chuang Network Technology Co., Ltd.) , You have to listen to the topic or want to share the topic can give us a message.

    Heads up! This alert needs your attention, but it's not super important.