Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Server scale up - Production Server Spec for Special Event New

Zoom session to explain each items on this doc

https://us06web.zoom.us/rec/share/ezgXk4L-p5-NLJ1n_qvA59SwIHE942TgMipWcy06qYc8rWMrv8xyFr_PjNAmAyUz.HTLAkP2Xcegf2BP6 Passcode: @&Na$5f6

Update Event start time and end time

Update Redis max connection automatically

FIX ME Login to any sidekiq-default pod as an example

kubectl exec hungryhub-sidekiq-default-785c4ffbf6-68lzv -n hungryhub -c service -ti -- bash

execute bin/rc inside the pod then execute these commands to disable some features that slow down servers

AdminSetting.event_start_time = "#{Time.thai_time.to_date} 09:00"
AdminSetting.event_end_time = "#{Time.thai_time.to_date} 11:00"
AdminSetting.redis_lru_max_connection = 300
AdminSetting.redis_lru_timeout = 3
Flipper.enable :reviews_v2

Then revert it later when the event is over

AdminSetting.redis_lru_max_connection = 100
AdminSetting.redis_lru_timeout = 3
Flipper.disable :reviews_v2

We update following configuration in Jenkins

URL: jenkins.hungryhub.com ask Saiqul for the account

Following setting maybe outdated already

Sidekiq

normal days: cache.m6g.large event day: cache.m6g.xlarge (not r instance)

LRU

normal days: m6g.xlarge event days: m6g.8xlarge

Caution:

Flush the Redis before we downgrade it. the current memory is bigger than what Xlarge machine can store. so we need to clean the memory first

redis-cli -h redis host flushall async

Login to any hh-server pod, install redis-cli

Memcached

normal days: 2 event days: 20

Upgrade RDS

Booking production:

normal days:

  • primary: r5.xlarge
  • Reader:
    • auto scaling 40%
    • 2 replicas event days:
  • primary: r5.xlarge
  • Reader:
    • auto scaling 25%
    • 3 replicas even

Change DB autoscaling setting to be

Huginn:

upgrade hh-production-db instance from db.m5.2xlarge jadi db.m5.12xlarge

Update Auto Scaling Group

use 50% spot, 50% on demand

Update Node groups

normal days, min 2 event days min 20

HH Server Pods

normal days:

  • min 10
  • max 1000 event days:
  • min 750
  • max 1500

Sidekiq Critical Pods

normal days:

  • min 2
  • max 20 event days:
  • min 5
  • max 20

Scale up Huginn server pods

normal days: 2 event days 5

Scale up Huginn DB

normal days t3.small event days: t3.large

Server scale up/down using Jenkins

When Jenkins server is down, we can scale up/down manually. see this doc: Private (https://app.clickup.com/9003122396/docs/8ca1fpw-14210/8ca1fpw-14290)

Open https://jenkins.hungryhub.com/job/scale-infra/ and click "Build with Parameters" or "Schedule Build" button from the sidebar

For the scheduled build, it uses Bangkok timezone.

Components

There are few components that we can scale up

SSM

We put environment variables to SSM. Some variables have different value between Normal event and Big event.

RDS

This is to change all RDS DB. If we scale up/down during low-traffic hours, then we can use Jenkins. Otherwise, hh-server will be very slow because we don't implement the Blue-Green Deployment strategy to our RDS Aurora yet. check scripts/scale_down_rds_aurora.sh script in our terraform code, to scale down the RDS during busy hours.

Redis

This is to change all Redis instances. If we scale up/down during low-traffic hours, then we can use Jenkins. Otherwise, hh-server will be very slow because we don't use Redis cluster strategy yet. check script scripts/manual_blue_green_deployment_for_inventory_redis.sh and scripts/manual_blue_green_deployment_for_app_lru_redis.sh in our terraform code, to scale down the Redis instances during busy hours.

Memcache

This is to scale up/down our memcached elasticache instance. so far we just adding/removing more memcached instances to handle the event.

EKS, Hungryhub-apps, EKS-services

These components are related to Kubernetes. So far there is no issue if we change the infra spec anytime.

Auto Approved

If the value is "No", then we need to confirm the button before Jenkins executes the pipeline

Environment

dev is sandbox aws account - hhstaging.dev

Event Types

We have 3 types

  1. big
  2. normal
  3. no event if normal_event is false and big_event is false, then it means we use no event config

Normal event is supposed to handle 20k request per minutes Big event is supposed to handle 150k request per minutes (we tested this already in 10.10 2023 flash sale event)

Date

We installed SMTP plugin to notify us whenever a pipeline is succeed or failed This date data will be added to the email. just to add more information to the email. Because it's confusing when we got many emails from Jenkins and we don't know it's for which pipeline. This date field is useful for Scheduled Pipeline

Current issues

hh-server app depends on DB, Redis, and Memcached Changing AWS RDS and Redis instance type would cause some connection issues in the hh-server app so typically, we should create a new instance of Redis/DB, and let hh-server connect to the new instances. Once the hh-server is connected to the new instance, we can modify the original instance and then change the connection back to the original instance when the modification is done.

Devops Team

Host list

  • CI server 78.47.203.51
  • VPS staging 62.55.33.126
  • data-team-server 135.181.80.79
  • Host fast-api 18.142.150.51
  • Host hh-menu HostName 65.108.213.91
  • Host madison HostName 5.161.124.128

CI CD Runners

We have two CI CD system currently

  1. Gitlab
  2. Github Those to system use the same Runner, the IP address is 78.47.203.51 once you have given the access, you can login to
ssh root@78.47.203.51

What to do when Gitlab CI doesnt work

usually it caused by the storage is full, you can clear it by

docker system prune -a
docker volume prune

What to do when Github Actions doesnt work

usually because the runner doesnt work

cd /home/github_runner
./svc.sh start # if doesnt work, try below
./run.sh

validate it by checking the running process

ps aux | grep github_runn
ps aux | grep run.sh

How to give someone access to kube cluster

  1. Create the IAM User from root account and give them EKS access role (accessKubernetesApi) (group kube-cluster-only)
  2. Update configmap on the cluster and add the IAM user
kubectl edit configmap aws-auth -n kube-system
  mapUsers: |
    - "userarn": "arn:aws:iam::123456789012:user/eks-user"
      "username": "eks-user"
  1. Then the user should update kubeconfig
aws eks update-kubeconfig --name <cluster>

Host List (Our IP Address) - Vendor has to Whitelist these IP Addresses

  • Prod
  • AWS Sandbox
    • For sandbox the IP is dynamic to reduce cost
  • CI server 78.47.203.51
  • Host fast-api 209.97.160.103
  • Digitalocean Kubernetes Cluster
    • do-staging 159.223.70.101
  •   *   do-production
    

152.42.213.61

CI CD system

We have 3 CI CD system currently

  1. AWS Codepipeline
    1. This CI CD is to build docker image for Kubernetes cluster and other services
  2. Github runner self-hosted on Mac Mini
    1. This CI CD is to build IOS and android apps
  3. Jenkins
    1. This CI CD is to build infra using terraform IaC

How to give someone access to kube cluster

  1. Create the IAM User from root account and give them EKS access role (accessKubernetesApi) (group kube-cluster-only)
  2. Update configmap on the cluster and add the IAM user (should add into terraform too)
kubectl edit configmap aws-auth -n kube-system
  mapUsers: |
    - "userarn": "arn:aws:iam::123456789012:user/eks-user"
      "username": "eks-user"
  1. Then the user should update kubeconfig
aws eks update-kubeconfig --name <cluster>

When pipeline terraform remote lock state failed

  1. Delete all dynamodb item
  2. Re-run the terraform pipeline on codepipeline

Robusta

  • robusta prod
    • channel #monitoring-k8s-prod
  • robusta sandbox
    • channel #monitoring-k8s
  • robusta digitalocean
    • channel #monitoring-k8s

Nameserver for AWS Route53 and cloudflare

AWS Route53

Architectures

HH-Server

Recommendation Service

domain: https://rec.hungryhub.com/ repo: https://github.com/hungryhub-team/rec-service diagram:

recommendation Diagram.drawio

GMT20221117-103412_Recording.m4a

https://t9003122396.p.clickup-attachments.com/t9003122396/74eaf3bb-8bfc-472b-abb6-de7b9aad8d7a/GMT20221117-103412_Recording_1920x1080.mp4?open=true

HH Menu

GMT20220428-081155_Recording.m4a

https://t9003122396.p.clickup-attachments.com/t9003122396/9ecb1ad1-bd48-481c-8bd8-1b3c5e1117f2/GMT20220428-081155_Recording_1440x900.mp4?open=true

AOA WebView

Currently, AOA still wants to use the old UI, so we can't merge it into the master because the master already uses the new UI. To address this, we have created a new production deployment called "aoa-master". We will implement the new UI once AOA agrees to adopt it in the future.

Staging

Domain: https://hh-aoa.netlify.app Repo: https://github.com/hungryhub-team/hh-web-new-ui/tree/feat/aoa-integration

Production

Domain: https://hh-aoa-production.netlify.app Repo: https://github.com/hungryhub-team/hh-web-new-ui/tree/aoa-master

Plumber

Repo: https://github.com/hungryhub-team/plumbers Domain: http://plumber.utility for every cluster Docs: https://github.com/hungryhub-team/plumbers/blob/main/docs/index.md

Reserve With Google

There're several error that might be happen when uploading feed in reserve with google.

locality

Usually this happen because google can get any city or state from latitude and longitude that we sent. In this example restaurant 1836 (The Coffee Club Day Inn (Phuket)) when I check admin the latitude and longitude seems wrong: If we manually type the google maps link with that latitude and longitude will open the sea instead of phuket. https://www.google.com/maps/@7.8942718,98.7412999276,14z?entry=ttu

If this case happen, we need to update our restaurant data from admin dashboard and set the right latitude and longitude.

country

This happen when we put the wrong phone number country code in our restaurant setting.

How to Match Location

After we successfully upload our restaurant feed the data will available in Inventory data:

But there is a case where a restaurant doesn't match with any google map data. So we need to fix it manually. We can click match now, select the right map data and then click save.

Backend Team

Private (https://app.clickup.com/9003122396/docs/8ca1fpw-12905/8ca1fpw-34296) Private (https://app.clickup.com/9003122396/docs/8ca1fpw-12905/8ca1fpw-34316)

Database configuration

Rails app uses config/database.yml to store the DB connection config. the database.yml on production environment is generated by buildspec.yaml

      - aws ssm get-parameter --region $AWS_REGION --name "/$Environment/hh_server_config_database" --with-decryption --output text --query Parameter.Value > config/database.yml

it means the codepipeline script will get the config from Secret manager (ssm) then write a database.yml before starting the pod (read k8s pod on the internet), we modify the database configuration on the container's args (manifest/prod/hungryhub-server.yml): server-entrypoint.sh is a script that generated dynamically on manifest/prod/hungryhub-server.yml we can see that there is a sed command to update the DB pool. original file:

production:
  adapter: <%= ENV.fetch('DB_ADAPTER') { 'mysql2' } %>
  encoding: utf8mb4
  charset: utf8mb4
  collation: utf8mb4_unicode_ci
  username: <%= ENV['MYSQL_USER'] || raise('undefined db username') %>
  password: <%= ENV['MYSQL_PASSWORD'] || raise('undefined db password') %>
  database: booking_production
  host: <%= ENV['MYSQL_SERVER_HOST'] %>
  port: <%= ENV['MYSQL_PORT'] %>
  pool: 20
  makara:
    id: mysql
    blacklist_duration: 5
    primary_ttl: 5
    primary_strategy: round_robin
    sticky: true
    connections:
      - role: master
        host: ${master}
        url: 'mysql2://${username}:${password}@${master}:3306/booking_production'
        disable_blacklist: true
      - role: slave
        host: ${slave1}
        url: 'mysql2://${username}:${password}@${slave1}:3306/booking_production'

modified file:

production:
  adapter: <%= ENV.fetch('DB_ADAPTER') { 'mysql2' } %>
  encoding: utf8mb4
  charset: utf8mb4
  collation: utf8mb4_unicode_ci
  username: <%= ENV['MYSQL_USER'] || raise('undefined db username') %>
  password: <%= ENV['MYSQL_PASSWORD'] || raise('undefined db password') %>
  database: booking_production
  host: <%= ENV['MYSQL_SERVER_HOST'] %>
  port: <%= ENV['MYSQL_PORT'] %>
  pool: 1
  makara:
    id: mysql
    blacklist_duration: 5
    primary_ttl: 5
    primary_strategy: round_robin
    sticky: true
    connections:
      - role: master
        host: <%= ENV['MYSQL_MASTER_HOST'] %>
        url: <%= ENV['MYSQL_MASTER_URL'] %>
        disable_blacklist: true
      - role: slave
        host: <%= ENV['MYSQL_SLAVE_HOST'] %>
        url: <%= ENV['MYSQL_SLAVE_URL'] %>

Web server

We use a Passenger web server to serve any HTTP requests passenger has passenger-status command to check the running processes: The screenshot above shows that we use one Passenger process to respond to all incoming requests. Adding more processes will consume more CPU and Memory, so we need to calculate the CPU and Memory usage first to decide how many processes we can use in a single server. If the server has a lot of CPU and Memory, then we can use more processes.