Deploying Spark on Kubernetes

Let's build a custom Docker image for Spark 3.4.1, designed for Spark Standalone Mode

Dockerfile. I found scripts in the spark-kubernetes repo on GitHub and change the Spark and hadoop version.

# base image
FROM openjdk:11

# define spark and hadoop versions
ENV SPARK_VERSION=3.4.1
ENV HADOOP_VERSION=3.3.6

# download and install hadoop
RUN mkdir -p /opt && \
    cd /opt && \
    curl https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \
        tar -zx hadoop-${HADOOP_VERSION}/lib/native && \
    ln -s hadoop-${HADOOP_VERSION} hadoop && \
    echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native

# download and install spark
RUN mkdir -p /opt && \
    cd /opt && \
    curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz | \
        tar -zx && \
    ln -s spark-${SPARK_VERSION}-bin-hadoop3 spark && \
    echo Spark ${SPARK_VERSION} installed in /opt

# add scripts and update spark default config
ADD common.sh spark-master spark-worker /
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf

Docker file for Spark UI I used this repo on GitHub

FROM python:2.7-alpine
COPY ./spark-ui-proxy.py /

ENV SERVER_PORT=80
ENV BIND_ADDR="0.0.0.0"

EXPOSE 80

ENTRYPOINT ["python", "/spark-ui-proxy.py"]

Build the Spark image and upload to docker

docker build -t spark-standalone-cluster:3.4.1 .


List the image

docker image ls spark-standalone-cluster:3.4.1

Upload the docker image to ECR

docker tag spark-standalone-cluster:3.4.1 public.ecr.aws/n4rxxxxx/sparkp:341
docker push public.ecr.aws/n4rxxxxx/sparkp:341


Build the Spark ui image and upload to ECR

docker build -t sparkui .
docker tag spark-ui:1.0.1 public.ecr.aws/n4rxxxxx/sparkui:v1
docker push public.ecr.aws/n4rxxxxx/sparkui:v1



With the docker images for images in ECR , we can deploy the Spark Master , Worker and UI in EKS

Spark Master

kind: ReplicationController
apiVersion: v1
metadata:
  name: spark-master-controller
  namespace: spark
spec:
  replicas: 1
  selector:
    component: spark-master
  template:
    metadata:
      labels:
        component: spark-master
    spec:
      hostname: spark-master-hostname
      subdomain: spark-master-headless
      containers:
        - name: spark-master
          image: public.ecr.aws/n4rxxxxx/sparkp:3413
          imagePullPolicy: Always
          command: ["/spark-master"]
          ports:
            - containerPort: 7077
            - containerPort: 8080
          resources:
            requests:
              cpu: 1000m
              memory: 2Gi
          volumeMounts:
          - mountPath: "/mnt/crm"
            name: volume
            readOnly: false
      volumes:
        - name: volume
          persistentVolumeClaim:
claimName: crm-poc-pvc
kind: Service
apiVersion: v1
metadata:
  name: spark-master-headless
  namespace: spark
spec:
  ports:
  clusterIP: None
  selector:
    component: spark-master
---
kind: Service
apiVersion: v1
metadata:
  name: spark-master
  namespace: spark
spec:
  ports:
    - port: 7077
      targetPort: 7077
      name: spark
    - port: 8080
      targetPort: 8080
      name: http
  selector:
    component: spark-master

Create PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: crm-poc-pvc
  namespace: spark
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: azurefile-csi-premium
  resources:
    requests:       storage: 5Gi(base)

Spark Worker

kind: ReplicationController
apiVersion: v1
metadata:
  name: spark-worker-controller
  namespace: spark
spec:
  replicas: 1
  selector:
    component: spark-worker
  template:
    metadata:
      labels:
        component: spark-worker
    spec:
      containers:
        - name: spark-worker
          image: public.ecr.aws/n4rxxxxx/sparkp:3413
          imagePullPolicy: Always
          command: ["/spark-worker"]
          ports:
            - containerPort: 8081
          resources:
            requests:
              cpu: 2000m
              memory: 2Gi
          volumeMounts:
          - mountPath: "/mnt/crm"
            name: volume
            readOnly: false
      volumes:
        - name: volume
          persistentVolumeClaim:
            claimName: crm-poc-pvc

Spark UI 

kind: ReplicationController
apiVersion: v1
metadata:
  name: spark-ui-proxy-controller
  namespace: spark
spec:
  replicas: 1
  selector:
    component: spark-ui-proxy
  template:
    metadata:
      labels:
        component: spark-ui-proxy
    spec:
      containers:
        - name: spark-ui-proxy
          image: public.ecr.aws/n4rxxxxx/sparkui:v1
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 100m
          args:
            - spark-master:8080
          livenessProbe:
              httpGet:
                path: /
                port: 80
              initialDelaySeconds: 120
              timeoutSeconds: 5
kind: Service
apiVersion: v1
metadata:
  name: spark-ui-proxy
  namespace: spark
spec:
  ports:
    - port: 80
      targetPort: 80
  selector:
    component: spark-ui-proxy
  type: LoadBalancer

To trigger the spark job automatically , we create cronjob to enable it to retrieve the source codes from PVC. Before setup the cronjob , it's required to create the role, role binding and service account to grant the access to let the cronjob pod to access the Spark pod in EKS

Role

RoleBinding

Service Account

Cronjob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: spark-encrypt-cron-job
  namespace: spark
spec:
  schedule: "*/1 * * * *" # Runs every minute
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: job-runner
          containers:
          - name: cron-job-container
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - kubectl -n spark exec -it svc/spark-master -- sh /mnt/crm/aks_test/check_shuup_aks_encrypt.sh
            #command: ["echo", "Hello from the CronJob! This is a sample command."]
            volumeMounts:
            - mountPath: "/mnt/crm"
              name: volume
              readOnly: false
          volumes:
            - name: volume
              persistentVolumeClaim:
                claimName: crm-poc-pvc
          restartPolicy: OnFailure

Shell script call by cronjob. It check if the spark job is running , only trigger another shell script to run Spark streaming job if it's not running

ps -ef|grep pyspark_test_aks_confluent_encrypt|grep -v grep

STATUS=$?
current_time=$(date +"%Y-%m-%d %H:%M:%S")
echo "Current date and time: $current_time"

if [ $STATUS -ne 0 ] ; then
  echo $current_time ' spark aks job is not running-- > ' > /mnt/crm/aks_test/log/spark_aks_not_running.log
  sh /mnt/crm/aks_test/shuup_aks_confluent_encrypt.sh
else
  echo $current_time ' spark aks job is running-- > ' > /mnt/crm/aks_test/log/spark_aks_running.log
fi

Shell script for Spark streaming. It's triggered by spark-submit

spark-submit --name "sync_encrypt_aks" --master spark://spark-master:7077 \
--py-files "/mnt/crm/aks_test/db_connect.py,/mnt/crm/aks_test/shuup_pg.py" \
--executor-memory 2G \
--total-executor-cores 2 \
--packages org.postgresql:postgresql:42.2.14,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 \
/mnt/crm/aks_test/pyspark_test_aks_confluent_encrypt.py > /mnt/crm/aks_test/tmpoutfile1 2>&1


service created for Spark Master and UI

pod created for Spark Master , Worker and UI

Spark UI