Q: What is DevSecOps and how does it differ from DevOps?

DevSecOps integrates SECURITY into every phase of DevOps — not as a final audit gate, but baked into code, build, deploy and runtime. Practices: SAST (static analysis), DAST (dynamic analysis), SBOM (software bill of materials), secret scanning in CI, container image scanning, runtime policies (OPA/Gatekeeper). Standard DevOps treats security as 'someone else's problem'; DevSecOps says everyone owns it. Shift-LEFT = catch vulnerabilities in code, not in production. 🔒 A Snyk scan in CI catches a critical CVE in a dependency before merge. Standard DevOps would catch it weeks later in pen-test; DevSecOps catches it in the PR.

Q: What is CI/CD and how does it work?

CI (Continuous Integration) = developers merge code into a shared branch many times per day, and every merge automatically runs build + tests. CD (Continuous Delivery/Deployment) = every successful build is automatically released to a staging or production environment. The pipeline: git push → CI server detects → runs lint/tests/build → packages artifact (Docker image) → deploys to environment → runs smoke tests. Goal: catch bugs in minutes, not weeks. 🏭 A factory conveyor belt vs a once-a-year manual assembly. CI/CD catches the defective part before it leaves the factory floor.

Q: What's the difference between Continuous Delivery and Continuous Deployment?

Continuous Delivery: every change passes through the pipeline and is READY to release — but a human clicks 'deploy'. Continuous Deployment: every change that passes the pipeline goes to production AUTOMATICALLY — no human gate. Most enterprises do Delivery (regulated industries, financial systems). Pure web companies (Etsy, Netflix) do Deployment. The harder one is Deployment — it requires extreme test coverage and confident rollback automation. 🚦 Delivery: the train is at the platform, conductor waves it on. Deployment: the train auto-departs every 5 minutes if signals are green.

Q: What is Jenkins and what is a Jenkinsfile?

Jenkins is an open-source automation server — the original CI/CD tool, still everywhere. A Jenkinsfile is a text file (Groovy DSL) committed to your repo that defines the entire pipeline as code: stages (build, test, deploy), agents, environment variables, and post-actions. Declarative syntax is preferred for new projects (cleaner). Scripted syntax exists for complex flows. Pipeline-as-code beats clicking through Jenkins UI — version-controlled, reviewable, repeatable. 📜 Five lines in a Jenkinsfile encode: build with Maven → run tests → build Docker image → deploy to staging → notify Slack on failure. All in Git, history forever.

Q: What are the most popular CI/CD tools and when do you pick which?

Jenkins — most flexible, on-prem friendly, plugin ecosystem; downside: maintenance overhead. GitHub Actions — native to GitHub, zero infra, great for OSS and Github-hosted code. GitLab CI — built-in if you use GitLab, fast and integrated. CircleCI — fast, great Docker support, good cloud option. ArgoCD — GitOps for Kubernetes specifically. AWS CodePipeline — when you're all-in on AWS. Pick by where your code lives + how much infrastructure you want to own. 🛠️ Startup on GitHub? GitHub Actions, done in 5 minutes. Big enterprise with strict on-prem? Jenkins. Kubernetes-native shop? ArgoCD for GitOps.

Q: What is a build pipeline and what are its typical stages?

A build pipeline is the sequence of automated stages every code change passes through. Common stages: (1) Source — fetch latest code. (2) Lint — code style. (3) Build — compile/transpile, produce artifact. (4) Unit Tests — fast, isolated. (5) Integration Tests — actual DB/services. (6) Security Scan — SAST + dependency CVE check. (7) Package — Docker image. (8) Publish — push to registry. (9) Deploy — staging/prod. (10) Smoke Test — verify deploy. Each stage gates the next; failure halts the pipeline. 🏭 GitHub Actions workflow with 8 parallel jobs — code change to production in 12 minutes. Without parallelism, the same flow runs in 40+ minutes.

Question 1

What is DevOps and why do companies adopt it?

Accepted Answer

DevOps is a culture + set of practices that merges software DEVelopment and IT OPerationS into one continuous pipeline — code, build, test, deploy, monitor, repeat. Goals: ship faster, fail less, recover quicker. Adoption is driven by the four DORA metrics: deployment frequency, lead time for changes, mean time to recovery, and change failure rate. Elite teams deploy multiple times per day with under 1 hour MTTR — DevOps is what makes that possible.

💡 Real-life example

🚀 Amazon deploys code every 11.7 seconds on average. Without DevOps (automation, CI/CD, monitoring), that's mathematically impossible.

Question 2

What's the difference between DevOps, Agile, and SRE?

Accepted Answer

Agile is about HOW you BUILD software — small iterations, customer feedback, working software over documentation. DevOps is about HOW you DELIVER software — bridging dev and ops to ship continuously. SRE (Site Reliability Engineering, Google's invention) is DevOps with engineering rigor — formal SLOs, error budgets, blameless postmortems, and a hard cap on time spent on toil. Many teams say 'class SRE implements DevOps' — SRE is one disciplined way to do DevOps.

💡 Real-life example

💡 Agile gets the feature designed in 2 weeks. DevOps ships it to production by Friday. SRE makes sure it stays up at 99.95% with documented error budgets.

Question 3

What are the key phases of the DevOps lifecycle?

Accepted Answer

Plan → Code → Build → Test → Release → Deploy → Operate → Monitor → back to Plan. It's a continuous loop, not a line. Each phase has dedicated tools: Jira for plan, Git for code, Jenkins/GitHub Actions for build, Selenium/JUnit for test, Docker registry for release, Kubernetes for deploy, Datadog/Prometheus for operate and monitor. Monitoring closes the loop — alerts and metrics feed the next planning cycle.

💡 Real-life example

♾️ Often drawn as an infinity symbol — emphasizes there's no 'done', only iteration. Toyota's lean production line was the inspiration.

Question 4

What KPIs do you measure DevOps success by?

Accepted Answer

The four DORA metrics, made famous by the State of DevOps Report: (1) Deployment Frequency — how often you push to production. (2) Lead Time for Changes — commit to production. (3) Mean Time to Recovery (MTTR) — how fast you recover from an incident. (4) Change Failure Rate — % of deployments that cause an incident. Elite performers: deploy multiple times/day, lead time <1 hour, MTTR <1 hour, change failure rate 0-15%.

💡 Real-life example

📊 Netflix: thousands of deployments per day, MTTR measured in minutes. Your bank's monthly batch deploy with 8-hour rollbacks? Bottom-tier DORA.

Question 5

What is DevSecOps and how does it differ from DevOps?

Accepted Answer

DevSecOps integrates SECURITY into every phase of DevOps — not as a final audit gate, but baked into code, build, deploy and runtime. Practices: SAST (static analysis), DAST (dynamic analysis), SBOM (software bill of materials), secret scanning in CI, container image scanning, runtime policies (OPA/Gatekeeper). Standard DevOps treats security as 'someone else's problem'; DevSecOps says everyone owns it. Shift-LEFT = catch vulnerabilities in code, not in production.

💡 Real-life example

🔒 A Snyk scan in CI catches a critical CVE in a dependency before merge. Standard DevOps would catch it weeks later in pen-test; DevSecOps catches it in the PR.

Question 6

What is CI/CD and how does it work?

Accepted Answer

CI (Continuous Integration) = developers merge code into a shared branch many times per day, and every merge automatically runs build + tests. CD (Continuous Delivery/Deployment) = every successful build is automatically released to a staging or production environment. The pipeline: git push → CI server detects → runs lint/tests/build → packages artifact (Docker image) → deploys to environment → runs smoke tests. Goal: catch bugs in minutes, not weeks.

💡 Real-life example

🏭 A factory conveyor belt vs a once-a-year manual assembly. CI/CD catches the defective part before it leaves the factory floor.

Question 7

What's the difference between Continuous Delivery and Continuous Deployment?

Accepted Answer

Continuous Delivery: every change passes through the pipeline and is READY to release — but a human clicks 'deploy'. Continuous Deployment: every change that passes the pipeline goes to production AUTOMATICALLY — no human gate. Most enterprises do Delivery (regulated industries, financial systems). Pure web companies (Etsy, Netflix) do Deployment. The harder one is Deployment — it requires extreme test coverage and confident rollback automation.

💡 Real-life example

🚦 Delivery: the train is at the platform, conductor waves it on. Deployment: the train auto-departs every 5 minutes if signals are green.

Question 8

What is Jenkins and what is a Jenkinsfile?

Accepted Answer

Jenkins is an open-source automation server — the original CI/CD tool, still everywhere. A Jenkinsfile is a text file (Groovy DSL) committed to your repo that defines the entire pipeline as code: stages (build, test, deploy), agents, environment variables, and post-actions. Declarative syntax is preferred for new projects (cleaner). Scripted syntax exists for complex flows. Pipeline-as-code beats clicking through Jenkins UI — version-controlled, reviewable, repeatable.

💡 Real-life example

📜 Five lines in a Jenkinsfile encode: build with Maven → run tests → build Docker image → deploy to staging → notify Slack on failure. All in Git, history forever.

Question 9

What are the most popular CI/CD tools and when do you pick which?

Accepted Answer

Jenkins — most flexible, on-prem friendly, plugin ecosystem; downside: maintenance overhead. GitHub Actions — native to GitHub, zero infra, great for OSS and Github-hosted code. GitLab CI — built-in if you use GitLab, fast and integrated. CircleCI — fast, great Docker support, good cloud option. ArgoCD — GitOps for Kubernetes specifically. AWS CodePipeline — when you're all-in on AWS. Pick by where your code lives + how much infrastructure you want to own.

💡 Real-life example

🛠️ Startup on GitHub? GitHub Actions, done in 5 minutes. Big enterprise with strict on-prem? Jenkins. Kubernetes-native shop? ArgoCD for GitOps.

Question 10

What is a build pipeline and what are its typical stages?

Accepted Answer

A build pipeline is the sequence of automated stages every code change passes through. Common stages: (1) Source — fetch latest code. (2) Lint — code style. (3) Build — compile/transpile, produce artifact. (4) Unit Tests — fast, isolated. (5) Integration Tests — actual DB/services. (6) Security Scan — SAST + dependency CVE check. (7) Package — Docker image. (8) Publish — push to registry. (9) Deploy — staging/prod. (10) Smoke Test — verify deploy. Each stage gates the next; failure halts the pipeline.

💡 Real-life example

🏭 GitHub Actions workflow with 8 parallel jobs — code change to production in 12 minutes. Without parallelism, the same flow runs in 40+ minutes.

Question 11

Explain Jenkins master-slave (controller-agent) architecture.

Accepted Answer

Jenkins MASTER (now called Controller) orchestrates — schedules jobs, serves UI, stores config. Jenkins AGENTS (slaves) execute the actual build/test work on dedicated machines. The controller distributes jobs to agents based on labels (linux, windows, gpu). Benefits: parallel builds, isolation (one bad job can't poison the controller), specialized agents (run iOS builds on Mac agents, Linux builds elsewhere). Modern best practice: ephemeral agents in Kubernetes pods — spin up for one job, destroyed after.

💡 Real-life example

🎯 100 PRs hit at once. Controller schedules across 50 Kubernetes-based ephemeral agents — each PR builds in parallel. Without agents, they'd queue serially for hours.

Question 12

Explain Docker architecture in one breath.

Accepted Answer

Docker has three pieces: (1) Docker Client — the `docker` CLI you type into. (2) Docker Daemon (dockerd) — the background service that builds images and runs containers. (3) Docker Registry — stores and distributes images (Docker Hub, ECR, GCR). Workflow: you write a Dockerfile → `docker build` creates an image (read-only template) → `docker run` starts a container (running instance) → `docker push` uploads the image to a registry → another machine `docker pull`s it and runs.

💡 Real-life example

🐳 Dockerfile = recipe. Image = ready-to-bake frozen meal. Container = the meal being eaten. Registry = the freezer aisle at the supermarket.

Question 13

What's the difference between a Docker image and a container?

Accepted Answer

Image = an immutable, layered SNAPSHOT of a filesystem + metadata (entrypoint, env, exposed ports). Built once, doesn't change. Container = a RUNNING instance of that image — its own writable layer, network namespace, and processes. You can run 100 containers from the same image; each has independent state. Stop a container, that state is lost (unless you mounted a volume). The image is the class; the container is the instance.

💡 Real-life example

📀 Image is a CD-ROM. Container is the game session — your save state lives in the container's writable layer (or a mounted volume), not on the CD.

Question 14

How do you reduce Docker image size in production?

Accepted Answer

Five proven techniques: (1) Use a slim base — alpine, distroless, scratch — not `ubuntu:latest`. (2) Multi-stage builds — build with full toolchain, copy only artifacts into a minimal final image. (3) Combine RUN commands so each layer doesn't keep deleted files. (4) `.dockerignore` to skip node_modules, .git, etc. (5) Order layers from least- to most-changing to maximize cache hits. A naive Node.js image is 1 GB; multi-stage + distroless gets it to 50 MB.

💡 Real-life example

🏋️ A 1 GB image takes 15 seconds to pull on each pod restart. At 50 MB, it's <1 second. Multiply across 1000 pods rolling out — massive deploy speedup.

Question 15

What is a Dockerfile and what are the key instructions?

Accepted Answer

A Dockerfile is a text recipe for building a Docker image. Key instructions: FROM (base image), WORKDIR (set working dir), COPY/ADD (copy files in), RUN (execute commands at build time — creates new layer), ENV (set env vars), EXPOSE (document port), CMD (default command, easy to override), ENTRYPOINT (the executable, harder to override), VOLUME (mount points), USER (drop root for security). Each instruction creates a CACHED layer — order matters for cache efficiency.

💡 Real-life example

📜 A 6-line Dockerfile: FROM node:20-alpine, WORKDIR /app, COPY package*.json ./, RUN npm ci, COPY . ., CMD ["node","server.js"]. Produces a runnable image in 30 seconds.

Question 16

What's the difference between CMD and ENTRYPOINT in Dockerfile?

Accepted Answer

ENTRYPOINT is the EXECUTABLE that always runs when the container starts. CMD provides DEFAULT ARGUMENTS to that executable (or the executable itself if no ENTRYPOINT). Together they form the actual command. ENTRYPOINT is hard to override (need `--entrypoint` flag); CMD is easy to override (pass args after the image name). Use ENTRYPOINT for 'this container IS X'; use CMD for 'X with these defaults'.

💡 Real-life example

🧰 ENTRYPOINT ["python"] + CMD ["app.py"] → `docker run image` runs `python app.py`. `docker run image other.py` runs `python other.py`. Container always runs Python; only the script changes.

Question 17

What is Docker Compose and when do you use it?

Accepted Answer

Docker Compose defines and runs MULTI-CONTAINER apps via a `docker-compose.yml` file. Specify all services (web, db, redis, cache) with their networks, volumes, env vars in ONE file. `docker compose up` brings the whole stack online. Mostly used for LOCAL DEVELOPMENT and integration tests where you need a full app stack on your laptop. NOT for production — for that, use Kubernetes, ECS, or Docker Swarm.

💡 Real-life example

🎼 A local dev compose: postgres + redis + node-app + nginx all wired and started with one command. Without Compose, you'd manually `docker run` 4 containers, link them, manage their state.

Question 18

What is Kubernetes and why do teams use it?

Accepted Answer

Kubernetes (k8s) is an open-source container orchestrator — it schedules containers across a cluster of machines, restarts them when they crash, scales them up/down based on load, and routes traffic to them. You describe the DESIRED state in YAML (10 replicas of nginx, exposed via this service) and k8s makes it true and keeps it true. Replaces the manual work of 'SSH in, start container, set up load balancer, configure restarts'.

💡 Real-life example

🎯 Like an air-traffic controller for containers — knows where each is, where it should land, redirects automatically when a runway (node) goes down.

Question 19

What are Pods, Deployments, and Services in Kubernetes?

Accepted Answer

Pod = the smallest deployable unit — one or more containers that share network and storage. You almost never deploy Pods directly. Deployment = a controller that manages a SET of identical pods — handles rolling updates, rollbacks, scaling. ReplicaSet sits underneath. Service = a stable network endpoint for a group of pods (pods are ephemeral, IPs change). Service routes traffic to whichever pods match its selector. Together: write a Deployment, expose it via a Service.

💡 Real-life example

🏘️ Pod = one apartment. Deployment = the building manager keeping N apartments occupied. Service = the building's street address that always works regardless of which apartments are occupied.

Question 20

A pod is stuck in CrashLoopBackOff. How do you debug it?

Accepted Answer

Step 1: `kubectl describe pod ` — shows events (image pull error? schedule failure? OOMKilled?). Step 2: `kubectl logs --previous` — logs of the LAST run before it crashed (current is gone). Step 3: check resource limits — OOMKilled means memory limit too low. Step 4: check readiness/liveness probes — failed probes also kill pods. Step 5: `kubectl exec` won't work for crashing pods; use a debug container or temporarily change the command to `sleep infinity` to inspect filesystem. Step 6: image issue? misconfigured env? missing secret? 🔁 CrashLoopBackOff = 'I tried to start, I died, k8s will retry'. The 'BackOff' part means k8s exponentially delays restarts — 10s → 20s → 40s. 90% of cases: bad config, missing env var, or OOM.

Question 21

What's the difference between Deployment, StatefulSet, and DaemonSet?

Accepted Answer

Deployment — for stateless apps (web servers, APIs). Pods are interchangeable, replaced freely, scheduled anywhere. StatefulSet — for stateful apps (databases, Kafka). Pods get stable network IDs (`pod-0`, `pod-1`), stable storage, and ordered startup/shutdown. DaemonSet — runs exactly ONE pod on EVERY node (or a subset). Used for node-level agents: log collectors (Fluentd), monitoring (Datadog agent), CNI plugins. Rule: stateless → Deployment, identity matters → StatefulSet, per-node → DaemonSet.

💡 Real-life example

🏢 Deployment = call-center agents (any one can take the next call). StatefulSet = doctors (Dr. Smith has her own patients, ordered shifts). DaemonSet = security cameras (one per building floor, always on).

Question 22

What is a Kubernetes Ingress and how does it differ from a Service?

Accepted Answer

A SERVICE exposes pods inside the cluster (or via NodePort/LoadBalancer to outside) — but you'd need one LB per service. An INGRESS is an HTTP/HTTPS router at the cluster edge that maps URLs to services: `foo.com/api` → api-service, `foo.com/web` → web-service. One Ingress, many services behind it, one TLS cert. Requires an Ingress Controller (nginx-ingress, Traefik, AWS ALB Controller) to actually implement the routing.

💡 Real-life example

🛣️ A Service is a door to one office; an Ingress is the building's front desk that knows which office handles which request.

Question 23

What's the difference between a ConfigMap and a Secret in Kubernetes?

Accepted Answer

ConfigMap holds NON-SENSITIVE config (env vars, config files, JSON/YAML blobs). Secret holds SENSITIVE data (passwords, API keys, TLS certs) — base64-encoded by default but should be ENCRYPTED at rest in etcd. Both are mounted into pods as env vars or files. Critical difference: Secrets get more care — RBAC restricted, support external providers (Vault, AWS Secrets Manager via CSI driver), and shouldn't be logged. Never commit either to Git in plain form.

💡 Real-life example

🗄️ ConfigMap: `LOG_LEVEL=info`, `FEATURE_FLAGS=enabled`. Secret: `DB_PASSWORD`, `JWT_SIGNING_KEY`. Same Pod, two different YAMLs.

Question 24

Explain Kubernetes namespaces — when should you use them?

Accepted Answer

A namespace is a logical PARTITION inside a Kubernetes cluster. Resources (Pods, Services, Deployments) are scoped to a namespace. Use cases: (1) separate environments in one cluster (dev, staging, prod) — though prod is usually a separate cluster. (2) multi-tenancy — team A's namespace can't see team B's. (3) RBAC scoping — limit who can touch what. (4) resource quotas — cap CPU/memory per team. NOT a security boundary — pods in different namespaces can still talk by default; use NetworkPolicies for that.

💡 Real-life example

🏢 `kubectl get pods -n payments` shows only the payments team's stuff. The infrastructure team isolates noise and applies quotas — payments can't accidentally use 100% of cluster RAM.

Question 25

What is a Helm chart and why use Helm in Kubernetes?

Accepted Answer

Helm is the Kubernetes PACKAGE MANAGER. A Helm CHART is a templated, versioned bundle of YAML manifests for an app + its dependencies. Benefits: (1) templating — use values.yaml to customize per environment (dev/staging/prod) without copy-pasting YAML. (2) versioning — track app deploys like npm releases. (3) `helm rollback` — easy version reverts. (4) reusable charts — nginx, postgres, redis are pre-packaged on Artifact Hub. Modern alternative: Kustomize (no templating, just overlays).

💡 Real-life example

📦 `helm install my-app ./mychart -f prod-values.yaml` deploys 14 manifests with prod-specific values. Without Helm, you'd duplicate the YAML for each env.

Question 26

What is Infrastructure as Code (IaC) and why do we need it?

Accepted Answer

IaC is managing infrastructure (servers, networks, databases, load balancers) through MACHINE-READABLE FILES instead of clicking buttons in a cloud console. Benefits: version control (git diff your infra changes), code review (pull requests for infra), repeatability (spin up identical environments in seconds), disaster recovery (re-create everything from code). The opposite — 'click-ops' — leads to undocumented prod, snowflake servers, and 'works on my staging' problems.

💡 Real-life example

📝 A 200-line Terraform file describes: VPC, subnets, RDS, S3 bucket, IAM roles, security groups. Run `terraform apply` — all of it created in 3 minutes. Delete the file and run `destroy` — all gone.

Question 27

What's the difference between Terraform and Ansible?

Accepted Answer

Terraform — declarative, cloud-resource PROVISIONING. You describe what should exist (3 EC2 instances, an RDS, an S3 bucket); Terraform figures out the API calls. Built around CLOUD STATE. Ansible — procedural, CONFIGURATION management. You describe steps to perform ON EXISTING servers (install nginx, copy this config, restart service). Uses SSH or WinRM. Many teams use both: Terraform to create the servers, Ansible to configure what runs on them. Modern Kubernetes-shops use Terraform + Helm + Kubernetes manifests instead.

💡 Real-life example

🏗️ Terraform builds the house (foundation, walls, plumbing). Ansible furnishes it (installs the fridge, paints the walls, hangs art).

Question 28

What is Terraform state and why is it so critical?

Accepted Answer

Terraform state is a JSON file (`terraform.tfstate`) mapping your code to real cloud resources — 'this `aws_instance.web` in code is the EC2 instance `i-abc123` in AWS'. Without state, Terraform can't tell what to change vs destroy. CRITICAL because: (1) it contains secrets and IDs — never commit it. (2) corrupting it means Terraform thinks resources don't exist and tries to recreate them. (3) two people running Terraform at once = corruption. Production fix: REMOTE state on S3 + locking via DynamoDB (or use Terraform Cloud). NEVER edit state by hand.

💡 Real-life example

📋 Like the inventory book of a warehouse manager. Burn the book — they think they have nothing. Two managers writing in it at once — chaos. Lock it, store it safely, back it up.

Question 29

What's the difference between Terraform plan, apply, and destroy?

Accepted Answer

`terraform plan` — DRY-RUN. Shows what WILL change without making changes. Output: `+` (create), `~` (update), `-` (destroy). Always review before apply. `terraform apply` — actually makes the changes. Asks for confirmation unless `--auto-approve`. Updates state file. `terraform destroy` — removes everything Terraform manages. Production rule: always plan first, review the diff in PR, then apply via CI/CD with state locking.

💡 Real-life example

🛡️ A plan output: '+ aws_instance.web (new), - aws_security_group.old (destroy)'. The reviewer catches 'wait, we're destroying that SG that another service uses!' — prevents an outage.

Question 30

What are Terraform modules and why use them?

Accepted Answer

A Terraform MODULE is a reusable, composable chunk of Terraform configuration. Define once (e.g., 'standard VPC with public/private subnets and NAT'), reuse across projects. Benefits: DRY, consistent infra (every team's VPC looks the same), versioning (pin to module `v1.2.0`), better than copy-pasting HCL. Sources: local dirs, Git repos, Terraform Registry. Modules have INPUT variables and OUTPUT values exposed to the caller.

💡 Real-life example

🧱 Instead of 200 lines of VPC config in every project: `module "vpc" { source = "terraform-aws-modules/vpc/aws"; cidr = "10.0.0.0/16" }`. 6 lines instead of 200.

Question 31

Which AWS services are core to a DevOps stack?

Accepted Answer

EC2 — virtual machines. S3 — object storage and artifact host. ECR — Docker image registry. ECS/EKS — container orchestration (EKS = managed Kubernetes). RDS — managed databases. CloudFormation — AWS-native IaC (or use Terraform). CodePipeline/CodeBuild/CodeDeploy — native CI/CD. CloudWatch — logs and metrics. IAM — permissions for everything. VPC — networking. Route 53 — DNS. Combine these and you have an end-to-end DevOps stack.

💡 Real-life example

☁️ A typical AWS DevOps stack: code in CodeCommit → CodeBuild builds → image to ECR → CodeDeploy ships to EKS → CloudWatch monitors → IAM secures everything.

Question 32

What's the difference between IaaS, PaaS, and SaaS?

Accepted Answer

IaaS (Infrastructure-as-a-Service) — you get raw compute, storage, network. You manage the OS, runtime, apps. Examples: AWS EC2, Azure VMs, Google Compute Engine. PaaS (Platform-as-a-Service) — provider manages OS + runtime; you deploy code only. Examples: Heroku, AWS Elastic Beanstalk, Google App Engine. SaaS (Software-as-a-Service) — fully managed app you just use. Examples: Gmail, Salesforce, Slack. Tradeoff: more abstraction = less control + less ops burden.

💡 Real-life example

🍕 IaaS = renting an oven; you bake. PaaS = pizza kit; you assemble. SaaS = pizza delivered.

Question 33

What is Auto Scaling and how does it work?

Accepted Answer

Auto Scaling automatically adjusts CAPACITY based on load. Two flavors: HORIZONTAL (add/remove instances) and VERTICAL (resize instances). AWS Auto Scaling Group (ASG) example: define min/max/desired instances + scaling policies (CPU > 70% for 5 min → add 2 instances). Triggers: CPU, memory, request count, custom metrics. Combined with Load Balancers — new instances auto-register. Pitfall: reactive scaling lags behind sudden spikes — use predictive scaling for known events.

💡 Real-life example

📈 Black Friday: ASG scales from 10 to 200 EC2 instances at 8am, back to 10 at midnight. You pay only for what you use — saves ~80% vs always running 200.

Question 34

What is a Load Balancer? What types exist and when do you use each?

Accepted Answer

A Load Balancer distributes incoming traffic across multiple backend servers. Types on AWS: (1) Application Load Balancer (ALB) — Layer 7 (HTTP), routes by path/host headers, integrates with auth and WAF. Best for web apps. (2) Network Load Balancer (NLB) — Layer 4 (TCP/UDP), ultra-low latency, millions of connections. Best for game servers, TLS pass-through. (3) Classic ELB — legacy. Algorithms: round-robin, least-connections, weighted. Health checks remove unhealthy targets automatically.

💡 Real-life example

💧 ALB routes `/api/*` → backend, `/static/*` → S3 origin, `/ws/*` → websocket service. One DNS, smart routing, automatic failover.

Question 35

What is GitOps and how does it differ from regular DevOps?

Accepted Answer

GitOps is a SPECIFIC FLAVOR of DevOps for Kubernetes (mostly): the Git repository is the SINGLE source of truth for both application code AND cluster state. A controller (ArgoCD, Flux) watches Git and continuously RECONCILES the cluster to match. Push to Git = deploy. Roll back = git revert. No one runs `kubectl apply` manually. Benefits: full audit trail, easy rollbacks, drift detection, no privileged credentials in CI. DevOps is the umbrella culture; GitOps is one operational pattern under it.

💡 Real-life example

🌳 ArgoCD watches `infra-repo/prod/`. You merge a YAML change — 30 seconds later it's live in the cluster. Cluster drifts? ArgoCD reverts to match Git. The git log IS the deploy log.

Question 36

What is ArgoCD and how does GitOps with ArgoCD work?

Accepted Answer

ArgoCD is a declarative GitOps tool for Kubernetes. It runs IN your cluster and watches a Git repo containing Kubernetes manifests (or Helm charts, Kustomize overlays). When Git changes, ArgoCD applies the diff to the cluster — continuous RECONCILIATION. UI shows sync status, diff view, rollback button. Features: auto-sync, sync waves (ordered rollout), health checks, RBAC, multi-cluster management. The deploy IS the git merge.

💡 Real-life example

🔁 Merge a PR updating `apps/prod/api.yaml` from version 1.2 → 1.3. ArgoCD detects within 30 seconds, applies the change. Cluster drift detected? ArgoCD reverts to match Git. Audit log = git log.

Question 37

What's the difference between git fetch and git pull?

Accepted Answer

`git fetch` — downloads remote changes into your local repo's tracking branches (e.g., `origin/main`) but does NOT touch your working branch. Safe to run anytime. `git pull` = `git fetch` + `git merge origin/<branch>` (or `--rebase` if configured). It applies the remote changes onto your current branch. If you're worried about losing local work, fetch first, look at what's coming, then pull or rebase.

💡 Real-life example

📬 Fetch = checking your mailbox; the letters arrive but you don't open them. Pull = opening them and acting immediately. Fetch is always safe; pull can create conflicts.

Question 38

Git rebase vs merge — when do you pick which?

Accepted Answer

Merge — creates a new commit that joins two branches, preserving the full branch history. Pros: non-destructive, accurate history. Cons: cluttered log with merge commits. Rebase — moves your commits ON TOP of the target branch, rewriting their parents. Pros: linear, clean history; easier to read. Cons: rewrites history — NEVER rebase a branch other people have pulled. Common workflow: rebase your feature branch onto main before merging (clean history), then merge into main with a merge commit (clear PR boundaries).

💡 Real-life example

🪡 Merge = stitching two threads together (you see the seam). Rebase = unwinding your thread and re-attaching it to the new end (looks like one thread).

Question 39

What is a merge conflict and how do you resolve it?

Accepted Answer

A merge conflict happens when Git can't auto-merge because the same lines were changed differently on both branches. Git marks the file with `<<<<<<<`, `=======`, `>>>>>>>` markers. To resolve: (1) open the file, decide which version (or combine) is correct, delete the markers. (2) `git add <file>` to mark it resolved. (3) `git commit` (for merge) or `git rebase --continue` (for rebase). Prevention: pull/rebase frequently, communicate with teammates editing the same files, keep PRs small.

💡 Real-life example

✏️ Two editors edit the same paragraph in different ways. Git is honest enough to say 'I can't decide which is right — you choose.' Better than silently picking one.

Question 40

What is git stash and when do you use it?

Accepted Answer

`git stash` saves your uncommitted changes in a temporary stack and reverts your working directory to clean. `git stash pop` reapplies the latest stash and removes it. Use cases: (1) need to switch branches quickly to fix a hotfix without committing WIP. (2) need to pull from remote but have local changes that would conflict. (3) experimenting and want to set aside changes 'just in case'. Stash is local — never pushed.

💡 Real-life example

🥡 You're 30% into a feature, urgent bug arrives. `git stash`, switch to main, fix and merge. `git checkout feature && git stash pop` → exactly where you left off.

Question 41

What's the difference between git reset, revert, and checkout?

Accepted Answer

`git reset` — moves the branch pointer backward. Three flavors: `--soft` (keep changes staged), `--mixed` (default, keep changes unstaged), `--hard` (DESTROY changes). Local only. `git revert <commit>` — creates a NEW commit that undoes a previous one. Safe for shared branches. `git checkout <commit/file>` — switches branches or restores files. Modern Git splits this into `git switch` (branches) and `git restore` (files) for clarity. Rule: revert for shared branches, reset for private branches.

💡 Real-life example

⏪ Bad commit pushed to main: `git revert abc123` → creates an 'undo' commit, safe. Bad commit on your local branch: `git reset --hard HEAD~1` → wipes it, only safe for your own branch.

Question 42

Git branching strategies — Git Flow vs GitHub Flow vs Trunk-Based?

Accepted Answer

GIT FLOW — multiple long-lived branches: main (prod), develop (integration), feature/*, release/*, hotfix/*. Complex but disciplined. Good for products with strict release cycles. GITHUB FLOW — single main + feature branches. Merge to main → deploys. Simpler, fits CI/CD. TRUNK-BASED — everyone commits to main directly or via very short-lived branches (hours). Requires feature flags + heavy automation. Used by Google, Facebook, Netflix. Tradeoff: simplicity (Trunk) vs control (Flow).

💡 Real-life example

🌳 A bank: Git Flow (slow, regulated releases). A SaaS startup: GitHub Flow (multiple deploys/day). Google: Trunk-based (thousands of deploys/day with feature flags).

Question 43

What's the difference between monitoring and observability?

Accepted Answer

Monitoring tracks KNOWN UNKNOWNS — predefined metrics, alerts on thresholds (CPU > 80%, error rate > 1%). Tells you THAT something is broken. Observability is the ability to answer arbitrary questions about your system from its outputs — UNKNOWN UNKNOWNS. Tells you WHY it's broken. Built on three pillars: METRICS (numbers over time), LOGS (text events), TRACES (request flow across services). Modern systems are too complex for monitoring alone — you need observability.

💡 Real-life example

🚗 Monitoring: the dashboard tells you the engine is hot. Observability: you can ask the car 'show me which cylinder is running rich and why' — and get a useful answer.

Question 44

What is Prometheus and how does it differ from traditional monitoring?

Accepted Answer

Prometheus is the de-facto open-source monitoring system for cloud-native workloads. Differences: (1) PULL-based — Prometheus scrapes metrics from `/metrics` endpoints on each service (vs push-based agents). (2) Dimensional data model — metrics have key/value labels (`http_requests{method="GET", status="200"}`), so you can slice arbitrarily. (3) PromQL — powerful query language for rates, aggregations, alerting rules. (4) Designed for ephemeral, scaling-up-and-down workloads (Kubernetes). Pair with Grafana for dashboards and Alertmanager for alerts.

💡 Real-life example

📡 A traditional monitor watches one server's heartbeat. Prometheus asks 1000 ephemeral pods 'what's your latency?' every 15 seconds and stores it as a time series you can slice by HTTP route, version, region, anything.

Question 45

What are the three pillars of observability?

Accepted Answer

(1) METRICS — numerical time-series data: CPU%, request rate, error rate, latency. Aggregated, cheap, fast queries. Best for dashboards and alerts. (2) LOGS — discrete events with full context: 'User X logged in at 10:23'. Verbose, expensive to query but rich. Best for postmortems. (3) TRACES — distributed call graphs across services: 'Request A spent 200ms in API, 400ms in DB, 150ms in cache'. Best for finding bottlenecks in microservices. Modern stack: Prometheus (metrics) + Loki/ELK (logs) + Jaeger/Tempo (traces) — unified in Grafana.

💡 Real-life example

🔍 Customer says 'checkout is slow'. Metrics: p99 latency 8s. Traces: pinpoint the DB query taking 6s. Logs: show the specific query and parameters. Without all three, you'd still be guessing.

Question 46

What's the difference between SLA, SLO, and SLI?

Accepted Answer

SLI (Service Level INDICATOR) — what you MEASURE: actual availability %, p99 latency, error rate. The raw metric. SLO (Service Level OBJECTIVE) — the TARGET you set internally: '99.9% availability over 30 days'. The goal SREs work toward. SLA (Service Level AGREEMENT) — the CONTRACTUAL promise to customers, usually with refunds attached. SLA is always LOOSER than SLO (you promise less than you target). Error budget = 100% - SLO. Use it: spend on risky deploys when budget is high; freeze when low.

💡 Real-life example

📊 SLI: API uptime = 99.95% last month. SLO: 99.9% (we're ahead!). SLA: 99.5% (no breach — no refunds). Error budget for the month: 0.05% remaining out of 0.1%.

Question 47

What is the ELK Stack and how does it differ from Prometheus?

Accepted Answer

ELK = Elasticsearch + Logstash + Kibana — a LOG management stack. Logstash ingests logs, Elasticsearch stores and indexes them, Kibana visualizes. Optimized for: full-text search, structured queries on log fields, alerting on log patterns. Prometheus = METRICS only — time-series numerical data. Different problem domains. Most production stacks use BOTH: Prometheus for 'what's the error rate' and ELK for 'show me the actual error log lines'. Modern alternatives: Loki (cheaper logs), OpenSearch (Elasticsearch fork).

💡 Real-life example

🔎 'Error rate just doubled' → Prometheus alerts. 'Show me the actual stack traces' → Kibana search. Different tools, same incident.

Question 48

Explain Blue/Green, Canary, and Rolling deployment strategies.

Accepted Answer

Rolling — replace pods one (or a batch) at a time. Default for Kubernetes Deployments. Pros: simple, no extra resources. Cons: mixed versions running simultaneously, slow rollback. Blue/Green — run TWO full environments (blue=current, green=new). Switch the load balancer when green is ready. Pros: instant rollback (flip the switch back), no mixed versions. Cons: 2x resources during deploy. Canary — route a SMALL % of traffic (5%) to new version, monitor metrics, gradually increase. Pros: catches bugs in real production traffic with minimal blast radius. Cons: complex routing logic, needs good observability.

💡 Real-life example

🎨 Painting a wall: Rolling = paint one stripe at a time. Blue/Green = paint a new wall behind the old one, then swap. Canary = paint a small test patch first, see if you like the color, then commit.

Question 49

Horizontal vs Vertical scaling — when do you pick which?

Accepted Answer

Vertical scaling (scale UP) — add more CPU/RAM/disk to existing servers. Pros: zero code changes, works for stateful apps. Cons: hard ceiling (no machine is infinite), downtime to resize, single point of failure. Horizontal scaling (scale OUT) — add more identical servers. Pros: nearly infinite, fault-tolerant (one dies, others survive), elastic. Cons: requires stateless services or shared state stores (Redis, DB), load balancing complexity. Modern best practice: design stateless, scale horizontally. Vertical is the fallback for legacy stateful systems.

💡 Real-life example

💪 Vertical = give the chef a bigger knife. Horizontal = hire 10 more chefs. At enough orders, only horizontal works — but only if your kitchen is designed for parallel cooks.

Question 50

What is a rollback strategy and how do you implement it?

Accepted Answer

A rollback returns the system to the previous known-good state quickly. By deploy type: (1) Blue/Green — flip the load balancer back to blue (instant). (2) Kubernetes Deployment — `kubectl rollout undo deployment/x` reverts to previous ReplicaSet (seconds). (3) Canary — reduce canary traffic to 0, eventually delete. (4) Database migrations — irreversible without forward-compat migrations + backup restore. Production rule: every deploy should be REVERSIBLE in <5 min, and the rollback path should be TESTED, not theoretical.

💡 Real-life example

⏪ Friday 5pm deploy breaks checkout. ArgoCD shows the bad sync — click 'Rollback'. 30 seconds later: old version is back. Without practiced rollbacks, this would be a multi-hour incident.

Question 51

What is feature flagging and how does it relate to deployments?

Accepted Answer

A feature flag (or toggle) is an IF statement around new code: `if (flags.newCheckout) { ... } else { ... }`. The flag is controlled by a config service (LaunchDarkly, Flagsmith, Unleash) — flippable at runtime without redeploying. Decouples DEPLOY from RELEASE: deploy disabled code to prod safely → enable for 1% of users → gradually expand → enable for everyone. Also enables: A/B testing, kill switches, gradual rollouts, percentage-based experiments.

💡 Real-life example

🚦 Deploy the new checkout dark on Monday. Enable for internal staff Tuesday. 5% of users Wednesday. 50% Thursday. 100% Friday. Each step: monitor, abort with one click if metrics regress.

Question 52

What is a Service Mesh (Istio, Linkerd) and why use one?

Accepted Answer

A service mesh handles SERVICE-TO-SERVICE communication in microservices — pulled OUT of your app code and into the network layer. Sidecar proxies (Envoy) sit next to every pod and intercept all traffic. Features: mutual TLS (encryption + auth automatic), traffic shifting (canary deploys via config not code), retries, circuit breaking, distributed tracing, observability. Tradeoff: adds complexity + CPU overhead. Worth it for: 50+ microservices needing standardized communication. Overkill for: small monoliths.

💡 Real-life example

🕸️ Istio: shift 5% of traffic to the new service version via YAML config — no app code change. mTLS between all services without anyone writing crypto code. Observe every request automatically.

Question 53

What is Configuration Management? Ansible vs Chef vs Puppet?

Accepted Answer

Configuration Management = the practice of keeping servers in a known-good state automatically. Ansible — agentless (uses SSH), YAML playbooks, simple to learn, push-based by default. Chef — agent-based, Ruby DSL, pull-based, mature in enterprise. Puppet — agent-based, custom DSL, declarative, also enterprise-mature. Modern winner for most teams: Ansible (low barrier + works everywhere). Container-era teams skip CM entirely — the Docker image IS the configuration; just deploy a new image.

💡 Real-life example

🖥️ Ansible playbook: 'all web servers should have nginx 1.24, this config file, and these firewall rules.' Run it → 200 servers obey. Run again → no-op (idempotent).

Question 54

How do you handle secrets in CI/CD pipelines?

Accepted Answer

NEVER commit secrets to Git (even private repos — they leak). NEVER paste them in plaintext in pipeline YAML. Layer 1: a secrets manager — HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, sealed-secrets for Kubernetes. Layer 2: the CI/CD tool's encrypted env vars (GitHub Actions secrets, GitLab CI masked variables, Jenkins credentials). Layer 3: short-lived credentials over long-lived ones (OIDC to AWS instead of static keys). Layer 4: scan repos with tools like git-secrets, truffleHog. Rotate everything regularly.

💡 Real-life example

🔐 A static AWS access key in a public repo gets harvested by bots within MINUTES. OIDC-based federated auth means even if your repo is public, attackers get nothing.

Question 55

What is HashiCorp Vault and why use it?

Accepted Answer

Vault is a centralized SECRETS MANAGER. Stores secrets (passwords, API keys, certificates) encrypted at rest, accessed via API + audit-logged. Features: (1) DYNAMIC secrets — generates short-lived database/AWS credentials on demand (no static keys). (2) Secret rotation — automatic. (3) Encryption-as-a-service — encrypt/decrypt without managing keys. (4) Identity-based access — apps authenticate via Kubernetes/AWS IAM, get short-lived tokens. Eliminates static secrets sprawled across CI/CD, config files, env vars.

💡 Real-life example

🔐 App needs a DB password. Old way: hardcoded env var, rotated yearly. With Vault: app requests credentials at startup, gets DB user `temp-abc-2hours`. DB auto-creates and expires. Compromised? 2-hour blast radius.

Question 56

What is an Ansible playbook and how does it differ from a role?

Accepted Answer

PLAYBOOK — a YAML file with the full automation flow: targets (hosts), tasks (steps), handlers (notifications). One playbook can configure many machines. ROLE — a REUSABLE, structured directory of tasks/handlers/templates/vars. Playbooks USE roles. Think: playbook = the recipe ('make Italian dinner'), role = the technique ('make-pasta'). Roles are shareable on Ansible Galaxy. Best practice: keep playbooks short — they orchestrate roles, not write tasks directly.

💡 Real-life example

🧰 A 200-line playbook for setting up a web server → refactor into roles: nginx, postgres, monitoring. Playbook becomes 10 lines invoking roles. Reuse roles across 20 projects.

Question 57

Your CI pipeline used to take 10 minutes; now it takes 45. How do you debug it?

Accepted Answer

Step 1: look at the pipeline TIMELINE in the CI UI — which stage ballooned? Usually it's one specific step. Step 2: compare to a recent fast run — what changed? New dependencies? New tests? Larger Docker image? Step 3: caching — is the dependency cache being hit? `node_modules`/`.m2`/`pip cache` reuse cuts most builds in half. Step 4: test parallelism — split tests across runners. Step 5: Docker layer caching — order Dockerfile commands so unchanged layers cache. Step 6: lint/test scope — only run on changed files for PR builds. Step 7: agents — are runners under-provisioned or queued?

💡 Real-life example

🔍 Last week's `npm install`: 90 seconds (cached). This week's: 8 minutes. Why? Someone added `npm cache clean` to the Dockerfile. Fix: remove it, deploy speeds back up. 90% of CI slowdowns are caching regressions.

Question 58

Production deployment failed at 3am — walk me through your incident response.

Accepted Answer

Step 1: Acknowledge the alert (PagerDuty). Step 2: Open the on-call runbook for this service. Step 3: Quick triage — is the issue user-facing? Severity 1/2/3? Step 4: Open an incident channel in Slack, ping on-call + cross-functional partners. Step 5: ROLLBACK first, debug later (`kubectl rollout undo` or LB swap). If rollback isn't possible, mitigate (rate limit, return cached responses, maintenance page). Step 6: After mitigation: dig in (logs, metrics, traces). Step 7: Communicate status updates every 15 min. Step 8: After resolution: blameless postmortem, fix the underlying issue, update the runbook.

💡 Real-life example

🚨 The rule: MITIGATE in 5 min, INVESTIGATE in 30 min, FIX permanently within a week. Don't debug live — that's how 2-hour incidents become 6-hour ones.

Question 59

A pod is consuming 100% CPU constantly. How do you debug it?

Accepted Answer

Step 1: `kubectl top pod` confirms the issue. Step 2: `kubectl logs <pod>` — anything looping in logs? Step 3: `kubectl exec -it <pod> -- top` or `htop` to see processes inside. Step 4: for Java/Node — get a CPU PROFILE: `kubectl exec` + `jstack <pid>` (Java) or `perf` (general). Identify the hot method. Step 5: likely causes: infinite loop, regex catastrophic backtracking, runaway recursion, memory thrash (GC eating CPU), large payload processing. Step 6: mitigate by killing the pod (CrashLoop is safer than CPU hog) while you fix the code.

💡 Real-life example

🔥 Pod hits 100% CPU. CPU profile shows 80% time in a regex match. Code review: someone added a user-input regex without validation — `\d+(\d+)+` causes exponential backtracking. Fix: validate length + use a regex linter.

Question 60

Your CI/CD pipeline succeeded but the app is broken in production. What do you check?

Accepted Answer

This is the worst case — green pipeline, red prod. Check in order: (1) What's the failure mode? 500s? slow? wrong data? (2) Pull production logs from the affected pods. (3) Diff the deployed image SHA against last known good. (4) Environment-specific config — prod has different env vars/secrets/feature flags vs staging. Maybe an env var is missing or has a typo. (5) Database — migrations ran but schema is incompatible? Long-running tx blocking? (6) External dependencies — third-party API down? Rate limit hit? (7) Resource limits — pod hitting CPU/memory ceiling that staging never tested. (8) Rollback IMMEDIATELY while investigating.

💡 Real-life example

🔍 Deploy works in staging, breaks in prod. Diff env vars: prod has `DATABASE_URL` pointing at a deprecated read replica. Set correct URL, redeploy — works. Lesson: env parity matters; use Vault/SSM for env config in source of truth.

Top 60 DevOps Interview Questions with Real-World Examples

🧠 DevOps Fundamentals (5)

🔄 CI/CD Pipelines (6)

🐳 Docker & Containers (6)

☸️ Kubernetes (8)

🏗️ Infrastructure as Code (5)

☁️ Cloud & AWS (4)

🔁 GitOps & ArgoCD (2)

🌿 Git & Version Control (6)

📊 Monitoring & Observability (5)

🚀 Deployment & Scaling (5)

🔐 Config & Secrets (4)

🔥 Real Scenarios (4)

📚 Want to go deeper?