Master AWS Best Practices: Optimize Your Cloud Performance

INTERNAL DOCUMENT: POST-MORTEM REPORT – PROJECT “SILVER LINING” (MIGRATION FAILURE) FROM: Senior Systems Architect (Infrastructure & Physical Security) TO: The C-Suite and the “Cloud Native” Evangelists who broke the bank. DATE: 2024-05-22 SUBJECT: Why we are broke and why my pager didn’t stop buzzing for 72 hours. { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowDevsToBreakEverything", … Read more

Machine Learning Best Practices: 7 Tips for Success

INTERNAL POST-MORTEM: PROJECT “ICARUS” / INCIDENT REPORT #8842-B TO: Engineering Leadership, DevOps, and anyone else who thinks they can “just run a script” FROM: Silas Thorne, Principal Systems Architect (Infrastructure & Recovery) SUBJECT: The Smoldering Remains of our “Machine Learning” Pipeline It is 4:42 AM. I have been awake for thirty-eight hours. The air in … Read more

What is Kubernetes? A Simple Guide to Container Orchestration

$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-7689d884b-l2v98 0/1 CrashLoopBackOff 42 (3m ago) 72h kube-system kube-proxy-z4m2n 0/1 Error 15 72h production api-gateway-v2-7f5d9c8d4b-9w2k1 0/2 ImagePullBackOff 0 14m production order-processor-5566778899-abc12 0/1 CreateContainerConfigError 0 12m production payment-service-8899aabbcc-xyz34 0/1 Terminating 0 72h production auth-service-66778899aa-def56 0/1 Pending 0 5m monitoring prometheus-server-0 0/1 CrashLoopBackOff 112 72h … Read more

Docker Image Guide: How to Build, Run, and Manage Images

text [root@prod-node-04 ~]# docker pull registry.internal.corp/analytics/data-cruncher:latest latest: Pulling from analytics/data-cruncher d5a1f291072d: Already exists f23a467d5e21: Pull complete 4f4fb5514a3d: Pull complete 7d23456789ab: Extracting [==================================================>] 2.1GB/2.1GB 8e34567890cd: Extracting [========================> ] 1.2GB/2.4GB failed to register layer: Error processing tar file(exit status 1): write /usr/lib/x86_64-linux-gnu/libLLVM-15.so.1: no space left on device [root@prod-node-04 ~]# df -h /var/lib/docker Filesystem Size Used Avail Use% … Read more

Top DevOps Best Practices for Faster Software Delivery

Incident ID: #8829-OMEGA. Status: Resolved (Barely). Subject: The day the load balancer decided to become a random number generator. Incident Summary * Duration: 02:04 UTC to 06:12 UTC (4 hours, 8 minutes). * Impact: Total loss of ingress traffic for the api.production.internal and checkout.production.internal zones. Estimated revenue loss: $2.1M. * Root Cause: A “minor” update … Read more

Docker Best Practices: Build Faster, Secure Containers

POST-MORTEM REPORT: THE DAY THE LAYERS COLLAPSED DATE: October 14, 2023 AUDITOR: Lead Infrastructure Engineer (Hardened Systems Division) STATUS: CRITICAL / FORENSIC COMPLETE INCIDENT REF: #882-ALPHA-FAILURE I’ve spent the last 72 hours staring at hex dumps and cleaning up the radioactive sludge left behind by a “standard” deployment. My eyes are bloodshot, my caffeine intake … Read more

Top Artificial Intelligence Best Practices for Success

text [2023-10-27T14:22:01.442Z] kernel: [12409.552101] python3[14201]: segfault at 0 ip 00007f8e12a34b12 sp 00007ffc8e12a340 error 4 in libtorch_cuda.so[7f8e10000000+12a34000] [2023-10-27T14:22:01.443Z] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.50 GiB (GPU 0; 23.65 GiB total capacity; 18.21 GiB already allocated; 4.12 GiB free; 19.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try … Read more