How to hunt down a bug or an Issue
It's not working. It gives an error. How to resolve a bug or an infra issue without losing it
Find out what the error is. It can manifest in different ways. Sometimes it's an error traceback; sometimes a network call not propagating through, and sometimes a pod crashing repeatedly.
Look one level deeper Whatever way the error manifests, look into it, like really look into it.
If it is a traceback, figure out which library is causing it
If it's a network call, understand the path it can potentially take before reaching the respective server
If it's an pod not start look at logs, events,
Sometimes the errors are generic that's when you take a step back and get a 360 degree view of all systems involved
Questions to ask yourself
Have I seen this before - Life is easy peasy
Have I seen something like this before
Searching the internet
A simple google search
Github issues
Community forums
Experimentation
Based on the above notes come up with a series of hypothesis for the potential causes of the issue
Devise a way to test them. Sometimes it's entering the prod system and running some scripts, sometimes it's changing some configuration
Different error is progress, know when you're complicating the issue v
Communicate
People like answers not process
But it is better to communicate that what you're doing is an experimentation
Feelings
Some issues have the power to take over you
Like 24/7 process
Take break
Talk to someone
Rubber duck your processs
Writing/Bug report
Revisit your notes
Answer the damn question
Add details later
Why should every software engineer do production support?
Understand the needs of the customer
Understand the system beyond the context of their development
Trains your brain to work through the issues quicker
Debugging a Github action
Debug the steps
Local system has aws creds
Work with env variables
Debugging Databricks with Airflow
The wrong variable unsupported
What can you do as a software engineer
Write better error messages
Last updated