How to hunt down a bug or an Issue

It's not working. It gives an error. How to resolve a bug or an infra issue without losing it

  1. Find out what the error is. It can manifest in different ways. Sometimes it's an error traceback; sometimes a network call not propagating through, and sometimes a pod crashing repeatedly.

  2. Look one level deeper Whatever way the error manifests, look into it, like really look into it.

  3. If it is a traceback, figure out which library is causing it

  4. If it's a network call, understand the path it can potentially take before reaching the respective server

  5. If it's an pod not start look at logs, events,

  6. Sometimes the errors are generic that's when you take a step back and get a 360 degree view of all systems involved

  7. Questions to ask yourself

    1. Have I seen this before - Life is easy peasy

    2. Have I seen something like this before

  8. Searching the internet

    1. A simple google search

    2. Github issues

    3. Community forums

  9. Experimentation

    1. Based on the above notes come up with a series of hypothesis for the potential causes of the issue

    2. Devise a way to test them. Sometimes it's entering the prod system and running some scripts, sometimes it's changing some configuration

    3. Different error is progress, know when you're complicating the issue v

  10. Communicate

    1. People like answers not process

    2. But it is better to communicate that what you're doing is an experimentation

  11. Feelings

    1. Some issues have the power to take over you

    2. Like 24/7 process

    3. Take break

    4. Talk to someone

    5. Rubber duck your processs

  12. Writing/Bug report

    1. Revisit your notes

    2. Answer the damn question

    3. Add details later

  13. Why should every software engineer do production support?

    1. Understand the needs of the customer

    2. Understand the system beyond the context of their development

    3. Trains your brain to work through the issues quicker

  14. Debugging a Github action

    1. Debug the steps

    2. Local system has aws creds

    3. Work with env variables

  15. Debugging Databricks with Airflow

    1. The wrong variable unsupported

  16. What can you do as a software engineer

    1. Write better error messages

Last updated