Production Incident

Posted in :

stlplace
Reading Time: 2 minutes

(Update 01-07-2024) Was doing some hashtag#productionSupport this past week, for the most part. At one time, it reminded me once at the credit card company, we let the one (out of 6) node (server) running for 8+ hours during the maintenance window, before we brought the other 5 nodes back online. hashtag#theFun hashtag#withProduction hashtag#SRE

(Original 11-16-2022) Had a production incident recently after production deployment. I was not intimately familiar with the oracle index charge (drop) and impact on other apps (lack of visibility and lack of perf testing environment). It’s hard to prevent this sort of thing from happening but I think as developers we should learn from those mistakes and try to avoid similar mistakes in the future.

For production incidents, it’s best for the dev team (or production support team) to know before the customers call in: especially in the case of external customers. I recall at my immediate previous employer, during pandemic we had this “Screen and go” web app. One morning the app went down: it turns out to be the auto scaler issue. Another time the Twilio SMS were blocked by the carriers. We found out both via the customer service desk.

Technical Assessments

At the credit card company I worked at a while ago, we have this thing called technical assessments for Change. Change is usually a production deployment of code / infrastructure change. The author of the Change need to add the technical assessments of the impacted team, I recall in one change I had to include 16 external teams tech assessments. It took some time and effort for me to get their blessing. But the plus side of all this, if we did this for the mentioned incident above, the incident may not happen (if the impacted team did seriously evaluate potential impact to their apps).

Btw, I just realized I did not write a lot on production, other than this one.

%d bloggers like this: