I first learned the “production environment” in 2010, when I worked as contractor for a major railway company. Before that I was mostly in CAD software development and consulting environment the word “production” did not come often. To be precise at Siemens PLM/UGS as developers, we did have access to various production releases and did validation for bug and bug fixes from time to time. Our code goes to release per year or per quarter. But production is not as significant as the maintenance releases, so this is the world of shrink wrap (engineering) software world.
Came to the world of business applications, or web. The first thing I learned is it’s not a good idea for newbies to touch production data. Or for that matter, not good idea for devs to touch that either. Very few people has production access, besides admins (database, web), the few people have access are usually product owner, business analysts, or product support people. And fast forward 5, 6 years, I became one of the latter. This is a privilege. Something I learned over past year:
1) Start from baby steps: e. g., if we want to update 1000 records: start from one or two records, do the update, validate and if everything looks good, do the mass update. This goes th way of divide/conquer too: so for example, if I need to delete 3 or 4 million records in one script (one run), I know it will be a long operation, and I don’t want the operation hang or fail in the middle. So what do I do? I divide the deleting operation into a few, each operation deletes half a million, much more manageable, and I will get the it complete much faster or get feedback much faster.
This applies to the career level or experience as well. I recall when I was a new developer at the local consulting company I mentioned above, I did not have access right to prod. My supervisor does. I recall one of her main tasks is to fix (I assume mostly update) the data in Oracle database, probably due to her familiarity with the data/application. It’s not a process I personally like though, as I believe in ideal world, we should have no serious bugs in application that requires manual intervention like that. Because manual intervention could have its downside as well, for example, wrong update commands were run. Or data gets deleted. Obviously there is way to recover.
2) Backup, backup, backup whenever I can. So when I found I made a mistake deleting or updating some data, I can put it back easily without asking a favor from DBA (in which case, I assume it is nearly impossible to shutdown the database or application to restore my small mistake).
3) Separating the environments. Once I made a mistake during testing, that the email went out to the actual client. Turns out in the code we did not have checks for different environments. We implement that feature after the incident.
4) On call: I started the on-call duty roughly a year ago, and felt I have gained much more confidence recently. Ironically it’s when the pager went off left and right, and I learned when under the fire. Hardware-wise, I have a martian notifier watch that will vibrate when a text message came in. Response-wise, I would acknowledge the alert first, then investigate and validate. Recall in the early days I did not have a lot confidence, and I would rely on my more experienced teammate for help. I still do (but much less frequently). Once I figure out the issue (or non-issue) I will respond appropriately.
5) Examples of production issues. I recall two issues kind dragged on for a while, both are performance related. One issue was the JDBC database connections would be exhausted. Turns out to be a SQL performance issue, but took a while to nail down (due to its complexity and data related, note there are tools for SQL analysis and performance tuning). Another one is Java code related, in that case we have code analysis tool to nail it down (appears like an infinite loop that spike CPU usage). Both are interesting problems that have expanded my horizon. Previously I did use the plugins in Eclipse (MAT) to analyze memory usage.
I will add more as time goes.