Sunday, December 14, 2008

Release Process !?!

Two days back, my teammate had to release to production a minor code change he made to one of the web modules. Since we both were new to the release process followed by the client, we were given a small briefing about it.

We first have to release the change to a common dev environment, where it will reside till the peer review and internal testing is completed. Then we had 3 documents to be meticulously filled up for the UAT (User Acceptance Testing). One of the documents gives a clear step by step instruction on how to release to the QA environment, explaining like it is being told to a 4 year old. Once the UAT release is done, the development team we would get an email notification and after the user test it, there would be yet another email notification giving a sign-off for a production release.

We then have to prepare another bunch of documents, add with that the UAT sign-off email and the UAT release notification email and send it across to the deployment and maintenance team at the client side, who would take care of the release. They follow a semi-automated release process whereby the release manager would just specify the steps to be executed and the scheduled deployment runs once every 3 hours.

On the release date, if you never hear back from the release manager, things are good. But if you get an email with a high alert… you probably woke up on the wrong side of the bed that day.
Well, apparently that’s what my teammate would have possibly done, because he had just received such an email. He had a senior colleague to help him figure out the problem. So I decided to keep away from it, after just telling to compare the code changes in source control and holler if he needs any assistance. I decided to stay out of it partly because I did not want to jeopardize the senior colleague’s approach and also because I myself had a couple of burning issues, to attend to.

While I was keeping busy, later that evening I heard from this teammate that they had to rollback the changes after trying a couple of things and that this has been escalated and needs to be re-released the next day. They had identified that a few things were not matching up with the QA and Production environment. But they had figured out what the problem was and they were prepared for the next day’s release. While he was explaining I asked a few questions for which he either had no answers or was not confident of them, which made me a bit nervous for him.
Well, it was the next day - same time, and sadly the same story. It was a Friday and people were all the more frustrated and this colleague of mine was being made responsible and he had no clue what was going on. When things were going towards yet another rollback, I decided to jump in, invited or not.

They had this published website which was released to UAT – which worked fine, but when released to production, it failed to even load the default page and redirected to an authentication failed error page. The “fix” after the first rollback was this – they had compared the config file and found some application specific Role-id’s to be different and assumed that this must have been the issue and prepared themselves to give another shot, which eventually failed – miserably.

The code version in the source control looked intact, leaving me no other option but to decompile. Reflector, my favorite tool in many instances came to the rescue and I decompiled the app code binary of the current version in production with ours. When I saw the decompiled login related method, it seemed to contain quite a lot of additional changes – changes that were not present in source control!

We then figured out a way to get this work by making a few changes in the configuration file. Call it a tweak or a hack but it saved the day and the weekend, but I made it aware to all the people involved in this including the client who got an emergency regression test run and also promptly created another request to get this mess cleaned up.

I have worked quite a bit in this onsite/offshore setup and have my own experiences to recon. But almost no client I have worked with before had such an elaborate documentation process. So at first I appreciated this way of working, but this incident clearly proved that no system or process is invincible to errors, not when people have the audacity to bypass certain rules.
This whole thing was obviously caused due to someone who had been here long enough and knew how to sidestep a few landmines, but unfortunately did not know or care about the consequences and complications it would create to the next bunch of people who would work on it. The blame game had started, and something tells me that whoever did this must have been long gone.

Even though I come from a background where documentation was not so much patronized, I still could suggest on maintaining a “deviation log” which can be mainly used to record any such deviations, in any step of the development life cycle.

The bottom line is this - 
Even though there are a lot of steps to streamline any work process – nothing could change unless people learn to respect it and follow it.

No comments: