Home

Final step to put new website into production deleted it instead

Who, Me? Welcome to Monday! The Register hopes you arrive at your desk well-rested after a pleasant weekend, and not stressed out by working late as is the case in this week's instalment of "Who, Me?" – the reader contributed column that chronicles your mistakes and escapes.

This week, meet a reader we'll Regomize as "Tom" who in 2009 contracted for major supermarket chain that operated a website to order groceries and decided it needed another one on which to sell general merchandise.

"We worked for about 18 months on the new site and one of my responsibilities was building out the development environments and defining the production deployment processes," Tom told The Register.

On arrival he found the production environment in "a bit of a mess when I got there" due to poorly documented patching. He also found "absolutely massive" deployment scripts.

"One of the things I did was remove 6,000 lines from the scripts to attempt to make them more manageable," Tom wrote. "And the scripts were only part of the process, I had to define all the other steps needed and script what I could, or document what I couldn't."

While Tom tidied things up as best he could, the supermarket wouldn't let him touch production systems – only employees were allowed to make those critical keystrokes.

After months of work, the new site was ready.

"We needed to deploy our general merchandise patch on top of the existing groceries site," Tom explained. "We had carried out multiple dry runs, deployed and rolled back in pre-production a number of times. And we had a four-hour window from 2:00 AM to 6:00 AM to when the business would allow the site to be down for this process."

Tom sat next to the employee who was allowed to make the change.

"I had supplied all the steps in detail, and all he really needed to do was cut and paste a few commands," he wrote.

But the employee decided to do it his way: Instead of deploying to each server in turn, he opened PuTTYCS – a tool that can send commands to multiple machines at once – and tried to update all the servers at once.

The staffer did ask Tom to confirm that the first step in the upgrade process was to remove the contents of one directory.

"Yes, just clear that directory," Tom replied, then watched in horror as the staffer ignored the command in procedure and instead typed rm -rf * – the "delete everything" command that often gets readers into trouble.

"Because he used PuTTY CS, the command went to every production server at once," Tom pointed out.

This happened at 02:00 AM and by then Tom had been working since 08:30 AM the previous day, after rising far earlier to make the two-hour commute to the supermarket giant's premises.

"Maybe I wasn't sharp enough at that point to catch him before he hit Enter," he mused. "I think I managed to get an anguished 'Nooooo!' just as he hit it."

Time for sitrep: It's just after 02:00 AM, Tom is exhausted, and the supermarket's entire production department is gone.

The next four hours of his life involved frantic server rebuilds, application installs, patching, and restoring things to the state in which the infrastructure was ready for the upgrade that should have taken a few minutes.

And it worked.

At 07:00 AM the supermarket chain's e-commerce director arrived and asked if the upgrade went well.

"No issues," said one of the supermarket's staffers.

"They went to get coffee," Tom concluded. "I went to get some sleep in the break area."

Have you failed to follow a procedure and flamed out afterwards? If so, don't make another mistake by not sharing your story with Who, Me? Instead, click here to send us an email so we can share your story on a future Monday. ®

Source: The register

Previous

Next