The minor maintenance turning to epic crash… and some data loss

SmartBots Blog August 17, 2019 by Glaznah Gassner

If you saw SmartBots website offline today, we are really sorry for that! This was a result of an epic crash we’ve faced today. The techy details follow for those who interested or curios.

Once is a chance

We’ve decided to run a minor maintenance this morning. It was supposed to be a regular replacement of an SSD drive (these drives get worn out every 1.5 – 2 years). The drives (say, A and B) are joined into the RAID array, so it was enough to pull out one drive (drive B) and insert a fresh one.

The Data Center technician took out the broken drive B. The drives are installed quite close to each other, so the healthy drive A got slightly moved as well. The technician decided to adjust it, pushed with a finger… and, as we know now, killed BOTH drives with a static electricity. One should always wear an anti-static bracelet!

He did not realized that both A and B just died… so he proceed to the maintenance, and the affected drive A started writing random zeroes to the database, erasing the recently created bots and purchases.

Twice is the coincidence

In 15 minutes, we’ve discovered that the database is being zeroed. Another 15 minutes has been spend to understand that it is not a hacker attack, and 30 more minutes spent to find a healthy drive and put it to the system.

Once the reason was discovered, the backup guy has been sent to get the off-site backup – the one stored on an external server. He took a spare drive, connected it to the external backup and started downloading gigabytes of data. Unfortunately, the backup guy occasionally took the ALREADY BROKEN drive B (he was in a haste, yeah).

Triple is either a malicious intent or stupidity

The data brought back on this drive was also FULL OF UNEXPECTED ZEROES. We were rewriting a broken data with another broken data. It took two more hours to discover that… and unfortunately, during those hours the broken data were writing over the fresh backup. We’ve just forgot to stop the live data replication.

So, we’ve ended up with a completely ruined database, and a corrupted live backup. The previous full backup was few days old. We’ve spend hours restoring as much data as possible. 99% of bots are running smoothly now; however, something got lost.

How do we live with that now

In SmartBots account, you’ll see a notification asking to check your bots and Wallet. If you feel that something is missing, please contact us. We will check our records and add you the missing funds.

We really apologize for the worries and the downtime! Please give us few days to arrange the missing data, and then we’ll try to compensate the outage for all customers. Also, thank you for reading through this post – hope it was not too boring and revealed some interesting details to you.

Thank you for running your business with SmartBots!
Despite all the issues we sometimes hit together (ʘ‿ʘ)

6 Comments »

  1. o.O Raid array on a server with no drive sleds?!?!? what are these being run off workstation towers??? Server’s use hotswap sleds and are individual. if your servers dont use them i would recommend getting them or new machines. this would of been avoided if the machines were properly setup for them.

    Comment by Derange D — August 17, 2019 @ 4:20 am

  2. Sorry to hear this guys – I know what a punch to the gut a HD failure & Backup failure feels like, I had it happen to me once.

    You’re probably still putting stuff back together so you may be aware – the API’s aren’t working in-world… I can’t access my Smartbots customer Terminal, my group joiners don’t work and our bot doesn’t respond to control HUD’s.

    There’s nothing urgent – the expiry date on the bot seems ok and she’s running but obviously I don’t have remote group invites.

    Comment by Lyndka Cochrane — August 17, 2019 @ 4:58 am

  3. my notices aren’t being sent this is very upsetting
    also you have my bot as un paid and I was paid for another weeki have to go to rl work so pls just credit my account so my notices will be sent ty
    am also missing my advanced notices add on

    Comment by annabel coveria — August 17, 2019 @ 6:36 am

  4. I renew my smartbot at 8/15th and cant see it at the website.

    2019-08-15 02:13:04 196ce7ed Destination: Glaznah Gassner
    Payment
    Region: DuoLife
    Description: SmartBots Terminal v3.3 L$1015

    Comment by Goddessgirl — August 17, 2019 @ 6:51 am

  5. My personally owned bot is still not working for my enrollment fee group. I don’t think it is on your database any longer.
    Do I need to do something to get him back up and running since the HD mishaps?
    Let’s please rectify this as quickly as possible, Im losing memberships fast.
    Thank you
    group name: Franks Place Elite Club, located on Franks Place 7
    Bot Name: FrankElite

    Thank you
    Nanceee Sinatra

    Comment by Nanceee Sinatra — August 18, 2019 @ 1:53 am

  6. Oh god, I’ve literally been through these situations in IT where trying to do something minor cascades into a huge catastrophe. And where just about everything that can go wrong does go wrong. You always end up feeling like the stress took a few days off your life! It can leave you shell-shocked for a long time, that is for sure!

    Comment by Chaz Longstaff — August 18, 2019 @ 6:47 am

Leave a comment