GreekChat.com Forums - (Resolved) GreekChat Recent Outages & Server Issues

GreekChat.com Forums (https://greekchat.com/gcforums/index.php)

- Greek Life (https://greekchat.com/gcforums/forumdisplay.php?f=24)

- - (Resolved) GreekChat Recent Outages & Server Issues (https://greekchat.com/gcforums/showthread.php?t=240332)

Whenever the spontaneous reboots occur it usually causes some crashed tables and other tables not being closed properly. Fortunately, as far as I'm aware, so far none of this has caused any major database problems.

Anyhow, everything should be back in order and functioning properly again.

Thanks for everything, John!

Had another spontaneous server hard reboot today...

2 spontaneous reboots on March 1st. One around 9:30 AM and the other around 10:00 PM. That's what took GC offline for a while last night. Some database issues just slow things down a bunch but others take the site completely offline.

Just finished fixing things up.

With the reboots happening more often, I might move GC to a different server temporarily until things with the current server or datacenter equipment are sorted out.

Thanks for keeping us updated, John. Gotta love hardware!

Happened again 8pm tonight. All sorted out now.

Yesterday I set things up so I receive text and email notifications immediately after unexpected server reboots which will speed up how fast I can get any resulting issues corrected when/if it occurs again.

Also yesterday I discovered that all these server hard reboots are causing plenty of other problems that I'll probably be needing to sort out sometime soon as well. I'll post details regarding these other issues either later tonight or tomorrow.

You're the best John.

Another hard reboot around an hour ago... All fixed up again.

Quote:

Originally Posted by FSUZeta (Post 2454566)

You're the best John.

Sure doesn't feel like it during times like this. :o :eek: I suppose on the positive side it's good that GC hasn't had any technical issues like this in probably over 10 years. And once all the issues are sorted out more permanently GC should end up with a server setup that is much more resilient to issues such as what we're facing now. :)

What you do is totally voodoo to me.

I do appreciate it!

Quote:

Originally Posted by AZTheta (Post 2454572)

What you do is totally voodoo to me.

My first instinct was to respond that it's not as complicated as it seems... but it probably is. I've just been doing this stuff for so long that much of it has become second nature.

Quote:

Originally Posted by John (Post 2454551)

I'll post details regarding these other issues either later tonight or tomorrow.

So...

The forum software we use here at GC, similar to most forum type software, uses the MySQL database software. MySQL, at least when this version of the forum software we are on was developed, defaulted to the MyISAM database storage engine.

And it turns out that the MyISAM database storage engine is not particularly resilient to sudden power loss as has happened with GC's server quite a few times in the past month.

Essentially, if the database server was in the process of saving any pertinent information when the power was disrupted, only part of the data may have saved and the other part lost/corrupted. Which may or may not cause corruption to various important data in the database.

Up until March 1st this, as far as I can tell, wasn't a big issue since problems seemed to always impact non essential areas of the database. But, on March 1st the two reboots crashed the user database table. After checking with the forum software developer, this sort of crash (despite being "repaired" using MySQL's repair functions) may have corrupted some GCer account records which may then not be recoverable and for impacted accounts, they would need to start a new account.

I'm definitely not okay with that, so will be doing everything I can to ensure GC data is minimally impacted once all the server issues are sorted out. Nobody has emailed me so far about problems accessing their GC account, so maybe no account corruptions so far.

Also, I don't know for certain that the MySQL repair functions leave data without issues untouched. So maybe there is data corruption that is currently undetected. This is something that I'll be looking into.

---

What I'll be doing:

1. Stabilizing the GC hosting environment.

Currently I'm waiting for the datacenter to replace a faulty/failing power strip/distribution unit. After that I'll test the server hardware to determine if these problems are due to the server going bonkers or if it's the datacenter's PDU that caused the problems.

2. I've been researching what changes to make and I will either reinstall the current server or setup a new server in such a way where GC's database will be resilient (or at least significantly more resilient) to future power disruptions.

3. Possible data corruption. I'll try to determine if there is data corruption. If not, then we should be good from that point. However, if there is data corruption I might restore the last trusted database backup (which is from just before the first hard reboot back in December) and will merge all of the new stuff from then to current back into that known good copy of the database.

What that will do is limit any potential resulting data corruption issues to only the past 3 months rather than the entire history of GC.

Unsure about that part but it's something I'm considering.

---

And one last piece of info in this extra long message:

Code:

# ls -f | wc -l

1160443

That's a Linux command to list the number of files in a directory. That number (1,160,443) is the number of database email error messages that have been sent to an email account on the server related to all these issues. Probably just from the times immediately after the reboots and before I repaired the database. I haven't seen that number increase for hours, so chances are it might mean there aren't any (or aren't many) lingering database problems due to the reboots.

All those emails also aren't likely unique errors. There may just be a few dozen errors each repeated thousands of times each. If it becomes necessary for me to look through the errors I'll write a software program to sort through all that and return just one message for each unique error.

---

That's it for now. Thanks for staying tuned in to GC!

^^^ I'm very glad it makes sense to you. Still voodoo to me. I don't speak that language!

Thanks again for everything you do, it is appreciated.

John, I do not understand even a little bit of what you said, but I am very thankful for all you do!!

Quote:

Originally Posted by rockwallgreek (Post 2454596)

John, I do not understand even a little bit of what you said, but I am very thankful for all you do!!

Let's say you had one chance to write down an important message with some requirement that the pen must not be removed from the paper until complete. You could not make corrections or finish the message if you remove the pen from the paper before you complete it.

Then, while writing, the paper is abruptly yanked away. Now your message is only half written with part of it not legible and that's how it must remain.

That's sort of what happens when there is a power outage with the web server. Anything in the process of being saved when the power is cut might end up a mess / corrupted and only partially saved to the database. Corrupted data could result in some things not working correctly on the website or maybe not at all.

Although I'm not certain, so far it seems that we may be in the clear regarding any data corruption.

^ I like that explanation, John.

I've worked with MySQL, and my personal preference for engine is InnoDB, not MyISAM. Mainly because InnoDB supports foreign keys and transactions. I'm guessing you don't have control over which engine is used.

Do you have any tools available to you to analyze DB performance? (e.g. NewRelic)

Thank you again for everything you do for us.

Quote:

Originally Posted by aephi alum (Post 2454606)

my personal preference for engine is InnoDB

InnoDB does seem really good. I've been reading up on it this week. That's what I'll likely be switching to, specifically for the transactions feature. Seems that may help significantly with data integrity in the face of all these power issues.

Once the power issues are sorted out hopefully there won't be any related problems again for a long time. But, if there are problems at least I'll know InnoDB may be able to handle it much better.

In addition, I'm going to test ZFS with my setup and if it works out well I'll place the MySQL data folder on a zpool for the additional data integrity benefits.

Quote:

Originally Posted by aephi alum (Post 2454606)

I'm guessing you don't have control over which engine is used

On the server, yes, but no so much with regards to the software.

Back when this version of the forum software was developed they decided to go with MyISAM. InnoDB was available then, but MyISAM was the default for MySQL at the time so maybe that's why they went with it.

I recall back then that InnoDB wasn't necessarily recommended for vBulletin, unsure exactly why but one of the issues mentioned was relating to full text search being available in MyISAM but not in InnoDB (which I read this week that it does now have that feature). Apparently, though, full text in MyISAM only impacted the search engine of the forum software but instead of fixing the search to work with InnoDB they just went with MyISAM tables.

Anyhow, there is a path to switching over to InnoDB which works with the current software that I'll be looking more into. (Although, I'm not planning to keep GC on this forum software much longer, but that is an entirely different topic that I'll be starting a new thread about soon.)

Quote:

Originally Posted by aephi alum (Post 2454606)

Do you have any tools available to you to analyze DB performance? (e.g. NewRelic)

I've read about NewRelic but never used it before. The server isn't too overloaded so I never really found it necessary to explore squeezing more performance out of the DB in that way.

Quote:

Originally Posted by NinjaPoodle (Post 2455072)

John, I just saw the IP addresses.

From the message I posted with the server log file? All those IPs, replaced with # symbols, were my IPs when I was logged in to the server working on stuff. I've been logged in a bunch more times since while dealing with the reboot issues, etc.

We're still not in the clear yet with the server reboot problems, though. The datacenter is taking excessively long with replacing the power distribution unit. GC's server is still powered through that failing PDU but they did reduce the load on it by moving any servers off of it that they could. It was a month ago when I notified their staff of the issue and they were able to confirm the PDU is failing.

Until they replace the PDU it causes uncertainty as to whether any of the reboot issues are due to GC's server hardware or if it's solely due to their PDU being on its way out.

I have a temporary server that I'll be moving GC to soon. Then after the datacenter PDU is taken care of I'll get things set back up on the current web server again.

On the blue header bar of each message above the join date, next to the "report spam " icon", is the computer icon, when you move the cursor over it, it shows the IP addy.

GC was offline for a while Saturday evening. This time the culprit was not a power disruption & reboot, although it was the indirect cause.

Each time the server reboots due to the power issues the server saves some messages into the system error log which is part of the BIOS. Not much space there for logs, so this one could only hold 512 messages. And, it turns out, that once that log fills up it will cause the system to wait at a specific startup screen simply to notify about the full system error log. I had to press the F1 key to get it going again.

Not quite what I was expecting when I saw that the server was completely unresponsive. But glad it was a relatively easy fix.

Thanks for the update!

John, it's all black magic to me, but I truly thank you for all you do.

Received notification from the datacenter staff that this week they will have a time set for replacement of the power distribution unit. My guess is that it will probably be replaced within the next 7 to 10 days.

As long as the PDU is the only issue, this should resolve the server power disruption problems and resulting site downtime it has been causing.

If the sporadic reboots continue after the PDU replacement then that will mean it is very likely to be a hardware issue with the server and a bit more work to be done to sort this all out.

I couldn't get onto GC earlier this evening (approx 6pm CDT). I am (obviously) online now, 10pm CDT. Just mentioning this as troubleshooting info for you and the datacenter folks.

Thanks for reporting that. Lately whenever the site is offline it's while I'm running scans/repairs on the database following a power disruption & resulting server hard reboot. I usually stop the httpd server while running maintenance on the database, so at those times GC is completely inaccessible usually for 15 to 25 minutes. Hopefully these issues will be all sorted out soon.

Other than the inconvenience of taking the site offline as well as having to handle this each time the server reboots, so far things have gone okay managing it this way. Although, very far from optimal.

Yesterday was six weeks since I notified the datacenter staff of the issue. After inquiring again yesterday, the datacenter support staff reported the current status in part as: "there are roughly a dozen clients sharing the cabinet and we have to get confirmation from everyone before we can perform any maintenance"

I suppose that also means they are now in possession of the replacement PDU. I'm extremely disappointed with how long the datacenter has taken to replace the failing PDU. However, this has been the only problem in the 4.5 years I've been hosting with them, so am trying to be patient and reasonable about it. I'm not sure what the norm is in this situation of needing to replace a failing, but not yet failed, PDU. Surely if the PDU completely stopped functioning it would have been replaced almost immediately.

The data centers excuse for why it’s been 6 weeks and the PDU isn’t replaced is 100% unacceptable. This is not routine maintenance, it’s emergency maintenance. Even for co-location arrangements with data centers, responsible vendors have provisions where the can do emergency maintenance with little notice in situations like this. The fact that they are treating this in such a lax manner would have me doing everything I can to find another data center. What a mess.

Also, thanks so much for keeping things running for us as best you can.

Finally, PDU replacement maintenance window is scheduled for 8:00PM tonight EST. So only around an hour away from that.

Quote:

Originally Posted by QueenD (Post 2455459)

6 weeks and the PDU isn’t replaced is 100% unacceptable. This is not routine maintenance, it’s emergency maintenance.

I agree. The way that they handled this, taking so long after verifying that the PDU was "showing signs of failure" on Feb. 23rd is very disappointing.

Quote:

Originally Posted by QueenD (Post 2455459)

The fact that they are treating this in such a lax manner would have me doing everything I can to find another data center.

Definitely a reason to consider moving away from them. Before all this happened I was actually considering other options but for different reasons.

With the current datacenter they have a lot of pros. The price is right, full peering with Atlantia TIE network, all traffic goes through an Internap FCP. If my server had two power supplies they would be able to provide two independent power feeds (so the server would still stay online if one of the power feeds (or even PDUs) went down).

Also, in 4.5 years (until now), no other unplanned power disruptions or even network disruptions. They may have had one or two planned maintenance windows that took us offline for a few minutes in all that time.

Besides all that, I'm also unsure yet if any of the power issues are due to hardware problems on GC's server. IPMI continues to report that the PSU is OK, but I'm not sure how certain that assessment is. If there are no power disruptions by tomorrow after the PDU replacement tonight then I think at that point it will be very likely to have been all due to the failing datacenter PDU.

Once I can rule out potential server hardware causes I'll be contacting the datacenter company CEO to see what he thinks about how it was handled. One of the reasons I also moved colo hosting to them back in 2013 was because they had a very solid reputation. Even if my server has other issues it still really surprises me that it took 6 weeks to get this sorted out.

Anyhow, hopefully no more power disruptions after tonight.

Sunday evening, April 8th, the data center's power distribution unit was replaced. Since then not a single power disruption to GC's web server.

As a few days have passed, I think it's now safe to say that the immediate hardware issues which were causing all of the random power losses/reboots are now resolved.

Great news!

Thank you again for all your hard work, John.