GreekChat.com Forums

GreekChat.com Forums (https://greekchat.com/gcforums/index.php)
-   Greek Life (https://greekchat.com/gcforums/forumdisplay.php?f=24)
-   -   (Resolved) GreekChat Recent Outages & Server Issues (https://greekchat.com/gcforums/showthread.php?t=240332)

John 03-03-2018 09:48 PM

Happened again 8pm tonight. All sorted out now.

Yesterday I set things up so I receive text and email notifications immediately after unexpected server reboots which will speed up how fast I can get any resulting issues corrected when/if it occurs again.

Also yesterday I discovered that all these server hard reboots are causing plenty of other problems that I'll probably be needing to sort out sometime soon as well. I'll post details regarding these other issues either later tonight or tomorrow.

FSUZeta 03-04-2018 06:40 AM

You're the best John.

John 03-04-2018 01:49 PM

Another hard reboot around an hour ago... All fixed up again.

Quote:

Originally Posted by FSUZeta (Post 2454566)
You're the best John.

Sure doesn't feel like it during times like this. :o :eek: I suppose on the positive side it's good that GC hasn't had any technical issues like this in probably over 10 years. And once all the issues are sorted out more permanently GC should end up with a server setup that is much more resilient to issues such as what we're facing now. :)

AZTheta 03-04-2018 02:06 PM

What you do is totally voodoo to me.

I do appreciate it!

John 03-05-2018 01:38 AM

Quote:

Originally Posted by AZTheta (Post 2454572)
What you do is totally voodoo to me.

My first instinct was to respond that it's not as complicated as it seems... but it probably is. I've just been doing this stuff for so long that much of it has become second nature.

Quote:

Originally Posted by John (Post 2454551)
I'll post details regarding these other issues either later tonight or tomorrow.

So...

The forum software we use here at GC, similar to most forum type software, uses the MySQL database software. MySQL, at least when this version of the forum software we are on was developed, defaulted to the MyISAM database storage engine.

And it turns out that the MyISAM database storage engine is not particularly resilient to sudden power loss as has happened with GC's server quite a few times in the past month.

Essentially, if the database server was in the process of saving any pertinent information when the power was disrupted, only part of the data may have saved and the other part lost/corrupted. Which may or may not cause corruption to various important data in the database.

Up until March 1st this, as far as I can tell, wasn't a big issue since problems seemed to always impact non essential areas of the database. But, on March 1st the two reboots crashed the user database table. After checking with the forum software developer, this sort of crash (despite being "repaired" using MySQL's repair functions) may have corrupted some GCer account records which may then not be recoverable and for impacted accounts, they would need to start a new account.

I'm definitely not okay with that, so will be doing everything I can to ensure GC data is minimally impacted once all the server issues are sorted out. Nobody has emailed me so far about problems accessing their GC account, so maybe no account corruptions so far.

Also, I don't know for certain that the MySQL repair functions leave data without issues untouched. So maybe there is data corruption that is currently undetected. This is something that I'll be looking into.

---

What I'll be doing:

1. Stabilizing the GC hosting environment.

Currently I'm waiting for the datacenter to replace a faulty/failing power strip/distribution unit. After that I'll test the server hardware to determine if these problems are due to the server going bonkers or if it's the datacenter's PDU that caused the problems.

2. I've been researching what changes to make and I will either reinstall the current server or setup a new server in such a way where GC's database will be resilient (or at least significantly more resilient) to future power disruptions.

3. Possible data corruption. I'll try to determine if there is data corruption. If not, then we should be good from that point. However, if there is data corruption I might restore the last trusted database backup (which is from just before the first hard reboot back in December) and will merge all of the new stuff from then to current back into that known good copy of the database.

What that will do is limit any potential resulting data corruption issues to only the past 3 months rather than the entire history of GC.

Unsure about that part but it's something I'm considering.

---

And one last piece of info in this extra long message:

Code:

# ls -f | wc -l
1160443

That's a Linux command to list the number of files in a directory. That number (1,160,443) is the number of database email error messages that have been sent to an email account on the server related to all these issues. Probably just from the times immediately after the reboots and before I repaired the database. I haven't seen that number increase for hours, so chances are it might mean there aren't any (or aren't many) lingering database problems due to the reboots.

All those emails also aren't likely unique errors. There may just be a few dozen errors each repeated thousands of times each. If it becomes necessary for me to look through the errors I'll write a software program to sort through all that and return just one message for each unique error.

---

That's it for now. Thanks for staying tuned in to GC!

AZTheta 03-05-2018 12:06 PM

^^^ I'm very glad it makes sense to you. Still voodoo to me. I don't speak that language!

Thanks again for everything you do, it is appreciated.

rockwallgreek 03-05-2018 03:29 PM

John, I do not understand even a little bit of what you said, but I am very thankful for all you do!!

John 03-05-2018 05:23 PM

Quote:

Originally Posted by rockwallgreek (Post 2454596)
John, I do not understand even a little bit of what you said, but I am very thankful for all you do!!

Let's say you had one chance to write down an important message with some requirement that the pen must not be removed from the paper until complete. You could not make corrections or finish the message if you remove the pen from the paper before you complete it.

Then, while writing, the paper is abruptly yanked away. Now your message is only half written with part of it not legible and that's how it must remain.

That's sort of what happens when there is a power outage with the web server. Anything in the process of being saved when the power is cut might end up a mess / corrupted and only partially saved to the database. Corrupted data could result in some things not working correctly on the website or maybe not at all.

Although I'm not certain, so far it seems that we may be in the clear regarding any data corruption.

aephi alum 03-05-2018 10:56 PM

^ I like that explanation, John.

I've worked with MySQL, and my personal preference for engine is InnoDB, not MyISAM. Mainly because InnoDB supports foreign keys and transactions. I'm guessing you don't have control over which engine is used.

Do you have any tools available to you to analyze DB performance? (e.g. NewRelic)

Thank you again for everything you do for us.

John 03-06-2018 01:39 AM

Quote:

Originally Posted by aephi alum (Post 2454606)
my personal preference for engine is InnoDB

InnoDB does seem really good. I've been reading up on it this week. That's what I'll likely be switching to, specifically for the transactions feature. Seems that may help significantly with data integrity in the face of all these power issues.

Once the power issues are sorted out hopefully there won't be any related problems again for a long time. But, if there are problems at least I'll know InnoDB may be able to handle it much better.

In addition, I'm going to test ZFS with my setup and if it works out well I'll place the MySQL data folder on a zpool for the additional data integrity benefits.

Quote:

Originally Posted by aephi alum (Post 2454606)
I'm guessing you don't have control over which engine is used

On the server, yes, but no so much with regards to the software.

Back when this version of the forum software was developed they decided to go with MyISAM. InnoDB was available then, but MyISAM was the default for MySQL at the time so maybe that's why they went with it.

I recall back then that InnoDB wasn't necessarily recommended for vBulletin, unsure exactly why but one of the issues mentioned was relating to full text search being available in MyISAM but not in InnoDB (which I read this week that it does now have that feature). Apparently, though, full text in MyISAM only impacted the search engine of the forum software but instead of fixing the search to work with InnoDB they just went with MyISAM tables.

Anyhow, there is a path to switching over to InnoDB which works with the current software that I'll be looking more into. (Although, I'm not planning to keep GC on this forum software much longer, but that is an entirely different topic that I'll be starting a new thread about soon.)

Quote:

Originally Posted by aephi alum (Post 2454606)
Do you have any tools available to you to analyze DB performance? (e.g. NewRelic)

I've read about NewRelic but never used it before. The server isn't too overloaded so I never really found it necessary to explore squeezing more performance out of the DB in that way.

NinjaPoodle 03-23-2018 05:46 PM

John, I just saw the IP addresses.

http://www.greekchat.com/forums/ubb/icons/icon14.gif http://www.greekchat.com/forums/ubb/icons/icon14.gifhttp://www.greekchat.com/forums/ubb/icons/icon14.gifhttp://www.greekchat.com/forums/ubb/icons/icon14.gifhttp://www.greekchat.com/forums/ubb/icons/icon14.gifhttp://www.greekchat.com/forums/ubb/icons/icon14.gifhttp://www.greekchat.com/forums/ubb/icons/icon14.gif
Thank you!!

John 03-23-2018 08:07 PM

Quote:

Originally Posted by NinjaPoodle (Post 2455072)
John, I just saw the IP addresses.

From the message I posted with the server log file? All those IPs, replaced with # symbols, were my IPs when I was logged in to the server working on stuff. I've been logged in a bunch more times since while dealing with the reboot issues, etc.

We're still not in the clear yet with the server reboot problems, though. The datacenter is taking excessively long with replacing the power distribution unit. GC's server is still powered through that failing PDU but they did reduce the load on it by moving any servers off of it that they could. It was a month ago when I notified their staff of the issue and they were able to confirm the PDU is failing.

Until they replace the PDU it causes uncertainty as to whether any of the reboot issues are due to GC's server hardware or if it's solely due to their PDU being on its way out.

I have a temporary server that I'll be moving GC to soon. Then after the datacenter PDU is taken care of I'll get things set back up on the current web server again.

NinjaPoodle 03-23-2018 09:23 PM

On the blue header bar of each message above the join date, next to the "report spam " icon", is the computer icon, when you move the cursor over it, it shows the IP addy.

John 03-25-2018 02:57 AM

GC was offline for a while Saturday evening. This time the culprit was not a power disruption & reboot, although it was the indirect cause.

Each time the server reboots due to the power issues the server saves some messages into the system error log which is part of the BIOS. Not much space there for logs, so this one could only hold 512 messages. And, it turns out, that once that log fills up it will cause the system to wait at a specific startup screen simply to notify about the full system error log. I had to press the F1 key to get it going again.

Not quite what I was expecting when I saw that the server was completely unresponsive. But glad it was a relatively easy fix.

NinjaPoodle 03-25-2018 01:13 PM

Thanks for the update!


All times are GMT -4. The time now is 09:23 AM.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2025, vBulletin Solutions Inc.