![]() |
Whenever the spontaneous reboots occur it usually causes some crashed tables and other tables not being closed properly. Fortunately, as far as I'm aware, so far none of this has caused any major database problems.
Anyhow, everything should be back in order and functioning properly again. |
Thanks for everything, John!
|
Had another spontaneous server hard reboot today...
|
2 spontaneous reboots on March 1st. One around 9:30 AM and the other around 10:00 PM. That's what took GC offline for a while last night. Some database issues just slow things down a bunch but others take the site completely offline.
Just finished fixing things up. With the reboots happening more often, I might move GC to a different server temporarily until things with the current server or datacenter equipment are sorted out. |
Thanks for keeping us updated, John. Gotta love hardware!
|
Happened again 8pm tonight. All sorted out now.
Yesterday I set things up so I receive text and email notifications immediately after unexpected server reboots which will speed up how fast I can get any resulting issues corrected when/if it occurs again. Also yesterday I discovered that all these server hard reboots are causing plenty of other problems that I'll probably be needing to sort out sometime soon as well. I'll post details regarding these other issues either later tonight or tomorrow. |
You're the best John.
|
Another hard reboot around an hour ago... All fixed up again.
Quote:
|
What you do is totally voodoo to me.
I do appreciate it! |
Quote:
Quote:
The forum software we use here at GC, similar to most forum type software, uses the MySQL database software. MySQL, at least when this version of the forum software we are on was developed, defaulted to the MyISAM database storage engine. And it turns out that the MyISAM database storage engine is not particularly resilient to sudden power loss as has happened with GC's server quite a few times in the past month. Essentially, if the database server was in the process of saving any pertinent information when the power was disrupted, only part of the data may have saved and the other part lost/corrupted. Which may or may not cause corruption to various important data in the database. Up until March 1st this, as far as I can tell, wasn't a big issue since problems seemed to always impact non essential areas of the database. But, on March 1st the two reboots crashed the user database table. After checking with the forum software developer, this sort of crash (despite being "repaired" using MySQL's repair functions) may have corrupted some GCer account records which may then not be recoverable and for impacted accounts, they would need to start a new account. I'm definitely not okay with that, so will be doing everything I can to ensure GC data is minimally impacted once all the server issues are sorted out. Nobody has emailed me so far about problems accessing their GC account, so maybe no account corruptions so far. Also, I don't know for certain that the MySQL repair functions leave data without issues untouched. So maybe there is data corruption that is currently undetected. This is something that I'll be looking into. --- What I'll be doing: 1. Stabilizing the GC hosting environment. Currently I'm waiting for the datacenter to replace a faulty/failing power strip/distribution unit. After that I'll test the server hardware to determine if these problems are due to the server going bonkers or if it's the datacenter's PDU that caused the problems. 2. I've been researching what changes to make and I will either reinstall the current server or setup a new server in such a way where GC's database will be resilient (or at least significantly more resilient) to future power disruptions. 3. Possible data corruption. I'll try to determine if there is data corruption. If not, then we should be good from that point. However, if there is data corruption I might restore the last trusted database backup (which is from just before the first hard reboot back in December) and will merge all of the new stuff from then to current back into that known good copy of the database. What that will do is limit any potential resulting data corruption issues to only the past 3 months rather than the entire history of GC. Unsure about that part but it's something I'm considering. --- And one last piece of info in this extra long message: Code:
# ls -f | wc -l All those emails also aren't likely unique errors. There may just be a few dozen errors each repeated thousands of times each. If it becomes necessary for me to look through the errors I'll write a software program to sort through all that and return just one message for each unique error. --- That's it for now. Thanks for staying tuned in to GC! |
^^^ I'm very glad it makes sense to you. Still voodoo to me. I don't speak that language!
Thanks again for everything you do, it is appreciated. |
John, I do not understand even a little bit of what you said, but I am very thankful for all you do!!
|
Quote:
Then, while writing, the paper is abruptly yanked away. Now your message is only half written with part of it not legible and that's how it must remain. That's sort of what happens when there is a power outage with the web server. Anything in the process of being saved when the power is cut might end up a mess / corrupted and only partially saved to the database. Corrupted data could result in some things not working correctly on the website or maybe not at all. Although I'm not certain, so far it seems that we may be in the clear regarding any data corruption. |
^ I like that explanation, John.
I've worked with MySQL, and my personal preference for engine is InnoDB, not MyISAM. Mainly because InnoDB supports foreign keys and transactions. I'm guessing you don't have control over which engine is used. Do you have any tools available to you to analyze DB performance? (e.g. NewRelic) Thank you again for everything you do for us. |
Quote:
Once the power issues are sorted out hopefully there won't be any related problems again for a long time. But, if there are problems at least I'll know InnoDB may be able to handle it much better. In addition, I'm going to test ZFS with my setup and if it works out well I'll place the MySQL data folder on a zpool for the additional data integrity benefits. Quote:
Back when this version of the forum software was developed they decided to go with MyISAM. InnoDB was available then, but MyISAM was the default for MySQL at the time so maybe that's why they went with it. I recall back then that InnoDB wasn't necessarily recommended for vBulletin, unsure exactly why but one of the issues mentioned was relating to full text search being available in MyISAM but not in InnoDB (which I read this week that it does now have that feature). Apparently, though, full text in MyISAM only impacted the search engine of the forum software but instead of fixing the search to work with InnoDB they just went with MyISAM tables. Anyhow, there is a path to switching over to InnoDB which works with the current software that I'll be looking more into. (Although, I'm not planning to keep GC on this forum software much longer, but that is an entirely different topic that I'll be starting a new thread about soon.) Quote:
|
Quote:
We're still not in the clear yet with the server reboot problems, though. The datacenter is taking excessively long with replacing the power distribution unit. GC's server is still powered through that failing PDU but they did reduce the load on it by moving any servers off of it that they could. It was a month ago when I notified their staff of the issue and they were able to confirm the PDU is failing. Until they replace the PDU it causes uncertainty as to whether any of the reboot issues are due to GC's server hardware or if it's solely due to their PDU being on its way out. I have a temporary server that I'll be moving GC to soon. Then after the datacenter PDU is taken care of I'll get things set back up on the current web server again. |
On the blue header bar of each message above the join date, next to the "report spam " icon", is the computer icon, when you move the cursor over it, it shows the IP addy.
|
GC was offline for a while Saturday evening. This time the culprit was not a power disruption & reboot, although it was the indirect cause.
Each time the server reboots due to the power issues the server saves some messages into the system error log which is part of the BIOS. Not much space there for logs, so this one could only hold 512 messages. And, it turns out, that once that log fills up it will cause the system to wait at a specific startup screen simply to notify about the full system error log. I had to press the F1 key to get it going again. Not quite what I was expecting when I saw that the server was completely unresponsive. But glad it was a relatively easy fix. |
Thanks for the update!
|
John, it's all black magic to me, but I truly thank you for all you do.
|
Received notification from the datacenter staff that this week they will have a time set for replacement of the power distribution unit. My guess is that it will probably be replaced within the next 7 to 10 days.
As long as the PDU is the only issue, this should resolve the server power disruption problems and resulting site downtime it has been causing. If the sporadic reboots continue after the PDU replacement then that will mean it is very likely to be a hardware issue with the server and a bit more work to be done to sort this all out. |
I couldn't get onto GC earlier this evening (approx 6pm CDT). I am (obviously) online now, 10pm CDT. Just mentioning this as troubleshooting info for you and the datacenter folks.
|
Thanks for reporting that. Lately whenever the site is offline it's while I'm running scans/repairs on the database following a power disruption & resulting server hard reboot. I usually stop the httpd server while running maintenance on the database, so at those times GC is completely inaccessible usually for 15 to 25 minutes. Hopefully these issues will be all sorted out soon.
Other than the inconvenience of taking the site offline as well as having to handle this each time the server reboots, so far things have gone okay managing it this way. Although, very far from optimal. Yesterday was six weeks since I notified the datacenter staff of the issue. After inquiring again yesterday, the datacenter support staff reported the current status in part as: "there are roughly a dozen clients sharing the cabinet and we have to get confirmation from everyone before we can perform any maintenance" I suppose that also means they are now in possession of the replacement PDU. I'm extremely disappointed with how long the datacenter has taken to replace the failing PDU. However, this has been the only problem in the 4.5 years I've been hosting with them, so am trying to be patient and reasonable about it. I'm not sure what the norm is in this situation of needing to replace a failing, but not yet failed, PDU. Surely if the PDU completely stopped functioning it would have been replaced almost immediately. |
The data centers excuse for why it’s been 6 weeks and the PDU isn’t replaced is 100% unacceptable. This is not routine maintenance, it’s emergency maintenance. Even for co-location arrangements with data centers, responsible vendors have provisions where the can do emergency maintenance with little notice in situations like this. The fact that they are treating this in such a lax manner would have me doing everything I can to find another data center. What a mess.
Also, thanks so much for keeping things running for us as best you can. |
Finally, PDU replacement maintenance window is scheduled for 8:00PM tonight EST. So only around an hour away from that.
Quote:
Quote:
With the current datacenter they have a lot of pros. The price is right, full peering with Atlantia TIE network, all traffic goes through an Internap FCP. If my server had two power supplies they would be able to provide two independent power feeds (so the server would still stay online if one of the power feeds (or even PDUs) went down). Also, in 4.5 years (until now), no other unplanned power disruptions or even network disruptions. They may have had one or two planned maintenance windows that took us offline for a few minutes in all that time. Besides all that, I'm also unsure yet if any of the power issues are due to hardware problems on GC's server. IPMI continues to report that the PSU is OK, but I'm not sure how certain that assessment is. If there are no power disruptions by tomorrow after the PDU replacement tonight then I think at that point it will be very likely to have been all due to the failing datacenter PDU. Once I can rule out potential server hardware causes I'll be contacting the datacenter company CEO to see what he thinks about how it was handled. One of the reasons I also moved colo hosting to them back in 2013 was because they had a very solid reputation. Even if my server has other issues it still really surprises me that it took 6 weeks to get this sorted out. Anyhow, hopefully no more power disruptions after tonight. |
Sunday evening, April 8th, the data center's power distribution unit was replaced. Since then not a single power disruption to GC's web server.
As a few days have passed, I think it's now safe to say that the immediate hardware issues which were causing all of the random power losses/reboots are now resolved. |
Great news!
Thank you again for all your hard work, John. |
Thank you so much!
|
All times are GMT -4. The time now is 10:56 PM. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2025, vBulletin Solutions Inc.