|
Post by petefinnigan on Mar 13, 2007 17:02:02 GMT -5
Hi All,
I have found a bug in 1.7.2. - I dont have a solution yet but some information on what happens. Any help will be gladly received.
I have upgraded my testblog and my web development blog to 1.7.2 and rebuilt all the files with no problem but these are quite small blogs. This evening I updated my main Oracle security blog to 1.7.2 and went to rebuiltd all the files and it errors with a 500 internal server error.
There is no additional information on the screen, the access lof has these entries:
xx.xxx.xxx.xx - - [13/Mar/2007:21:07:32 +0000] "POST /weblog/cgi-bin/gm.cgi HTTP/1.1" 200 8718 "http://www.petefinnigan.com/weblog/cgi-bin/gm.cgi" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; .NET CLR 2.0.50727)" xx.xxx.xxx.xx - - [13/Mar/2007:21:07:34 +0000] "POST /weblog/cgi-bin/gm.cgi HTTP/1.1" 200 7121 "http://www.petefinnigan.com/weblog/cgi-bin/gm.cgi" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; .NET CLR 2.0.50727)" xx.xxx.xxx.xx - - [13/Mar/2007:21:07:35 +0000] "POST /weblog/cgi-bin/gm.cgi HTTP/1.1" 200 4005 "http://www.petefinnigan.com/weblog/cgi-bin/gm.cgi" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; .NET CLR 2.0.50727)" xx.xxx.xxx.xx - - [13/Mar/2007:21:07:38 +0000] "GET /weblog/cgi-bin/gm.cgi?gmtoken=gm1173798457&thomas=rebuildupdate&rebuilding=everything&rebuildfrom=logarchives&connectednumber= HTTP/1.1" 500 620 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; .NET CLR 2.0.50727)"
The third entry was when i pressed rebuild everything, this was at 21:07:35
The counter was updated at 21:07:37 - nothing actually chnaged in the counter.
Then the index page was updated at 21:07:37
Then the access_log shown in the last of the 4 entries above shows the url generated from the meta-refresh, the time id is shown.
Next the oldest archive index page for September 2004 is updated at 21:07:39 -
Then the first entry page on the archive 00001010.htm is generated at 21:08:07, then after that from 21:08:09 to 21:10.01 the other 30 monthly archive index pages are generated. Finally the gm-token.cgi file is generated with the next timestamp.
Whilst this goes on the screen says "Rebuilt log archives. Now rebuilding archive master index..." and never gets to the next screen, it hangs and finally gives a 500 error. The error_log says
[Tue Mar 13 21:09:38 2007] [error] [client 83.104.132.73] Premature end of script headers: gm.cgi
This means that the first IF in the function gm_rebuildupdate is executed and the second elsif
} elsif ($IN{'rebuildfrom'} eq "logarchives") {
is most likely executed as the gm-token is written. This means the code fails after the token but before the screen is updated to say "Rebuilt log archives. Now rebuilding archive master index..." but I cannot see the reason why.
any ideas???
cheers
Pete
|
|
|
Post by coldstone on Mar 14, 2007 12:39:10 GMT -5
I hate the error: premature end of script header, it is so unhelpful.
Can you put some warn statements in the code, like right before and after that elsif?
The only change to that section was a fix for blogs whose first entry is the only entry for that week/month. I'll try and replicate this at home and debug.
|
|
|
Post by petefinnigan on Mar 14, 2007 14:37:44 GMT -5
Hi Coldstone,
I did a diff on that function last night and saw that you added a single if statement. I commented it out and I still have the error. I have just spent the last 40 mins writing a Gm_Trace module and getting it to compile. I have added a single test call at entry of the rebuild function. I will now instrument the cmplete function so that i can trace what happens.
I have written Gm_Trace so that it can be switched on/off and so that it can set trace levels. 1=error, 2=function entry, 3=flow,4=debug. you can set it to print level 1 say or levels 1 to3 say.
at the moment its hard coded but it could be controlled by config or some othre mechanism so that the call is there but not doing anything. That way in the future with this sort of issue we can turn on trace and hopefully work out the issue.
I will report more latr
cheers
Pete
|
|
|
Post by petefinnigan on Mar 14, 2007 17:29:56 GMT -5
Hi Coldstone, Sorry no news on the fix yet. Quick update. I have implemented a Gm_Trace package - I will post up the package tomorrow. I am tired and need to get some sleep I have instrumented the code in gm_rebuildupdate - again I will post the instrumented code tomorrow. I have also instrumented gm.cgi around the head of the file and the new call to check the IN params. The rebuild still fails in the same place after processing all the monthly archive templates in then seems to hang for 2 minutes and errors with premature file headers etc. I have run "top" on the web server and the CPU goes to 99.7 - 100% CPU during the run. The strange thing is that both my small sites rebuild fine. I have just ran the rebuild again on my test blog and it works fine. 1.7.1 also works fine against my main blog, therefore it has to be one of the chnages made in 1.7.2. My trace shows some strange behaviour, I will enhance the trace tomorrow to include the IP address and post tomorrow. The stage where the archives are being created shows gm.cgi being started 4 times, I need to check if this is all me, someone could be posting a comment to cause this. cheers Pete
|
|
|
Post by coldstone on Mar 14, 2007 22:55:28 GMT -5
Hi Pete,
Sounds like you have a good handle on this. However, if someone posts a comment, you aren't going to get a hit on gm.cgi.
That CPU spike is bad, thats usually a sign of an infinite loop. Until you got into the trace stuff, I was going to suggest removing the new web param processing stuff (by uncommenting old and commenting out Gm_Web call) and see if it works then.
Ok, after digging awhile this just gets strange. As you point out, if you see the lines 'Rebuilt log archives...' then it is almost out of that function (which ends in an exit). However, if you see the 'Rebuilt log archives...' then you should see another access log entry which has '...rebuildfrom=masterindex...' in the url.
Can you post (or message me if you prefer) a complete access log from login to 500 error? I am guessing how Gm process the data has subtly changed some data that is used for loop execution or something. One last question, do you have connected files with any of your blogs?
|
|
|
Post by petefinnigan on Mar 15, 2007 12:53:45 GMT -5
Hi Coldstone, Thanks very much for your reply. You are of course correct on the comments, i was tired last night I thought the same about the spike but i have not found any evidence in the trace to suggest there is an infinite loop yet. That doesnt mean that there isnt one though. I have added the instrumented library, gm.cgi, gm-utils, gm-web, the trace code Gm_Trace.pm and a trace file to web.petefinnigan.com/greymatter_mods.htm where you can see the trace code, how i have added trace to the files and also the trace file from the run. I find this a strange bug as the issue doesnt occur on my small blogs. I also tested a rebuild on archive log indexes and that also failed with a 500 error. this is called slightly differently but ends up in the same code. I have restored 1.7.1 for now and tested a complete rebuild of all files. This worked correctly. I will upgrade back to 1.7.2 again later I have tried commenting out the new IN parameter code and it still fails on 1.7.2 The issue seems to be the fact that the next screen is not drawn (returned to the browser), therefore the META-EQUIV REFRESH is never called to start the next section. I have tried adding a http request header to that bit of code at the end of the function, that didnt work either. The code gets to the end of the function, yiou can see that in the trace as the html is included for the next screen, it simply never gets to the browser and for the life of me I cannot see why. I am wondering if its a performance issue. The old code was slow and maybe we have just tipped it over the edge with the size of my blog and also the code getting bigger. If this is the case and not a simple bug then I am unsure how we fix it cheers Pete
|
|
|
Post by coldstone on Mar 15, 2007 14:14:29 GMT -5
I gotta say I like your commenting style, very polished.
That trace is cool. I see what you mean about the end of the script. It looks like it doesn't finish drawing the page (I don't see the closing body or html tags - UPDATE I just realized that trace won't print out the closing tags, however, it could still be a flush buffer issue). I am almost wondering if this is a flushing issue. A new line should cause the buffer to flush (add a 'print "\n\n";' just before exit), but it can be forced with '$|=1;'.
I think this issue can definitely be fixed, since it works in 1.7.1 and we have heard from people with 3k+ entries. We will get it, it'll just take some tinkering.
|
|
|
Post by petefinnigan on Mar 15, 2007 14:34:42 GMT -5
Hi Coldstone,
Thanks for your post. Yes i also noticed the lack of closing body and html tags but the first call to the function has them missing as well.
I need to re-update to 1.7.2 and try the flush. I wanted to put it back to 1.7.1 as a sanity check. I will try the newlines first. I also thought about file handles and tried a select(STDOUT) but no go. I will try the flush later. I am still concerned about the 100% CPU though as well, maybe its a memory issue as well?
I will update later.
cheers
Pete
|
|
|
Post by petefinnigan on Mar 16, 2007 11:59:02 GMT -5
Hi Coldstone, Some news. I have found out what the issue is. Well not totally , let me explain. The reason the next page in the rebuild (The HTML) at the end of the trace is not displayed in the example trace from yesterday is that Apache times out before completing sending the HTML to the browser. The request header is printed at the beginning of gm.cgi, the rest of the HTML code including the META-EQUIV refresh is at the end after all the processing. I realised this last night when testing. The issue is that the request header is dispatched to the browser and then Apache times out because the rest of the page is not sent to the browser before the Apache "TimeOut" httpd.conf value is exceeded. So OK, we now know the reason it fails with a 500 error but not the real reason. I increased the "TimeOut" value on my apache server this morning before work and let a complete rebuild run on a browser on my spare PC in the background. I have just checked the timestamps on the files - it didnt fail this time - and the elapsed time for a complete rebuild of my site was over 6 hours. with 1.7.1 it took between 40 minutes and an hour depending on server load. So we know what the problem is at a high level but not in detail. We have to assume that because i did a restore last night to 1.7.1 and did a rebuild that worked reasonably fast that 1.7.2 has introduced some slow code or some error that causes unncecessary processing. The proble is finding what this is. I think the only way forward is to use the Gm_Trace package i created to test for this issue and to instrument every function in GM at entry and exit. this will record something like: Thu, 15 Mar 2007 07:44:59:2:12345:gm.cgi:main Thu, 15 Mar 2007 07:44:59:2:12345:gm.cgi:funca Thu, 15 Mar 2007 07:44:59:2:12345:gm.cgi:funcb Thu, 15 Mar 2007 07:44:59:2:12345:gm.cgi:funcc Thu, 15 Mar 2007 07:44:59:3:12345:gm.cgi:funcc Thu, 15 Mar 2007 07:44:59:3:12345:gm.cgi:funcb Thu, 15 Mar 2007 07:45:01:3:12345:gm.cgi:funca Thu, 15 Mar 2007 07:45:01:2:12345:gm.cgi:main
I can modify Gm_Trace to make it output the file, line, function and also identify entry and exit of functions by using level 2for entry and level 3 for exit. I also need to identify execution runs byt a unique token and also change the time to inc at least mili-seconds. If I can quickly add the calls to all the code we can start to test where the bottlenecks are cheers Pete
|
|
|
Post by coldstone on Mar 16, 2007 14:58:02 GMT -5
That fits with what I have seen as well. That 1.7.2 seems to be slower and its causing the timeout, my workaround was to change the number of files its processes in a batch (from 9 to 5). However, you are correct there seems to be a bigger issue. And of course since the archive logs are done all at once, there is no way to break up their processing. I made very few changes to gm-library, there is one particular place I remember changing to fix a defect with the calendar. I will see what has changed and perhaps that too will give us a clue. I am very curious what I could have done to give Gm a x10 performance slowdown
|
|
|
Post by coldstone on Mar 16, 2007 22:55:17 GMT -5
Versions Affected: GreyMatter 1.7.2 Severity: High (Unable to rebuild files, depending on circumstances)
Issue Description: GreyMatter experienced a significant performance loss with the release of 1.7.2, causing request timeouts when trying to rebuild. The issue has been discovered. Due to my fix for the previous/next links, there was a 30-45 second increase in a loop that is executed 9 times for every entry file built (leading to a cumulative 4:30 to 6:45 per request).
Resolution: The code is a smallish fix at line 1954:
do { foreach my $entry ( sort { $gmentrylist->{$b}{'id'} <=> $gmentrylist->{$a}{'id'} } keys( %$gmentrylist ) ) {
to:
my @sortedentries = sort { $gmentrylist->{$b}{'id'} <=> $gmentrylist->{$a}{'id'} } keys( %$gmentrylist ); do { foreach my $entry ( @sortedentries ) {
Fixed in GM 1.7.3. The problem area of code is when GM determines the next/previous link and to do this, loops through every single line in the gm-entrylist, which means that as rebuilding progresses, this loop becomes slower and slower.
|
|