Fun with the overnight crew...
Dateline: Sunday morning, 2:30am (my week on call)...
Ring!!!!!
Kevin3NF: {snort} thisiskevin
Caller: Hey man...sorry to wake you but we have some t-log backups failing for Acme Widgets
Kevin3NF: whatstheerrormessagein the job (waking up a bit, not out of bed)
Caller: I can't find the job in the SQL Agent job list
Kevin3NF: WhatIsTheExactAlertFromNetIQ? (slightly clearer now)
Caller: SQL Task failed: 'Backup Job 17' failed
Kevin3NF: The job list is alphabetical...did you sort by name?
Caller: Yeah...but theres like 2000 jobs...most have numbers and letters and stuff
Kevin3NF: Those are reporting services jobs...just ignore them and look for the Backup job
Caller: I'm looking for Job 17, but its not in the 'J's
Kevin3NF: Did you look under 'B', since the job's name starts with 'Backup...' ?
Caller: silence
Caller: I'm an idiot
Kevin3NF: Goodnight.
Caller: Click
Kevin3NF: snore.
Still makes me laugh...I went back to sleep knowing I would post this little 2 minute interaction :)
Kevin3NF
12/10/2007
11/20/2007
Why we maintain databases...
I was blissfully unaware of this issue until I came in this morning and saw the debris...
The players: Freakshow and Tiny (my co-workers), the Customer, and the application Vendor.
Customer: Everything is slow and my cluster keeps failing over! Must be SQL Server!
Freakshow: We didn't change anything on your servers sir...
Customer: You must have!
Freakshow: Could be bad indexes or stats...
Customer's Vendor: Oracle doesn't have this issue
Freakshow: Grrrr....
Tiny: Double Grrrrrr.....
Freakshow: You have 49 million rows, and are table scanning/locking...
Customer: We have 3 years of data
Vendor: You should have 90 days, and you are 3 years behind on versions of the application...
Freakshow: Let me update the stats....
Customer: Hey...its working!!!
Vendor: You know that 'purge' utility we provide?....it comes with fries :)
Freakshow: I'm going back to bed
Tiny: I'm going home (6 hours past end of the shift...)
Moral: Purge your data, keep your versions up, and maintain your indexes folks...
I was blissfully unaware of this issue until I came in this morning and saw the debris...
The players: Freakshow and Tiny (my co-workers), the Customer, and the application Vendor.
Customer: Everything is slow and my cluster keeps failing over! Must be SQL Server!
Freakshow: We didn't change anything on your servers sir...
Customer: You must have!
Freakshow: Could be bad indexes or stats...
Customer's Vendor: Oracle doesn't have this issue
Freakshow: Grrrr....
Tiny: Double Grrrrrr.....
Freakshow: You have 49 million rows, and are table scanning/locking...
Customer: We have 3 years of data
Vendor: You should have 90 days, and you are 3 years behind on versions of the application...
Freakshow: Let me update the stats....
Customer: Hey...its working!!!
Vendor: You know that 'purge' utility we provide?....it comes with fries :)
Freakshow: I'm going back to bed
Tiny: I'm going home (6 hours past end of the shift...)
Moral: Purge your data, keep your versions up, and maintain your indexes folks...
10/30/2007
17803 errors all over the place....
We have close to 400 SQL Server instances to manage. Within the last month we are starting to see 17803 errors at least once a week on completely different servers, different customers, different hardware, etc. We used to get one maybe every 6 months. SQL 2000, SP4 + .2187 hotfix rollup on most.
Anyone else seeing this? I'm wondering if something at the O/S level is the commonality (assuming there is one).
We have close to 400 SQL Server instances to manage. Within the last month we are starting to see 17803 errors at least once a week on completely different servers, different customers, different hardware, etc. We used to get one maybe every 6 months. SQL 2000, SP4 + .2187 hotfix rollup on most.
Anyone else seeing this? I'm wondering if something at the O/S level is the commonality (assuming there is one).
10/25/2007
Do what?
So....I've got this job that backs up the T-log every 15 minutes (SQL 2005). Works like a champ across all servers.
Out of the blue, it starts failing with a "No full backup detected", despite the fact that one occurred less than an hour ago. Its recorded in the msdb system tables and the file is right where the full backup job left it.
The customer had decided for reasons unknown to create a job that does this (every hour):
Use MyDB
DBCC SHRINKFILE(MyDB_Log, EMPTYFILE)
BACKUP LOG MyDB WITH TRUNCATE_ONLY
DBCC SHRINKFILE(MyDB_Log, EMPTYFILE)
go
I kid you not. And no...there is not a second file to EMPTY the log contents into.
For those of you new to the game, when you Truncate a T-log, the sequence is broken and SQL Server barfs on future log backups until you run a full.
Kudos to my co-worker (code name Freakshow) for catching this as I was headed to the ERRORLOG to see what was going on. He nailed in 10 seconds what would have taken me 7 minutes. And then he knew it was a new job while I was still reading the log. Scary smart, that dude.
No fries, but ketchup squirted all over that client.
Kevin3NF
So....I've got this job that backs up the T-log every 15 minutes (SQL 2005). Works like a champ across all servers.
Out of the blue, it starts failing with a "No full backup detected", despite the fact that one occurred less than an hour ago. Its recorded in the msdb system tables and the file is right where the full backup job left it.
The customer had decided for reasons unknown to create a job that does this (every hour):
Use MyDB
DBCC SHRINKFILE(MyDB_Log, EMPTYFILE)
BACKUP LOG MyDB WITH TRUNCATE_ONLY
DBCC SHRINKFILE(MyDB_Log, EMPTYFILE)
go
I kid you not. And no...there is not a second file to EMPTY the log contents into.
For those of you new to the game, when you Truncate a T-log, the sequence is broken and SQL Server barfs on future log backups until you run a full.
Kudos to my co-worker (code name Freakshow) for catching this as I was headed to the ERRORLOG to see what was going on. He nailed in 10 seconds what would have taken me 7 minutes. And then he knew it was a new job while I was still reading the log. Scary smart, that dude.
No fries, but ketchup squirted all over that client.
Kevin3NF
10/24/2007
9/12/2007
Performance on a clustered SQL 2005 instance
Customer sez: I've been getting timeouts all morning
Kevin3NF sez: CPUs are good, memory is fine (buffer cache hit ratio and page life expectancy...my two favorite memory counters for quick hit analysis..)
Kevin3NF: Uh-oh....why is the disk queue so high for drive N? Houston, we found the problem...need clearance for root cause.
Root cause: Someone decided to expand the volume on the SAN that included the data drive, T-log drive and backup drive while the instance was running, and at HIGH priority. The second it finished, everything went back to normal.
Sadly, I needed a Windows/Storage guy to tell me what was going on after I identified I/O as the resource bottleneck.
Kevin3NF
Customer sez: I've been getting timeouts all morning
Kevin3NF sez: CPUs are good, memory is fine (buffer cache hit ratio and page life expectancy...my two favorite memory counters for quick hit analysis..)
Kevin3NF: Uh-oh....why is the disk queue so high for drive N? Houston, we found the problem...need clearance for root cause.
Root cause: Someone decided to expand the volume on the SAN that included the data drive, T-log drive and backup drive while the instance was running, and at HIGH priority. The second it finished, everything went back to normal.
Sadly, I needed a Windows/Storage guy to tell me what was going on after I identified I/O as the resource bottleneck.
Kevin3NF
8/20/2007
2005 Log Shipping
ok...I'm behind the times and just started working with SQL Server Log Shipping instead of writing my own custom scripts (Backup, Copy, Restore...just ain't that hard...)
Messing with a dead LS database this weekend, I tried to script out the existing configuration from the Log Shipping GUI. Has anyone else seen this thing drop all of the information on the destination server when it created the script?
I didn't bother to Google it and just re-created from scratch.
It bothers me that you have to drop and re-create to "re-initialize" a log shipped database.
No fries for me :(
ok...I'm behind the times and just started working with SQL Server Log Shipping instead of writing my own custom scripts (Backup, Copy, Restore...just ain't that hard...)
Messing with a dead LS database this weekend, I tried to script out the existing configuration from the Log Shipping GUI. Has anyone else seen this thing drop all of the information on the destination server when it created the script?
I didn't bother to Google it and just re-created from scratch.
It bothers me that you have to drop and re-create to "re-initialize" a log shipped database.
No fries for me :(
8/15/2007
LazyWriter: warning, no free buffers found.
SQL 2005, 9.0.3042
Got this in my ERRORLOG, along with a ton a memory messages not worth repeating here.
8GB RAM in the box, SQL Server set to dynamically allocate, and using 6.3 gb.
SQL was not allowing connections, giving the misleading "remote connections are not allowed/TCP" messages (translated: "something bad happened")
Solution (as seen in other places, but not many)...hard reboot of the server, bring up SQL, configure max memory to 6GB (anything other than dynamic, but leave some for the O/S).
No fries...took 30 minutes to get to the resolution. Dessert maybe :)
Kevin3NF
SQL 2005, 9.0.3042
Got this in my ERRORLOG, along with a ton a memory messages not worth repeating here.
8GB RAM in the box, SQL Server set to dynamically allocate, and using 6.3 gb.
SQL was not allowing connections, giving the misleading "remote connections are not allowed/TCP" messages (translated: "something bad happened")
Solution (as seen in other places, but not many)...hard reboot of the server, bring up SQL, configure max memory to 6GB (anything other than dynamic, but leave some for the O/S).
No fries...took 30 minutes to get to the resolution. Dessert maybe :)
Kevin3NF
6/21/2007
4/24/2007
2/06/2007
Enterprise Manager hang
My customer's SQL 2000 .818 box has started timing out when he tries to use Enterprise manager to backup or restore a database.
After giving him the command to run in Query Analyzer, I took a look at his backup tables in MSDB and found roughly 3.8 million records in each...dating back to mid 2005.
I'll be running sp_delete_backuphistory tonight. :)
If you don't want to use the cursor driven proc provided by MSFT, click here for Tara Kizer's custom script:
isp_DeleteBackupHistory
Sorry for the long delay between postings...its been slow :)
Kevin3NF
My customer's SQL 2000 .818 box has started timing out when he tries to use Enterprise manager to backup or restore a database.
After giving him the command to run in Query Analyzer, I took a look at his backup tables in MSDB and found roughly 3.8 million records in each...dating back to mid 2005.
I'll be running sp_delete_backuphistory tonight. :)
If you don't want to use the cursor driven proc provided by MSFT, click here for Tara Kizer's custom script:
isp_DeleteBackupHistory
Sorry for the long delay between postings...its been slow :)
Kevin3NF
1/24/2007
1/16/2007
Database in Read-Only after detach/attach
I got a "middle of the night panic call" this morning. My customer was moving 2 user databases to a new drive on the same server.
After detach and copy, he re-attached successfully. But, the databases are in read-only mode and cannot be changed (Error 5105, Device activation error. The physical file name '%.*ls' may be incorrect.)
SQL 2000, post Sp3, Windows 2003, sp1.
Additional errors:
2007-01-16 03:06:41.02 spid61 Starting up database 'MyDatabase'.
2007-01-16 03:06:41.02 spid61 udopen: Operating system error 5(Access is denied.) during the creation/opening of physical device F:\SQLData\MyDatabase_Data.mdf.
2007-01-16 03:06:41.02 spid61 FCB::Open failed: Could not open device F:\SQLData\MyDatbase_Data.mdf for virtual device number (VDN) 1.
2007-01-16 03:06:41.02 spid61 udopen: Operating system error 5(Access is denied.) during the creation/opening of physical device F:\SQLData\MyDatabase_Log.ldf.
2007-01-16 03:06:41.02 spid61 FCB::Open failed: Could not open device F:\SQLData\MyDatabase_Log.ldf for virtual device number (VDN) 2.
Turned out that the SQL Server startup account did not have the correct security permissions on the folder or the files. Whoops. :)
Kevin3NF
I got a "middle of the night panic call" this morning. My customer was moving 2 user databases to a new drive on the same server.
After detach and copy, he re-attached successfully. But, the databases are in read-only mode and cannot be changed (Error 5105, Device activation error. The physical file name '%.*ls' may be incorrect.)
SQL 2000, post Sp3, Windows 2003, sp1.
Additional errors:
2007-01-16 03:06:41.02 spid61 Starting up database 'MyDatabase'.
2007-01-16 03:06:41.02 spid61 udopen: Operating system error 5(Access is denied.) during the creation/opening of physical device F:\SQLData\MyDatabase_Data.mdf.
2007-01-16 03:06:41.02 spid61 FCB::Open failed: Could not open device F:\SQLData\MyDatbase_Data.mdf for virtual device number (VDN) 1.
2007-01-16 03:06:41.02 spid61 udopen: Operating system error 5(Access is denied.) during the creation/opening of physical device F:\SQLData\MyDatabase_Log.ldf.
2007-01-16 03:06:41.02 spid61 FCB::Open failed: Could not open device F:\SQLData\MyDatabase_Log.ldf for virtual device number (VDN) 2.
Turned out that the SQL Server startup account did not have the correct security permissions on the folder or the files. Whoops. :)
Kevin3NF
1/12/2007
New Year's resolutions kill Log Shipping!
ok...funniest issue I've run across in a while :)
We have some custom log shipping script that was working just fine until the end of 2006. Starting around January 2, the restore process couldn't restore fast enough to keep the standby server current.
I looked at the T-Log files for the last 2 weeks and saw that the T-log backups between 3 and 5 am were 2 to 3 times the size they used to be.
Why? Because the customer here is a national Fitness center business that had a huge influx of memberships and activities :)
People trying to lose weight broke log shipping.
The fix? Run the restore process more often (it was only running off hours).
You get fries AND a low-fat shake with this one...
Kevin3NF
ok...funniest issue I've run across in a while :)
We have some custom log shipping script that was working just fine until the end of 2006. Starting around January 2, the restore process couldn't restore fast enough to keep the standby server current.
I looked at the T-Log files for the last 2 weeks and saw that the T-log backups between 3 and 5 am were 2 to 3 times the size they used to be.
Why? Because the customer here is a national Fitness center business that had a huge influx of memberships and activities :)
People trying to lose weight broke log shipping.
The fix? Run the restore process more often (it was only running off hours).
You get fries AND a low-fat shake with this one...
Kevin3NF
Autogrow continued:
re: 12/19/2006 SQL 2005 Autogrow issue
I have confirmed with Microsoft that the bug will be fixed in SP2, and that there is no workaround other than monitoring.
I have written a script for my customer that will check the 'growth' and 'status' values in the sysfiles table of each database on their system. If the growth number is over x and the status is over 'y'...we have a problem.
The best way to avoid this whole mess:
MANAGE YOUR DATAFILES!!
If you rely on Autogrow, you are not properly administering your database. Autogrow is a failsafe mechanism for when you get unexpected growth, and nothing more.
Kevin3NF
re: 12/19/2006 SQL 2005 Autogrow issue
I have confirmed with Microsoft that the bug will be fixed in SP2, and that there is no workaround other than monitoring.
I have written a script for my customer that will check the 'growth' and 'status' values in the sysfiles table of each database on their system. If the growth number is over x and the status is over 'y'...we have a problem.
The best way to avoid this whole mess:
MANAGE YOUR DATAFILES!!
If you rely on Autogrow, you are not properly administering your database. Autogrow is a failsafe mechanism for when you get unexpected growth, and nothing more.
Kevin3NF
Subscribe to:
Posts (Atom)