Problem Determination
Through the use of monitoring tools, the Data Center typically receives alerts of problem situations. The Operations Specialist's primary duty is to respond, repair or report problem situations to Data Center Management. Problem situation alerts can be in the form of real-time monitoring, email message alerts or SMS text paging. Problems are taken care of based on their severity and effect to the campus community.
Let's consider a real example of a problem encountered with an enterprise backup application, and follow the steps required to troubleshoot the issue and formulate a solution. The scenario below occurred during the fall of 2013 and was documented as part of my independent study course (SED 599C).
The Product
The Data Center at California State University Northridge currently uses a data protection product entitled NetBackup. The product was originally created by Veritas but has since been purchased by Symantec. This program is an enterprise backup and recovery software solution typically designed for large-scale multi-location computing environments. The application is client-server based with server policies that initiate system backups based on a client schedule. The application performs system backups for a combination of 350 virtual and physical servers and stores data locally and remotely to disk or tape media at offsite locations for file recovery and disaster recovery purposes.
The Problem
While reviewing the job activity monitor within NetBackup some jobs were not displaying increasing kilobyte counts or increasing kb/sec data transfer rates. This is an indication that no data is currently being transferred from the client to master server. Other executing jobs were terminating with an error code of "13 - Read File Failed".
Figure 1 - NetBackup Activity Monitor
Job ID's indicated by a "blue man" completed successfully, a "green running man" indicates a currently executing job, and "red x's" are failed jobs.
The Architecture
The Netbackup application runs on the SunOS v11 operating system while the client server runs on a Microsoft Windows 2003 server running Exchange 2007. In order to further research this issue a team of systems analysts were assembled who represented the respective operating systems and applications invloved. The team was managed by the Operations Manager.
Troubleshooting Efforts
Over the course of the next few weeks the team met daily to review the situation, gather log information and to formulate an idea as what was causing the error. Since a work around to the problem was to terminate the hung processes and re-run the failed jobs, the severity of the issue was low and didn't warrant an immediate solution. Here is a summary of what was found by the team.
- The job failures were isolated to the Exchange mail server backup and were random in nature.
- The client job processes in question were executing with 0% job load on the system.
- Client network loads would drop significantly minutes into job execution.
- Some client jobs would not terminate normally while others terminated within 2 minutes.
- Detailed system logs indicated communication issues between the two servers.
- A reboot of the server and client machines cleared the jobs but did not solve the issue.
Problem Identification
At this point we determined that the problem appeared to originate from the client and not the server, but this was not a known fact as yet. Prior to engaging enterprise support addtional troubleshooting efforts were performed by the Operations Manager to insure the NetBackup Master server was not at fault. Based on previous customer support cases, the vendor's best practices typically requests that the master server be upgraded to the latest application version and operating system patch level. This was scheduled through IT's internal change control process and the application was upgraded from v6.5 to v7.5. An additional maintenance patch became available a month after the upgrade and was installed as well which brought the master server application up to version 7.5.0.6.
Additional Observations
Additionally I noticed for each executing process on the master server there was an associated "bpbkar.exe" executing process on the client as shown in Windows Task Manager. Over a period of time the hung jobs would continue executing and multiply until they reached a limit as described by a registry key value. When that value was reached all future jobs would receive a "57 - Can't Connect to Client" error message. This basically means that all communication sockets for this application have been exhausted and future requests for a communication channel will be denied. To clear this condition all currently executing processes must be terminated using the windows "tasklist" and "taskkill" command line utilities, or the netbackup client service restarted from within Microsoft Windows services.
Community Support
After all the logs have been researched and the problem diagnosed and summarized, the next step in every System Analyst's playbook should be to reach out to the user community and see who else has experienced a similar issue. There are countless forums and user support groups available based on the application and operating systems involved to assist with most technology problems. In this case I accessed the Symantec Connect Backup & Recovery User Community.
A quick search on this forum based on the client symptoms expressed above resulted in an article that had a computing environment running the same operating system that we were running and the same application version as our Exchange mail application. They were running a lower version of the server application, v6.0 but this confirmed my suspicion that the issue was with the client and not the server. They were experiencing intermittent backup failures on the first storage group of their exchange servers while we were experiencing failures on random storage groups.The article further pointed to additional Microsoft/Exchange knowledge base articles specific to service pack 1 of the Windows 2003 server operating system that were of great value.
The Solution
Support cases were opened with Symantec NetBackup and Microsoft and all related log files were gathered and delivered to their technical staff. After reviewing our case and performing additional troubleshooting efforts with their level II engineering staff, they recommended that we upgrade our Exchange server operating system to service pack level II as well as upgrade the Netbackup client to the latest version. They found no issues with our master server which had previously been upgraded to the latest version based on previous issues encountered with our Exchange backups.
An internal change management ticket was opened and dicussed in IT's weekly change management meeting with senior management staff.The ticket was approved by the committee and the upgrades were installed by the Systems Administrator. System backups completed successfully without errors and all support cases were closed.
Following these five basic strategies we were able to diagnose and solve this particular issue. This can be beneficial to other system administrators to effectively identify and solve routine and complex issues.
- Assume ownership of the problem.
- Inform management and solicit support form co-workers, vendors, colleagues and other sources.
- Assign tasks and gather information.
- Develop a strategy or plan of attack.
- Implement a solution documenting the issue for future knowledge.