Alexandre Polozoff - Network Early Warning System
|
||||
|
The following paper is based on a network management project that was done at Continental Bank of Illinois (now Bank of America). The primary developers on this were myself (also the manager), Dick Brandt and Mike Davenport. Marty McManus had significant input on the requirements, design and overall ease of use of the system. This page is complete. The Problem CBK is split up among three buildings with about 3,500 employees on a computer. Like any other company they were building up their LAN on almost a daily basis. Additionally, if a server went down you could potentially have 100 people effectively unable to do their job.
Additionally, some departments had specialized databases which the users depend upon. Sometimes the database server itself would go down. This essentially locked users out of the database until the machine was restarted. The ability to monitor the access to seperate databases and the reporting of a failed access is needed. Likewise, various groupware products in use at CBK also had their own server software which would periodically crash. The ability to monitor different vendor applications is also required. The ability to have different support people paged, only on their shift, and only for their particular area of expertise. CBK did not want to add additional software to the servers that was not involved with day to day banking business. The network management application needs to do its work without using 'agents.' This does require the use of polling the services. While not necessarily the best solution polling does work in a small environment such as this. The Design of a Solution The design went through several code iterations before finally gelling into the solution that is discussed here. Prototyping is still the best way to test out and tweak a design. The network manager would be composed of several different applications running on more than one workstation.
NEWS Daemon The NEWS daemon is the essential heart of the entire network manager. The daemon reads in his database on start up. He correlates information into internal structures to provide the ability to run seperate tests at timed intervals and page the appropriate support staff in the event of an error. Then an outage record is created and the daemon continues. The daemon is multi-threaded. Consider the enormous number of network services and elements that needed to be tested, doing so in a single threaded environment posed certain problems. Testing each service and element in a sequential manner could take almost five minutes before being able to test the first element again. This was an unacceptable lag time because most users were able to detect a problem within five minutes of it's occurence. The multi-threaded environment was capable at testing all the services and elements within a one minute period providing a substantially better response performance. Correlation occurs in the daemons paging logic. Once a fault has occured he looks to see if any correlation rules exist for that particular element or service. If correlations are found and matched then paging of support personnel occurs in correspondence to the correlation rules. Otherwise a page is issued immediately. A correlation rule that was found to be acceptable for all outage reports for paging support personnel occurs as follows: A page is sent on the initial outage detection. Subsequently, a page will be issued every 5 minutes for the next half hour. The next half hour after that a page is issued every 15 minutes. And then every hour after that a page is sent once an hour. This was used succesfully and kept support personnel from ignoring their pagers. Paging Server The paging server is a standalone application that waits around for NETBIOS connections. Upon a connection it receives data that is comprised of two pieces of data. The first is the pager unit to send the message to and the message itself. Once the message is received it then connects to the paging service provider via modem. It sends the message to the provider. For better throughput it maintains the connection to the service provider for a minimum of one minute. If a subsequent NETBIOS message is received it is unnecessary to reconnect. This saves time since several support personnel can be paged per network failure. NEWS Databases Needless to say, the daemon's network database contains extensive information. Additionally, each time a new network element is added to the network the database has to be updated manually. No discovery is built into this design because some network elements are testing environments. Performing fault management on these elements would have been useless and time consuming for the support staff. The argument could be made that the testbeds be part of the correlation logic and cancelling the notification. The problem with this is that the test machines were not announced to the support staff. If a new testbed came on line, was automatically discovered, no correlation logic would occur until that database could be updated. This results in pages being issued in error, outage reports that skew the final reports and a frustrated support staff. In addition to the test information is a field for timing interval. This allowed for a special test that monitors the network manager itself. Doesn't help if the network manager goes down and no one knows about it. Essential during the early development, this feature became obsolete as bugs were erradicated from the manager. The daemon also had a maintenance table. Periodic maintenance is not uncommon in a network. While it may appear as an outage it really is a planned event. This prevented the daemon from sending spurious messages to personnel that were actively engaged in maintenance. The daemon also has an output database that contains outage information and which support personnel were notified. Every network element has three entries on an outage situation (a) the initial outage record (b) continuing outage record each time subsequent tests of the element fail and (c) the terminal outage report when the element is back online. Applets The daemon itself is ignorant about the network content. All he is aware of is a particular network address, be it a LAN Server ID, TCP/IP address, a Lotus Notes Server ID, database table name or whatever. How the daemon tests the particular service is through the use of applets written specifically to either query the status of the service or utilise the service in order to determine it's availabilty. The applet can only result in two ways. Either successfully determining that the service is available or not. In many cases the applets only contain at most 20 lines of code. But an applet has to be written to test each particular service. Because of this there are several applets. Care had to be taken to ensure that the applet does not put unnecessary burden on the network. Applets could be written in any language. We had applets in both C and as REXX command scripts. Another useful applet was file monitoring. There are several critical configuration files for both operating systems and applications. This would notify a network administrator when one of these files had changed and who changed it. The only way someone could circumvent this is to gain access to the server itself which is behind a locked door. A version control system also helped in making backups when a change occurred. The applets also served a purpose in that they did not require any additional applications to reside on the element. This was important not only because it was a requirement but also due to the fact that several different operating systems could be in use. This reduced the need to have development environments to support different systems. It also provides the ability to keep maintenance costs lower. Applets could also serve as periodic maintenance applications. In some cases it is beneficial to reboot a server or restart an application remotely. By writting the appropriate code this work that normally needed human intervention was no longer required. Reports Reports from the NEWS Database provided several different views of the network. One report is used by the network administrators in pinpointing particular services that had unusually high levels of problems. This provided a systematic method of where to focus attention in problem determination. Other reports provided billing information with respect to the amount of downtime. In a big brother-isc manner the reports could also determine if particular support personnel were not providing immediate response to detected problems. The reports were charted by using the Visualizer product on OS/2. Reports on a daily, weekly and monthly basis are produced. Results The results of this project were highly successful. Support personnel were aware of problems before the users. In some instances where the correction took some time the support staff was able to respond intelligently that a problem existed and was being investigated. This produced two results. One was more confidence in the network support/administration. It also provided more network uptime and less loss of business. Since the network administration was billed on uptime, revenues for the network increased. One problem that occurred is due strictly to human behaviour. Much like Pavlov's experiments many support personnel were conditioned to receiving one type of network outage problem because it occurred more frequently than others. When the odd network problem showed up the support personnel would try to solve the more common outage problem. This resulted in some confusion until someone read the actual message. One way of solving this was to have messages in different formats for different problems. The Future With the advent of two way paging upon us several new additions can be made to a system like this. Active responses from support personnel can be recorded. Likewise messages could be sent to the daemon notifying it of unscheduled maintenance work being conducted. |
||||
| Jump to travel
guides written by Alexandre Polozoff Send mail to Alexandre Polozoff Jump to Alexandre Polozoff's Home Page Copyright @ 1999, Alexandre Polozoff. All Rights Reserved. |