Services
- Choosing and implementing Services
- Estimate Support Requirements
- Keep Familiar, Read the Manual
- Unique Service Names
- Case-Study: News-server
- Managing Rollouts
- Case-Study: E-mail
- Icing: Staff who can Patch
Services are the core of any IT operation, they are what you rely on and what your customers pay for. Needless to say, they need to be well managed. A large part of this is Service Monitoring, but since our colleague Ann Harding has covered that in her previous talk at SAGE-IE, we're going to concentrate on some of the other aspects.
Choosing and implementing Services
You should chose both your services offered and your implementations wisely. Your choice of services will be mostly guided by customer requirements and management decisions, but it's important that you volunteer your input for consideration whenever possible. If you have any ideas for new services, volunteer those too. You might be surprised at how good they are.
When it comes to choosing implementations, don't get stuck into "we use this because we've always used this" rut. Outline sensible requirements of your implementation and examine alternatives. Here's our current service evaluation criteria:
- Good IPv6 capability
- Compliance with Relevant RFCs
- Ease of management and configurability
- Maintainership and frequency of security issues
- Feature-set
For you, those criteria may not be suitable, but it is important that you outline some. This way you can come to some sane, rational decisions about what software you need to use. It also means that you have an accountability trail, which is important.
Estimate Support Requirements
You should also make an estimate of support requirements for services. Just because you have the hard-ware to run a service does not mean you have the man-power. Some services eat up a lot more time than others, and it's important that this be a factor of consideration.
If you do not properly estimate support requirements you will degrade your level of service on the whole. A good example is a USENET service, news is a disproportionately time-consuming service. Despite it's relatively small userbase, usenet generates a lot of traffic, disk IO and admin issues. You'll also have to deal with customer complaints and ongoing filtering concerns.
Keep Familiar, Read the Manual
We keep saying it, and it's still true. Read the Manual. Keep an interest in the services you run if at all possible, subscribe to the mailing lists, read new documentation and review new releases as they come out.
You'll pick up golden nuggets of information that will help you improve your service, and help you when you need to debug it. By seeing the problems others encounter, and their resolution, you can avoid the same problems yourself.
Whilst doing this for every little piece of software you run would be inefficient and exhaustive, it is certainly a good idea for your major services.
Unique Service Names
If you supply a service to more than one customer, you should consider unique service names. Most services are accessed via a name of some sort, be it a DNS hostname or whatever. If you can, give customers their own unique names, DNS records cost nothing. If warranted, consider allocating extra IP space for exceptional cases.
If you have 10 customers sharing a mail server for example, consider
giving them access hostnames such as customer1.mail.example.com ,
customer2.mail.example.com and so on. That way should you
ever need to migrate the service, or individual customers, you can do
it on a one by one basis with zero interuption to customer service, or
any effort on their behalf.
Even for simple things like file and print services, unique names should exist for each service even if provided by a single system. You never know when that one system will no longer be enough for the work load. Making changes to lots of user systems is wasted effort. Make your life easier.
Many people have been stuck with huge install bases of systems with actuall machine names in the configs and so have had to either under go the pain of changing all of these systems or support that hostname as a valid identifer for that service.
Case-Study: News-Server
HEAnet upgraded our USENET service during 2002 and
since all peers and customers used the one hostname to
refer to the service, news.heanet.ie it
was impossible to migrate the customers one by one
without having them make configuration changes at
their end. Since the risk of having a big bang cutover
was too great we had to have customers make these
changes but by having them switch to unique service
names we could make this a once off cost.
External peers and customers who recieve news via NNTP
point at name.feeds.news.heanet.ie which
is a CNAME to the machine that provides the service.
Likewise those using NNRP based services point at
name.reader.news.heanet.ie.
This means that any changes in service infrastructure
can be made in a seamless fashion.
Managing Rollouts
Rollouts, migrations and service upgrades should be managed and should have as little service affecting impact as possible. For new services, the rollout is relatively simple: An installation and configuration followed by incremental periods of testing and feedback. For migrating existing services the matter can be more complicated.
During our period of re-organisation we had quiet a few service migrations, some of which are still ongoing. The central element to all of our migrations has been dual provision of service. The first element of any migration should be getting the new server/implementation working. Testing it as much as you can afford to at this point is a good idea.
Once the new target system has been tested our next step was generally to move customers over one by one, testing each time. At this stage, the service for that customer was still from the original server, but also the target server. We then allowed customers to test their service on the target service and when they were satisfied, changed the service names to reflect the new location. After a period of time, service was disabled on the original server.
At no stage was it impossible to quickly and easily roll back to a known, working service implementation.
Case-Study: E-mail
A good example of this process was our own E-mail requirements. As the size of our staff and our E-mail requirements grew, it became neccessary to upgrade our mail server.
After installing a new PowerEdge Server for the task, we followed our own guidelines and consulted everyone on what was needed. We settled on Exim as an MTA, Courier for POP and IMAP and squirelmail for webmail. We also centralised our virtual domain mail handling and dedicated an IP address to virtual domain MX and v-pop.
To manage the migration, we first got our new server into an acceptable, working state, and tested it until we were satisfied that it did what we needed. We then set up it's MTA to simply relay all mail to the old server and went about transfering DNS records. During this time, all mail arriving to either our new mail-server or old mail-server got delivered to the same place.
When we were ready to implement a cut-over, we implemented the opposite. We now configured our new mail-server to accept mail for delivery, and our old one to relay to the new one. Again, all incoming mail came to the same place. This ensured that mail was handled consistently accross the migration. We transfered old mailstores after we were sure that the new server was functioning acceptably in production. Again, at no stage was it impossible to quickly and painlessly roll-back to a known,working server.
Icing: Staff who can Patch
Allthough by no means a neccessity, having staff who can competently patch software can be a major bonus. In an open-source environment, having the ability to customise and develop functionality can be a great source of service development and service improvement.
By writing a new lookup method patch for exim, we were able to handle a massive Distributed Denial of Service attack relatively easily, without it becoming service affecting. During the initial period of general release of Apache 2.0 we worked to improve it's Suexec, CGI and IPv6 functionality until it became production capable. All patches were submitted back to the open-source community.
Custom patch-writing can be a source of over-dependency on staff though. There's no point having a great patch writer, if the code is undocumented, un-maintained and they get hit by a bus. At HEAnet, a simple policy was implemented. Custom patches could be used if they had been integrated into the upstream software projects, if they were very well documented, or in exceptional short-term circumstances (such as DDOS).
This policy allowed us to reap the benefits of custom development with the associated plus of long-term maintainership.
