Comment: Avoiding and recovering from nasty network configuration mistakes

Network configuration changes can have unexpected consequences – from service interruptions to performance degradation, and even downtime

Ongoing changes to network and security device configuration are unavoidable yet necessary for businesses, but they are also risky. These changes can have unexpected consequences – from service interruptions to performance degradation, and even downtime.

How can you reduce the risk associated with configuration changes? Here is a three-tier strategy:

Reduce the likelihood of configuration errors:

Monitor and review changes
Establish change procedures and processes
Establish a test plan for all changes

Detect problems as early as possible:

Monitor the environment
Listen to your users

Ensure that you can make a fast recovery if something goes wrong:

Maintain accessible, actionable audit information
Establish standard recovery procedures

Finally, implementing solutions that can automate error-prone, repetitive tasks and maintain vigilance 24 hours a day can go a long way in preventing, and recovering from, human configuration errors.

Monitor and review changes

Even if they look simple, all configuration changes should be monitored and reviewed. For example, suppose you're adding a host to a network group in order to provide access, and you are unaware that the same group is used in a different place to block traffic. Another pair of eyes will often catch something you missed.

Establish change procedures and processes

Change requests must be consistently communicated so that the right people can review them and assess their impact. Many problems can be avoided simply with good communication. Some organizations schedule weekly change review meetings to understand and plan complex changes. But the most effective way to ensure that changes are reviewed and approved is by enforcing a change process workflow.

Establish a test plan for all changes

It may sound surprising, but many changes are tested for hours or days after implementation, while some are never tested at all. A test plan for every change is a critical part of the change process. Sometimes this isn’t as easy as it sounds and involves coordinating end users, business partners, and professional testers. The work you put in here will give your team a reputation for doing things right.

Monitor the environment

The firewall environment should be continuously monitored, and abnormal behavior should automatically trigger alerts. The firewall environment might include the operating system, the network interfaces, the firewall software, the firewall hardware, and the firewall rule base. These should be analyzed and correlated and, if necessary, escalated for a closer look.

Listen to your users

A helpdesk should be in place so that users can easily report problems. The helpdesk should be manned with trained personnel and have clear processes for handling incidents. Have a plan for correlating multiple incidents to a single problem. Each team should have tools to assist root cause analysis before escalation to the next level.

Maintain accessible, actionable audit information

Each and every change must be properly documented and recorded in an audit trail. A comprehensive audit trail should include the target device, the exact time of the change, the configuration details, the people who were involved (requestor, approvers, implementer), and the change context, such as the project or application.

A detailed audit trail, however, is not enough on its own. The information must also be presented in an easy-to-read format so that you can readily access it when needed. Additionally, you'll want to have filtering and querying capabilities on top of the data to speed up searches and lookups.

Prepare for rapid recovery

Now comes the incident. Despite everything, something bad has happened and you need to respond. You will be judged by the time it takes to recover, so you want to be well-prepared with tools, staff and processes to handle this event. You want to keep stress down to a minimum.

If you have set up the aforementioned procedures, then you are already in pretty good shape. Either you caught the problem during the change process or, if it was missed, you can discover it early, before users and services are affected. Thanks to the audit trail, you know exactly what changes have been made lately, by whom, and why. Experts agree that most recovery time is spent figuring out what changed; so if you already know, recovery times will be much shorter. Run some quick queries to pinpoint likely suspects and you can quickly roll-back the changes.

There are a number of tools on the market that can help control changes, detect problems, and recover from errors – this could make your life a whole lot easier. These tools provide:

A complete audit trail with full accountability and integration with ticketing systems
Comprehensive change reports and side-by-side diffs for rule bases, objects and textual configurations
Real-time change notifications with filtering (by change type, device, affected networks)
Central console for viewing all recent changes across all devices, regardless of vendor and model
A policy analysis tool for determining which firewalls and rules are blocking services across an environment
Rule and object change history reports
Business process automation to manage the change process and integration with existing ticketing systems

You can recover from configuration mistakes – it’s a case of putting in the right rules and procedures and combining these with the right tools.

Read previous Tufin Expert Tips at our blog
Tufin Expert Tip #1: Relocating a Server
Tufin Expert Tip #2: Analyzing Network Connectivity Problems
Tufin Expert Tip #3: Best Practices for Optimizing Firewall Performance
Tufin Expert Tip #4: Vendor- and Model-specific Tips for Optimizing Firewall Performance

Reuven Harrison is CTO and co-founder of Tufin Technologies, where he leads the company’s development staff and manages all product architecture while ensuring seamless integration with all leading firewall vendors. Harrison has more than 20 years of software development experience, holding two key senior developer positions at Check Point Software, as well other key positions at Capsule Technologies and ECS. He received a bachelor’s degree in mathematics and philosophy from Tel Aviv University.