One Little Checkbox
One Little Checkbox
Experience and persistence in technology are often what separates the good troubleshooters from the not-so-good troubleshooters. Every once in a while, you come across an issue that comes down to one little checkbox, and finding that checkbox can take a long time.
There is a disconnect between those who manage technology and those who use it. There are completely different perspectives on issues and responses. As the saying goes, sometime the IT manager has to kiss a lot of frogs before they find their prince. I’m hoping to shed some light on what that process entails, and provide a different perspective to those who wonder why it took so long to fix an issue.
IT issues take many different forms. Most issues involve one or more of the following:
- Software problems, bugs that require patching.
- Hardware problems, equipment that has malfunctioned.
- Configuration issues, a setting that needs to be changed.
Determining what is causing a specific issue, and how to best address it, can be challenging to say the least. Especially if the issue falls into more than one category.
A practical example of this appeared recently. Out of the blue, users were unable to access the Internet intermittently. As any experienced IT professional can tell you, intermittent problems are the worst. Why? Because if there are multiple potential causes, you don’t want to start pulling various levers to try and fix it only to find they either: A. don’t fix the issue, or B. cause issues of their own.
There are many devices between a user and the Internet in an enterprise network. A quick list might look like this: wireless AP, multiple switches, one or more security appliances, routers and an ISP. Some quick troubleshooting and liberal application of the process of elimination, should narrow things down fairly quickly (which isn’t fast enough when your users can’t work). In our example, after a thorough review of the wireless and switch infrastructure they seemed to be operating properly. Additionally, the routers and ISP seemed to be working as well. So that left the security appliances.
Sometimes as an IT support person, you wish that when something fails it fails hard. In this case, redundant systems were in place to ensure that in the event of a device failure, access to the Internet would be automatically re-routed. Had the device failed hard, it would have been easy to identify. Additionally, because the issue was intermittent, the majority of the time things were working. When things work most of the time, it can be hard to justify taking them down completely to perform a fix.
Once you’ve narrowed down the possibilities, the fun part begins. What is it about one of these devices that has changed? Was there an update applied, did someone change the configuration, did usage patterns change, is it just buggy? Review of often cryptic logs, discussion with people who have access, and reconstruction of recent events, start to build a picture of what might be the culprit. Even so, that thorough analysis may not point to a specific event. In our example, it didn’t. There was nothing specifically that stood out.
So, it was time to apply the process of elimination. One by one, each device was tested to see where the problem manifested. And finally, the device causing the issue was identified.
Great, so now we know what device was causing the issue, but what we don’t know is why. Why did this device suddenly start operating differently than it had historically? It had been in place, operating well, for a full year. The information provided in the logs didn’t point to a specific issue. That is, it didn’t point to a specific issue until the software running on the device was updated during the troubleshooting process. At that point, the entries in the logs changed. The logging itself was changed by the update to provide more specific information. It was the difference between saying: “There is an issue”, and saying “This is the issue.” It did in fact point to a configuration setting. It pointed to one, little, checkbox…
Uncheck the box and the floodgates open, and the Internet is once again available for all. No one had changed the setting prior to the issue occurring, so there was nothing that would indicate this was an issue at all. Traffic patterns had changed, a particular threshold had been crossed, and the device acted per the configuration. No polling could indicate the network was nearing this threshold, so there was no indication that there would be an issue. One setting among thousands of possible settings, buried deep in an advanced configuration was the culprit. It is odd to think about how complex this issue was when it boiled down to one little checkbox.