In 2012, Oracle CEO Mark Hurd predicted that increasing IT complexity would drive more customers to the cloud. “Eliminate complexity — simplify IT” was his message. What he didn’t mention was this simplification would create its own set of complexities.
Whether you’re outsourcing your servers and major infrastructure to the cloud or using one or more cloud-based third-party services, there’s a nasty by-product to our growing dependence on outside resources: the risk of performance blind spots — outages or slowdowns that impede digital service delivery to our own end users. This has created a challenge for organizations which must now keep track of an increasing number of external resources, many beyond their direct control.
Based on our experience, we’ve identified the most common recurring blind spots. Let’s examine where the dividing line is between what is under our control and not, as well as the optimal ways to address these elements.
Where You Have Little or No Control
Migrating your servers and applications to the cloud offers many benefits, but it also means you no longer directly control much of your infrastructure. Even though your cloud service provider offers a monitoring panel, you likely won’t get advance notice if a server is about to go down. You won’t have much direct control over your SaaS applications either, aside from the ability to customize features. But you can take some steps to minimize the risk:
Migrate with pre-migration standards in mind
Know exactly what your performance (load time, availability) metrics need to be before migrating. This will help you choose the proper cloud instance type. Also, benchmark against these metrics after you’ve migrated and again before you go live, so you have an opportunity to make any necessary improvements or adjustments.
Once you’re live, monitor
Nodes placed on major cloud service providers’ infrastructures can help measure platforms’ response times through synthetic monitoring. This a technique where high volumes of “dummy” traffic are generated and ping an infrastructure, in order to gauge its availability and response times and provide a picture of what real end users are likely to experience. Organizations also have the option of having dedicated nodes placed within their public cloud instances (known as a private enterprise node).
However, while these measurements taken directly from the cloud can provide some visibility into infrastructure health, they only tell part of the story, since what end users ultimately experience depends on many other geography-specific variables standing between them and the cloud datacenter (CDNs, regional and local ISPs, end users’ browsers and more). While this type of monitoring can provide some reassurance, it cannot be taken as an ultimate litmus test on what end users around the world are experiencing.
Related Article: 10 Ways Cloud Computing Will Evolve in 2018
Build in redundancies
More organizations started heeding this advice following the Amazon S3 crash of February 2017. This includes backups for all your critical cloud-based services, because while you can’t necessarily control a cloud service provider or third-party outage, you can prepare for it.
Have a clear outage plan in place
When an outage occurs, who’s responsible for communicating with the cloud service provider or the cloud-based third party? Which team members need to talk internally? How will your end users (customers) be alerted? Will they be quickly greeted with a “working on it!” notice, relaying confidence, or will they be left in the dark? Plan these details in advance and you’ll minimize emergencies and maintain good will with end users.
Be discerning about third-party site elements, many of which are cloud-based. These may be social media tags, video feeds or other elements. While these can enrich the end-user experience, each one also introduces the potential for risk. Make sure any third-party site elements you deploy are absolutely required, and make certain they align with your business needs and can back up performance requirements with an SLA or other assurances.
Minimize tags, especially during peak traffic periods
When US-based news organizations were grappling with GDPR privacy compliance last spring, some were compelled to eliminate many marketing/advertising, conversions, analytics and other types of tags from their EU sites. The result? The sites loaded much faster than the US versions. Inadvertently they were following the lead of many ecommerce sites on Black Friday. When you eliminate tags you don’t absolutely need, you minimize risk and speed load time.
Related Article: How Your SaaS Company Can Survive a Cloud Outage
Learning Opportunities
Use a tag manager
Tag managers enable you to implement and control tags from a single interface. We recommend a reputable tag manager, as this will help you quickly and easily locate and eliminate poorly performing tags before they slow overall page download speeds, thus affecting end users.
Where You Have Some Control
This category of potential blind spots involves elements found within applications and services residing in private clouds or your own datacenter. These do provide some level of control, but still bear close watching. Here are the main ones we’ve found tripping up our customers:
API
APIs are the “glue” that holds modern application components together. Monitoring APIs will help you uncover slow or broken APIs, and whether it’s an internal or external (third-party) API. When an online shopper places an order on an ecommerce site, the payment gateway uses APIs to verify the user’s credit card data. If the API that integrates the payment options on the site is broken, not only does this result in an abandoned cart, it also adds to frustration resulting in a negative user experience.
DNS
DNS acts as the “phonebook” of the internet, routing site visitors to their chosen URLs and serving as the beginning of the web journey, the first critical point in the process. Everything else could be working perfectly, but when DNS fails, no one can access your site or data flow. The big DNS outage in October 2016 was a wake-up call for many to monitor DNS as well as engage DNS redundancies. New advances in end-user monitoring enable site managers to see, for the first time, when real end users attempting to access a site cannot, due to region-specific issues such as DNS.
Related Article: Marketo Outage Reportedly Caused by Failure to Renew Its Domain
MQTT
This machine-to-machine protocol controls the Internet of Things. If you wrote your own IoT application, you can control its MQTT protocol, including specifying the quality of service level that application or device needs for its functional requirements. By monitoring MQTT you can spot disruptions occurring between your devices or those of your users. Pinpointing MQTT issues will help your team improve mean time to resolve (MTTR), preferably before end users are impacted.
SMTP
Email receipt and delivery is vital to any organization. If you monitor your SMTP server, you can ensure email application availability and quickly detect outages or protocol failures, as well as determine whether an outage is due to a connection failure or SSL not being supported by your user’s browser.
For each of these more “controllable” areas, synthetic monitoring can be the key to catching potential problems before real end users are affected. Make sure you’re monitoring from geographic locations close to your customers, as blind spots are often regional in nature. Finally, like the less controllable elements, you should institute solid backup, contingency and communication plans for multiple scenarios.
In a world of growing interdependence, it would be easy to adopt a “it’s not my fault” attitude, but the clear takeaway here is to exercise as much control as possible as your IT systems grow. Simplicity in IT is gone forever, but with some planning, you can make your life much easier, and keep your end users happy.
Learn how you can join our contributor community.