Mastering the complexities of managing IoT-connected device fleets at large scale
By Thomas Ryd, Mender CEO, and Farshad Tavakoli, Mender head of technical product marketing
With massive digitization, IoT device management is becoming a key strategic issue at board level. Before, IoT projects were predominantly small siloed efforts from innovation projects on the periphery of the enterprise. Here, shortcuts may have been taken in an effort to demonstrate fast results. IoT device management is now taking center stage, fusing with security and robustness concerns. It begins to interplay with mission-critical infrastructure and innovations for the business such as machine learning models on edge devices for equipment maintenance and production-process monitoring.
Furthermore by 2025, there will be 55.7 billion IoT-connected devices worldwide and 75% will be connected to an IoT infrastructure. With this projection, enterprises have to plan for a flexible—yet scalable—IoT infrastructure.
And yet today, research from Microsoft found that too few enterprises are properly prepared to tackle scalability. The report found that when organizations engage in an IoT project proof-of-concept, they typically do not take into account the full-scale scenario at this stage. Therefore they fail to deliver on robustness and security when they make the move from hundreds of devices to thousands, hundreds of thousands and or even millions of devices.
Scalability must be tested and planned for at the inception stage of the project.
Use cases driving the business
IoT device management at scale is essential to get right. It will enable key use cases for the business. At a basic level, device management includes monitoring the status of production machines remotely, and supporting maintenance of those machines with a data-driven approach. At a more advanced level, this could include predictive maintenance where the condition of the machine is analyzed so as to predict when a component might fail, and the optimization of machine processes to help reduce waste and energy consumption, and to increase the quality of production. The real data flow also facilitates the digital twin, where subject-matter experts can access the raw data and build better-modeled equipment that paves the way for improved automation based on machine learning and streaming engines.
Challenges in IoT device management scalability
There are many complex challenges in IoT device-management scalability. Getting the diverse stakeholders aligned on priorities in the IoT device management project at the very beginning is the hardest challenge and yet the most important. We at Mender believe the technical challenges (in IoT device management scalability) are easier than the human ones. It is actually much harder to address and solve the people and process challenges. It is easy to be irrational in highly bureaucratic organizations and as a result, the planning can be very poor. Very basic questions often come up very late in the IoT device life cycle such as the need for over the air software updates for the devices.
Unified device fleet management
Another key challenge is accessing and managing a large and distributed heterogeneous fleet of devices. An organization should have a standard means to manage these different devices with automation and version controlled code. Manual administration increases the likelihood of human error and creates security and operational risks. The device-management infrastructure used to manage the devices should also be designed to be flexible and extensible. This means that once the next generation of hardware, software and new products are developed then the device management infrastructure should evolve to support them too.
Often, if the device-management infrastructure is developed in house. Then adapting this infrastructure to new products is non-trivial. Many organizations fall into the trap of building yet another homegrown device management solution, which will need to be developed and maintained. Larger companies typically end up with more than a handful homegrown solutions leading to a disperse and diverging fleet management situation.
Latency
When managing a growing IoT device fleet, timely over-the-air (OTA) software updates are needed to keep the devices operating optimally and secure. This throws up a technical challenge as low latency is essential for timely updates, and it means having the same response time for each device as the number of requests increases. Response is determined by the quality of the server application design, the data model on the database and the ability to handle parallel requests trying to access and update the records. The server and client design must factor in scalability design as a key criteria for success.
Robustness
During a software update, IoT devices in the field can fail due to power or connectivity loss. This is called a bricked device, and once the device is bricked, it must be recalled from the field to be reset. With 100 devices, the chances of something malfunctioning might be relatively low, and manual surveillance feasible. However, with 100,000 devices in the field one or more updates will fail and manual controls will not be feasible. The importance of robustness of updating device software over-the-air (OTA) increases with scale. For example, if we assume there is a probability of 1% a fleet of devices could fail, that is one device for a fleet of 100 whereas it is 1,000 for a fleet of 100,000. Therefore, a significant difference in the potential damages that the event could sustain.
There are mechanisms to avoid bricking at scale: for example, an A/B device partition design on the device where if a software image update fails, the device will revert to the previous software version thereby avoiding corruption upon the new update.
Operational complexity
IoT devices in a large, distributed fleet can be highly constrained by a number of factors: firstly the hardware is a commodity and most of the value is in the software-defined functionalities. More code results in devices being more vulnerable to exploitations. IoT devices are also hard to reach and resources such as battery or wireless connectivity are likely to fail at some point. Connectivity can also be a challenge where the network can be unreliable or the bandwidth simply too narrow to fulfill a task.
Considerations when planning a large-scale deployment
We advise the adoption of a common respect for the challenge across the organization at different levels. To be successful at managing a fleet of IoT devices at scale, this requires end to end consorting combining different technologies and different people in different roles with different responsibilities. A holistic overview of what needs to be done is also required, and buy-in from necessary stakeholders. There are multiple stakeholders involved in an IoT device management project and all need to be aligned and involved in the planning and scalability modeling:
● The product management team will view IoT and OTA software updating as integral product design components to help get their product out onto the market before they have finalized their product roadmap
● The embedded engineering team will design and configure the IoT device hardware
● The logistics team will be responsible for getting the right devices to the right places
● DevOps and DevSecOps will manage the devices after they have been installed. The security team will insist on OTA updates as software will have vulnerabilities that will need to be addressed through the device lifecycle
● Engineering QA are typically the final stakeholder to be involved in the planning as they seek OTA updating as a means to patch software bugs.
Each stakeholder should understand how the scalability of the project will affect them, and what they can contribute from their domains of expertise to make scalability achievable. Once the project is up and running, building a chain of trust with features such as role-based access control and two-factor authentication in the device management infrastructure helps ensure that only the right people get access to the right devices at the right time and make the changes they are authorized to make.
Cost of network traffic
Fast data transfer and lower bandwidth means that an IoT device management project at scale shouldn’t eat into the company’s bottom line. IoT devices need OTA software updates to stay healthy and to evolve yet the cost of ensuring cellular connectivity and sending the updates to the devices can be a considerable financial challenge. The average cost for transferring a 269 megabyte full software update to a device over cellular LTE/4G/5G has been estimated to be $3.33 USD. However, there are mechanisms available such as delta updating which will only transfer the delta of the update meaning that the file size is compressed to a meagre 30 megabytes resulting in an average data cost of .371c USD and savings in the region of 90%.
Avoiding vendor lock-in
In the server-orientated data center world, infrastructure management is a reasonable challenge as the servers are in close proximity to each other. With IoT device management, the opposite is the case: The devices are distributed and built on various hardware and software platforms. Also, in these hard to reach environments, a loss of service or disconnection of the data stream could cost millions of dollars in damage within a short period of time. The best practice is to have a fully optimized IoT end to end software management infrastructure that minimizes fleet operation complexity. It should have integrations with the organization’s cloud infrastructure, software and hardware, and no lock-in to any specific platform and development tools. The software management should support the updating of all device software from kernel, and root file system all the way to user level applications updates with containers, packages, files and directories.
Remote management of devices is another critical component: Provisioning, troubleshooting, configuring, and monitoring devices remotely and securely in the fleet lifecycle management. For security, the remote access should be carried out through the same encrypted communication channel through which the OTA software updates are provisioned.
End to end interoperability
When an IoT device fleet scales, a key strategic concern is ensuring that there is a process to get the necessary software updates to the devices in a frictionless and controlled manner. The best practice is to use APIs to integrate a software updating and device management infrastructure with the continuous integration continuous delivery (CI/CD) deployment system, so new software build outputs can be automatically uploaded to a server and deployed to the intended IoT devices.
Achieving risk tolerance
Risk management is an essential consideration in IoT fleets at scale and having a high level of risk tolerance in the software management system. The best practice is to use time-based and phased rollouts to control how software is deployed to the IoT device fleet. Even though the software may have been rigorously tested by engineering QA, performance can only truly be understood when it is released to a device in the field. To be safe, this release must be controlled to a small number of devices first. This allows understanding of the consequences of the change in the context of differences in time zones, latency, hardware and customer usage patterns, without negatively impacting the whole fleet. After the test builds confidence that the change works, then it can be applied to the whole fleet.
Using automation where possible
Automation makes device management faster and more reliable by reducing the room for human error. It also increases deployment consistency and reduces update cycle times. Examples of automation in device management include dynamic groups where devices that are being commissioned can be automatically assigned to predefined groups based on certain attributes. These attributes can include hardware type, geography, and devices within these groups would receive software updates based on certain agreed parameters. Another example would be a scheduled deployment of a software update to certain devices or an automatic retry if the update to the device failed in the first instance.
Getting to grips with linear scaling
The device-management software system should respond quickly and cost effectively to requests from the devices within the fleet. Typically, IoT devices will poll the server at regular intervals looking for updates or for updating their inventory attributes. The amount of these requests scale linearly with the amount of devices accepted into the system. IoT device fleet planners have to think about the latency effect on device performance and response time as the size of the fleet that needs to be managed grows. They must consider and plan for the time it will take the management server to update devices as the fleet grows beyond the single thousands and into the hundreds of thousands and then millions.
To help understand linear scaling in device management, the Mender Engineering Group has modeled the 50% percentile response time (or latency) for device API requests with an OTA software updating server performing at scale. Using the recommended 30 minutes interval for update poll interval, means that every 30 minutes all devices in the system will contact the server at least once. On a scale of 1 million devices in a fleet, 50% of the responses from the server to the device is 10ms.
On the 95% percentile, the measured response time is around 200ms on 750k-1M devices scale.
In this model, it would take the management server around 14 hours of operation to update 1 million devices in the fleet.
Conclusion
The growth and proliferation of IoT devices and dependent use cases that vary from basic to advanced are necessitating that organizations take a proactive, strategic stance in planning for scalability of their IoT device fleets. The respective domain experts within the organization must align and agree on the scaling plan first, and only then can the technology plan be considered. A chain of trust from when the device is first commissioned to when it is decommissioned must be created. Good practices such as an end-to-end software-management infrastructure that is highly interoperable must be established to manage and update the devices.
Automation can be used to minimize administrative burden and human error from manual intervention with the devices. Cost of data transfer and connectivity can also be tightly controlled with clever automation features such as delta updating. Latency between the management server and the IoT devices must also be carefully considered and planned for as the device fleet grows.
As this could have negative impacts on device performance and ultimately customer satisfaction with the service. A robust and secure infrastructure for provisioning OTA software updates to the devices in the fleet is the best means to keep the devices operational, evolving, healthy and secure. This checklist provides a useful guide to setting up a software management system for IoT devices at scale.