Clearing data-quality roadblocks: Unlocking AI in manufacturing

| Artigo

Today, manufacturers collect information from an ever-broadening network of sources. Whether it’s time-series data from traditional physical sensors, real-time video streams, or unstructured and manually entered reports, data are at the core of day-to-day operational decision making.

That said, industry leaders looking to leverage their data to power sophisticated AI models are discovering that poor data quality is a consistent roadblock for the highest-value AI use cases. Some common data-quality issues include missing data points, broken or miscalibrated sensors, incomplete data mappings or dictionaries, incompatible systems, architectural limitations, slow access speeds, and insufficient understanding of existing sources. These shortcomings are often compounded in manufacturing, where sensors typically function in more physically taxing environments than other industries.

Legacy system architecture and weak data governance also contribute to poor data quality. Without modernized industrial data management, an alternative ecosystem of multitab spreadsheets driven by well-intentioned subject matter experts often feeds critical processes. As a result, data at different plants are manipulated in different ways, rendering them effectively ungovernable.

Data quality needs to be incorporated as a central component of data infrastructure and governance. While modernizing technology and data infrastructure can be a significant undertaking, data quality can be addressed along the way in targeted and iterative ways that focus on the longest-standing and most important issues for the business.

Low-quality data are too often accepted as unavoidable, requiring large-scale system redesign, years of data collection, and significant labor and capital allocation to correct. For these reasons, many AI use cases are shelved for years while companies wait for the problems to resolve themselves, leaving significant value on the table. This article illustrates how industrialized tools can clean and enrich existing data to remove data-quality roadblocks and unlock the full potential of AI in manufacturing.

Data-quality challenges in manufacturing

In recent years, machine learning–based modeling has changed from a barrier to entry to an open-source commodity. AI is now table stakes across many industries. Today, the availability of high-quality raw data is the chief source of competitive advantage in analytics.

Although there are many off-the-shelf tools that can detect data-quality issues, creating a list of the problems is only the first step. Fixing these problems has historically required significant effort (see sidebar “Data quality is a significant challenge across industries”). When issues are discovered, data engineers can spend weeks querying multiple systems to identify where they’re broken. And if the issues can be fixed, the remediation often requires deep dives into the IT stack. Instances in which the data were never collected in the first place can be even tougher to resolve.

Technology is not the only challenge. Data teams are often narrowly resourced and centralized away from core business units. This decentralized structure limits the frequency of interaction between business units and data teams. As a result, teams face endless cycles of chasing issues instead of going after root causes and focusing on the data issues with the biggest impact on the business.

Correcting data at scale: An agile, data-centric approach

Data management has traditionally been treated as a big-build waterfall IT project rather than a series of agile quick-cycle iterations, blocking progress downstream until the data are “perfect.” Moving to an agile framework—one that many projects using data downstream may have already adopted—can help organizations tackle concrete use cases in parallel, prioritizing those that will deliver the most value. This shift to a fast-paced operating model is critical to meeting continuously evolving data priorities.

With this in mind, industry leaders are increasingly setting up agile data-quality “SWAT teams” (Exhibit 1). These teams are typically composed of three to five data scientists, engineers, and data experts, as well as aligned translators (or data owners) from the business units, who work closely with the business to identify and remediate the problems blocking the highest-value AI use cases. They are often part of a model centered around a central data-governance and architecture team that manages a set of common standards and processes. This enables the SWAT team to embed data units in each business to tailor data management while adhering to common standards.

Data-centric AI is an emerging approach that focuses on unlocking machine-learning-model performance through improved data. Data-centric AI tools can accelerate data-quality remediation with increased levels of automation, including programmatic labeling, synthetic data generation, and the enrichment of data by integrating internal and third-party sources. Although previous solutions could sometimes take months to work out, these tools enable data-quality SWAT teams to create solutions in days (see sidebar “Five innovations that are driving impact today”).

Data-centric AI is an emerging approach that focuses on unlocking machine-learning-model performance through improved data.

Data quality is also a core component of machine-learning operations, which is the practice that brings together data scientists, data engineers, and DevOps (software development and IT operations) to build and continuously improve production-grade machine-learning models. Automated approaches to data quality can help ensure such models are built on the right data and that data-quality blockers are resolved within the machine-learning development process instead of causing delays—for example, using programmatic approaches to label data for subject matter experts when the data are missing.

There is no magic bullet for improving data quality. Instead, the data-quality SWAT team will need tool kits equipped with reusable data-quality building blocks and components, such as entity resolution and anomaly detection, which the team can use to quickly assemble solutions (Exhibit 2). Furthermore, automated data lineage can ensure that changes to the data are tracked so they can be more easily explained and tested.

Modern governance, systems, architecture, and processes also need to replace artisanal spreadsheet-driven processes with industrialized data management. This is required to ensure a sustained adoption of AI-based products, which could lose traction if users lose faith in the underlying data.

Case study: Clearing data-quality roadblocks for an aerospace manufacturer

Predicting equipment failure can help manufacturers mitigate costly unplanned downtime and extend the life of their assets. That said, the maintenance records needed, which elaborate on root causes or failure modes, are often buried in handwritten forms or computer log files unreadable by humans. Although manual review and interpretation can help predict when and where equipment may fail, doing so often requires specific expertise or experience and doesn’t scale well for data-hungry AI models. The following case study describes a data-centric approach that avoided these pitfalls.

An aerospace manufacturer had a problem: satellites attempting to transmit data back to Earth through a dedicated ground station were encountering frequent communication failures. These failures were especially disruptive. Critical customer data were being lost or delayed, and some satellites had only one chance per orbit (or “pass”) to transmit data. A system failure meant that the satellite would have to circle the globe before it could try again. Persistent issues could seriously disrupt a satellite while engineers manually attempted to diagnose and correct one of dozens of potential problems to get things working again.

The company needed to develop a faster approach to respond to communication failures between spacecraft and the ground station. When weather, human error, or equipment caused a failure, the engineers typically learned the cause only after manually sifting through log files and field notes. The proposed solution was to identify failures the instant they occurred via an AI-based tool that could highlight the suspected root cause and help engineers accelerate a resolution (Exhibit 3).

The company sought to use AI to detect failures but was blocked by a data-quality problem. Labeled data identifying specific time-stamped failures and their root causes, which required a machine-learning model to be trained to identify these failures, did not exist. However, the company had more than enough low-quality data, including gigabytes of communications telemetry, signal power and noise measurements, hand-entered incident and operational logs, and computer-generated system logs. It also had access to third-party data sources for information such as weather and high-altitude cloud coverage.

To turn this collection of low-quality sources into a single high-quality data set, the company deployed programmatic labeling with “weak supervision.” Engineers provided a set of first-principles-derived heuristics, which were then used to label certain troublesome scenarios, such as when a particular frequency drifted out of tolerance, a power level dipped too low, or a rain cloud was likely to be right above an antenna. These heuristics were combined and deconflicted to optimally label failures and identify their root causes.

Once the time-stamped failures and their root causes were labeled, a machine-learning model was calibrated and implemented into a suite of tools that engineers could use to identify and respond to failures more proactively.

Case study: Making data quality a priority at a mine

Modern mines have no shortage of data. The typical mine has thousands of sensors distributed throughout the site, plant operators have racks of screens showing hundreds of live measurements and trends, trucks are tracked with GPS, and routes are optimized hundreds of times per minute by complex algorithms. In other words, large capital investments are allocated and spent to ensure that mines can accurately track and report how much finished product they extract and how efficiently operations are run.

All too often, however, these data are looked at only once before they are stowed away in a process historian, where they sit until someone pulls them up to conduct an analysis—sometimes long after the specific context has been forgotten.

One mining company encountered this problem in the middle of an effort to develop a machine learning–based predictive model of a process in its mill. Although model performance during development had been satisfactory, the team repeatedly encountered the same scenario in debugging sessions: unusual data had led to unusual model performance. In response, the mine’s subject matter experts worked with the team to determine the root cause of the unexpected behavior, which often was the undocumented replacement of a sensor, a missed calibration date, a problem with a piece of equipment, or a missing piece of data.

At one point during development, a new dashboard was deployed to give downstream-processing plant operators insight into critical data on upstream ore hours before that ore needed to be processed. This was seen as a major improvement over the status quo, in which operators had limited advanced insight into the ore they were processing until it was inside the mill. Initial excitement over this new tool’s ability to limit surprises and associated drops in efficiency was tempered when the company realized that the data that the dashboard relied on—having never been used for a tool like this before—were often not accurate enough to be useful. To support advanced applications, the mine needed to solve some accumulated data problems.

Mine operators learned a few lessons as they strove to bake data-quality discipline into the organization. The mine started rigorously tracking errors caused by faulty data, and, with the team’s help, identified a few root causes that could be addressed. For example, extra radio technicians were engaged to make sure that GPS antennas on key equipment were always functioning. A deep review with the mine’s engineers of nearly every sensor crucial to the model caught dozens of additional small bugs in how sensors were interpreted and incorporated, and a dynamic recalibration algorithm was developed to correct drifting on-stream analyzer measurements. This algorithm was then applied backwards to the entire historical data set for all analyzers.

In the end, taking the time with subject matter experts to be sure institutional knowledge had been encoded back into the data resulted in a substantial model performance boost, and time devoted to ensuring the capture of high-quality data resulted in a significantly more useful set of tools.

How to get started

For the past decade, AI and machine-learning systems have matured at a faster pace than data management, effectively making data quality a roadblock for disruptive innovation. As data quality and governance issues are resolved, disruptive ecosystems will likely evolve to drive both revenue and efficiency. In some cases, these ecosystems will spawn new business models.

Five steps can help owners and operators prioritize data-quality issues according to their impact on the business—including environment risk, engineering, and downtime—and set up a small team with the right tools to quickly resolve any problems.

AI and machine-learning systems have matured at a faster pace than data management, making data quality a roadblock for disruptive innovation.

Step 1: Identify the operational or business problems that are rooted in poor data quality

Identify areas in which data quality or availability is blocking validated AI use cases (such as predictive maintenance, schedule optimization, or safety analytics) or is affecting operations. Next, build a prioritized backlog in collaboration with users or operators.

Step 2: Set up a data-quality SWAT team and agile operating model

Assemble a small AI SWAT team of data scientists, data engineers, and subject matter experts who are tasked solely with resolving data-quality issues. This team will work with business units to prioritize and develop technical solutions to the most critical business issues.

Step 3: Provide modular tools that can make corrections at scale

Provide a modular tool kit that includes entity resolution, time-series anomaly detection, and pipeline monitoring with the ability to chain modules together. This will enable a rapid assembly of solutions that can fix data at scale with the flexibility needed to resolve the biggest issues. For example, the flexibility of AI for data quality (AI4DQ) has enabled it to solve large-scale problems in aerospace, mining, banking, and telecommunications.

Step 4: Engage subject matter experts to focus on the biggest problems

Set up a working model with users to validate corrective measures and bring subject matter expertise into the process, particularly in data governance, privacy, safety, and engineering.

Step 5: Ensure processes are in place to maintain high quality of data

To fully realize the value of data in manufacturing, organizations need industrialized data-quality systems, architecture, and processes. Evaluate the current state of the system architecture and data governance and develop a modernization road map to remove manual processes, such as spreadsheets, and drive consistency. Ensure that the right processes are in place to sustain data-quality improvements. Set up monitoring and accountability of data-quality metrics with alerts that trigger action. Finally, build a team of data product owners who continuously drive quality and remediate any roadblocks.


Manufacturers that can clear data-quality roadblocks will fully and quickly realize the benefits of AI by unlocking new use cases and safeguarding their long-term adoption by ensuring they are built on data that users trust.

Explore a career with us