Insight Starts with Data Integrity
Setting the stage to make informed scientific, engineering and business decisions
Carly Fiorina, former executive, president and chairwoman of Hewlett-Packard, once said “The goal is to transform data into information, and information into insight.” However, first, we need to ask ourselves what business questions we need answered. If we extend this adage to its logical origin, we arrive at data integrity. Now, with this business goal in mind, we can begin by breaking data integrity down into its components:
• First, there is the issue of the type of data to be collected. It can be numeric, text, video, audio or a mixture of these. If numeric, it can be attribute, variable or locational.
• The measurement scale needs to be decided, whether nominal, ordinal (ranking), interval or ratio.
• The person or device collecting the data needs to be considered. If a person, then their level of training matters. If a device, then its capacity and capability matters.
• Once the data is collected, the person in charge of maintaining the data needs to be involved in setting up a system for backing up the information and even data corruption detection.
• The owner of the data needs to provide support regarding finances and duration of data storage, as well as frequency of backups.
• The format in which the data is collected and stored will be driven by the type of data, finances available, the method by which it will be analyzed, the frequency by which it will be backed up, and the how it will be archived.
• The manner in which the data will be analyzed should considered (but often isn’t) before a system is set up. The type of analysis could range from simple overall metrics consumed by management to real-time data mining by statisticians.
• The type and location of any output from the analysis should be thought through, such as posting to a Web site rather than languishing on a server.
• Finally, the question of whether the data system supports any regulated systems. Any associated validation requirements need to be planned for early.
Despite the best efforts in planning, data integrity issues will arise. The most basic of these are data entry errors caused by humans. The more complex of these arise from database architecture issues:
• Entity integrity mandates that any attribute of a primary key can not have a null value. Otherwise, it is impossible to identify a record. However, some null values are valid, but need special treatment.
• Referential integrity states that all non-null foreign key values need to match an existing primary key value. Inconsistent relationships between tables lead to problems.
• Software bugs are inevitable and can compromise data integrity.
• Finally, transmission errors between computers, hardware crashes, and natural disasters all can contribute to data integrity issues, as well.
Fortunately, there are solutions available to many of these problems. If manual, the use of clear, printed forms should be instituted. This will aid in avoiding emotional bias related to expected values, unnecessary rounding, or loss of critical information, such as time sequence of data capture. The importance of not screening out any data upon entry can not be overemphasized. Data coding through use of a factor (adding or subtracting a constant or by multiplying or dividing a factor), substitution (i.e. integer to substitute for inch fractions), or truncation of repetitive place values (e.g. for series 0.23410, 0.23485, 0.23461… capture only the last 2 digits) can minimize fatigue.
If data entry is automated using a database, security features, such as passwords, can be applied to control who is allowed to enter data. Then, at the field level, data entry errors can be minimized through an intelligent user interface which can enforce the correct data types through the use of input masks:
• Duplicate values can be allowed or forbidden.
• A requirement for a value in a field along with the judicious use of default values can be applied, as well as limitations on the list of possible entries. This facilitates later analysis.
• At the table level, comparisons between records through the use of rules can be implemented.
• Separate validation tables can hold values which can be referenced to allow consistency.
Once data entry is complete, there is the potential for the use of error detection and correction software. Statistics can be applied to determine whether the expected distribution of data conforms to theory. For example, tests of normality can be performed and outliers identified for investigation.
It is critical to apply objective tests to identify outliers and only remove them when cause is found. A log of the original data and reasons for outlier removal should be kept. An appropriate sampling plan should be done. Random sampling can be performed if it needs to be representative without capturing temporal or spatial trends. If these exist, then either a sequential (time based) or stratified (throughout a variable such as material lot) sampling plan is needed.
Finally, regular backups of information ensure that a minimal amount of data loss occurs.
Once sound principles of data integrity are applied and maintained, the stage is set for the data analysis which provides the insight needed to make informed scientific, engineering and business decisions.
Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at firstname.lastname@example.org