Scientific Computing
   Popular Searches:
lims, visualization, chemistry, statistics, hpc
DATA SOLUTIONS



SITE SPONSORS
Home > Data Solutions > Extracting the Most Value from Your Data

Extracting the Most Value from Your Data

New technologies for technical data management

Caroline Bright


Comparison of characteristics of different file formats
click to enlarge

TABLE 1: Comparison of characteristics of different file formats
As we continue to model the world around us, we are collecting more data than ever before. When developing a new measurement system, significant time and money are typically invested in evaluating and purchasing the equipment and developing the applications needed to acquire the test data. With increasing processor capabilities due to multicore technology and the decreasing cost of computer memory, we can collect and store more information. However, how best to store and manage it to optimize post-processing and reporting often receives little attention and can create great inefficiencies within the application.

For example, imagine a shelf of books. It would be fairly easy in this situation to scan the shelf to find a particular title. Now imagine a library full of books. It is no longer efficient to scan each book to find a specific title. In this case, using the catalog system to find a particular title would be the best solution. This same need for cataloging, or data management, also applies to test systems. The ability to rapidly find, process and trend on large amounts of simulation and test data to arrive at usable results can be a significant differentiator in today’s fiercely competitive business environment.

The challenges in properly managing data can be broken into three key areas:
• how best to store the data to disk
• how to manage and organize the data for quick searchability and trending
• how to process large amounts of information to reach final results
New approaches and enabling technologies in each of these areas are helping the scientific community to extract the useful information from increasing amounts of data.

Comparing file formats
For many new test systems, how best to store data to disk is an afterthought in the overall application design. This leads many to search for a storage approach that most easily meets the needs of the application in its current state. Yet, choices made in storage format can have a large effect on the overall efficiency of the acquisition system, as well as the post-processing of the raw data. There are many characteristics to consider when evaluating storage formats such as:
• sharing and exchangeability
• footprint
• inclusion of meta information and properties
• and writing speeds

Depending on the application, you may prioritize certain characteristics over others. Common storage formats such as ASCII, binary and XML have strengths in different areas. A new file format, technical data management streaming (TDMS), aims to combine the strengths of each of these formats as well as to improve file storage.

The TDMS file format
click to enlarge

FIGURE 1: The TDMS file format is a hierarchical format designed for saving properties with test data
Many prefer to store data using ASCII files due to human-readability and exchangeability. ASCII files can be opened in common software applications found on most computers today, such as Notepad, Wordpad and Microsoft Excel, making it simple to quickly open files written from the acquisition and view the data immediately, as well as to easily share data with colleagues. ASCII files, however, have several drawbacks, such as a large disk footprint for each file, which can be an issue when storage space is limited. Also, reading and writing data from an ASCII file can be significantly slower compared to other formats and, in many cases, cannot keep up with the speeds of acquisition systems, possibly resulting in the loss of data.

In contrast, binary files have a much smaller disk footprint and can be streamed to disk at extremely high speeds, making them ideal for high-channel-count applications. The shortcoming of saving data to a binary file format is that it is not human-readable. Also, different applications may interpret binary data in different ways. One application may read the binary values as textual characters while another may interpret the values as colors.

To share binary files with colleagues, you must provide them with an application that interprets your file correctly. Also, if you make changes to how the data is written in the acquisition application, these changes must be reflected within the application reading data. This can potentially cause long-term application versioning issues and, ultimately, lost data.

XML is another common file format for storing data that has gained popularity over the last several years. Its growing popularity is due to its ability to store complex data structures. With XML files, you can store not only the raw measurement data, but also data attributes and formatting. With the flexibility of the XML format, you can store additional information with the data in a structured manner.

Hybrid data management systems
click to enlarge

FIGURE 2: Hybrid data management systems store only the properties of a file within the database for quicker searching
Another benefit of XML is that it is relatively human-readable and exchangeable. You can open XML files in many common text editors as well as XML-capable Internet browsers, such as Microsoft Internet Explorer. However, in its raw form, XML includes tags within the file that describe the structures, which also will appear when opened in these applications. For some, this limits the readability.

The weakness of the XML file format is that it has an extremely large disk footprint and cannot be used to stream data. There also can be considerable planning needed when designing the layout, or schema, of the XML structures.

Several years ago, a new TDMS file format was developed specifically to meet the needs of engineers and scientists collecting test data and to address all of the concerns listed above. The TDMS file format is binary-based, so it has a small disk footprint and can stream data to disk at high speeds. At the same time, TDMS files contain a header component that stores descriptive information, or attributes, with the data. Some attributes, such as file name, date, file path, and so forth, are stored automatically; however, users can easily add their own custom attributes as well.

Another advantage of the TDMS file format is the built-in hierarchy. TDMS files consist of three levels: file, group and channel. A TDMS file can contain an unlimited number of groups and each group can contain an unlimited number of channels. You also can add attributes at each of these levels. This hierarchy creates an inherent organization of the test data. Finally, although TDMS files are binary, you can open them in many common applications, such as Microsoft Excel and OpenOffice, for sharing with colleagues. Thus, TDMS files provide the benefits of easy exchangeability and inclusion of attributes without sacrificing speed and size.

Organizing and managing data
After deciding how to store the data to disk, the next step is to evaluate the most effective way to organize and manage the information. For many, the solution appears to be the file- and folder-naming approach in which data is stored to files and then organized based on file names and folder structures. For example, you may create a directory structure based on a particular type of test, then have subdirectories that reflect the serial number of the product tested, and then name the files within that directory based on the test data or test operator.

Parametric searches
click to enlarge

FIGURE 3: Parametric searches can help find trends and correlations within large amounts of data.
Although this approach is a simple, low-cost and familiar way to store data, it often does not scale over time, as application needs change and the amount of data being collected continues to grow. Often, data is stored on different machines and in different file formats, making it a challenge to find a particular data set let alone correlate information from multiple files. Although the file- and folder-naming approach may appear to be straightforward and low-cost initially, over time, this approach is highly ineffective, resulting in lost productivity. This approach to data management also can be easily corrupted when files are inadvertently moved or renamed.

At the opposite end of the spectrum is the use of a dedicated database to store and manage the technical data. Often, engineers and scientists use a database once their application outgrows the file- and folder-naming approach. Databases have a reputation for organizing data for easy search and retrieval. This makes them ideal in situations in which the amount of test data makes finding a data set difficult.

Yet, there are several drawbacks to using a standard database, such as Access or Oracle, to store measurement data. These databases are not designed specifically for test data, so there is a significant amount of work that needs to be done on the front-end to design the overall schema of the database. And, if test needs change, your schema may need to be expanded or altered to account for new data. This database modeling and scaling can take significant time and often requires the engagement of IT experts who may not have the bandwidth to help the test groups. The long-term investment in database modeling and maintenance can quickly become expensive and this does not even account for the actual cost of the database.

A new hybrid approach for data management developed from the advantages and shortcomings of the file- and folder-naming and database approaches. This hybrid approach, used in technology such as the NI DataFinder, uses a database to manage just the attributes stored in test files of any format. Meaning, you can store your data in the file format that best meets your application needs and the DataFinder automatically indexes, stores and organizes the properties/attributes stored within the test files, which you can then search to find trends and correlations within the data. Behind the scenes, the application creates a database, which is abstracted from the user so there is no need to have any database expertise. This hybrid approach to data management provides the flexibility to store data in any file format while maintaining visibility into your data via searching without having the high cost and overhead of a dedicated database.

Post-processing and reporting
Once test data has been stored and organized on disk, the final step is to determine the best way to post-process and report on the results of the tests to reveal the most relevant information, find trends and clearly display the results. With several analysis tools, such as Microsoft Excel and NI DIAdem for analysis and reporting, you can connect to external data sources, such as files and databases, and then load the data for further analysis. DIAdem also has an interface for searching data within the DataFinder index. Users can run quick keyword searches as well as more advanced parametric searches. The parametric search allows creation of query conditions for locating files or even digging down to the channel level for trending across multiple files without the overhead of opening each file individually.

Once the data of interest is found, it is important to use post-processing tools designed with engineers and scientists in mind. Many times, users will develop their own custom analysis applications using languages such as C, .NET or NI LabVIEW graphical programming software. Yet, there is significant overhead associated with developing and maintaining these custom applications. Post-processing tools, such as DIAdem, are designed specifically for the post-processing of technical data and include preconfigured analysis and graphing tools, removing the need for custom development. By using a tool targeted for engineers and scientists, you can move quickly from raw data to usable results that help you make decisions faster and spend more time on the acquisition of the data, as opposed to the analysis.

Conclusion
With today’s computers, researchers can collect more data than ever before, resulting in an increasing need for data management solutions to handle all this information. New technology and approaches for storing data to disk and managing the data for quick searching and trending, as well as off-the-shelf products for performing engineering analysis and reporting, can help you gain more in-depth insight into test data and get the most out of the investment made in acquiring this data.

Caroline Bright is data management product manager at National Instruments. She may be contacted at editor@ScientificComputing.com.

Acronyms
ASCII American Standard Code for Information Interchange | LWF Bavarian State Institute of Forestry | TDMS Technical Data Management Streaming | XML Extensible Markup Language


A Hybrid Data Management System

National Instruments
11500 N Mopac Expressway Bldg B
Austin TX 78759-3504
www.ni.com

Email Article | Contact the Editor | Printer Friendly

Post to Del.icio.us | Digg This | Post to Slashdot
 








Bioscience Technology Chromatography Techniques Drug Discovery & Development Laboratory Equipment Pharmaceutical Processing R&D Scientific Computing
Advantage Business Media © 2010 Advantage Business Media
Privacy Policy | Terms & Conditions | Advertise with Us