Having all of your Big Data located at a single location for long term storage and making it available globally to all of the engineers in your organization initially sounds like a fantastic way to manage data and put it into the hands of the engineers who need it. BUT IT IS NOT THE BEST CHOICE. If any of the engineers trying to access the data at the single central data center are geographically thousands of miles away, the performance of uploading raw data files and analyzing them later will be very poor. The only scenario where a central location can reasonably service remote users globally is a system with the following configuration:
A distributed data center configuration can provide the best performance option when it is properly configured and supported. If you have groups of activities where the primary use of the data is localized, and you put in place a means where all of the data collected by each location is shared with each distributed data center, then you can achieve the best performance scenario overall. You will need global standards for processing of the local data, and a good method for updating each location with new data from all of the other locations. Redudency of the data in order to achieve good data analysis access performance has obvious additional benefits. It is essential that at each location, a very high speed network connection between each engineer and the server where the data resides is configured.
A frequent scenario is the engineer who records the data, keeps the data locally stored on a group shared drive, or the engineer's laptop / PC. Assuming the laptop / PC and any used local network connection is all high performance, then the data processing experience will be very good. The downside to this arrangement is that the data is not open for access by other engineers, and it is up to to that engineer to properly categorize and organize the data.
When you think about the storage of Big Data from testing data acquisition activities, then the paradigm is similar to working with many large video files, WITH THESE IMPORTANT EXCEPTIONS:
Can your relational or NoSQL database service all of those needs? These needs are more easily serviced by a binary file based approach to the storage of each data acquisition file, where the unique needs of this type of data is accommodated. One such structured file format is the NI TDM Data Model.
Transferring Big Data across geo-distributed locations thousands of miles apart is going to be primarily constrained by the transport protocol if over the internet or a dedicated network, and finally by latency. The the time it takes to send data between the source and destination points, plus the time it takes for delivery acknowledgment, is round-trip time (RTT). Moving a 10 GB file or data set across the US will take 10-20 hours on a typical 100 Mbps line using standard TCP-based file transfer tools (source). Specialized services are available that can reduce that time as little as 14 minutes, but beyond that, latency will limit what can be done. Only breaking the laws of physics will overcome the latency constraint. Latency will also prevent you from continuously sending a single channel of data from a sensor to a distant wired/wireless location at anything better than a sample rate of 42 Hz.
If your budget is small, or your data sizes are even larger, and you can afford a few days of delay, then consider physically moving the portable storage device or a copy of the data on a portable flash drive by mail. Do it yourself, or consider Seagate's Lyve Mobile high capacity edge storage solution. NI's Edge Storage and Data Transfer Service (DTaaS) claim a storage capacity of 200+TB and data throughput of greater than 6GB/s.
Once your raw data makes it to it's final destination for long term storage and data analysis, then look at how the TSDMS Bulk Operations can accelerate your extraction of information from that data. Bulk Import will help you decode and import thousands of raw data files. Bulk Analysis will perform statistical calculations on the entire set of data files. And Bulk Report will generate reports from the analyzed files, providing you with insight on your data in minutes.
The continuous streaming of a single channel of data is limited by latency. The table below provides a realistic expectation for the continuous transmission of a single channel of data by various available methods.
Type | Latency | Sample Rate | Maximum Measurable Event Frequency |
---|---|---|---|
Satellite | 800 ms | 1.3 Hz | 0.13 Hz |
4G Cellular (< 28000 mi) |
150 ms | 7 Hz | 0.7 Hz |
5G Cellular (< 4000 mi) |
20 ms | 50 Hz | 5.0 Hz |
Copper Wire(Ethernet limited to 100 m) | 24 ms | 42 Hz | 4.2 Hz |
Fiber Cable(across USA) | 24 ms | 42 Hz | 4.2 Hz |
If your sampling requirements exceed what can be done with continuous data streaming, then consider using a data logger that can record it directly to portable media, and then swapping out and postal mailing that portable media to your long term storage location. The TSDMS Bulk Operations will have you quickly extracting value from that data.
The software you choose to manage text based test measurement data can have a significant influence on the time required to import a medium size (115 MiB) data file for analysis. If you need to analyze groups of files for trends, then the application data processing requirements can be even more of a challenge.
Lets look at the case of reading medium size (115 MiB) and large (1 GiB) text files. The table below compares the time required to read text files with comma separated value (CSV) data, ranging in size from 1.1 MiB to 1.1 GiB (gigabyte). Excel does a fairly good job at reading the typical text file imported by a user for general business purposes. However, medium size test measurement files (115 MiB) and larger begin to take considerably long to import, and anything with more than 1 million rows of data will exceed the limits of the application. Look at the performance of a DIAdem DataPlugin created using the text file DataPlugin wizard. This is the performance you need when processing many new text based measurement data files.
# Rows | File Size | DIAdem DataPlugin (~38-56 GiB/hr) |
Excel |
---|---|---|---|
10000 | 1.1 MiB | 0.2 sec | 5.3 sec |
100000 | 11.5 MiB | 0.8 sec | 16 sec |
1000000 | 115 MiB | 7.2 sec | 168 sec |
10000000 | (1.1 GiB) 1126 MiB | 104 sec | (1) 387 sec |
(1) 387 sec (6.5 minutes) for 1,048,577 rows (10% of the data). Excel cannot store more than 1,048,576 rows of data in a worksheet.
Now consider the case where you need to analyze multiple files for trends. If you save the 1 million row text file you read into Excel to a Excel binary .xlsb file format, it will cost you nearly ten seconds to read that file (again) with Excel. With DIAdem, you use a DataPlugin to import the data once, save it to the TDM/TDMS file format, and then you can read a file that is 10x bigger than what you can store in Excel in less than 0.04 seconds! If you need to analyze many data files, perhaps looking for trends over time, then consider the performance advantage of working with them in DIAdem TDM/TDMS binary file format.
Analyzing many large data files as a group for trends is commonly approached by concatenating those files sequentially in a specific order, and then analyzing the final assembled file as if it were one data file. The problem with that approach is that when you go to analyze the file, you probably need to read it completely into memory, and that causes you to quickly reach the limits of what you can do in Excel. In DIAdem, a channel (or measurement) can have up to 2 billion (2^31) values.
Processing / analyzing many large files in DIAdem is where this application really sets itself apart from the rest. To begin with, DIAdem has a file oriented storage methodology. The TDMS file format is designed to provde easy and fast access to the metadata associated with the data file (so you can search and find the files of interest). DIAdem provides several options for loading the data. For example, you can load only the structure of the data (not the data values), and determine what channels you ultimately want to load. It also has a file format that was designed to allow you to read only the specific channels of interest from that file. The combination of very fast file reading, and flexability in what content is read from each file, results in an application that is superior for processing and analyzing test measurement data.
Collecting field test data and putting it into the hands of engineers who can use it to make informed product / process decisions, requires careful planning from the start of the process to the end. Any interruption in the timely delivery of of the data to the engineers at the end of the process, or a compromise in the quality of the data, will affect the outcome.
Start with the end in mind. Identify from the end users of the data, what their needs are, what would they really like, and when they need it. But before creating a data collection plan, carefully consider what analysis and data processing will be required to turn that raw data into useful information, and who is going to do that. You also need to perform a risk analysis on the data that has been identified as critical, and assess the potential that a sensor or a other equipment could fail and compromise the quality or complete delivery of the data. Redundent sensors / hardware and even duplicate recorded data channels may be a prudent choice to save the test activity.
For some projects, it may be sufficient to collect the data, and then some time later the analysis will begin after all of the recording is complete. However, this is very risky, and should be avoided unless a complete pilot run is performed to insure that a sample of the data (perhaps collected in the laboratory where the test specimen is being instrumented) is recorded and then fully processed as intended for the actual final data set. A much better approach is to do the pilot test, but also during the actual testing small batches of data are released as frequently as possible, and fully analyzed. This gives you the best opportunity to catch any mistakes or omissions as early as possible.
Think carefully about the size of the raw data you will be transferring from a remote field test location to the final long term storage destination. Continuous streaming of data over thousands of miles limits you to a sample rate of about 42 Hz. Daily collected data (< 24 hours/day) raw data files with a total size in the gigabyte range will not transfer quickly over WiFi, cellular, or over any network with thousands of miles between the two points unless a specialized service is employed. If you can afford a few days delay between pulling a batch of files from the field test unit and final analysis at the final destination, consider physically mailing the portable storage device.
Most data loggers and data acquisition equipment can be configured how often they can generate a new file while streaming data to a flash drive. Whenever possible, try to keep the recorded file size to 100 MB or less. This may sound small, but large files will dramatically reduce data transfer and conversion time later.
If the data is CAN/LIN bus log data, then this data will expand 10x in size or more when decoded with the bus log databases. If you have a 3.3 GB raw CAN/LIN bus log data file, and decode it, it easily becomes a 30 GB file!
It is always easier to analyze and concatenate small files, than it is to split large files. A DIAdem channel can have up to 2 billion (2^31) values so it has outstanding capability for working with large amounts of file data. But the performance will be significantly better if you keep those analyzed data file sizes smaller. Ideally, one or more files should together cover an event of interest. You don't want to later manually split large files because the test specimen was performing a different task, or operating in a different environment.
Sometimes a sensor was not able to record a value at a particular time when a sample is recorded, or the value exceeds the maximum measurable limit of the sensor. These are just some of the conditions where the value will be stored as NoValue when imported. NoValues are an important DIAdem feature that allow you to clearly differentiate when the value of the channel is not known at a particular time a sample was taken. It would be highly undesirable if the value were instead set as zero, the minimum allowed measured value, the sensor full scale value, or interpolated.
Some DIAdem functions work well with NoValues, and others require the user to resolve them to something else in order to perform the intended data manipulation or analyis. You have several options for the management of NoValues, including interpolation, using the last value, or deleting them. DIAdem includes tools to make it easy to manage NoValues in channel data.
If you configure your data logger / data acquisition equipment in one location while installed in the test specimen, and then ship it to the test destination another time zone away, and then ship the recorded data to another time zone for long term storage and data analysis, will you really know when the data was recorded? The best practice is to have a procedure where the DAQ / data logger equipment is set for Coordinated Universal Time (UTC), and the time zone at the time of the recording is also documented. If you cannot document the time zone, then an alternative is to record GPS location data, so that later the date/time relative to UTC can be determined, and the local time zone can be inferred from the GPS coordinates.
It is amazing what information can be determined about the past environmental conditions (weather) and topography at a particular location and date/time using online API's . DIAdem has Shuttle Radar Topography Mission (SRTM) data tools that fetch high-resolution digital topographic data covering nearly all of the earth at a resolution of 30 m. Online API services such as https://timezonedb.com can tell you the country code, time zone name, time zone offset, for a pair of GPS coordinates. DIAdem includes tools that calculate the linear distance and elevation range between two GPS coordinates.
GPS signals include a date/time reference relative to Coordinated Universal Time (UTC). The resolution of this value could be as good as 1.5 seconds, but drift the initial synchronization on 1 Jan 1980 has caused a considerable, but known difference. As of 2021, the GPS reported time is 18 seconds ahead of the actual UTC value. The correction to the GPS reported UTC time is calculated from the number of leap seconds that have occurred since 1 Jan 1980. Each leap second adds one second differrence to the GPS report UTC time, and the actual UTC time. When properly adjusted, the GPS signals for UTC date/time can provide a valuable reference identifying when data has been recorded.