Data Life Cycle
Data life cycle: the sequential steps all business data must go through from creation, uses, storage, and final disposal
1Define
Defining what data a business needs and where to capture or retrieve such data.
Determines what data a business needs and where such data would be retrieved from helps enhance the likelihood that selected data is relevant to the goals of data collection for the business.
2Capture/Creation
Obtain the data, either by creating data internally or capturing data from where it has been externally.
Internal data is a type of digital asset that is created by the company manually or automatically or semi-automatic.
External source should consider integrity, safety, and copyrights of the data. Might need to sign contract.
3Preparation
Determine whether the data is complete, clean, current, encrypted, and user friendly.
Enhancing completeness and integrity of data: any time a data is moved location to another, it is possible that some of the required data could have been lost during the capture process. Can be done through 4 steps:
•Compare number of records expected v. actual.
•Compare descriptive statistics for numeric fields if you are privy to checksum from the original data source. Comparing those statistics helps to check for potential missing data or incorrectly formatted fields.
•Validate fields formats are consistent with the source to ensure that the formatting transfers appropriately.
•Compare character limits for the attributes in source file to new source file.
Data Integration: when data is sourced externally it is important to design the data architecture to integrate and be updated/mirrored properly.
Quality is important. Cleaning data:
•removing unnecessary headers
•clean leading zeros and non printable characters
•format negative numbers to ensure consistency identify and correct inconsistencies across data in general
•address inconsistent data type
Data Encryption: for selective data storage and moving
4Synthesis
Bridge between preparation and usage. Not a necessary step, but might be a step needed to add on to data you already have so you can use it for your own purposes.
5Analytics and usage
The data is ready for practical use in the organization to create reports and inform decisions. As long as data remains useful, this stage will last. It focuses only on internal.
6Publication
Where data prepared for internal users may also be shared with external users. Be careful.
7Archival
Following the decline in need, data sets are moved from an active system to a passive system.
Frees up storage resources, enhances active system performance, and reduces security risk.
Archived data will be tested for accuracy and completeness before and after.
8Purging
Data is useless. There is no other requirement that makes us maintain it. Make sure it is completely purged.
Types of Data Collection:
Extract, transform, and load: Data already exists, is extracted from its original source, transformed into useful information, and loaded into the tool for analysis.
Steps: capture, preparation, and synthesis but ETL is more specific method for collecting existing data in order to answer a specific data analysis question
Active data collection: New data.
Passive data collection: information gathering without direct permission from their users through tracking web usage via cookies or gathering time stamps of when users interact with website
