
Data accumulation is the first stage of an AI project. An organization needs to either create or acquire data during this stage. One alternative is to have organizations partner with other organizations that are willing to share data. Data that are most useful for organizations are ones that accurately represent company-specific activities.
Organizations generally have many disparate sources of data. There may even be dark data that an organization does not even know exists. Finding, organizing, cleansing and processing all this data can take substantial time. Also, the organizations may not have kept good data hygiene, resulting in data debt. Such debt refers to the cost of additional rework that needs to be done because of poor data practices of the past. Organizations should stop creating this debt; otherwise, someone in the future would have to pay that debt. Industry best practices should be used to groom the data garden and maintain good data hygiene regularly.
A disciplined approach is needed to cleanse data, standardize it, and integrate it. Data consistency is vital for better results and integration purposes. Systematic techniques need to be used to deal with missing data and outliers. Ideally, an organization needs to get to a point where there is a single truth of data. Currently, that is not the case with many organizations.
A high volume of structured and high-quality data is required for AI models. Ideally, such data should be available in a form that can be automatically ingested by AI models. An organization should define specific data quality aspects that it wishes to maintain and set up systems to perform these checks automatically. If an AI model is trained on data that is either unclean or of suboptimal quality, AI model output may not be entirely trustworthy. Due to the importance of data quality, some large companies are investing heavily in data quality management.
While working on AI model development, it is crucial to prioritize cleansing and curation efforts on datasets needed to solve specific problems. This is because these efforts take substantial time and can delay AI projects if not prioritized. It is a good practice to clarify the expectations of a given dataset before doing any major work with it.
Author: Dr. Jodie Lobana
Image Attribution: Programming Background photo created by kjpargeter – www.freepik.com
0 Comments