Leverage on Talend Data Catalog to Identify the Source of Data Discrepancies

Date : 14-Feb-2020
Location: Kuala Lumpur

Key Takeaways:
  • Whenever an IT system, application or personal productivity tool is used inside an organisation without explicit organisational approval, we talk about shadow IT. Shadow IT is not only a security and compliance nightmare, it creates a data sprawl where each group can create its data silos
  • According to a Cisco 2016 customer survey, there is 15-25x more used services without IT involvement in an organisation. Furthermore, the cloud services explosion is likely to accelerate this trend.
  • To avoid any shadow IT, equipping people with modern tools such as Talend Data Catalog is essential for avoiding creating those uncontrolled copies of data.
  • Data citizens can then process data from sources to destination without keeping local storage or unknown or unprotected folders, systems, on premise storage or uncontrolled cloud-based storage. This is not acceptable anymore with the rise of regulations (Basel II, IRFS, GDPR, CCPA, etc.), where companies are mandated to take control of their data assets. If they don’t, companies run the risk of being non-compliant and being exposed to significant regulatory fines.
  • Talend Data Catalog helps organisations to create a central, governed catalog of enriched data that can be shared and collaborated on easily. It can automatically discover, profile, organise and document organisations’ metadata and makes it easily searchable.
  • Imagine that you find some inconsistent data in your data systems that have been created and perpetuated in one of your datasets and you are asked to explain it, identify it and correct it. The data lineage will dramatically accelerate your speed to resolution by helping you to spot the right problem at the right place. Moreover, if new datasets come to your data lake, establishing a data lineage will help you to identify these new sources very quickly.

  • The more shadow IT is developing, the less easy it is for users to access & protect data.
  • IDC estimates that data professionals spend 81% - and waste 24% - of their time searching, preparing and protecting data before they can actually take advantage.
  • When data is not a team sport, everyone spends time creating silos and their version of truth, which drives up costs. Decisions are influenced by questionable data and ultimately put the organisation at risk.
  • IDC went even deeper in the analysis in a data governance webinar, highlighting the high frequency of spreadsheet usage by business users as a data integration tool. Data silos start here, as copy/paste is the most frequently used approach to bring this data in.
Best Practices:

Data Control should take place everywhere: when the data enter the system, along data pipelines and at data consumption points thru apps, api or analytics. As more and more data professionals are getting closer to operations to drive business outcome “where the action is”, there is a growing risk of data fragmentation and misalignment. There is a need for a central organisation that can enable people with data in a governed way while tracking and tracing data flows through data lineage.

But data lineage is not enough, organisations also have to cleanse the data without leaving local files on unsafe data systems.

Managing data along the data pipelines and at data consumption
unsplash-logoPaul Hanaoka

Editor's comments:
  • Another subset of shadow IT is consumerisation of IT : usage of consumer IT tools within an organisation to solve specific work related issues without formal approval of organisation top management - for example, a HR department uses a cloud based job posting web site for recruitment drive.
  • Think of Talend Data Catalog as a centralized data repository manager that allows one the ability not only to automatically catalog and discover data, it also identifies the source of a data which could have been duplicated across multiple documents, for check and balance purpose.