Data Engineering Q&A with Ranjan Bhattacharya, EQengineered's Chief Data Officer
EQengineered conducted a data engineering Q&A with Ranjan Bhattacharya, our Chief Data Officer. Here are his responses to a litany of questions.
1. What is Data Engineering and why is it important?
AI and ML are becoming increasingly mainstream and being adopted into core business processes. Organizations have built Data Science teams, staffed with statisticians and mathematicians to explore, experiment, and build the right models and algorithms appropriate for the problems at hand.
However, the development of ML models requires access to massive volumes of data of various kinds: structured like from traditional databases, and flat files, unstructured like from free-form texts, images, audios, videos. The data may come in batch, or real-time and both internal and external sources.
Also these ML models do not exist in isolation. They have to be deployed to production environments, have access to the right data, monitored for performance, versioned, and if necessary, rolled-back.
Organizations have realized that to make ML models production ready, in addition to the data scientists, they need people to focus on the data pipeline—data acquisition, selection, cleansing, and versioning—and the ML pipeline—model versioning, deployment, scaling, and monitoring.
These tasks and processes do not cleanly fall under existing technology roles like DBAs who traditionally dealt with data management. Nor do they fall under the roles of application developers, back-end engineers, deployment, or DevOps engineers.
It has become clear for some time that we need a new role—that of a Data Engineer—whose responsibility is to move data and process data, in a scalable and reliable way.
2. What gets you excited about the current state and direction of data engineering?
Business problems being tackled by ML are becoming increasingly complex, leading to the need to implement, monitor and manage more and more advanced data and ML pipelines. It is exciting to see how software companies and cloud providers have risen to the challenge of creating tools for automating the building of ML pipelines.
It must be said, however, that this is fast-changing and relatively less mature field, there is a wide variety of tools available, and an organization should be ready to spend some time to evaluate and select the right tools for its needs.
It is also exciting to see that a set of best practices are increasingly taking shape and they are being adopted by practitioners.
The combination of these two trends—that of increased tooling, and adoption of best practices—in my view will help making data engineering more mainstream.
3. For enterprise organizations evaluating the quality, flexibility, and usefulness of their data and data engineering practices, what advice would you share as they consider the go forward direction?
There is well-known saying in the IT industry: “garbage-in garbage-out.” Unless your data is reliable and clean, and data practices are mature, building ML models will be a fool’s errand.
An organization that wishes to incorporate ML based analytics in its business processes needs to assess the maturity of their data and analytics operating model: how clean and reliable is their data, do they have data governance practices in place, do they have appropriate data ownerships defined, what is the maturity of their technical organization and so on.
My advice to an organization would be to undertake a maturity assessment of their data practice, identify the gaps, and only then start building a data engineering and science practice.
It is often quicker to bring in an experienced consulting practice to accelerate this process.
4. Where do you believe data engineering is going?
At the beginning, data engineering used to mean making the data ready for data science by acquiring, formatting, cleansing, with some automation. These days the role of the data engineer has evolved to also include infrastructure and devops type activities: setting up automated model deployment pipeline, model monitoring, and so on. Today a data engineer can own the management and the organization of massive volumes of data, scaling its storage, transmission, and processing.
Data engineers increasingly act as a bridge between the disciplines of software engineering and data science. They should be well-rounded in programming, scripting, setting up and scaling infrastructure, and be knowledgeable of the available tools and techniques in the industry.
5. What are some of the pitfalls to avoid when enterprise organizations advance their data engineering practices?
The scope and responsibilities of the role of a data engineering team in an organization change depending on the industry. It is not a one-size-fits-all kind of a role. For technology companies, the role may be dealing with scale. In industries like healthcare, the primary concern is dealing with compliance, and auditability needs. For marketing and advertising type of companies, you need someone who is familiar with off-the-shelf tools. For complex industries like financial services, you may be looking for a generalist who is keen to learn new things and comfortable with complexity.
In my opinion, the key pitfall an organization should avoid is to hire talents whose capabilities do not align with the industry specific responsibilities of the role. The responsibilities also vary with the company size. Smaller companies can hire a generalist developer adept in python, SQL, and other programming languages, who can quickly learn new tools, and technologies. Enterprise companies may need to hire for different specialties: data warehouse, enterprise BI tools, architecture, ops. Companies dealing with massive real-time data will need to look for an altogether different skill-set, dealing with schema-on-read data.
The other pitfall to avoid is a misaligned organizational structure. A lot of companies separate the data engineering team from the data science team. A data science team is focused on building the right algorithm. They are not thinking about all the ancillary tasks like monitoring, scaling, recovery that are necessary to take the model to production. When teams are working separately it is hard for them to communicate and collaborate and gain the big picture understanding of the projects they are working on.
6. How do the skills of data engineers and data scientists compare?
Data engineers focus on making data available and processing it in production environments. They should have knowledge of technologies like databases, data warehouses, ETL tools, Hadoop, streaming tools, various big data offerings from cloud providers and programming ability in languages like SQL, Java, and Python. They should also have some amount of ops knowledge, dealing with monitoring and scaling. In some cases they may even need to rewrite for a model developed by a data science team to target a different framework like Spark and hardware like GPUs to improve scalability and performance.
Data scientists on the other hand focus on exploring and developing algorithms in languages like R, Python, and most importantly possess analytical skills like machine learning, statistics, and visualization.
You can compare the relationship between these two roles in terms of the relationship between designers and front-end developers. One comes up with the ideas, and the other builds them. It helps if each role learns a subset of the other’s skills. Data engineers not only need to have a basic idea of data science, but also capable of rewriting a model in a different model and platform. Similarly data scientists should understand infrastructure, scalability and the considerations for production-readiness.
7. What makes an enterprise organization ready for ML and AI? How can an organization prepare and mature their data engineering practices in preparation for predictive and prescriptive analytics?
Companies trying to explore the techniques of ML and AI to take advantage of the capabilities hidden in their data should start thinking at an organizational level and start identifying key operational gaps in areas such as data collection and management, data hygiene, data governance, security, compliance, and ethics.
For an organization to become truly data-driven, and to speak the language of analytics in its day-to-day operations, the entire organization must commit to the journey, adopt an agile mindset, and bridge the gap between technology and business.
8. As a technologist, why have you focused on data engineering? What is most interesting to you about the data engineering space?
As more and more companies are realizing the applicability of machine learning to their specific industries, the role and skills of a data engineer is evolving and becoming more central to an organization’s data journey.
Our view of data engineering has evolved like this: initially it was batch-oriented, structured data, with only a few systems integrated; next we have come to integrate unstructured and even real-time data with many systems; and now we are talking about concepts like self-serve, data-ops, data mesh.
The practice of data engineering will continue to mature over the coming years, and with it, an increase in the capabilities of what companies can build and accomplish with their own data.
There’s a lot left to do, and with that a lot of opportunity. I can’t wait to see how the field evolves in the coming days.
9. What are your thoughts on ethics in AI? How do we police the ethical use of AI?
The ability to create machines that can think, act, and learn independently of human intervention has fueled a serious discussion of what is right, and what is enough, or too much.
Some of the key concerns are around areas like privacy, and surveillance, behavior manipulation, bias and opacity of decision making by AI systems. More broadly there are questions around how to take into account social and moral, and ethical values in AI based decision making.
Ethical guideline developed by the likes of Microsoft, Google, Apple, and others to ensure transparent, principled, and ethical considerations around human dignity, rights, freedoms, and cultural diversity are subscribed to, but much work remains to be done. Where the line eventually gets drawn is ultimately our collective responsibility. But the important thing is to think about these concerns at an organizational level and not just leave them to the technologists.
Additional data engineering video conversation can be watched at the below: