As AI systems become more and more integral to our daily lives, the demand for skilled people to work with and build these systems will only continue to grow. Historically, data scientists were essential to building and managing AI systems. However, as AI systems become easier to use and more accessible, are data scientists still essential to making AI systems work for most organizations?
AI systems are data-driven, so it’s important to know how to leverage data to deliver results. Typically, data scientists are tasked with developing models to transform large amounts of data into insights and patterns. These insights can be used for a variety of activities, from descriptive and diagnostic analytics to advanced machine learning models, applicable to all seven AI models.
Despite all the relevant skills that data scientists bring, they are highly skilled, expensive, and hard to find. The speed at which organizations are looking to implement and leverage AI capabilities far outpaces the market’s ability to supply skilled and experienced data scientists.
Using and creating AI modelsWhen we think about the skills needed today and in the future, we must first distinguish between the need to build AI models from scratch and simply use the models that are already developed. The power of generative AI systems and extended language models (LLMs) has proven that easy access to AI capabilities is within everyone’s reach and produces spectacular results.
You don’t need to be a data scientist to get the most out of LLM systems. Users will increasingly find AI capabilities embedded in their everyday tools and applications. So, simply using and leveraging AI systems doesn’t require data scientist skills.
Instead, organizations will need to develop their communication engineering skills to take advantage of ready-made LLM systems. Learning effective communication engineering relies more on soft skills than hard skills. You don’t need math, programming, or statistical analysis skills to be a good communication engineer. Communication engineering requires knowing the right communication models for different situations and having strong critical thinking, creativity, collaboration, and communication skills. These liberal arts-based skills are more available and less expensive, and easier to cultivate with existing human resources than data scientists.
Fine Tuning and RAG: New SkillsBut what if you want to go further? While publicly available models can perform well for general needs, they don’t perform as well for private data, domain- and context-specific requirements, and tasks that aren’t as well-suited to the types of tasks that generative AI models were designed for. Of course, these public models are getting better every day, so the range of tasks that generative AI systems can perform is growing every day. Still, there’s the issue of private and domain-specific needs. For that, we need more advanced skills outside of rapid engineering and the associated general skills. However, it’s still not as intense as machine learning engineering or data science.
When we want to adapt a commonly trained machine learning model to more domain-specific responses, we use the following approach fine tuning. Fine-tuning involves collecting lots of examples of specific prompts and responses, and then feeding those examples into the LLM API. For example, to fine-tune Open AI GPT models on your own datasets, you would collect sample datasets, and then use fairly basic Python scripts to feed those datasets into the OpenAI API to generate a custom, fine-tuned model.
In case we want LLMs to work on proprietary or custom data, we can use the Retrieval Augmented Generation (RAG) approach. With a RAG, we store our custom data in a database indexed using the same word vector approach that is the cornerstone of LLMs. Then, when a user makes a quick query, it first retrieves the relevant information from the database based on the query and then asks the LLM to respond to the user’s query based on the data provided to it as part of the quick context. The skills required to build a RAG are primarily programming skills to coordinate between the LLM and the database and data skills to collect and process the data that goes into the RAG database.
While data scientists can be helpful in the fine-tuning and RAG development process, they are not as necessary as they would be if you were developing a machine learning model from scratch. Since many tasks can be accomplished through rapid engineering, fine-tuning, and RAGs, the set of tasks that require data scientists and machine learning engineers continues to shrink.
Will Data Engineers Get What They’re Due?Data scientists remain essential to the continued development and advancement of AI, particularly in building and maintaining core models and performing the full range of tasks that data scientists perform outside of AI. However, the common thread that unites advanced model development, rapid engineering, fine-tuning, and RAG development is the need for good quality, relevant data.
While the role of data science and data scientists has been highlighted over the past decade, it’s becoming clear that data engineering deserves greater visibility. Data engineering is all about making data available for AI and analytics. Data engineers move data, ensure data consistency and cleanliness, and manage the data engineering pipelines that keep data flowing to systems that depend on a steady stream of good data. From this perspective, data engineers are even more critical to AI projects than data scientists. Perhaps data engineers will be the most important hire in the next decade to put AI into practice in the organization?