Databricks LLM in Data Engineering: Testing DBRX vs GPT-3.5

Introduction: In recent years, the field of data engineering has witnessed a significant transformation, largely driven by advancements in artificial intelligence and machine learning. One of the key contributors to this evolution is the emergence of powerful language models, such LLM’s. These models, built upon deep learning architectures, have demonstrated remarkable capabilities in natural language understanding and generation. In this Blog, we'll explore how data engineers can leverage LLM models to enhance various aspects of their workflows in generating insights.

DBRX is a transformer-based decoder-only LLM that was trained using next-token prediction. DBRX is having 132B  parameters, 36B parameters out of which are active on any input, with my intial testing experince I am pretty impressed with DBRX as I utilized it to parse unstructured data efficiently. With its natural language processing capabilities, this model can extract relevant information from text-heavy datasets, such as QnA,social media feeds, customer reviews, or news articles.

Refer Databricks page for more details on DBRX

But as per Databricks it surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro. I try to give it a quick comparison with GPT-3.5, DBRX is clear winner as you can see in below screenshots DBRX has parsed the unstructured data more efficiently when same set of prompts given to both LLM.

Response from GPT-3.5 : Having no characteristics like taste, color, But I am pretty sure this can be improved with good prompting.


Response from DBRX : Having characteristics like taste, color and winning the race



Recently, I've conducted experiments with leveraging the DBRX model to perform a natural language data processing task within a Databricks notebook. This was achieved by creating a Python-based UDF function. Detailed steps are mentioned below :

Step 1:  Create python function that can interact with DBRX Instruct, here is the code of making function that will do NLP and  provide humanly answers. 

NOTE: Exact same function can also be made using Open AI , so it will be more what user want to use, but the comparison above can help you choosing wisely.


Step 2: Registering the function as UDF, so that we can use it in our DataFrame


Below is the sample DataFrame that have one column having certain questions


Step 3: Using UDF, to create new column “answers” which will hold responses from DBRX Model in DataFrame


But if you need to use AI directly in SQL fashion and do not want UDF, Databricks has also provided ai_query SQL function that can be used in Databricks SQL doing similar kind of work for which we are creating the UDF. Unfortunately, this function cannot be used in Notebooks as of now and is only limited to in Databricks SQL.  Also you can request AI Functions Public Preview Enrollment if you want to experiment it.

Result: BINGO!!  DBRX has done great job in answering all the questions, this opens new possibilities for innovation. By incorporating these models into their toolkit, data engineers can unlock deeper insights, streamline processes, and drive greater value from their data assets.


However, it's essential to recognize the ethical considerations and potential biases associated with AI-driven solutions, ensuring responsible and equitable deployment in data engineering tasks. Gen AI aims to replicate the breadth and depth of human intelligence across a wide range of domains. While the advancements in artificial intelligence have been extraordinary, there are certain tasks that still require the human touch but combination of human and AI represents a powerful synergy that leverages the unique strengths of both entities to achieve unprecedented outcomes.

Comments

Popular Posts