By Kevin Rogers | Director, Data & Analytics
With the boom of attention on Generative AI, a variety of tools have emerged to help organizations “talk to their data” through natural language processing and code generation, otherwise known as text-to-SQL or NL2SQL. This capability can be utilized at many different points in the data & analytics lifecycle – from data transformation, data quality, and analytics development, to enabling ad-hoc insights for business users. When harnessed correctly, text-to-SQL capabilities can bring significant benefits such as time and cost savings, speed and efficiency improvements, increases in data availability for business users, and backlog reduction for data & analytics and report development teams. Also, junior developers can step up their output and business users with even minimal SQL capability can gain access to a new world of data insights.
While these tools continue to improve extremely rapidly in terms of performance, accuracy, and reliability, it is still good practice to have a human in the loop to verify the output, particularly when first deploying them. They should be viewed as assistants that can help generate simple queries or provide a starting point for more complex queries rather than relied upon to produce near perfect accuracy without oversight. These tools will benefit greatly from well-structured data repositories with robust metadata, just as traditional SQL-based querying would, so this technology should not be viewed as a shortcut for general data quality standards and best practices.
For an organization looking to gain these capabilities, Trace3’s recommended approach is to evaluate and proof-of-concept (POC) existing technologies that may already be deployed in the organization. Then, leverage existing investments first before deciding to buy an off-the-shelf product or custom configure a solution. To this end, we’ve laid out a few categories of text-to-SQL offerings, including Data Platform Integrated Solutions, Analytics Integrated Solutions, and Standalone solutions:
Data Platform Integrated Solutions
Trace3 has determined data platform-based text-to-SQL solutions to be a smart choice for several reasons:
They are easy to POC and experiment upon.
The source data is generally already centralized and available.
They are optimized for the specific data platform against which they’ll be writing SQL.
They can leverage the metadata maintained within the platforms for better results.
Security such as RBAC can be leveraged in-platform and enforced for text-to-SQL tools.
Existing queries can be improved where they live rather than after export.
They often don’t have any additional cost beyond existing platform costs.
Databricks has released the Databricks Assistant, which can help with a variety of activities such as automating code commenting and using natural language to generate SQL.
*The following links to an insightful article covering techniques to improve text-to-SQL performance, which can be applicable to any tooling, not just Databricks: Improving Text2SQL Performance with Ease on Databricks | Databricks Blog
Cortex is Snowflake’s suite of AI features, which includes Snowflake Copilot. Amongst other features, Copilot has strong natural language processing abilities for SQL generation. Interestingly, it uses Mistral’s Large model in combination with Snowflake’s proprietary SQL generation model to achieve better performance and accuracy.
Depending upon acceleration use-case candidates, you may look at different Copilots within the ecosystem for different points within the data lifecycle. The current set of Copilot offerings (subject to change and potential further renaming) all include some variety of text to code generation:
Power BI
If your organization is currently a Power BI premium capacity customer, it would be worthwhile to explore Power BI Copilot, also known as Fabric Copilot for Power BI, to see if it helps accelerate your analytics workflows. It allows business users and report developers to use natural language to generate visualizations and insights. Code is automatically generated and executed against the existing curated data sets and semantic models already published to Power BI users. Trace3 has several customers experiencing success with this tool, particularly boosting the efforts of their departmental Power BI super users generating self-service analytics.
There are countless standalone, off-the-shelf GenAI solutions available that can provide text-to-SQL capability, often in conjunction with other assistant features. These tools are often built to cater to specific industries or use cases, sometimes with fine-tuned AI models that have augmented foundational LLM models with industry-specific context. They may also tout direct data connectors that allow for connections to source systems commonly used by organizations in that industry. However, it is wise to proceed with caution when considering investing in a tool in this category as they have the potential to be a repackaging of open-source technologies with little intellectual property of their own. Following are some questions and considerations that should be explored when evaluating this category of tools:
What are the differentiators of the solution that make it more effective for my specific industry or use case?
What LLM is utilized by the tool? Can you select from a variety of models or “bring your own”?
Is the tool fine-tuned for typical industry use cases?
Is there one user interface that must be used, or can the technology be integrated to existing tools to seamlessly integrate into workflows?
Does the tool connect to all the source data locations my organization currently requires?
Does the tool have general connector options for future sources?
What is the approach for regression testing results when new software versions or updated models are rolled out?
What is the architecture of the tool from a security perspective?
Does it satisfy data privacy and compliance requirements for highly regulated industries or government regulations such as GDPR?
Is customer data made available to the provider for uses such as model training?
On the Horizon
Keep an eye out for developments with the newly coined “GenSQL” which eventually may lead a sea change with the approach to SQL altogether. For example, the following links to a recent article from MIT describing the approach:
GenSQL involves querying (with natural language) a probabilistic model alongside the dataset rather than only the dataset in isolation. Compared to existing neural network approaches, researchers found “GenSQL was between 1.7 and 6.8 times faster than these approaches, executing most queries in a few milliseconds while providing more accurate results.”