Trace3 Blog | All Possibilities Live In Technology

BYOM: Harness the Power of Generative AI in the Enterprise (Part II)

Written by Lars Hegenberg | May 31, 2024

By Lars Hegenberg | Trace3 Innovation

Please note that this is a continuation of Part 1 of the blog series BYOM: Harness the Power of Generative AI in the Enterprise.

With the rapid proliferation of Generative AI (Gen AI) and an increasingly confusing market landscape filled with “Copilots”, organizations are looking for ways to level up control and governance over the usage of Gen AI and the data that is being shared with these tools. This has shifted the focus towards safer deployment options that reduce dependencies and avoid data being shared with third parties, while maximizing performance for custom business use cases. Enter: “Bring Your Own Model (BYOM)”, an approach where organizations deploy a dedicated instance of a large language model in their own environment. This two-part blog will delve into the technical life cycle of BYOM –from pre-deployment to post-deployment, including strategic choices that need to be made along the way, such as data and model types, infrastructure, security, and version control.

 

Where can a Large Language Model be Deployed?

Public Cloud

When opting for the safe deployment in a public cloud, organizations can choose between a dedicated instance of either a proprietary or an open-source LLM. The specific model options differ greatly between the different cloud service providers - Microsoft’s Azure OpenAI offering mainly enables the private deployment of ChatGPT models, while AWS lets you deploy a variety of open-source models such as Cohere, AI21 labs, or StabilityAI through its Bedrock platform.

The main advantage of going with a public cloud provider is that it allows you to offload most of the infrastructure management and scaling. Whether it’s compute in the form of GPU powered instances, networking, or storage- a CSP can take away many of the intricacies of managing the infrastructure layer. It also ensures users always have access to the latest infrastructure technology possible (incl. compute), which is of high value considering the blistering pace at which the AI infrastructure space is moving. Other benefits include a quick time to market and a lower upfront investment, as well as easier version management post deployment.

Some of the downsides of a public cloud deployment include a higher latency compared to a deployment on-prem or at the edge, risk of vendor lock in, as well as increased privacy concerns. While a private deployment drastically increases the level of security, organizations with highly sensitive data should still take into account where sensitive data is stored, how it is processed and how the model is trained, to minimize any data privacy concerns.

 

Figure 1: This ranking is based on a multitude of factors and ranges from 0 (least desirable) to 5 (most desirable).

Example: The more resource-intensive a deployment, the less desirable it is.

 
 
 

 

On-Premises

When going for a local deployment, organizations can choose to build a model from scratch, or choose from one of the many open-source models widely available through hubs such as Hugging Face. A local deployment is a way to maximize security and control across the deployment lifecycle, as all the relevant infrastructure, data, and workloads are kept in-house. A local deployment can also address specific infrastructure requirements. Depending on the planned deployment, very specific hardware may be necessary that is not obtainable from a cloud provider. For example, GPU types that are not widely available, as well as certain memory, storage, or networking requirements.

In the future, many security- and privacy-conscious industries will also consider edge deployments, enabling training and inference of models on edge servers or devices for improved computational efficiency and data security. To date, however, most of the generative AI applications are not conducive to edge hardware. Currently, LLMs mostly extend the value of existing edge applications by adding new content discovery, search and analytics capabilities in the cloud (via hybrid edge-cloud deployments).

The downside of a local deployment, first and foremost, are the high fixed costs and resource requirements. This includes physical infrastructure such as accelerated computing silicon (e.g. GPUs), networking components including routers and switches, as well as storage systems that can handle large volumes of data and provide high-speed data access. With the enormous investments flowing into infrastructure technology and the rapid development we have seen over the last few months, there is also a high risk of infrastructure quickly becoming obsolete or outdated quickly.

Physical infrastructure will also have to be complemented by software that can help orchestrate and manage the stack, as well as help with the fine-tuning and overall operationalization of LLMs. This includes ML frameworks, distributed computing frameworks, database management systems, model serving & deployment, among others. Last but not least, this requires significant talent across the different disciplines, not just during deployment, but also for ongoing maintenance.

Figure 2: This ranking is based on a multitude of factors and ranges from 0 (least desirable) to 5 (most desirable).


Example: The more resource-intensive a deployment, the less desirable it is.

 
 

 

 

Specialty Cloud Provider

Over the past one to two years, a multitude of smaller cloud service providers have emerged that build out their own accelerated hardware infrastructure (compute, network, storage) optimized for AI workloads. These specialist CSPs often focus on specific niche markets or have moved further into software, first with development tools and then with a fully packaged, end-to-end software stack for training, fine-tuning and running LLMs. The main advantages are usually around cost and compute efficiencies: Specialized AI clouds offer alternative compute delivery models such as containers or batch jobs, that can handle individual tasks without incurring the start-up and tear-down cost of an instance. Surprisingly, specialty CSPs have shown to have better GPU availability than some of the larger cloud providers. According to hourly on-demand pricing, startups are sometimes offering 50%-70% cost savings on GPU hours for advanced NVDIA A100s and offer unique access to the latest H100 chips. These platforms also allow you to host a wide range of open-source models of different sizes or to train your own, preventing any model provider lock-in.

Overall, what makes specialty cloud providers unique is their ability to differentiate themselves based on AI chip availability, accelerated compute, local presence, LLMops tooling, multicloud support and support for multiple types of legacy hardware. However, this means that functionality is not the same across these providers and organizations need to carefully select a vendor that fits their specific use cases. Due to their young age, there are also only a few providers that have a proven track record when it comes to functionality and security.

 

Figure 3: This ranking is based on a multitude of factors and ranges from 0 (least desirable) to 5 (most desirable).


Example: The more resource-intensive a deployment, the less desirable it is.

 
 

 

 

Factors influencing the deployment location decision
1.  Data Gravity

Data gravity is a concept that refers to the tendency of data in one location to attract additional datasets and applications in that location, exerting a “gravitational” pull. Depending on how that data is accessed, it can increase the overall costs through additional data management, movement, and processing needs. (Gen) AI has added another layer of complexity to data gravity. Not only does it create more data to work with, but it also attracts more AI applications that rely on the data and insights from models. This means the location of a Gen AI deployment should be tightly coupled to an organization’s overall data strategy. As an example, if there is a lot of data on-premises, then the networking cost of uploading that data to the cloud will be higher, which may influence bringing the compute closer to the data (i.e. data center or edge). If most data is already in the cloud, it might be more efficient to deploy Gen AI models in the cloud. Specifically, if an organization is already leveraging one of the larger CSPs such as Microsoft Azure or AWS, taking advantage of their offerings could offer a very low barrier to entry. It is important to note that the location of training data models and using (inferencing) them can be different. Hence, the first decision needs to be about where are enterprises going to train AI models, and where are they going to use the resultant algorithms?

2.  Privacy & Control

The method of Gen AI deployment greatly affects the level of control an organization has over the development and usage of different models, as well as how secure that usage is in respect to sensitive data that is processed. When it comes to model performance, managing physical infrastructure can become a source of competitive advantage, as it allows for fine-grained control over training and inference. This may be required to achieve certain capabilities and model behavior and can reduce marginal cost at large scale. An on-prem deployment can also be advantageous if there is very specific hardware that cannot be obtained from any of the cloud providers. For example, GPU types that are not widely available, as well as unusual memory, storage, or networking requirements.

If organizations deal with a lot of sensitive data and fear their data will be used by third parties, they will be more likely to deploy GenAI in a data center for model training and at the edge for model inferencing. While other deployment methods covered in this blog also offer enhanced levels of security features, on-premises is still the most secure as sensitive information stays within the physical boundaries. It also ensures compliance with industry-specific regulations by keeping sensitive data and language model processing under direct control.

3.  Scale & Resource Availability

Before the deployment of Generative AI models, organizations need to engage in careful planning around the scale of their Gen AI deployment and usage, as well as what resources are readily available vs need to be acquired.

An on-prem deployment requires a big up-front investment for hardware, making it more capex heavy. The scale and specifications of this hardware depend on the model’s characteristics including size, complexity of tasks it performs, and the need for model fine-tuning and inferencing. Economically, on-prem makes sense if organizations operate at a very large scale and expect high-volume inference workloads. Ideally, when choosing on-prem an organization has a track record of managing physical infrastructure and has the financial & human resources necessary to manage such efforts. If not, a larger on-prem deployment may be out of scope.

Where usage cannot easily be predicted or is limited, a usage-based pricing model (e.g. specialty cloud) may be more appropriate, as this gives teams the flexibility to scale up and down as needed. Cloud providers also enable faster time to value and require less up-front cost, while allowing organizations to offload much of the infrastructure management. Finally, a cloud deployment guarantees access to the latest AI hardware and software innovations, ensuring consistent best-in-class performance.

Figure 4: This ranking is based on a multitude of factors and ranges from 0 (least desirable) to 5 (most desirable).
Example: The more resource-intensive a deployment, the less desirable it is.

 
Post-Deployment Considerations

Version Management

With the rapid improvements around AI infrastructure and new training techniques, there are new and improved models launched every single day that outperform previous ones. In order to stay up to date and gain access to the latest innovations, organizations need to design their processes for optionality where possible. Generally, when exchanging or updating the models that existing systems and applications rely on, there are several requirements to ensure consistent performance.

  1. Assessing compatibility: Review release notes and documentation to understand if there are any breaking changes in the new version that could impact your existing applications. Things like model architecture, input/output formats, APIs, etc. may have changed.

  2. Testing functionality: Run thorough tests on your applications with the new model, ensuring the expected functionality works as intended. Check for changes in performance, accuracy, output formats, etc. Fix any integration issues.

  3. Evaluating compute requirements: Larger models or model updates may have higher memory, GPU and other computational requirements. If applicable, ensure your local hardware can support running the new model efficiently.

  4. Migrating data: If the model requires any data stored, like embeddings or dictionaries, plan data migration so the new model has what it needs to work optimally.

Overall, model versioning for proprietary models such as Azure Open AI’s ChatGPT is more straightforward. Once available, deploying a new model version can be done at the flick of a switch, and teams can freeze or take snapshots of a model, protecting it from potential disruptions caused by automatic upgrades or version changes. While applications simply have to be subscribed to the new endpoint, additional fine-tuning may be necessary through the provider’s portal.

Open-source models are more heterogenous, with a wide range of models available from different providers with different characteristics (domain, model type, size, training data etc.). Not only is it very important to study the technical details, but deploying these locally will require additional MLOps tooling for loading, training, and managing the different model versions. As for model integrations, enterprises should design their applications in a way that switching between models requires little more than an API change.

Figure 5: Example Process for Version Management of Open-Source Models.

Model Monitoring

LLMs can exhibit biases, generate harmful or inappropriate content, or produce inconsistent outputs. It's crucial to continuously monitor the model's performance, outputs, and potential biases, and have processes in place to evaluate and mitigate any issues that arise. As new data becomes available or domain-specific requirements change, LLMs may also need to be updated or fine-tuned to maintain their performance and relevance. This involves retraining the model on new data or adapting it to specific tasks or domains. Finally, when integrated with existing enterprise systems, applications or workflows, organizations need to engage in ongoing development and maintenance of APIs, connectors, and user interfaces to ensure a smooth user experience.

Responsible AI Practices

Ensuring compliance with relevant data privacy, security, and regulatory requirements is an ongoing task when deploying LLMs, requiring organizations to move from a reactive compliance strategy to the proactive development of mature responsible AI capabilities. Responsible AI frameworks should cover principles & governance, risk, policy & control, as well as technology and culture and should be applicable to the entire development life cycle.

While security at the data and model level are very important during deployment, continuous governance at the prompt and application layer is necessary as well. As such, making sure to secure prompts from malicious manipulation and injection attacks, and minimizing privacy risks from user inputs should be of the highest priority. The prompt design should be continuously monitored and evaluated, and outputs optimized for soundness, fairness, explainability and robustness. Governance should also extend to detect usage of shadow AI, where unsanctioned applications or models are used by employees.

Finally, consider the development of ethical guidelines that minimize AI safety issues, discourage abuse/misuse of AI models and monitor and guard model robustness for the development and use of applications. Security capabilities may have to be augmented by third-party security tooling.

 
Conclusion

Each deployment method comes with its own set of pros & cons and there are many factors influencing deployment decisions. Some enterprises may opt for a hybrid approach, using external providers for certain inference needs while building in-house solutions for mission-critical or high-volume use cases.

Decision-makers should consider the security & technical requirements at each stage of the value chain, from pre-deployment to post-deployment. This includes the complex model integrations with organizations’ existing systems and workflows, as well as new applications that will be built on top and the ongoing maintenance of these. Ultimately, however, the method of deployment comes down to an organization’s use cases and individual preferences, which have to be defined in conjunction with an overall (data) strategy for the modern AI era.

Lars is an Innovation Researcher on Trace3's Innovation Team, where he is focused on demystifying emerging trends & technologies across the enterprise IT space. By vetting innovative solutions, and combining insights from leading research and the world's most successful venture capital firms, Lars helps IT leaders navigate through an ever-changing technology landscape.