Datos personales

martes, 17 de junio de 2025

Llama3.1 LLM, RAG and production deployment PT 1



As a backend and Ops engineer I was very curious this year about AI because in my mind it was something only for PhD, Scientist and Big companies working on cutting edge technologies.


I found that it’s partially true because only big companies and specialized teams can develop foundational models(Gemini, Llama, Deepseek). It requires a lot of human resources and money.



On the other hand fortunately there are open source AI models like Llama, the foundational model that required tons of training hours and power is free to the world.


So you can build your own AI platform, you read it right, you don’t need to pay ChatGTP, Claude, etc and share your personal information with them but with limited resources it’s all about your budget.



Of course this is not for all, you would want to have your own AI platform because you are a company, researcher, engineer, etc.


So I would say there are 3 levels to simplify it






  • Top Level AI consumers: Used for assistance and source of info paying subscriptions.
  • Middle Level AI Builders/MLOps: Used by private companies, engineers, etc to deploy own AI systems using foundational models.
  • Bottom Level AI Researchers: Big Companies and AI researchers to create the algorithms and training methodologies for new foundational models.


In that context I’m going to put myself in the middle level where I want to use a foundational model for my research needs.


AI World


  • LLM/text processing
  • Image processing
  • Video processing
  • Audio processing


LLMs in my opinion are the most popular and “easy” to start because currently there is a lot of information and platforms ready to start working with it.


LLM - Foundations models


  • Llama
  • Deepseek
  • Mistral


Llama



Llama AI models are open source and the ones with the most documentation and community, don't forget that it was the first one and probably deepseek, mistral and others were born thanks to Llama.

It's true that currently it is not the most accurate or fastest model but again it has a lot of documentation.

Another advantage is PyTorch framework which is used to run llama code easily on nVidia video cards. It's super intuitive and well documented to use Cuda.


Hardware



A key component working with AI is the hardware, understanding the requirements, costs and power consumption is very important.


While you can start working in Google Colab, AWS Bedrock, etc to use cloud resources and GPUs it's better to understand the foundations.

Because we picked Llama3.1 8B-Instruct
We are focused on nVidia hardware and software for local environments



Llama3.1 Requirements:

https://github.com/meta-llama/llama3?tab=readme-ov-file#inference



Model

Nodes/VideoCards

vRAM - FP16

8B/8B-instruct

1

16gb

70B

8

140gb

405B

8++

810gb



Local AI System


  • Linux Ubuntu 24.04

  • GeForce 4070 TI SUPER 16gb

  • Nvidia T400 4gb

  • Nvidia Tesla K80 24gb

  • Intel i9 14900k

  • 64GB RAM






 

Drivers



Driver

Video Card

5xx

GeForce 4070 TI SUPER

5xx

Nvidia T400

4xx

Nvidia Tesla K80



Findings:


  • Can't combine k80 with 4070 so 24gb + 16gb forget about it drivers not compatible

  • Nvidia k80 only runs on ubuntu 18.X

  • Running torchrun must be on free 16gb gdm off and tty only

  • T400 and 4070 can be combined to run gdm in 4gb and full llama on 4070 16gb







Download and Run Llama3.1



At this point we have already 16gb free for llama and ready to run examples on our local AI System.

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/ --tokenizer_path /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/tokenizer.model --max_seq_len 1200 --max_batch_size 1     


#1

 


#2

 




Common errors are:

  • CUDA out of memory

  • Bad configuration for max_seq_len

  • Bad configuration for max_batch_size




Power Consumption


Idle - 111w


torchRun - 366w






Nvidia Tool for monitoring


nvidia-smi








After get a success run of torchrun using llama3.1 8B-instruct is time to optimize and understand max_seq_len and max_batch_size

  • max_seq_len:  text size in response

  • max_batch_size: prompts concurrency 




Play around with a chat example prompt and look at the nvidia-smi power consumption and memory.



For my system/


Max configurations allowed

  • max_seq_len: 1200

  • max_batch_size: 1


So to get bigger numbers and configurations more vRAM is good.


At this point having a foundational model running locally is a big achievement and while it could appear to be easy to start, it’s not, in the middle there are many other issues to fix.



Next steps



Let’s say this is our development environment. We need to add features and then deploy them to a production environment.



It’s true that we must expect better results on production environments because of better resources but the dev environment helps us to understand all the little details needed to run it.



Part #2


  • API to receive prompts and return responses

  • Chat interface

  • RAG to train data to the FM and get answers from our data




The goal for the first feature is to get results from our documents or database.

that’s part of the RAG (Retrieval-Augmented Generation)


Tools


  • Llama3.1 8B-Instruct

  • Pytorch

  • Vector storage system(TBD)




Part #3



Deploy to production, we are going to start with AWS Bedrock since it’s the cloud platform I have most experience with.


While we could think it’s the last step, it’s not, this is the most critical part since each project has its own needs like how many users are going to use it and how many concurrently in terms of max_batch_size processing.


Because it’s the first time setting up this system I don’t expect the best results at first deployment so it will be a continuous improvement running configs and resources.



Production deployment are for API and Pytorch/Llama3 processing but also to build the automation for the RAG, we need something similar to a Continuous Integrations but for data training.


Conclusion


I started with zero AI knowledge and to my surprise I ended up running locally llama3.1 so at this points there is a light that shows that you don't have to be a scientist to build with foundational models,

The truth is that currently this research is only for fun.


Links


paperweight


Nvidia Tesla K80






martes, 9 de abril de 2024

AWS IAM Identity Center and Datadog SSO

When you join a company it’s very common to get access to many applications like slack, sentry, Datadog, Cloudflare, etc. but Have you thought about it? How is it possible? how to manage the increasing number of users across a whole ecosystem of applications and services?

Single Sign On, also known as SSO, allows users to have access to multiple applications by signing in with only one existing account and SAML is an open standard based on XML that can empower SSO implementation, It consists of two parts, namely the SAML identity provider (IdP) and the SAML service provider (SP).

IdP examples:

- Jumpcloud

- Okta - AWS IAM Identity Center


SP examples:

- Datadog

- Slack

- Sentry - AWS Resources


Whether you are starting to grow a team or looking for more IdP options, here we are going to show an integration using AWS IAM Identity Center as IdP and DataDog as SP.

Why AWS IAM Identity Center?

One important concern in my opinion is to save money so we are looking for the best pricing solutions, here comes AWS IAM Identity Center, a service that makes it easy for you to centrally manage IAM Identity Center access to multiple AWS accounts and business applications.





AWS IAM Identity Center is available to you at no additional cost. IAM Identity Center APIs are available in all regions supported by IAM Identity Center.

So since we are running servers on AWS and IAM Identity Center it's free and it supports many application integrations we decided to go for it and DataDog is the first service we want to integrate with new team members and since there is not clear information on the internet about this integration we are going to share it step by step.


Setup


  1. You have an administrator account/root on AWS

  2. Login into your aws console

  3. Go to IAM Identity Center

  4. Enable IAM Identity Center and choose Enable with AWS Organizations


  5. Go to Applications and then click Add Application

  6. Select “I want to select an application from the catalog” and search for Datadog


  7. Download IAM Identity Center SAML metadata file and keep the window open for configurations


  8. You have an administrator account on Datadog

  9. Login into your Datadog dashboard

  10. In the Datadog app, hover over your username in the bottom left corner and select Organization Settings. Select Login Methods and click on Configure under SAML.


  11. Upload the IdP metadata file we downloaded from IAM Identity Center SAML metadata file by clicking the Choose File button. After choosing the file, click Upload File.


  12. After uploading the IdP metadata, return to the Login Methods page and turn SAML on by default

  13. Now copy Single Sign-on URL from Datadog and paste it on the AWS Identity Center window at Application Start URL


  14. Then go back to Datadog Service Provider Details and download the Service Provider Metadata file named sp_metadata.xml


  15. Also add your domain to the Just In Time Provisioning 


  16. Go back to AWS AWS Identity Center and at Application Metadata select

Upload application SAML metadata file

  1. Upload sp_metadata.xml from Datadog and submit


  2. Verify your Datadog application is active on AWS


  3. Click on you Datadog application find the Action button and select Edit Attribute mappings


  4. Verify you have givenName, sn and probably you are missing eduPersonPrincipalName then save changes.

    It’s very important to map correctly the attributes since each Service Provider has its own attributes and most of the time it’s a source of Authentication errors.
    You also can see these attributes at the sp_metadata.xml from Datadog.


  5. After that the IdP and SP are configured and we need to assign users

  6. Go to Users and Add User on IAM Identity Center

  7. Create an user with your domain email, name, etc


  8. After create the user he will receive an invitation email from AWS IAM Identity Center


  9. Accept the invitation from the email

  10. Then the user will need to create a password and set MFA

  11. At this moment the invited user is logged in on the AWS Access Porta but without applications

  12. Now back to the AWS administrator account go to the Application and find Datadog

  13. Click on Assign Users & Groups, select the user we just created and assign users


  14. Back to the new user AWS Access Portal he should see the Datadog application


  15. Click on the Datadog application and the user now should have access to Datadog



Do you know that you RoR application can support SAML client authorization? You just need https://github.com/SAML-Toolkits/ruby-saml but that is another history.


Conclusion


Each Service Provider has its own caveats whether it's a optional start URL, attributes mapping, certificates, etc, here we setup a Datadog integration that is not very clear on the DD documentation. We did not cover IAM Identity groups or Datadog mapping roles FYI. I hope it helps to anyone to start using SSO or to discover a new IdP as AWS IAM Identity Center to manage users and access more easily and in a secure manner... you can find me at X as @torukmnk. Cheers!

Sources



    z