Datos personales

martes, 17 de junio de 2025

Llama3.1 LLM, RAG and production deployment PT 1



As a backend and Ops engineer I was very curious this year about AI because in my mind it was something only for PhD, Scientist and Big companies working on cutting edge technologies.


I found that it’s partially true because only big companies and specialized teams can develop foundational models(Gemini, Llama, Deepseek). It requires a lot of human resources and money.



On the other hand fortunately there are open source AI models like Llama, the foundational model that required tons of training hours and power is free to the world.


So you can build your own AI platform, you read it right, you don’t need to pay ChatGTP, Claude, etc and share your personal information with them but with limited resources it’s all about your budget.



Of course this is not for all, you would want to have your own AI platform because you are a company, researcher, engineer, etc.


So I would say there are 3 levels to simplify it






  • Top Level AI consumers: Used for assistance and source of info paying subscriptions.
  • Middle Level AI Builders/MLOps: Used by private companies, engineers, etc to deploy own AI systems using foundational models.
  • Bottom Level AI Researchers: Big Companies and AI researchers to create the algorithms and training methodologies for new foundational models.


In that context I’m going to put myself in the middle level where I want to use a foundational model for my research needs.


AI World


  • LLM/text processing
  • Image processing
  • Video processing
  • Audio processing


LLMs in my opinion are the most popular and “easy” to start because currently there is a lot of information and platforms ready to start working with it.


LLM - Foundations models


  • Llama
  • Deepseek
  • Mistral


Llama



Llama AI models are open source and the ones with the most documentation and community, don't forget that it was the first one and probably deepseek, mistral and others were born thanks to Llama.

It's true that currently it is not the most accurate or fastest model but again it has a lot of documentation.

Another advantage is PyTorch framework which is used to run llama code easily on nVidia video cards. It's super intuitive and well documented to use Cuda.


Hardware



A key component working with AI is the hardware, understanding the requirements, costs and power consumption is very important.


While you can start working in Google Colab, AWS Bedrock, etc to use cloud resources and GPUs it's better to understand the foundations.

Because we picked Llama3.1 8B-Instruct
We are focused on nVidia hardware and software for local environments



Llama3.1 Requirements:

https://github.com/meta-llama/llama3?tab=readme-ov-file#inference



Model

Nodes/VideoCards

vRAM - FP16

8B/8B-instruct

1

16gb

70B

8

140gb

405B

8++

810gb



Local AI System


  • Linux Ubuntu 24.04

  • GeForce 4070 TI SUPER 16gb

  • Nvidia T400 4gb

  • Nvidia Tesla K80 24gb

  • Intel i9 14900k

  • 64GB RAM






 

Drivers



Driver

Video Card

5xx

GeForce 4070 TI SUPER

5xx

Nvidia T400

4xx

Nvidia Tesla K80



Findings:


  • Can't combine k80 with 4070 so 24gb + 16gb forget about it drivers not compatible

  • Nvidia k80 only runs on ubuntu 18.X

  • Running torchrun must be on free 16gb gdm off and tty only

  • T400 and 4070 can be combined to run gdm in 4gb and full llama on 4070 16gb







Download and Run Llama3.1



At this point we have already 16gb free for llama and ready to run examples on our local AI System.

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/ --tokenizer_path /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/tokenizer.model --max_seq_len 1200 --max_batch_size 1     


#1

 


#2

 




Common errors are:

  • CUDA out of memory

  • Bad configuration for max_seq_len

  • Bad configuration for max_batch_size




Power Consumption


Idle - 111w


torchRun - 366w






Nvidia Tool for monitoring


nvidia-smi








After get a success run of torchrun using llama3.1 8B-instruct is time to optimize and understand max_seq_len and max_batch_size

  • max_seq_len:  text size in response

  • max_batch_size: prompts concurrency 




Play around with a chat example prompt and look at the nvidia-smi power consumption and memory.



For my system/


Max configurations allowed

  • max_seq_len: 1200

  • max_batch_size: 1


So to get bigger numbers and configurations more vRAM is good.


At this point having a foundational model running locally is a big achievement and while it could appear to be easy to start, it’s not, in the middle there are many other issues to fix.



Next steps



Let’s say this is our development environment. We need to add features and then deploy them to a production environment.



It’s true that we must expect better results on production environments because of better resources but the dev environment helps us to understand all the little details needed to run it.



Part #2


  • API to receive prompts and return responses

  • Chat interface

  • RAG to train data to the FM and get answers from our data




The goal for the first feature is to get results from our documents or database.

that’s part of the RAG (Retrieval-Augmented Generation)


Tools


  • Llama3.1 8B-Instruct

  • Pytorch

  • Vector storage system(TBD)




Part #3



Deploy to production, we are going to start with AWS Bedrock since it’s the cloud platform I have most experience with.


While we could think it’s the last step, it’s not, this is the most critical part since each project has its own needs like how many users are going to use it and how many concurrently in terms of max_batch_size processing.


Because it’s the first time setting up this system I don’t expect the best results at first deployment so it will be a continuous improvement running configs and resources.



Production deployment are for API and Pytorch/Llama3 processing but also to build the automation for the RAG, we need something similar to a Continuous Integrations but for data training.


Conclusion


I started with zero AI knowledge and to my surprise I ended up running locally llama3.1 so at this points there is a light that shows that you don't have to be a scientist to build with foundational models,

The truth is that currently this research is only for fun.


Links


paperweight


Nvidia Tesla K80