Software Development: 2025

As a backend and Ops engineer I was very curious this year about AI because in my mind it was something only for PhD, Scientist and Big companies working on cutting edge technologies.

I found that it’s partially true because only big companies and specialized teams can develop foundational models(Gemini, Llama, Deepseek). It requires a lot of human resources and money.

On the other hand fortunately there are open source AI models like Llama, the foundational model that required tons of training hours and power is free to the world.

So you can build your own AI platform, you read it right, you don’t need to pay ChatGTP, Claude, etc and share your personal information with them but with limited resources it’s all about your budget.

Of course this is not for all, you would want to have your own AI platform because you are a company, researcher, engineer, etc.

So I would say there are 3 levels to simplify it

Top Level AI consumers: Used for assistance and source of info paying subscriptions.
Middle Level AI Builders/MLOps: Used by private companies, engineers, etc to deploy own AI systems using foundational models.
Bottom Level AI Researchers: Big Companies and AI researchers to create the algorithms and training methodologies for new foundational models.

In that context I’m going to put myself in the middle level where I want to use a foundational model for my research needs.

AI World

LLM/text processing
Image processing
Video processing
Audio processing

LLMs in my opinion are the most popular and “easy” to start because currently there is a lot of information and platforms ready to start working with it.

LLM - Foundations models

Llama
Deepseek
Mistral

Llama

Llama AI models are open source and the ones with the most documentation and community, don't forget that it was the first one and probably deepseek, mistral and others were born thanks to Llama.

It's true that currently it is not the most accurate or fastest model but again it has a lot of documentation.

Another advantage is PyTorch framework which is used to run llama code easily on nVidia video cards. It's super intuitive and well documented to use Cuda.

Hardware

A key component working with AI is the hardware, understanding the requirements, costs and power consumption is very important.

While you can start working in Google Colab, AWS Bedrock, etc to use cloud resources and GPUs it's better to understand the foundations.

Because we picked Llama3.1 8B-Instruct
We are focused on nVidia hardware and software for local environments

Llama3.1 Requirements:
https://github.com/meta-llama/llama3?tab=readme-ov-file#inference

Model
Nodes/VideoCards
vRAM - FP16
8B/8B-instruct
1
16gb
70B
8
140gb
405B
8++
810gb

Local AI System

Linux Ubuntu 24.04
GeForce 4070 TI SUPER 16gb
Nvidia T400 4gb
Nvidia Tesla K80 24gb
Intel i9 14900k
64GB RAM

Drivers

Driver
Video Card
5xx
GeForce 4070 TI SUPER
5xx
Nvidia T400
4xx
Nvidia Tesla K80

Findings:

Can't combine k80 with 4070 so 24gb + 16gb forget about it drivers not compatible
Nvidia k80 only runs on ubuntu 18.X
Running torchrun must be on free 16gb gdm off and tty only
T400 and 4070 can be combined to run gdm in 4gb and full llama on 4070 16gb

Download and Run Llama3.1

At this point we have already 16gb free for llama and ready to run examples on our local AI System.

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/ --tokenizer_path /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/tokenizer.model --max_seq_len 1200 --max_batch_size 1

#1

#2

Common errors are:

CUDA out of memory
Bad configuration for max_seq_len
Bad configuration for max_batch_size

Power Consumption

Idle - 111w

torchRun - 366w

Nvidia Tool for monitoring

nvidia-smi

After get a success run of torchrun using llama3.1 8B-instruct is time to optimize and understand max_seq_len and max_batch_size

max_seq_len: text size in response
max_batch_size: prompts concurrency

Play around with a chat example prompt and look at the nvidia-smi power consumption and memory.

For my system/

Max configurations allowed

max_seq_len: 1200
max_batch_size: 1

So to get bigger numbers and configurations more vRAM is good.

At this point having a foundational model running locally is a big achievement and while it could appear to be easy to start, it’s not, in the middle there are many other issues to fix.

Next steps

Let’s say this is our development environment. We need to add features and then deploy them to a production environment.

It’s true that we must expect better results on production environments because of better resources but the dev environment helps us to understand all the little details needed to run it.

Part #2

API to receive prompts and return responses
Chat interface
RAG to train data to the FM and get answers from our data

The goal for the first feature is to get results from our documents or database.

that’s part of the RAG (Retrieval-Augmented Generation)

Tools

Llama3.1 8B-Instruct

Pytorch
Vector storage system(TBD)

Part #3

Deploy to production, we are going to start with AWS Bedrock since it’s the cloud platform I have most experience with.

While we could think it’s the last step, it’s not, this is the most critical part since each project has its own needs like how many users are going to use it and how many concurrently in terms of max_batch_size processing.

Because it’s the first time setting up this system I don’t expect the best results at first deployment so it will be a continuous improvement running configs and resources.

Production deployment are for API and Pytorch/Llama3 processing but also to build the automation for the RAG, we need something similar to a Continuous Integrations but for data training.

Conclusion

I started with zero AI knowledge and to my surprise I ended up running locally llama3.1 so at this points there is a light that shows that you don't have to be a scientist to build with foundational models,

The truth is that currently this research is only for fun.

Links

- https://github.com/oddmario/NVIDIA-Ubuntu-Driver-Guide/issues/2
- https://huggingface.co/blog/llama31

- https://aws.amazon.com/bedrock/pricing/

- https://huggingface.co/torukmnk

paperweight

Nvidia Tesla K80

Software Development

Datos personales

martes, 17 de junio de 2025

Llama3.1 LLM, RAG and production deployment PT 1

AI World

LLM - Foundations models

Llama

Hardware

Llama3.1 Requirements:

Local AI System

Drivers

Findings:

Download and Run Llama3.1