As a backend and Ops engineer I was very curious this year about AI because in my mind it was something only for PhD, Scientist and Big companies working on cutting edge technologies.
I found that it’s partially true because only big companies and specialized teams can develop foundational models(Gemini, Llama, Deepseek). It requires a lot of human resources and money.
On the other hand fortunately there are open source AI models like Llama, the foundational model that required tons of training hours and power is free to the world.
So you can build your own AI platform, you read it right, you don’t need to pay ChatGTP, Claude, etc and share your personal information with them but with limited resources it’s all about your budget.
Of course this is not for all, you would want to have your own AI platform because you are a company, researcher, engineer, etc.
- Top Level AI consumers: Used for assistance and source of info paying subscriptions.
- Middle Level AI Builders/MLOps: Used by private companies, engineers, etc to deploy own AI systems using foundational models.
- Bottom Level AI Researchers: Big Companies and AI researchers to create the algorithms and training methodologies for new foundational models.
In that context I’m going to put myself in the middle level where I want to use a foundational model for my research needs.
AI World
- LLM/text processing
- Image processing
- Video processing
- Audio processing
LLM - Foundations models
- Llama
- Deepseek
- Mistral
Llama
Llama AI models are open source and the ones with the most documentation and community, don't forget that it was the first one and probably deepseek, mistral and others were born thanks to Llama.
It's true that currently it is not the most accurate or fastest model but again it has a lot of documentation.
Another advantage is PyTorch framework which is used to run llama code easily on nVidia video cards. It's super intuitive and well documented to use Cuda.Hardware
A key component working with AI is the hardware, understanding the requirements, costs and power consumption is very important.
While you can start working in Google Colab, AWS Bedrock, etc to use cloud resources and GPUs it's better to understand the foundations.
Because we picked Llama3.1 8B-Instruct
We are focused on nVidia hardware and software for local environmentsLlama3.1 Requirements:
https://github.com/meta-llama/llama3?tab=readme-ov-file#inference
Local AI System
Linux Ubuntu 24.04
GeForce 4070 TI SUPER 16gb
Nvidia T400 4gb
Nvidia Tesla K80 24gb
Intel i9 14900k
64GB RAM

Drivers
Findings:
Can't combine k80 with 4070 so 24gb + 16gb forget about it drivers not compatible
Nvidia k80 only runs on ubuntu 18.X
Running torchrun must be on free 16gb gdm off and tty only
T400 and 4070 can be combined to run gdm in 4gb and full llama on 4070 16gb
Download and Run Llama3.1
At this point we have already 16gb free for llama and ready to run examples on our local AI System.torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/ --tokenizer_path /home/torukmnk/.llama/checkpoints/Llama3.1-8B-Instruct/tokenizer.model --max_seq_len 1200 --max_batch_size 1
#1
#2
Common errors are:
CUDA out of memory
Bad configuration for max_seq_len
Bad configuration for max_batch_size
Power Consumption

Nvidia Tool for monitoring
nvidia-smi
After get a success run of torchrun using llama3.1 8B-instruct is time to optimize and understand max_seq_len and max_batch_size
max_seq_len: text size in response
max_batch_size: prompts concurrency
Play around with a chat example prompt and look at the nvidia-smi power consumption and memory.
For my system/
Max configurations allowed
max_seq_len: 1200
max_batch_size: 1
So to get bigger numbers and configurations more vRAM is good.
At this point having a foundational model running locally is a big achievement and while it could appear to be easy to start, it’s not, in the middle there are many other issues to fix.
Next steps
Let’s say this is our development environment. We need to add features and then deploy them to a production environment.
It’s true that we must expect better results on production environments because of better resources but the dev environment helps us to understand all the little details needed to run it.
Part #2
API to receive prompts and return responses
Chat interface
RAG to train data to the FM and get answers from our data
The goal for the first feature is to get results from our documents or database.
that’s part of the RAG (Retrieval-Augmented Generation)
Tools
Llama3.1 8B-Instruct
Pytorch
Vector storage system(TBD)
Part #3
Deploy to production, we are going to start with AWS Bedrock since it’s the cloud platform I have most experience with.
While we could think it’s the last step, it’s not, this is the most critical part since each project has its own needs like how many users are going to use it and how many concurrently in terms of max_batch_size processing.
Because it’s the first time setting up this system I don’t expect the best results at first deployment so it will be a continuous improvement running configs and resources.
Production deployment are for API and Pytorch/Llama3 processing but also to build the automation for the RAG, we need something similar to a Continuous Integrations but for data training.
Conclusion
The truth is that currently this research is only for fun.
Links
- https://github.com/oddmario/NVIDIA-Ubuntu-Driver-Guide/issues/2
- https://huggingface.co/blog/llama31





