How to choose AI Models (LLM)

There are more AI models than what you see in the news and social media. There are hundreds, including open-source models, private ones, and the tech giant’s own Gemini, Claude, OpenAI, Grok, and Deepseek. An AI model is a neural network utilizing a set of massive data to recognize specific patterns. Now is the time to take take advantage of them and choose wisely whether for business, personal assistance, or creativity enhancement.
The objective of this guide is not about “model training”; it is geared for individuals new to the field of AI who want a better understanding and leverage the technology. You can build with AI, not over it, so after reading this guide, the knowledge gained shall be understanding general concepts, usage, and measuring accuracy. In this AI guide, you will learn the following:
- Category of Models
- Corresponding Tasks of Models
- Naming Convention of Models
- Accuracy Performance of Models
As a beginner or just hearing about the popular tools, make note that there isn’t one type, multi use-case model that does everything you ask it to do. From the interface, it may appear that you are just typing to a chatbot, but there is a lot more being executed in the background. Business analysts, product managers, and engineers adopting AI can identify the what objective they have a select from a category of AI models.
Here are 4 categories of models among many:
- Natural Language Processing (general)
- Generative (Image, Video, Audio, Text, Code)
- Discriminative (Computer Vision, Text Analysis)
- Reinforcement Learning
While most models specialize in one category, others are multi-modal with different levels of accuracy. Every model has been trained on specific data and therefore can do specific tasks related to the data it trained on. Here’s a list of common tasks each category of models can do:
Natural Language Processing
Enables computers to interpret, understand and generate natural human language using tokenization and statistical models. Examples are chat bots and the most common one is chatGPT which stand for “generative pre-trained transformer”. Most models are actually pre-trained transformers.
Generative (Image, Video, Audio, Text, Code)
They are Generative Adversarial Networks (GAN) which use two sub-models known as a generator and a discriminator. Realistic imagery, audio, text and code can be produced based on tons of data it was trained on. Stable Diffusion is the most popular method of generating images and videos.
Discriminative (Computer Vision, Text Analysis)
These use algorithms designed to learn different classes of datasets for decision-making. They include sentiment analysis, optical recognition, and sentiment analysis.
Reinforcement Learning
Using trial-and-error methods and human enforcement to produce goal-oriented outcomes, such as robotics, game playing, and autonomous driving.
Naming Convention of Models
Now that you understand the types of models and their tasks, the next step is to identity the model quality and performance. This begins with the name of the models. Let’s break down a model naming. There’s official convention to naming AI models but the most popular ones will just have a name followed by the version number such as chatGPT#, Claude #, Grok #, Gemini #. However, the smaller open-source, and task-specific models will have longer names. This can be seen on huggingface.co which will contain the organization name, model name, parameter size, and lastly the context size. Let’s elaborate with an examples:
mistralai/Mistral-Small-3.1-24B-Instruct-2053
- Mistralai is the organization
- Mistral-Small is the model name
- 3.1 is the version number
- 24B-Instruct is the parameter count in billions or training data
- 2053 is the context size or token count
google/gemma-3-27b
- Google is the organization
- Gemma is the model name
- 3 is the version number
- 27B is the parameter size in billions
Additional details which you will see and need to know is the quantization format in bits. The higher the quantization format, the more computer ram and storage is required to operate the model. A quantization format is represented in floating point such as 4, 6, 8, and 16. Other formats can include GPTQ, NF4, and GGML which indicate usage for specific hardware configurations.
Accuracy Performance of Models
If you’ve seen news headlines about a new model release, do not immediately trust the results that are claimed. AI performance competition is so competitive right now that companies cook-up the performance numbers for marketing hype. How many people will test them on their own instead of trusting the marketing hype? Not many at all so don’t fall for the “hallucinated figures”. References https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/ and https://lmarena.ai/?leaderboard
The real way to determine model quality is to check benchmark scores and leaderboards. There have been several tests which you could say are semi-standardized or maybe fully standardized but in reality, we are testing “black boxes” with tons of variables. The best measure is to check answer responses from AI with facts and other scientific sources.
Leaderboard websites will show sortable rankings with votes, confidence interval scores usually in a percentage value. The common benchmarks are tests that prompt the AI model with questions and get answers which are measured. They can include: AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, WinoGrande, GSM8K, HumanEval.
Here are brief descriptions of those benchmark tests:
AI2 Reasoning Challenge (ARC) – 7787 multiple-choice science questions from grade school
HellaSwag – commons sense reasoning exercises through sentence completion
MMLU – Massive multitask language Understanding of problem solving
TruthfulQA – assess truthfulness by encouraging falsehoods and avoiding responses like “I’m not sure”.
WinoGrande – Winograd schema challenge with two near-identical sentences based on a trigger word
GSM8K – 8,000 grade school level math questions
HumanEval – measures ability to generate correct python code across 164 challenges
Leaderboard websites
https://www.vellum.ai/blog/llm-benchmarks-overview-limits-and-model-comparison
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
https://artificialanalysis.ai/leaderboards/models
https://epoch.ai/data/notable-ai-models
https://openlm.ai/chatbot-arena
https://lmarena.ai/?leaderboard
Now using these facts and knowledge you can take a little extra research time beyond reading the next Ycombinator/Tech Crunch headline to determine if a new model performs the same or worse than stated by press releases.
Leave a Reply
You must be logged in to post a comment.