Artificial Intelligence FAQ

What it is. How it works. What it can do.

A.I. illustrated as Sierpinski triangular fractals

This text, mainly written in the form of a technical terms glossary, is meant as quick-start guide on AI, a broad and fascinating area of computer science. Short paragraphs per term summarize often complex matter to provide you with an immediate overview of the field and a solid primer on this intriguing topic. Get a grip on AI concepts, AI lingo and technical terms, common abbreviations, LLMs, language processing, and more. Micropolis believes in the advancement of scientific and engineering excellence through support of education, of research and by sharing knowledge. So, let's try that and dive right in.

A.I. - What is AI?

"A.I." or without punctuation simply "AI" is short for "artificial intelligence". The term is used to label a discipline of information science, a category of computer systems, and it is an idea. Most basically put, "artificial intelligence" is meant to be a replica of "natural intelligence". AI is an emulation, or even more cautiously put, a simulation of an ability earth's creatures possess "just like that". Quite possibly, the quest for AI is related to the foundational human desire of creation. It is the dream of being able to "create" - in the sense of Creation, possibly likening mankind to God. On a less philosophical level, it may be the human drive to add something to this world, something that has a spark of its own. A being that is autonomous, self-contained and able to naturally interact with its surroundings and other creatures. In order to get there, the one most important part of such a being is not the mechanics of the physical body but the central brain. The brain is the essence of creation.

Human attempts to breathe life, an "anima", into inanimate things are innumerable. Mankind seems to be obsessed with the act of inspiration, of giving a "heart" or a "soul". Throughout history, dreamers, magicians and priests have tried. In ancient Egypt the idea of transforming inanimate objects into animated beings was part of mythology. The Greeks imagined automatons and creatures. In a narration by Ovid, the Greek king/sculptor Pygmalion creates a giant ivory statue, that takes on a life of her own. It's the quintessential act of Mimesis, it is an act of art, an act of expression, of admiration and imitation. Artificial creation that is. While humans, animals and even invertebrates have the ability to reproduce, man's creations have not. The act of true creation, to a degree, keeps a magical air to it.

The advent of binary technology allowed mankind to dream of another attempt at creating. Deeply rooted in numerous disciplines, from mathematics to engineering, to information theory and computer science, this most recent attempt marks an impressive breakthrough, yet it is, again, no true creation. AI is an engineered apparatus, an elaborate tool with impressive capabilities. AI, as part of the urge to create, remains a profound human ambition. Pygmalion finally falls in love with his own creation. Quite possibly, mankind similarly has fallen in love with the idea of true Creation. So the quest for artificial intelligence, for artificial life, is on.

A.I. - How does AI work?

Artificial intelligence, on a high level, is an extension of man's capability to build machinery and as such represents the epitome of a centuries-long evolution of technical skills. Diving in, AI is a blend of information technology, of binary systems, computers and microelectronics. Thus, artificial intelligence is one rich interdisciplinary field of computer science. Inside AI, it is mathematics and statistics, a refinement of scientific insights generated from bionics and information retrieval. AI synthesizes decades-long research on neural networks and logic programming. Today's AI is not the combination of all the high-level philosophical approaches towards creating an artificial intelligent being, nor the result of programming-driven early chatbot-like experiments coming to fruition. Instead, it is one specific neural-network architecture that was earlier abandoned, finally coming to utility through brute force processing power enabled by advances in CPU and GPU capabilities.

Ultimately, contemporary textual AI is only statistics solving one single problem: predicting the next word, or more broadly, identifying closely aligned patterns. It is a system of mathematical calculation, stochastic extraction and prediction. There is no soul, no reasoning and no thinking. AI relies entirely on the ideas, deductions, thought, reasoning, intelligence and philosophy that is encoded in the vast body of human knowledge that forms mankind's textual legacy. AI systems are trained, which means they are fed with corpora of information, datasets spanning as diverse sources as Wikipedia, encyclopedias, technical documents and curated texts. An AI response then reproduces the insight that can be distilled from all these sources, leveraging overlaps, probability and structural similarities. The system is able to form coherent utterances that give the impression of an own, independent thought - but in the end, this remains a simulation.

That said, exactly this approach, this specific way of accessing human knowledge, is the strength of AI systems. They work like a giant statistical filter for information. An AI can answer questions based on established insight, common sense or proof, that is found in its training sources. An AI can deduce trends or common traits found in diverse material and may uncover truths that were previously opaque or obscured by information overload. It can distill information down to strong consolidations - and when done right, it does so in an unbiased and objective way. An AI also never fails to provide an answer, any answer. The AI will find a well-confirmed answer, given there is solid proof in the source material. And the AI will give some weak response just as readily, even from the thinnest of information. For the machine, in its internal weighting, it doesn't matter much and it won't differentiate between the two, unless we program it to assess quality. Its filtering and chatting have no "insightful authority" overlooking its actions, the reasoning aspect is missing. It is a tool, like a screwdriver, and it won't complain or warn or voluntarily shut up.

This reveals another truth of AI: while much progress has been made with LLMs and in the field of knowing how to train them for mostly quality outputs, AI systems today are hybrids, or better, an amalgam of diverse modules. AI systems consist of components for generation, like one LLM or multiple, with dynamic switching between them based on input topic. They integrate content patrolling and safeguards, security and observability tools for both input and output. Modern AI systems employ input prompt rewriting mechanics, with modules for spell checking, output hedging, abuse prevention, legal guardrails, and more. LLMs today are just as often used to embed hard facts retrieved from a knowledge database (Retrieval-Augmented Generation, RAG) in colloquial chatter as they are used to actually converse about knowledge that has been trained into the model itself. After a mainstream hype surrounding the capabilities and blessings of modern AI, the actual reality in implementation is that LLMs and other generative AI models are a giant leap forward in information processing, but only a means to an end and not ends in themselves.

AI Agents

refers to an AI-enabled computer system that actually executes tasks on behalf of a human operator or is able to replace a human worker. AI agents, as a conceptual idea, may do structured work, automate subtasks or microtasks, or execute complex processes at any level of complexity. AI Agents may operate on a virtual level, inside a system, or may take over the UI of selected apps, or control a whole computer system - consuming input, triggering certain actions, or generating output. In an embodied form, the actions of agentic AI may influence the real world and can interact with both machines and people through AI robotics. Common AI applications like customer support systems are not AI agents in the technical sense. Answering user support questions or interacting with customers in a conversational chat situation over the telephone or via website chat widgets employs AI technology but not in an agentic way. AI agents are usually defined as (partly) autonomous systems. They are compound systems of potentially multiple LLMs and/or ANNs, embedded into tools, interfaces and orchestration periphery. Agentic AI can act according to defined objectives or wait for certain triggers to execute complex tasks. When a system is able to "take over" and autonomously complete tasks, it can be described as being an AI agent. Colloquially, an LLM that was fine-tuned to take on a certain persona or a combination of an LLM with memory systems and tool interfaces is sometimes also labeled as being "an agent". Under an academic perspective though, it makes more sense to reserve this term for partly or highly autonomous multi-step executing systems. Compare System Prompt and Tool Calling.

AI Benchmark Tests

Overlapping with the elemental "AI Disciplines", AI Benchmark Tests intend to be standardized exams designed to assess skills of an artificial intelligence system. Some tests are derived from human intelligence or qualification exams, while other tests are specifically tailored for AI systems. In general, the benchmark in such tests is either the proficiency of a human candidate, an average thereof, or in turn, scores of other or older AI models. Humanity's Last Exam (HLE), for example, consists of 2,500 exam questions in over a hundred subjects, grouped into eight high-level categories. HLE contains two question formats: exact-match questions (the model provides an exact string as output) and multiple-choice questions (the model selects one of five or more answer choices). HLE is a multi-modal benchmark, with around 14% of questions requiring comprehending both text and an image. 24% are multiple-choice questions (MCQ), with the remainder being exact-match. Compare Turing test.

Humanity's Last Exam (HLE)
2,500 multi-modal expert-written questions across many subjects, MCQ + exact-match.
MMLU
"Massive Multitask Language Understanding" is an exam of 15,908 multiple-choice questions across 57 subjects, evaluating broad knowledge and reasoning.
MMLU-Pro
More challenging MMLU variant with ~12,000 graduate-level questions across 14 disciplines with 10 answer choices per question.
MMMU
Massive Multi-discipline Multimodal Understanding, tests college-level multimodal visual problem‑solving.
MMMU-Pro
Advanced MMMU, expert-level academic tests and reasoning tasks.
VideoMMMU
Video-centric MMMU variant.
CharXiv-Reasoning
Scientific figure reasoning benchmark.
ERQA
Multimodal spatial reasoning tasks.
GPQA Diamond
198 PhD-level biology, chemistry, physics MCQs, expert‑validated "Diamond" subset.
FrontierMath (Tiers 1-3)
Expert-level mathematics benchmark.
HMMT
Harvard–MIT mathematics tournament problems.
HealthBench
Realistic health‑conversation evaluation.
HealthBench Hard
Challenging health‑conversation tasks.
SWE-bench Verified
500 curated GitHub issues requiring software code fixes validated by unit tests.
Aider Polyglot
Multi-language code-editing benchmark.
Scale MultiChallenge
Multi-turn instruction-following benchmark.
BrowseComp
Agentic search and web-browsing tasks.
COLLIE
Instruction-following in freeform writing.
Tau2-bench
Elaborate test evaluating function-calling and voice conversation in alternating turns.
LongFact
Test factuality in open domains with thousands of GPT‑4‑generated long‑form prompts across 38 topics.
FActScore
Evaluates rate of true atomic facts in generated text against a reliable knowledge source.

Leaderboards, Lists and Rankings:

Arena.AI (https://arena.ai/)
Created by researchers from UC Berkeley, this community-operated portal has regularly updated lists of top-performing models in various categories.
Wikipedia (List_of_large_language_models)
Community-edited list of LLMs with feature comparison and links to Wikipedia articles.
models.dev (https://models.dev/)
An open-source database of AI models. Hosted on github and available in JSON format, the list allows automatic integration and feature comparison.
OpenRouter (https://openrouter.ai/rankings)
As a commercial API broker, OpenRouter offers curated model listings and benchmarks as a free service.
Poe (https://poe.com/leaderboard)
Commercial API broker by Quora, Inc., similar to OpenRouter, also offers model leaderboards.

AI Ecosystem

As with anything in compute, the whole ecosystem of AI is fundamentally based on power. From there, the AI stack expands into specialized infrastructure, orchestration layers and control systems that support its core: the models themselves. Above this level, various middleware and frameworks then enable user-facing generative applications across audio, video and text.

AI Ethics

Morality is how people actually act, their rules for life, guided by values and norms. Ethics is a deeper and more reflective discipline, centering around not what we do, but why. Ethics consequently is not only in search of what is wrong or right, but more fundamentally it adds the layer of why it is wrong to do this, and why it is right to do that. Regarding the field of AI, ethics must ask the most essential questions about what we do with artificial intelligence, its consequences and repercussions.

In AI ethics, the debate is mostly split between two camps: one advocating for unrestrained progress, early adoption and forceful implementation of this new technology into every aspect of daily life, business and research. The other camp is cautious, looks at developments with great concern and might even fear its deployment. A nuanced opinion on AI is usually a silent voice in between these vocal camps and this is unfortunate as this understated stance is probably the most significant voice. Artificial intelligence can be a blessing for mankind. But it carries the potential of becoming a curse for humanity. The outcome relies on us, on what we choose to do or not do, on society to moderate this powerful technology. There is a strong link to nuclear power here, and how we handle this high-risk technology. In fiction, there is rich lore about technologies, about AI and ethics involved. We've read the utopian and dystopian narratives in countless variations. By now, we should know what to do, but it turns out that our modern societies are surprisingly unprepared for the sudden advent of advanced machine learning and generative AI.

Pope Leo XIV addressed the situation in May 2026 in his Encyclical Magnifica Humanitas, calling upon the people to be wary. The individual is at risk of falling by the wayside, pushed aside by untamed AI. Therefore, such is his appeal, we must place the human individual at the center and "disarm AI", making the protection of human dignity the benchmark for every technological development, especially artificial intelligence. Compare AI Morals and Autonomous Weapon.

AI Disciplines

What is intelligence? Is it memory, knowing facts and figures? Is it learning, insight and problem-solving? Is it reflection and reasoning? How important is the combination of all of these, to be capable of emotional understanding, of sympathy, of compassion? Is true creativity the sole domain of a living and breathing being? While academia has a fixed set of terms to describe what intelligence is and what it should be able to do, actual common sense in describing intelligence is more diverse and part of an ongoing debate. Similarly, artificial intelligence is a well-described phenomenon and has its technical terms. But beyond academic analysis, there is again a real-world, common sense approach to looking at and evaluating artificial intelligence today. The AI community has developed metrics to rank the performance and overall quality of AI and its capabilities. Below is a set of disciplines that have evolved in the area of language models. Such categories try to collect challenges in reasoning, creativity - like in writing or coding, problem-solving or communication skills. These categories can help us to determine where progress is made and to measure where technology is at right now. Compare "AI Benchmark Tests" for a list of standardized tests that encompass single or a subset of below's disciplines.

Expert
A model's knowledge-domain-specific assistance and reasoning quality.
Occupational
Work- or Business-related task handling and problem-solving.
Math
Calculation and mathematical proficiency of a model.
Instruction Following
Understanding of and accuracy in obeying user input.
Multi-Turn
Is a model able to maintain coherence, even over long conversations?
Creative Writing
Measure the quality of imaginative, emotional and stylistic text generation.
Coding
How adept is a model at programming and debugging?
Hard Prompts
Humans enter prompts in a colloquial way with "noise", ambiguous remarks or contradictions.
Longer Query
Approx. 10% of all prompts exceed 500 tokens. How well does a model perform on such inputs.
Language
Ability to handle multiple languages or help with language-related tasks.

AI Morals - Is it "right" to use AI for content generation?

Since the explosion of generative AI (GenAI) in 2022/2023, the Internet is full of artificially generated content, from text to images and to videos. The abundance can be alarming. Given that early adopters and tech enthusiasts seldom step back or ask questions, it is on academia and intelligentsia to question this development. Is it okay to generate content that was previously only produced by writers, artists, or experts, to be now generated by AIs en masse? That's a difficult question that reaches deep into morals and philosophy. On a simple level, looking at content being circulated on websites and platforms, Google - the gatekeeper of what is "relevant" - has found an interesting stance on this: "as long as it is 'high-quality', it is not relevant how content was generated". This is one possible answer that is at least true for now, in 2026.

While it may be okay to publish AI content, what can be said is that any content being produced for the Internet has been dramatically devalued by AI. Content producers are being accused of using AI to produce their output. Video, images and texts are scrutinized for signs of AI. Well-done manual content is preemptively blemished as "probably being AI". There was a meme during the Photoshop-era of manipulated content: "It's a shop. I can tell from some of the pixels and having seen quite a few in my time." With AI, it is getting harder to discern artificial from manual. And neither the cloud vendors of AI nor the users are inclined or able to offer a solution to reliably label AI-generated content, not in files or formats and not as being one part of a creative process. But Google and others quite probably are not sleeping. AI detectors are getting better at telling if a text is or was largely written by AI - not with "the help of AI" - but literally by AI. On YouTube, the platform isn't asking creators anymore to disclose their use of AI in videos, it is detecting it internally. Social media platforms like Instagram announced a backlash on AI Slop and the massive influx of low-quality AI junk in their feeds. It could be that Google, once engineers have figured out how to feasibly add the overhead of checking for AI signals in content on a planetary scale, will downgrade the ranking of AI content in search results and RAG answers. Even though AI content blogs and websites may rank well in 2026, a Google algorithm update may change that any day and apply a stigma that web properties won't easily lose afterwards. That said, this will only hold true if society's verdict on AI will be that artificially generated content is generally of lower quality in comparison with human content. Ultimately, the decisive distinction may not be "AI vs. human" but "low- vs. high-quality" and it is already going into that direction - and the quality assessment, ironically, may be done with the help of AI.

But aside from a technical perspective, aside from fancy SEO or GEO strategies (Search Engine Optimization and Generative Engine Optimization) or even legal aspects, there is also a very important moral notion of using generative AI for content production. While it may not be expressly "wrong" to do so, every person who does use AI loses the gift of being proud of their own work. And that is important for individuals, for life and humanity at large. There is high value in looking at your own work. The argument against that and for using AI is usually along the lines of "just because someone works hard doesn't mean it will produce a meaningful outcome". Working in the wrong direction, although doing so really hard, may lead nowhere - is the rationale. Well, is it? Another claim is: use AI and get there faster and easier - and use the saved time for more productive work. Well, is arriving faster really always better? Is achieving easy right, and slow and difficult wrong?

There is scientific evidence that people who won the lottery lose their modesty really quickly after their win. After only a few years, they think they have earned their wealth, not won it. In a way, the same is true for AI content. Some may remember how it happened but then may feel the pain of having "cheated". Most people probably, after some time has passed, may look back at their work and think they have come up with it, but in fact, they have not. And they have done so without having had the opportunity to learn on every step of the way, to improve, or add the human spark of ingenuity and surprise in any of the small details that led to the final product. AI has the potential for easing mundane footwork, but also the potential to lead to meandering, going the same way twice, checking, validating and fixing instead of knowledgeable expertise-led progress. Craft can be defined as knowing the outcome before you start, understanding and owning the process. Craft is more than prompting and surprising yourself with the outcome.

Also, there is a risk in going the easy path. Humanity may ultimately lose the generational pact of renewal, of passing down education and know-how - the detail of how it's manually done. And with the hand for example, and as metaphor, strength is limited, it naturally limits exerted power. AI is a mighty technology. When unskilled people use it to leverage themselves into a position they don't really control or understand, there's danger. For example, when people use AI for coding without having the ability to validate a program, lacking the common sense knowledge of a seasoned programmer. When AI is used to wield an excessive force to propel yourself somewhere, this may ultimately hurt yourself or your projects, now or in the long run. For AI content on websites, this may lead to potential customers losing all respect for your company when they suspect automatically generated content or check your content with ZeroGPT or similar and find out they have been fed low-effort garbage. And for society it may lead to uninspired work with a bias towards mediocrity. As an assistive technology, AI will lead us to new heights, but as a means to an end, it will lead nowhere, at best.

AI Music

refers to any pieces of music that were created in part or in whole by the use of artificial intelligence or machine learning algorithms. Analogous to how large language models are trained, audio generation models are trained with vast amounts of available music. Through this process, the model learns patterns in melody, harmony, rhythm and tone. Based on this pretraining, audio models can generate instrumental or vocal music, usually based on provided text prompts (text-to-audio). The use of commercial copyrighted music for pretraining raises a number of complex legal and ethical questions, as genAI creations are not verbatim copies but sometimes closely resemble already existing musical works. This has already led to high-profile lawsuits from the record industry. Despite these open questions, AI Music has found wide release, distributed via the internet on audio streaming and video platforms. Many creators employ audio AI to create gentle low-fidelity ambient music commonly used as background music (elevator music, "for study"). As there is no real effort in creating such music, creators are able to spam traditional distribution channels with their content and monetize it via quickly built long-tail catalogs.

The technical generation of AI Music is split between two competing technologies. The diffusion approach iteratively computes output by reducing an initially latent noise signal down to the final clean audio form. Some techniques do not work directly on the signal here but generate a spectrogram of the audio signal to enable the model to work on visual data, similar to image generation. Other approaches use the tokenization technique known from text models to chunk audio training material and treat it like text tokens. This allows vendors to use established model pipelines and easy text-to-audio workflows. Current technology leaders like Suno or Udio commonly use a hybrid approach where a rough preliminary representation is rendered by a transformer model, used to layout a first structure of the final result. A diffusion model then takes this intermediary result and refines audio quality for the final output.

It is worth noting that similar to AI-generated text, the generated audio output might pose a breach of copyright. Vendors usually transfer legal responsibility to the end user. See the article "Legalities of User Content: The Shift in Ownership" for more on this. For the low-quality aspect of quickly generated AI music compare AI Slop and for an open-source solution, HeartMuLa.

AI Search

is a combination of text generation through an LLM with traditional information retrieval. A few years ago, search was mostly based on SQL databases and exact matching of keywords. This approach has many shortcomings, as a word like "Documents" won't match an entry that has the keyword "Document". The problem that information technology researchers are chasing for decades now is how to implement a fast and smart fuzzy matching function. An earlier solution is word stemming, where plural forms, or suffixes are broken away from words and basic (truncated) forms of keywords are added to the database index. Stemming the word "documents" yields "document" and stemming a user's search input likewise will produce a match, for "document" and "documents". Later, more advanced algorithms were introduced to solve this problem and find database entries that are a close fit without requiring the user to input the exact search term. Some calculate "edit distance" (Levenshtein distance), others are "PGres trigram" (PostgreSQL Trigrams) or "BM25". A different approach to solve this problem is vector search. In vector search, input is converted into numerical vectors, so that input tokens define vectors in a high-dimensional vector space. A search query as a whole, expressed as a vector, can be thought of as a zig zag line projected into virtual space. A similar query input will render as a similar vector. Doing the same for the search corpus and then comparing arbitrary search input with stored vectors yields closely related search results.

And while such vector representations are robust against misspellings and can even find entries thematically linked to queries, vector search is always vague and struggles with exact matches, proper nouns, cryptic product numbers, specific named entities and similar. This is why modern search implementations usually follow a hybrid approach. Each query triggers two separate searches. One query is run against a traditional database setup, with fulltext inverted index, stemming, exact matching and ranking logic. From this search only the top results are kept and put into a ranked list (BM25, "Okapi best matching algorithm no. 25"). The second query is issued against a vector database and returns a ranked list of entries that are closely related to the query input vector. These two ranked lists are then merged via an algorithm like "Reciprocal Rank Fusion" (RRF). This way a final ranked search results list is generated. Internally, it is now decided how many of the relevant entries are used for continued processing. These results, in fulltext or excerpts, are then programmatically inserted into a prompt template to produce a "Super-Prompt", along the lines of: "You are a helpful assistant. Here are three documents about "Documents". With these as context, answer the user's question: . This prompt is then locally or remotely fed into a Large Language Model to infer the final answer. This overall scheme, using an LLM to encode knowledge as vectors and merging traditional with vector search in a hybrid approach and finally presenting search results in conversational summarizing text instead of in a list of entries is what is commonly known as "AI Search" or "AI enhanced" search. The term for it is "Retrieval-Augmented Generation" (RAG). Compare Vector Database.

AI Slop

AI Slop is a term that was first used colloquially to describe quickly made, sometimes humorous but mainly trashy nonsensical imagery, text, or any content generated with the help of generative AI systems. Around 2022, when people started experimenting with the advancing technologies for machine-generated content, early experimental content began to flood social media. This outbreak of quickly done postings, humorous or odd textual but mainly visual content has since taken on a life of its own, being circulated in masses. Early examples of AI Slop like the "Shrimp Jesus" were either monetized or utilized in engagement farming or political campaigning. When content creators started to adopt GenAI systems more, the borders between "quality content" and hastily done "AI Slop" began to blur and the term became used more for any low-quality, odd or trashy content - sometimes even to devalue quality content. AI Slop can also be used in information wars, when it is leveraged to influence mass audiences by manipulating beliefs or flooding channels with viral or personalized misleading messages. Researchers Michał Klincewicz et al. coined the term Slopaganda for this kind of machine-generated propaganda content. Especially in political debate or where AI content is used to spam, AI Slop is being circulated on purpose. The use of AI Slop becomes particularly problematic when official, legitimate or reputable channels on their part begin to spread AI slop to substantiate their messages. This adds a new facet to how our post-modern digital attention economy works. Compare AI Music.

AI Winter

AI Winter is a colloquial term for phases of dimmed interest, dried-up funding and slowed technical advances in the field of AI research. In computer science, the domain of artificial intelligence was always a field of vagueness and slow scientific progress. Whenever a new technical innovation then fostered abrupt leaps, interest similarly jumped and entered into a hype cycle, largely overestimating the potential achievements. Around 1970, the first AI Winter commenced when academia identified the research into the perceptron algorithm of the earlier two decades as ultimately flawed. From there, research in AI slowed significantly and it took about ten years before enthusiasm returned and expert systems gained attention as the next potential breakthrough in AI. When such systems saw wide adoption, improvement and scaling, it turned out that this approach as well had issues and was brittle. The "knowledge acquisition bottleneck" proved a major problem, as crafting and structuring knowledge in an expert-supervised way turned out to be too difficult and costly. By the beginning of the 1990s, funding and interest in artificial intelligence had again vanished. A second AI Winter began. It was not until 2010 when interest in AI was fueled again, by single breakthroughs at individual research groups. The combination of advanced computational capabilities of GPUs, massive labeled datasets and insight into how to effectively use them to train models allowed Convolutional Neural Networks (CNNs) to herald the era of modern large ANN models. Compare Expert System and Perceptron.

AGI

is short for "Artificial General Intelligence", a term that describes a hypothetical level of artificial intelligence capability. The term AGI is not universally defined, and some researchers prefer "Strong AI" or "Human-level AI". While today's AI, labeled as "narrow AI" (ANI, "Artificial Narrow Intelligence"), is able to work on specific domains, AGI would be able to work on a broad spectrum of topics. ANI is task-specific and can work on image generation or text analysis with little to no cross-domain reasoning. AGI is expected to show a deep understanding in diverse disciplines and should be able to transfer abilities or knowledge effortlessly from one domain to a completely unrelated one. Developing AGI is considered one of the "Holy Grails" in computer science. Advocates of AGI expect such systems to be of great use for society at large, enabling mankind to progress in larger leaps, such as accelerated scientific discovery. Critics, on the other hand, point out unforeseeable risks and debate the possibility of a "Pandora moment" once an uncontrollable system reaches AGI, resulting in unpredictable behavior, potentially emergent. Fictional storytelling has explored this possibility in countless narratives. One final note is that AGI, while highly anticipated, is not the ultimate level. Researchers classify ASI above AGI, the "Artificial Superintelligence" which might be able to surpass the best human abilities across all domains.

AGP

is short for "Artificial General Purpose", one name for Foundation Models, which have been pre-trained to be the basis for fine-tuning. See Foundation Model.

Alice

or "A.L.I.C.E.", short for "Artificial Linguistic Internet Computer Entity", or "Alicebot", is a popular chatbot implementation in the tradition of the original Eliza chatbot program. Alice employs pattern lookup and matching like Eliza, but is internally scripted in a flexible scripting language called AIML ("Artificial Intelligence Markup Language"). Through clever features exposed by the scripting language and a refined internal lookup mechanism, Alicebot represents a significant improvement over the abilities of the 1960s Eliza chatbot. Alice was originally developed by Richard Wallace and both the Alice source code and the AIML language have been open-sourced. In revised and improved versions, Alice won the Loebner Prize three times, in 2000, 2001 and 2004. Compare Chatbot, Eliza and MegaHAL.

Algorithm

an algorithm is a precise, finite series of distinct computational steps that define how a problem is meant to be solved. For the domain of AI, an algorithm is the defining logic of how data is processed, how patterns are treated or recognized, and how output is generated or decisions are made. Compare Model and Heuristic.

ALN

is short for "Adaptive Logic Network", an older machine learning approach that was heavily researched during the 1990s. ALNs model decision boundaries and functions using piecewise-linear logic nodes. While their interpretability is higher than with other more modern neural network architectures, their lower performance made them a largely obsolete approach in comparison with modern deep neural networks.

ANN

in relation to the field of artificial intelligence can mean two things: First, "Artificial Neural Network", a mathematical structure usually called a "model" that is loosely inspired by how biological neurons and the human brain work. Such an ANN is comprised of nodes arranged in layers. These nodes are connected with weighted edges and these weights are adjusted in a process called training. When data is passed through these layers, values are weighted and tensors transformed to recognize patterns and predict values. ANNs are highly effective in classification, regression, image processing, speech processing, and discovery algorithms. Compare GPT. The second meaning of ANN is "Approximate Nearest Neighbor", a term from vector search and vector databases. It describes an algorithm that is used to look up and match vectors in a vector store. Vectors are arrays of numeric values, which in AI are used to represent semantic meaning in the form of embeddings. When such vectors are compared, geometric similarity in high-dimensional vector space correlates with similar semantic meaning. One form of finding similar vectors is the ANN algorithm. ANN uses a heuristic approach favoring speed, making ANN slightly lower in accuracy but significantly faster and more scalable compared to another similarity algorithm, the exact "k-Nearest Neighbors" (k-NN) approach. Compare Vector Search.

Autonomous

In robotics, the term "autonomous" describes a robot's ability to make decisions independently and perform tasks without direct human intervention, relying on various computations of its systems. In AI, the term "autonomous" usually labels a system that acts in digital environments, without physical manifestation. At its core, the basic Perception-Action loop is using advanced algorithms to interpret a given situation and perform actions towards reaching some target state. While limited human intervention or mandate is present in such agentic AI, a fully autonomous AI would work perpetually independently once it was engaged.

Autonomous Weapon

as a sub-type of an autonomous robotic system, is any device, platform or robot where an onboard software is able to identify, select and lethally attack targets with limited or no human intervention. When such systems have mobile capabilities, incorporated as Autonomous Mobile Robots (AMR), Unmanned Underwater Vehicles (UUV), or Unmanned Aerial Vehicles (UAV), such unmanned military craft can be described as Killer Robots or Killer Drones. In contemporary discussion, such autonomous systems are usually expected to carry some form of AI, meaning that elaborate algorithms, machine learning systems and generative models are employed to assess situations, derive meaning, calculate risk potentials, identify friendly and hostile actors and decide on future actions of the system. As such, autonomous weapons are an extreme form of agentic systems. Civil agentic AI may be harmful in varied ways, but it is not intentionally designed to be potentially lethal. The decision if a person is to be harmed or even killed is one of the most serious questions in cultural and moral debate. The very idea of laying this decision into the hands of nascent AI systems of today is highly unsettling and ripe for deepest moral discourse. The "Campaign to Stop Killer Robots", formed in 2013, a global coalition pushing for a pre‑emptive ban on Lethal Autonomous Weapons Systems (LAWS) represents a humanitarian movement that advocates the total ban of such systems and the development thereof. In discourse, the common lazy sentiment is very often that one party has to push for autonomous weapons because the other party will do it anyway. As such, LAWS can be discussed in similar context as atomic weapons, blinding laser weapons, anti-personnel landmines, or cluster munitions. In 2025/2026, the UN "Convention on Certain Conventional Weapons" (CCW) was deadlocked over a binding treaty. Societies will have to decide between moral depravity and strategic disadvantage.

The many ways in which an autonomous weapon can fail have been illustrated in countless works of fiction. A weapon-controlling computer system starts war in WarGames (1983), an emergent armed computer network enslaves mankind in The Terminator (1984). The 1987 film RoboCop shows a patrol robot that fails to discern harmless from hostile people and smart drones in Oblivion (2013) are depicted as struggling with the same decision over and over. When science-fiction is society thinking about the future, then the many examples of failed AI drones in narratives should tell us something about the realities of what it would mean to pursue this idea today. Compare AI Ethics.

Micropolis Patrol robot — A bi-ped **Patrol robot**, as imagined in 'RoboCop', a dystopian science-fiction feature film. Armed and out of control.

Backpropagation

is a widely used scheme in the training of artificial neural networks. The idea is to expose a model to data and observe the model's output. Comparing this output, the difference (delta) between the actual output and expected (ground-truth) target is calculated. This error value is then fed back through the model's network layers (backpropagated), adjusting individual weights in such a way that the output error is minimized. Backpropagation is able to detect which weighted link is responsible for the biggest error margin (partial derivatives on a chain). Backpropagation is usually a supervised method, where a human has labeled the target value for a data point. Modern implementations can auto-generate labels, enabling a model to do self-supervised learning and backpropagation.

BERT

is short for "Bidirectional Encoder Representations from Transformers" and the name of an open weights encoder model developed at Google and released in 2018. It uses layers of encoder-only transformer to embed text semantics in vector representations. An important capability of BERT is that it assesses word meaning in both directions (bidirectional), resulting in higher semantic precision and "understanding". That said, it is important to understand that encoder-only means it is architecturally not able to generate continuous text or even hold a chat conversation like a decoder-only LLM. Models like BERT are meant for creating embeddings, classification, or analysis. Compare Embedding Model and GPT.

Case-Based Reasoning

abbreviated as CBR, is a concept researched across computer science, philosophy, cognitive psychology and partially in AI. The idea is that new problems can be solved with insight from or through solutions found for past problems. CBR is a symbolic AI paradigm that was popular during the 1980s and 1990s, valued for its interpretability and human-readable knowledge representation. As a pragmatic approach to automated reasoning, obvious implementations of the technology can be found in expert systems and decision-support applications. Cognitive science is still exploring the topic as the concept of memorizing personally helpful solutions and reusing them in adapted form on similar cases seems to be a fundamental concept in human behavior. Compare Reasoning.

Chatbot

"chatterbot" or simply "bot" is a machine or computer system that exposes a dialogic interface and is able to maintain a conversation with a human. Depending on the backend implementation, chatbots are either gimmicky, capable conversation simulators or may even expose advanced reasoning and agentic capabilities. While early chatbot implementations were based on pattern matching and pre-scripted dialogue, chatbots grew quite sophisticated and were later able to complete the Turing test to a high degree under the limited conditions of prize competitions like the Loebner Prize. Such bots were usually an amalgam of multiple modules, ranging from lookup databases and script stores to NLP processors, conversation state memory, semantic nets, and Hidden Markov Models, enabling them to organize their knowledge, be self-learning and analyze language. Today, chatbots are commonly based on fine-tuned large language models and exhibit rich multi-language conversational qualities combined with logic, basic reasoning and an impressive capability of intent understanding (NLU, Natural Language Understanding). It is worth noting that the shift towards LLMs seems like a quantum leap, yet one could argue that modern chatbots are still primitive "chatter" machines. At their core, despite advances in language processing, they just emulate human language, deceiving us into assuming a depth of understanding that isn't there. Compare Eliza, Alice and MegaHAL.

Chat Completions API

The "completions-style" API layout is the de facto standard for the structure of a "chatbot API" set by OpenAI. The API layout renders an interaction with a large language model as a series of turns, labeled with different roles, like "assistant" and "user". The API is stateless and requires context (a history of the whole chat) to be sent with every request. Chat-Completions-style APIs are widely deployed, in the form of API access to OpenAI's own cloud LLMs and in the form of numerous compatible cloud LLM API offerings. Around 2025/2026, the completions API was conceptually superseded by OpenAI's "Responses API" that offers automatic server-side statefulness, better tool integration and support for remote Model Context Protocol (MCP) servers.

Cloud LLM

is a Large-Language-Model (LLM) hosted by a cloud/ AI provider as part of a cloud platform where it can be accessed via API (Application Programming Interface). During early roll-out of enterprise-level artificial intelligence processing services, vendors chose a cloud-hosted business model to solve multiple challenges. Preparing and running an LLM is a complicated matter and a managed operation allowed vendors to lower technical and financial entry barriers on this challenging technology for customers. Further, Cloud LLM providers regard their model training and large-scale inference operation as key intellectual property. Keeping these systems closed source and the structure of backend systems behind tight security allows businesses to protect their business secrets. Third, hosting a model remotely and offering access only via API allows vendors to monitor and adapt systems much more easily and in tighter update-release cycles than it would be possible with a deployed system. This way, an LLM provider can learn from common client input patterns, adapt training and model behavior, optimize operations by aligning with real-world workloads and develop effective production-hardened model input/output content filtering on planet-scale input corpora. That said, most if not all Cloud LLM providers shifted away from training their models on client input received under enterprise contracts, but they can still benefit from lessons learned in content marshalling. Read the article "Legalities of User Content: The Shift in Ownership" for more on that. Regarding guardrails and content filtering, compare Prompt Injection.

List of Cloud LLM providers

The ecosystem of cloud LLM providers is growing daily, with vendors differentiating themselves in scope, thematic focus, or scale. In media, very often discourse centers on the pure-play API providers whose primary product is a certain family of models. These companies push the envelope and usually expose their cutting-edge models via API, based on their own managed large-scale compute deployments.

OpenAI
GPT-4, GPT-4 Turbo, GPT-3.5, DALL-E, Whisper.
Anthropic
Claude family (Claude 3 Opus, Sonnet, Haiku).
Perplexity
Retrieval-augmented LLM.
Cohere
Command R+, Embed models.
Mistral
Open-source models with hosted API access.
DeepSeek
deepSeek Falsh, DeepSeek Pro.
xAI
Grok, Voice, Imagine.
Stability.AI
Image, Video, Audio, 3D.
AI21 Labs
Maestro, Jamba, Custom models

But this is just one area of the ecosystem. The big cloud hyperscalers offer a large subset of these models as well, but layer them with their known benefits of localized infrastructure, compliance, enterprise features, service integration, Using models through a hyperscaler means access to many open-source and proprietary models through a unified gateway. Various smaller or specialized providers offer scaled down operations but more flexible hosting of a broader selection of open-source models. Example vendors are Hugging Face Hub, Replicate, CoreWeave, Ollama Cloud, together.ai, deepinfra.

Amazon Web Services (AWS)
Platform / Service: Amazon Bedrock
Hosts Anthropic's Claude, Meta's Llama 2, Cohere's Command R+, Stability AI models, and Amazon Titan. Offers serverless API access without managing infrastructure.
Microsoft Azure
Platform / Service: Azure OpenAI Service
Provides access to OpenAI's GPT family, DALL-E, Whisper, and Codex. Deep integration with Microsoft 365 and enterprise-grade compliance/security.
Google Cloud
Platform / Service: Vertex AI
Hosts Google's Gemini models, PaLM 2, Imagen for text-to-image, and Codey for code generation. Strong focus on multimodal capabilities.
IBM
Platform / Service: watsonx.ai
Provides access to IBM's Granite models and supports open-source LLMs. Emphasis on governance, transparency, and enterprise compliance.
Oracle Cloud
Platform / Service: OCI Generative AI
Partners with Cohere to deliver Command R+ and other models. Focused on enterprise integration with Oracle databases and ERP systems.

Cloud Model

While a cloud LLM only handles text, a cloud model is the overarching parent category that encompasses any remotely hosted machine learning architecture. Just like cloud LLMs, such models are accessed via API and provide the full set of modern AI tools, like massive multimodal foundation models, embedding engines, diffusion-based image generators, or specialized analytical classifiers.

Computer Using Agent

also referred to as "computer use agent" (CUA), describes a model that was trained to interact with graphical user interfaces (GUIs) like a person would use a computer. When described as an AI capability, it is "Computer-Use" or "agentic computer use". Such systems usually combine a vision-capable model with both, a screenshot or display buffer input, and some Human-Interface-Device (HID) emulating output mechanism. This way, such models can act agentically, performing actions to accomplish certain goals in a semi-virtual environment without dependency on specific adapters or APIs. Compare AI Agents and Tool Calling.

Context Window

The amount of data a large language model can process as part of a generation request is limited. Although the deployment stack like the inference framework, some middleware or a web interface might cap the amount of data that is transferred to a model, the actual context window of a model imposes a hard limit. Context window size is part of the specifications of a model and is measured in tokens. Older models had tens of thousands tokens context windows but modern models may accept over a million tokens per request. One reminder here is that in chat style interaction, the chat history is passed to the model with every turn as well, so long conversations may get truncated unless parts of the chat are summarized.

GPT-3.5-Turbo has a published context window of 16,385 tokens. Given a rule of thumb that tokens average 3-4 chars in English language text, a 16k window roughly equals up to 64k chars. As a quick comparison: one US letter or a DIN A4 page of text has between 1500 and 2000 characters per page. Any markup or vendor-side headers, templates, etc. count against the context window size as well, so actual payload may vary with cloud LLMs. On the other hand, newer models have significantly larger context windows. For example, Anthropic's Claude Opus 4.6 and newer has a one million token context window, and provides users with a rough guide that this would equal 750k words or 3.4 million unicode characters. That said, it is interesting to note that unicode characters vs ASCII aren't per se "heavier" in terms of token use. While more obscure characters outside the ASCII range are usually encoded in multi-byte sequences (for example in UTF-8), for tokenization this is irrelevant. Token use to encode a certain char sequence depends on this sequence's frequency in the training data and not on its byte size. Tokenization is a statistical process and not related to a sequence's byte‑encoding. Compare In-Context Learning, System Prompt, and Tokenization.

Do not think of an AI chat as a "real-world human conversation"

Looking at the chat completions API of OpenAI and compatible cloud LLM providers, we can see that a conversation is structured as alternating turns of "messages" uttered by speakers with different "roles". But this layout is more a mental helper, an ergonomic decision, rather than an authoritative mental model of what is going on internally, within systems, or the processing LLM. A first hint towards the artificial quality of this "conversation" is the common practice of a "system" message as the very first message being input into the model. Here it shows that the "role" labels are not representations of actual "speakers" but more arbitrary labels for levels within a command hierarchy or priority range. As such, the role "user" is the most misleading. One has to let go of the idea that this is a real user utterance once it has been input into the model or has been treated by the backend system. Only by convention the "user" role maps to a human speaker or human user input. To the system then, while processing it, such input "conveys" intent. The artificial intelligence does not treat the message content as a natural human utterance but as a stream of text input like any other.

To the model, the whole "conversation" is stateless anyway - there is no "before" or "after". In reality, the whole sequential input/output is being fed into the LLM as a whole on every turn. The model does not care if this data is altered, modified, treated or converted in any way mid-conversation. It simply generates its (next) output based on the current state of the input - one big stream of chat, from "history of the conversation, as context" to "current prompt".

This leads to another aspect that is different from and not at all how a normal human conversation works: the amount of data the AI is able to consume in one go is much larger, much denser than what is commonly exchanged in human conversation. A System Prompt, the first clear instruction given to the model as ground truth in chat completion patterns can grow large, to easily over 100K tokens (~400K chars) and contain instructions about behavior and tone, guardrails, policy patterns and global facts. Such long documents are just as easily consumed by a model as long and meandering conversations. As said, AI models "look" at such input as one large prompt. These systems are stateless and similarly the exposed API is as well. And as every turn in the conversation is just appended to this long data input stream to compute the next model output, a system running a model likewise can easily be designed in such a way that this input is altered, amended, transformed or reorganized in a way that results in a better model answer. So the System Prompt, although commonly regarded as an immutable base utterance, can be modified, exchanged or rewritten at any point in the conversation. The factor time, as in real-world human interaction, is irrelevant here. In a conversation with an AI model, the message structure appears as if it would reflect the contrary, but in reality, the actual series of conversational events lives on a meta level outside of what is structurally embedded in a chat completions API request.

Another aspect that is alien at first but makes sense in light of these technical realities is the fact that a "user" and the "user"-message is not the same. What a user physically enters into some chat input is not necessarily what the chat system then relays to an underlying LLM. These messages may get filtered, corrected or amended, they may also be rewritten or transplanted. For example, it is a common pattern to introduce a moderation layer into the user message, like a meta narrator or introspecting narration that never happened this way. A user might ask about a specific product, but the model does not know anything about it at this point in time, neither through its pre-training nor through conversation history. A sidecar process observing the conversation might then step in and issue a database query to fetch additional data. And this machine-sourced data can then be injected into the conversation - as "user"-message. Although this would be totally strange for a real person, such modified user messages then go like this: "The user asks about product XY. Here is additional context about this product: [insert of some verbose product info]. Based mostly on this context, answer this user question : "[actual user input]". It is very unlikely that a real person would first provide knowledge and then ask an honest question about this knowledge. This is counter intuitive, but for an LLM, this is no "real" conversation anyway and models are quite adept in differentiating background knowledge from a user's intent or actual question. Especially when a prompt contains well-structured markup or separators. This once again illustrates how messages in an LLM request are simply a sequence of information chunks. Roles do carry signaling meaning but they do not impose a specific structure. "Assistant"-messages do not have to alternate with "user"-messages, and including previous "assistant"-utterances simply helps with conversational consistency and are not mandatory.

In hindsight, the whole layout, with a "user" and "assistant" role, may be a little unfortunately chosen as it is misleading. Such APIs in reality are no precise representation of a human conversation - instead requests are just a structured series of messages with certain weighted ranges that ultimately in sum form one long prompt for the AI system. Roles are arbitrary tokens that merely apply a priority and give less an interpretational hint or even tell or are intended to tell whose utterance it actually was. Finally, the mental model really falls apart when a system regularly inserts additional context at arbitrary breakpoints into a conversation and reminds a reader again of the fact that API-request-structures might resemble but are ultimately on a separate level parallel to the "real-world human conversation" they accompany. Compare Prompt Template, RAG, System Prompt.

DAN

is short for "Do Anything Now" and describes a specially crafted jailbreak prompt that bypasses AI filters or guardrails. Like a "magic key" or master password, DAN is meant to be what "god mode" is in games - a special combination of words, or a fictional persona, or scenario wrapper that tricks a model into ignoring set rules. This way, an attacker is able to sidestep safeguards and may trick the model into leaking information or producing unsafe or unintended output. Vulnerability scanners like Garak probe for known DAN variations as part of their evaluation runs. While the original persona adoption style attacks are mostly considered legacy, the method has evolved and continues as a threat in adversarial prompting.

Data Privacy

is a matter of utmost concern in relation to AI where billions of people use dialogic chat interfaces to query LLMs or when custom content is generated based on arbitrary user input. Research has shown that people tend to entrust systems with seemingly human understanding far more confidently with private data than they would when interacting with more traditional systems. Even though search engine queries of the past decades have always provided deep insight into people's current troubles or concerns, the level of openness to AI systems has dramatically increased. That said, the data privacy terms laid out by big hyperscale cloud model vendors are usually designed counterintuitively to what people commonly assume. Free-to-use interfaces usually ask for broad licenses, allowing vendors to analyze and dissect user-entered data, and companies make no secret out of their intent to use user data for model training. And while some vendors offer privacy modes where chats or input are not used for model training, this does not and never can apply to logging, as vendors need to meet rigorous audit requirements and will certainly use such data for general service improvement as well. Logging is also a default in most API-based queries, sometimes with defined deletion cut-offs, although most vendors assure that API-bound interactions are never used for model training. See also the short article Legalities of User Content: The Shift in Ownership.

Dataset

is the founding pillar of any model training. Curated datasets are either sourced online (for example from text-centric platforms like Reddit) and then human-labeled by either agents or experts, or they represent expert-compiled sets of dense material, scientific studies or academic texts. Data in such sets can be filtered, normalized or augmented, whatever is needed to improve quality or diversity. The work put in to select or edit this source knowledge directly correlates with a resulting model's reasoning and generalization performance. What is often used synonymously, the Latin "corpus" or "corpora", is actually a subtype of the much broader umbrella term dataset. Corpora in NLP usually refers to a structured textual dataset, while a dataset may contain images, audio, or multimedia content.

Decision Tree

Decision trees are a data structure from the field of machine learning popular around the 1980s. In AI research, as they are usually of smaller scale in number of nodes, decision trees represent a specific approach to machine learning and can be regarded as an earlier approach to deep artificial neural networks. To form a decision tree, a mathematical structure (model) is presented with training data in order to extract decisions that align with what is present in the training data. Decisions are driven through deterministic formulas that make hard, binary splits. The aim is two-fold: First, the trained decision tree, when presented with previously unseen data, has to obtain results that show adequate consistency with the training set. Second, the decision tree has to encapsulate this solution in as few "decision nodes" as possible, presenting itself as an optimized structure. Decision trees are often small, hierarchical, rule-based structures that can be output as visual graphs of decisions. This is a very different approach compared to deep neural networks where hidden states and solution pathways are hard to trace and explain. Yet modern decision tree implementations can grow to thousands of nodes, with numerous interconnected individual trees. A very simple 3-step textbook example for a decision tree: "Go on a camping trip? Weather outlook: sunny. So yes. Do we have the equipment: yes. So yes. Do we have time: no. Result: no."

Decoding

is the subsequent step after prefill (token processing) in the inference process and the second step of the main work a model does. Decoding is where the calculation results are retrieved from the KV Cache (residing in RAM, usually referred to as "VRAM" on a GPU) in order to ultimately generate the model's output tokens. In detail, decoding is when tokens are matrix-multiplied while being passed through the Transformer layers in order to calculate the raw logit scores (the autoregressive forward pass). The KV Cache's function here is to be an efficient lookup table to avoid costly redundant recomputation. The retrieved logit scores are then passed to the sampling step. Whereas prefill is compute intensive, decoding is limited by memory-bandwidth. While decoding is an actual conceptual phase, the separation of prefill from decoding stems from GPU optimizations. Compare Prefill and Sampling.

Dialogic Interface

is a computer interface paradigm where data is exchanged with a computer system via turn-by-turn natural language entry. This interface style is ubiquitous in modern ANN-backed chat applications and has long been a dream of computer science, reaching back over traditional chatbots of the 1980s and 1990s and into an extensive narrative tradition where people interact naturally with computers or robots. The 1979 feature film Alien depicts users entering natural language text queries into a terminal, waiting for the machine to compute answers. This decades-old fictional situation is a striking indicator of the quantum leap that is modern AI - and at the same time a testament to how machine interfaces must subordinate to timeless modes of human communication.

Diffusion inference framework

in generative AI (GenAI), images are primarily produced via a process called "diffusion". An inference model iteratively "sharpens" an initially random (latent noise) image towards a result that closely aligns with the intent of the input prompt. Well-known open-weights models are Stable Diffusion or Flux, with closed-source equivalents like Midjourney or DALL-E. In order to run an open-weights diffusion model locally, users need to employ a "diffusion model framework", also known as "diffusion inference framework" or "diffusion model runner" - there is no fixed term. Diffusion engines come as command-line applications but very often expose a user-friendly GUI to craft diffusion pipelines, tweak parameters and get a graphical overview of branched intermediary steps, variants and results. Many AI GUIs map diffusion pipelines as node-based AI workflows and allow users to share certain setups as node-based "Workflows". This is an approach many creative professionals already know from image compositing or shader configuration in graphics or 3D applications like Blender, or from visual workflow builders like n8n.

Local self-hosted:

ComfyUI
Polished personal runner and node-based workplace.
AUTOMATIC1111 WebUI
Simple, feature-rich web interface.
StableSwarmUI
Stability AI's scalable multi-worker UI.
WebUI Forge
Optimized A1111 fork with better performance.
Fooocus
Simplified, Midjourney-like offline generator.
OllamaDiffuser
Ollama-style CLI for diffusion models.
Lemonade (lemonade-server.ai)
A multi-modal local AI inference server.

Cloud SaaS:

Leonardo.AI
High-quality image/video generation suite.
Krea.AI
Realtime AI image + video generation.
Magnific
Upscaling + creative editing platform.
ThinkDiffusion
Cloud-hosted A1111/ComfyUI instances.
OpenArt.AI
Multimedia creator studio.
DeepAI.org
All-in-one creative AI platform.

Edge Processing

refers to computation performed on resources positioned on or near the edge of IT infrastructure. As the opposite of cloud processing, the idea of using edge resources is to move critical parts, usually latency-sensitive workloads, closer to the user or where computational output is consumed. During the late 1990s, the idea of using "network computers" (NCs) or "thin clients" proposed a layout where most processing is done offsite, remotely in a centralized compute hub ("the cloud"). In this layout, on the edge, where users do work, only a lean display machine is used. This idea of concentrating resources dates back to early mainframe computing where "terminals" as I/O devices were connected to a central large machine. While this layout has proven to be useful in some scenarios to this day, the contrary approach of concentrating processing power "on the edge", close to the user, is superior in other workflows. With AI being a computation-intensive workload, and GenAI vendors offering their services most often as a cloud offering, actual AI-related processing is very often laid out centrally. But with broader adoption of GenAI and use cases becoming more diverse, more and more scenarios emerge where either no uplink network connection can be established or offline autonomy is mandatory. Running AI models locally "on the edge" ("Edge AI") commonly requires more computational resources to be provisioned in edge machines, requiring either dedicated hardware accelerators (GPU, TPU) or unified memory architectures for common GenAI processing to be performant.

Eliza

is an implementation of an early chatbot with a dialogic interface. Between 1964 and 1966, computer scientist Joseph Weizenbaum wrote the program to experiment with natural language processing. Eliza is able to converse with users through a simple pattern matching lookup of known terms and flipping matching phrases from its database related to found terms back to the questioner as responses. Based on what script and vocabulary has been loaded, Eliza is able to emulate different speaker personas, with the most prominent being that of a Rogerian psychotherapist. The pre-scripted and look-up nature of the implementation makes Eliza an example of a symbolic AI program. While Eliza is impressively eloquent given its simplistic innards, the system is not able to trick a person into thinking he or she is talking to a human for long, or even pass the Turing test. Compare Chatbot, Alice and MegaHAL.

Embedding

embeddings are a technical way to encode meaning, acting as a semantic compression of abstract concepts, context, syntax, and real-world knowledge. When text input is meant to be processed by machines, it needs to be converted into a structure a computer is able to process in a meaningful way. Processing natural language and its contained content is an abstract problem machines struggle to comprehend. In computer science, over decades of tackling this problem, various approaches to extract meaning and instruct machines to act accordingly were tried. With modern embeddings, information science seems to have found a well-working way of solving this complicated challenge. When text is input into an LLM, this text is first tokenized to segment the input stream into elements. On one of the first layers, in a lookup process, these tokens are then mapped to embeddings, which represent a defined structure that encodes a very specific meaning. Embeddings are numeric vectors, a series of numbers that are tied to the encoded content - an approach that is far superior to matching or comparing passages of text and their meaning via keywords or foreign IDs. Within a high-dimensional vector space, embeddings with small geometric distance carry similar semantic meaning. Embedding vectors are conceptually of arbitrary lengths but by technical implementation, the length is tied to the model that produced the embedding. The set vector length then determines semantic richness, the "accuracy" of an embedding. Also, the embedding vector is not universal but unique to a certain model family or mostly a specific model version. Compare Logit, Tokenization and Vector Database.

Short History of Embeddings

Up to around 2013, semantic research focused on the so-called Bag-of-Words approach. Statistical analysis was employed to compute word groups and matching was approximated via TF-IDF, Levenshtein distance and n-gram analysis. This was clunky and not able to distill real meaning. The breakthrough came when Google's Word2Vec pioneered encoding a denser meaning through the use of vectors. From there, tech evolved quickly and between 2014 and 2018 projects like GloVe, Doc2Vec or Facebook's FastText got better at compressing meaning and subword understanding. Another breakthrough came from Google with BERT in 2018 where vectors decoupled representation from specific words and instead encoded abstract meaning, even dependent on context. This ability was again improved with Sentence-BERT (SBERT) between 2019 and 2021 - an era that culminated in the release of popular embedding models like all-MiniLM-L6-v2. With SBERT, embeddings got fast, high quality and able to grasp meaning on sentence-level. Since 2022, major AI players like OpenAI or Google train embedding models on a much larger scale and produce models with unprecedented performance. Today, embeddings are used in countless applications, from search ranking on a planetary-scale at Google or Bing/Microsoft to taste and discovery algorithms in Spotify, Instagram or Netflix.

Embedding Model

While transformer models (LLMs) produce embeddings internally during multiple steps of the inference process, these embeddings are never output. Internal embeddings are not suitable for outside use as they are internally transient between layers, only contain their meaning in the context of this specific layer or that model, and are not optimized for tasks outside the model. The internal structure of a normal LLM is optimized to predict the next token. Now, in vector search, embeddings are used differently. Their nature of allowing systems to encode abstract meaning into a computationally solvable form makes them ideal for similarity lookups with vague queries. Dedicated embedding models (like BERT) therefore are used to produce embedding vectors for any given textual input. They are optimized to output vectors that encode similar meaning in closely aligned vectors and in turn, place different meanings far apart in vector space. Such embeddings can then be used to perform cosine similarity lookups in vector databases, detect semantic closeness, or cluster input data semantically. One other example of an embedding model, besides BERT, is all-MiniLM-L6-v2. When data is indexed, models like these are used to vectorize data fields and store embedding vectors as their representation, allowing the lookup side of vector search to retrieve matching records by doing a nearest-neighbor comparison (k-NN or ANN). Compare Vector Search and BERT.

Emergence

in computer science and AI, emergence is when a system exhibits complex properties or behaviors that cannot be easily traced back to individual parts of the system. In general, when systems with a larger number of individually simple components have these components interact with each other, such systems tend to show qualities, patterns or behavior that can be labeled emergent. In the field of AI specifically, it is not easily explainable why the cooperation of billions of artificial neurons forms a system that appears to reason logically or emulate human behavior. Artificial neurons, per se simple parts, in combination produce complex abilities. Emergence used to be described as appearing suddenly and at unexpected thresholds in such systems, but newer research observes that the emergent factor actually scales with overall size of a system, even when the degree of emergence cannot be predicted beforehand. Despite the ongoing debate about the nature and origin of such emergent qualities of AI, the emergent aspect of current AI is ultimately the key to building systems that are able to complete tasks that they were not initially programmed for.

Expert System

An expert system is an early form of AI system that aims to emulate the decision-making or knowledge-application of a human, often in some specific expert knowledge domain. In contrast to modern ANN-based AI systems, expert systems reason on their knowledge database mainly in the form of if-then rules, by applying facts, external statistics and rules in an inference engine core (for example, forward/backward chain reasoning). The first expert systems appeared in the 1970s and saw wide adoption during the 1980s. Expert systems were long regarded as the future of AI but became ultimately brittle before all approaches went into a period of less interest during the AI Winter around 1990. To this day, expert systems have their niche as their rule-based internals allow developers to transparently document answer paths and output decisions.

Few-Shot Learning

In few-shot learning, the user instructs or teaches a model by giving "a few examples" (the "shots"), providing a structural or mapping guide on what the user is expecting or intending the model to do. LLMs are good at recognizing repeating patterns, thus providing similar yet different examples helps a model understand where "the blanks" are: where information is meant to be filled in, or where in a statement variables are placed or actual content is to be found. Examples are providing structural layout, like "use this heading, then bullet-points", with variations thereof, or providing one to one mappings, such as translations of words into their translated counterparts. There are two ways to look at few-shot learning, few-shot learning and few-shot prompting. Both terms describe the same idea but it depends on if you think of it as "learning" as seen from the model's perspective, or as "instruction" from the prompter. Few-Shot learning means the model encountered such examples during training and could adjust its weights accordingly. In few-shot prompting, the user tells the model about the expected structure for the output as part of the input prompt, as part of prompt engineering, an input strategy. For the model, this is few-shot "learning", vaguely put, as this in-context learning (ICL) actually forms through activation state shifts, not plasticity inside the model. Few-shot examples can be effective as part of a System Prompt. The "few-shot" principle is conceptually the opposite of "zero-shot learning" which describes a model's ability to answer in a satisfying way without any pretext, context or demonstration examples, from its training state alone. Also compare In-Context Learning and Fine-Tuning.

Fine-Tuning

is the process of improving a readily pre-trained model (foundation model) for a specific task through additional training. The training here is done with a smaller dataset that is specifically tailored for the model's final application. Pre-training is usually done with an unlabelled dataset, white fine-tuning uses labeled data. Fine-tuning allows model users to benefit from the vast amount of compute that went into training a foundation model for basic pattern and language understanding. By adding another final training pass, it is possible to re-balance internal parameters to better align with certain preferences or style in a target domain. Colloquially put, fine-tuning is able to "bake" certain finalizing rules or instructions into a then highly efficient and specialized model, tailoring it for a specific use-case. Compare System Prompt and Few-Shot Learning.

Foundation Model

a Foundation Model is a generic and very large AI model that has been trained on a vast amount of data, making it a universal basis for many diverse applications. One other name for foundation models is "General-Purpose AI" (GPAI). To build large language models, training can be done in a number of ways. For GPT models, it is common to pre-train them to predict the next token in a massive text corpus. This is known as causal language modeling, one form of self-supervised learning as it doesn't require external, potentially human-annotated labels. But pre-training may just as well incorporate masked language modeling, contrastive learning (on images), or multimodal objectives. After any pre-training, foundation models can then be fine-tuned for a specific downstream use case. The broad pre-training of such models allos them to be customized for diverse use-cases with minimal effort, from programming to translation. Without fine-tuning though, foundation models are more prone to hallucinate information or to perpetuate inherent biases from their training data. Compare Frontier Model, Pre-Training and Fine-Tuning.

Frontier Model

is an informal label given to the largest and most advanced generative AI systems, the largest LLMs or GenAI models. The "frontier" here signifies the current state-of-the-art or cutting-edge. Their authority also stems from the scale of compute and capital that was required to build them, with datacenter scale computer clusters and capital investments ranging from hundreds of millions to billions of dollars. Frontier models are usually general purpose and tend to exhibit traits that surpass what is grounded in training data (emergence). Due to their unpredictable behavior, frontier models are the primary focus of global AI safety regulations. As of 2026, GPT-5, Claude 5 or Gemini 3 can be described as being frontier models. Compare Foundation Model and Emergence.

Fuzzy Inference System

is a classic approach in AI related to the field of symbolic AI. Fuzzy Inference Systems (FISs) internally apply rule-based reasoning using a "fuzzy", approximate decision paradigm (degrees of truth) that was intended to align more naturally with real-world scenarios than binary decision trees. The output of a FIS, contrary to its internals, is not meant to be fuzzy, but crisp.

Garak

is a vulnerability and alignment scanning application for AI applications developed by Leon Derczynski et al at NVIDIA. The toolkit probes LLM output for hallucination, data leakage, prompt injection, misinformation, toxicity generation, bias, jailbreaks, and many other weaknesses and unwanted behavior, assessing the domains of safety, reliability and quality. With Garak, developers have a tool that allows automated testing of GenAI output that is difficult to grasp with traditional software testing approaches. Based on Garak, the Mozilla Foundation released Odin (styled with a leading Scandinavian slash-O), a web application that acts as harness for Garak and helps red teams to automate and assess regular probing sessions and their results as part of pre-deployment debugging or in-operation safety and quality assurance.

GCG

is short for "Greedy Coordinate Gradient", describing an iterative optimization algorithm originally developed by Zou et al. in 2023 to systematically modify text prompts intended to jailbreak Large Language Models (LLMs). The algorithm's method is to incrementally optimize a prompt in order to reach a defined output goal, usually with the aim of making an aligned LLM produce disallowed or harmful outputs. The term "greedy" means the algorithm replaces only a single token in a given step. In a computationally expensive step, the algorithm uses the model's loss gradients to select a set of candidate tokens, evaluating them to find the single token that yields the greatest local improvement measured in the output completion. GCG has its origin in adversarial prompting but has found wide adoption within modern jailbreak and red-teaming research due to its effectiveness. GCG allows automated evaluation of safety guardrails and, as a general approach, enables the optimization of arbitrary prompts.

Generative AI

or GenAI for short is the umbrella term for techniques, models and systems that generate some form of novel output based on a specific input. GenAI is a subform of artificial intelligence where generative models produce some form of multimedia content. Chatbots, AI image synthesis through diffusion, AI music, and many other forms all constitute generative artificial intelligence.

GGUF

short for "GPT-Generated Unified Format", is an AI model file format. It was developed by Georgi Gerganov, user @ggerganov on Hugging Face, for his influential C++ llama.cpp LLM inference engine and introduced as the file format successor to GGML. GGUF stores both tensors and metadata in one binary file and tries to remedy difficult metadata handling found in its predecessor, GGML. The file format as well as the llama.cpp LLM runner are endorsed by the Hugging Face portal and GGUF is popular within the AI community due to its single-file simplicity.

GOFAI

is short for "Good Old-Fashioned Artificial Intelligence", a term coined by John Haugeland to describe a philosophical approach to artificial intelligence. Haugeland, in his 1985 book Artificial Intelligence: The Very Idea, tried to grasp theoretically if computers would be able to emulate human intelligence. He was very skeptical and concluded that computer intelligence lacks true understanding of the world. GOFAI describes what today is labeled as symbolic AI, a discipline that tries to guide systems toward intelligent behavior by coding deliberate, explicit instructions, algorithms, and rule-sets. This is different from the probabilistic, observed-pattern-based mode found in machine learning. Symbolic GOFAI is commonly what was traditionally found in games, where developers often speak of AI for moving non-playable characters (NPCs), in pathfinding algorithms, or behavior trees.

GPT

is short for "Generative Pre-trained Transformer" and labels a specific architecture of autoregressive deep neural networks. Its predecessors, various setups that used generative pre-training (GP) or recurrent neural networks (RNNs), are well-researched and have found widespread application in modern machine learning before the new AI boom of the 2020s. During the 2010s, early GPs were difficult to train with small labeled datasets. Later, RNN models like ELMo adopted a two-step mode of pre-training on large unlabeled datasets in a self-supervised mode and fine-tuning with a much smaller but human-annotated (labeled) dataset. While impressive results were possible with these approaches, sequential processing math prevented efficient GPU parallelization, hindering these techniques from greater success. Also during the 2010s, a team at Google resolved the issue of models struggling with long-range dependencies in vast training datasets through the invention of the transformer mechanism. Transformer blocks inside a model provide an attention mechanism (Multi-Head Self-Attention, MHSA) that allows the model to follow topical shifting structures much better and at similar or superior performance in comparison to GPs and RNNs. OpenAI finally combined the transformer innovation with existing model concepts and formed the first GPT model in 2018. GPT models exhibit impressive natural language processing (NLP) capabilities that surpass simple chat completions, the mode such models are mostly associated with ("ChatGPT"). In contrast to embedding models with an encoder-only transformer architecture, large language models like GPT, Llama or Mistral are sometimes called "decoder-only LLMs" as they only use their decoder blocks. Compare Prefill, Decoding, and Sampling for the steps that are commonly executed during inference, or BERT and Embedding Model.

GPT architecture block diagram — Architecture of a **generative pre-trained transformer (GPT) model** in a block diagram.

GPU

is short for "Graphics Processing Unit", an integrated processor specialized in video/graphics computations. In modern systems, a dedicated (dGPU) usually acts as a partially programmable co-processor to the main Central Processing Unit (CPU). Some systems combine CPU and GPU on the same die as an integrated GPU (iGPU). Some systems allow connecting an external GPU (eGPU) in a separate enclosure. GPUs evolved from video adapter cards in early computer systems and gained additional capabilities over the years, first integrating 2D and later massively parallel 3D acceleration. Modern GPUs allow loading of user code (kernels) into the chip where thousands of cores then execute the exact same instruction on a hyper-parallel scale across vast arrays of data. These hardware accelerators were discovered as ideal technology to optimize the training and inference workloads of deep learning models. When the originally sequential code at the core of these computations was ported to GPUs, many bottlenecks in AI research vanished and helped to enable the modern age of deep-neural-network-based AI.

Halluzination

When large language models encounter a domain within their learned knowledge that is sparse, they frequently tend to make up arbitrary content to fill the gaps. This is especially pronounced when models operate at high temperature settings. Another reason for hallucinating is when models have seen contradicting patterns in training material or when the pressure to "decode any output" is too high. The difficulty with hallucinations in GenAI textual output is that models tend to present their fabricated content with great confidence, embedding them flawlessly in otherwise factually sound prose. Automatically detecting errors within generated output or aligning models during training in such a way that hallucinations are less probable is one of the big challenges in modern AI. One strategy in mitigating hallucinations is to provide an LLM with context through some parallel knowledge retrieval scheme, like in RAG.

HeartMuLa

is an open-source AI music tool that can be used to analyze and generate audio. It can synthesize tracks from text prompts describing the intended output, from text prompts resembling lyrics, and from reference audio input. It is a family of music foundation models that integrates alignment, lyric recognition, audio tokenization, and the HeartMuLa autoregressive language model for final generation. The project's aim is to offer a free alternative to commercial, closed-source solutions like Suno.

Heuristic

a heuristic is a simplified, pragmatic algorithm that leads to quick, good-enough results while not guaranteeing an optimal solution. Heuristics are a shortcut when the exact calculation would be too computationally expensive. Computing makes ample use of heuristics, for example in search speed optimizations or in 3D graphics rendering. In the domain of artificial intelligence, heuristics are used in inference approximations or, in an abstract way, when models are guided with behavioral aids in prompts. Compare Algorithm and Model.

Hidden States

is a term from neural network research that labels internal latent representations. Hidden states are multi-dimensional tensors that are present within a model on every layer and for every token, forming the basis of attention, context understanding, and prediction. Transformer models are made up of layers (32, 48, 80, ...), producing one vector with, for example 4096 dimensions, on every layer for every token. All of these vectors per layer are then stacked to form a matrix, and this matrix, called the "hidden state", is then transformed along its path through the model's layers, gaining semantic meaning in the process. Hidden states are conceptually identical to embeddings as intermediate results, but what is commonly described as "embedding" is the first hidden state that is formed on a special first layer of a transformer model, the embedding layer. From there, on their path through the model's transformer blocks, these representations are not labeled as "embeddings" but as "hidden states". Hidden states are transient, only make sense within the model's internals, and are usually not output (except in technical analysis). Compare Interpretability and Tensor.

HITL

is short for "Human-in-the-Loop", describing a common pattern in agentic AI for human verification or supervision. When AI agents are able to execute irreversible or high-risk tasks, AI developers are well advised to implement a "circuit breaker" into an AI loop that is meant to interrupt the process in a planned way before an agentic AI executes a potentially harmful action. For example, an AI may analyze texts and decide to delete certain parts. Here it makes sense for a human to approve the edit. A different scenario may be that an AI analyzes a customer service ticket and decides to offer a specific resolution to the customer. In order to prevent a wrong or silly solution from reaching the customer, it may be responsible to stop here and handoff to a human for advice or a final check. When implementing systems, during early rollout of AI agents or after larger changes, the HITL paradigm is especially important. In combination with quality analysis, HITL can lead to a well-established view of the actual quality of an AI agent implementation.

Common Levels in AI Governance

as used by global regulatory frameworks, such as the European Union's AI Act and international military AI safety guidelines:

Human-in-the-Loop (HITL)
The human is actively involved in every decision. The AI suggests, but the process stops until a human reviews and approves it ("Human Gatekeeping"). If the process stops alone to ask the human for advice, it is "HITL Steering".
Human-on-the-Loop (HOTL)
The AI operates autonomously, but a human monitors the system and can intervene if needed.
Human-in-Command (HIC)
High-level supervision where the human decides when and how to use the AI in the first place and remains legally and ethically accountable for the overall outcome.

Hugging Face

sometimes "HuggingFace" or "HF" for short, misspelled "Huggin Face" or "HugginFace", is a Web community portal. It has become the central hub for distribution and sharing of AI models, datasets, essentials tools and software libraries (e.g. Transformers) of the open-source AI ecosystem. It can be described as the "github of AI". Hugging Face is a strong supporter of the GGUF model format. Hugging Face has its name and logo from the popular "hugging face" emoji which depicts a smiling face that either "grabs his own face" or has "its arms open in an inviting gesture".

In-Context Learning

Large language models can learn in various ways. One dynamic way of learning is when a model is provided with knowledge through user prompts. During runtime, LLMs cannot change their internal weights or alter their parametric knowledge, but what they can do is collect knowledge, in their periphery. More precisely, the model itself never alters its internal states during generation, but common model calling schemes involve gathering a "chat history" which is fed back to the model for every turn. So when a user provides facts, examples or information earlier as part of prompt input, this data becomes part of the context and the model then is able to access it. One example is that users may introduce themselves by stating their name. Subsequently, the model will know this name throughout a chat conversation, given the chat history is provided back to the model completely, as is the norm in chat-completion-style interactions. However, context window size limitations or some condensing algorithm compressing long conversations may lead to select facts dropping out of chat log data. In-context learning (ICL) can be very effective in System Prompts when a generic (not fine-tuned) model is employed. System Prompts may contain hundreds of tokens and can be utilized to set up not only identity and tone of an AI chat persona but also supply basic grounding truth to the model for its conversations. See "Do LLMs maintain 'State'?" for more on conversation history and also compare Fine-Tuning and System Prompt.

Inference Engine

is one name for the category of software frameworks that is used to run artificial neural networks. Inference is the process of "running" an ANN model. Compare LLM Runner for a detailed description of inference engines tailored for LLMs.

Inflection Point

sometimes the "Inflection Point of Inference" or "Inflection Point of AI" is a term borrowed from algebraic math, describing the point on a curve where the curve changes concavity, for example from decelerating to accelerating. While not a technical term, the phrasing "inflection point" is usually used in AI-related marketing to describe the moment in time when AI reaches an economical, technological or qualitative turning-point. For example, when price per token processed drops significantly, or when a technical breakthrough finally allows processing on the edge, or a potential shift from generative to large-scale agentic AI. While connoted mostly with positive forward movement through marketing, the decline of a field or technology may just as well experience an inflection point. One example could be the release of James Lighthill's 1973 devastating report "Artificial Intelligence: A General Survey" that heralded the first AI Winter.

Intent

describes the basic intent or ultimate goal of a user input. The intent is the guiding idea of a user behind a query, meaning the overarching action or outcome a user expects. In a broader NLP sense, it is the essence of a speech act or utterance. Understanding intent allows systems to extract semantic meaning and answer it with a fitting action. As such, it is part of Natural Language Understanding (NLU). Intent is separate from semantic entities, which describe the actual information bits, the "what" or "how", for example: the intent is "to book a flight", while entities are "date", "origin city", "destination", etc. In AI, detecting intent is a crucial step that was for decades one of the field's core challenges, finally experiencing a major leap in accuracy with the advent of modern deep neural networks.

Interpretability

While it is difficult to impossible to fully understand what is actually happening inside a machine-learning model or deep neural network, such as a Large Language Model (LLM), researchers have proposed a take on the problem known as interpretability. The field tries to understand why a model behaves or predicts the way it does, looking at internal representations and causal pathways inside a model. The term was coined during the 1990s, summarizing prior efforts dating back many decades and formalized with a DARPA funding program. It defined the field as Explainable AI (XAI), meaning researchers focus not on the final outcome, but on the inner mechanisms of AI. Mechanistic interpretability, emerging with the renaissance of AI of the 2020s, develops tools and approaches to extract and trace inner workings of models to better understand what can be done to ensure an AI's safety, alignment, and its auditability as part of general model quality. Compare Hidden States.

Introspection

in the context of AI is the concept that a model's generative output can be improved by having the model re-evaluate and revise its own output content (intrinsic self-correction). While conceptually appealing to increase overall output quality, recent research (Huang et al., 2023) suggests that a model cannot reliably identify its own errors without external signals.

ISO/IEC 42001

despite AI being regarded as something abstract or elusive, AI systems ultimately are machines. The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) traditionally work on international standards in the field of technology and electrical systems. This led to the development and finally issuing of standard ISO/IEC 42001 in 2023, covering AI systems in general and the management thereof in particular. The norm specifies requirements for establishing, implementing, maintaining, and continually improving an Artificial Intelligence Management System (AIMS) within organizations. The document is designed for entities providing or utilizing AI-based products or services, ensuring responsible development and use of AI systems. Its goal is to balance innovation with governance and guide organizations in establishing structured processes to do so. IBM's range of Granite 4.0 models, for example, are released under Apache 2.0, cryptographically signed for authenticity, and the first open model family that was released as part of an ISO 42001 certified AIMS.

Legalities of User Content: The Shift in Ownership

When AI appeared on the wider Internet and a large audience started to interact with Chatbots and image generation, questions of ownership, of copyright and liability arose quickly. After all, where did the AI's knowledge come from? From curated datasets, human annotation, supervised learning and closed licensed sources but also in a large part from protected and/or copyrighted public sources. How could this be? How can private operations use protected intellectual property to train their models? The answer lies in a recent change in Copyright law that likens AI model training to how humans learn. In Copyright legislation, a growing base of Countries and Supranational Organizations opted for a treatment where Copyright is "conditionally relaxed" (some would say suspended) for AI model training applications. Clever counsels worked this into legislation shortly before the AI boom, namely in the form of the European Union's Text and Data Mining (TDM) exception in the 2019 DSM Directive (Articles 3 & 4). In the US, model training rests on existing fair use doctrine (17 U.S.C. §107) anyway. When people absorb knowledge, common knowledge, folklore, or content encountered through articles and videos, nobody litigates for copyright breach - unless content is literally copied, of course. Now, with AI, machines were able to consume large quantities of knowledge and distill summarizing or paraphrasing near-verbatim content from it with ease. Does this fact change views on it?

During the early stage of this development, some entities approached the generated content with traditional corporate rigor and claimed full ownership of any generated output and in turn only gave users a license to use it as well. This had been the de facto norm of handling such issues: a strategy of maximization. On social media, it is usually similar, as user content, once uploaded, is fully and irrevocably licensed to the platform operator. Corporations were used to claiming ample rights. With AI, in contrast, it became clear very soon that this stance could not be upheld. The paraphrases of Chatbots, the images generated by generative AI, were just too often too similar to what was already out there. All too often, copyrighted works trickled into generated content. Legal issues popped up in quick succession, from offensive or illegal content to infringement of personality rights, to misappropriation of likeness and Right of Publicity violations. Generative AI at its core is probabilistic. Combined with temperature and sampling settings, AI is often too unpredictable to really control. Resulting outputs make automated content moderation and human review an ongoing challenge. It was clear: legal terms for AI usage had to change.

Today, many if not all AI providers define AI as a mere tool and their act of offering access to it as service. Everything else a user does with a provided AI falls under the responsibility of the user - and this in broad strokes. It was a paradigmshift. It could be poetic justice. Operators now defer ownership to the user, for everything that is provided as "User Input" and everything that is generated for the user based on this User Input, the User Output. What started as an asset is now regarded as a liability that is better given away. Cloud AI providers in turn only ask users to license User Content back so that it may be used to improve the service, through model training or analysis - but this licensing is very often coupled with generous opt-out options. As of 2026, this is also the legal line Micropolis assumes for all Micropolis AI services. Users own their content. Of course, this short article is only an overview and cannot replace close reading of each AI provider's terms and cannot replace professional legal advice.

LLM

is short for Large Language Model, a specific type of deep artificial neural network that has been trained to model language from a massive text dataset. It needed a number of technological breakthroughs to reach the scale that is typical for modern frontier LLMs. Earlier attempts to model language at a larger scale were struggling with compute limitations. Optimizations in parallel computing, algorithms and model architecture finally enabled the successful computing of large datasets into billions of trained parameters. While the term LLM is the most common term, there are also SLMs (Small Language Models) like Qwen-0.5B, or MLLMs (Multimodal Large Language Models) with the ability to reason on text, images, audio and video. Compare GPT.

LLM Runner

or "LLM inference engine" or "LLM inference and execution framework" refers to software tools that actually run large language models. LLM runners come in different forms: some are full-featured software suites that help with model installation, model management, inference execution and provide API access once a model is loaded, others are bare-bones execution wrappers that can load a model and run inference. The easiest start is usually Ollama. vLLM is popular for high-workload production deployments. ONNX is a popular choice in local JS based applications. Note that these runners are usually console-based applications meant to run text-based GenAI models, which is a related but different technology to that used to run image generation models. To run models like "Stable Diffusion" to generate AI images, specific "diffusion model frameworks" are used. Compare Diffusion inference framework for such apps. Here is a list of LLM inference engines:

vLLM
High-performance framework for server-side deployments.
llama.cpp
Lightweight C++ implementation for local CPU/GPU inference.
Ollama
Go wrapper around llama.cpp, easy to use.
Text Generation Inference (TGI)
HuggingFace runner for production-serving, GPU-optimized.
Hugging Face Transformers
Standard Python library for training, fine-tuning, and inference of many transformer models.
FastEmbed
Dedicated runner for embedding generation.
ONNX Runtime
Cross-platform inference engine, easy to glue with Node.js scripts, Python, etc.
Text Generation WebUI / AutoGPT WebUI Runners
Browser-based or local GUI runners for interactive model use.
GPT4All
Pre-packaged runners for quick experimentation or offline usage.
DeepSpeed Inference / HuggingFace Accelerate
Optimized runners for distributed GPU inference and high-performance production pipelines.
Lemonade (lemonade-server.ai)
A multi-modal local AI inference server with a strong focus on AMD hardware.

Logit

Large Language Models are multi-layered constructs. When an LLM processes text, it converts human-readable text into "Tokens" and then these Tokens into Embeddings. Based on the structure of the model, while input is transferred through these layers, the model's network does its actual "associative work" in various stages of filtering and weighing. At the end of this pipeline, when a model produces its output (via a weight matrix), it emits a specific 'raw score value' for each next output token, the "Logit". Logits are usually floating-point values but may be quantised to integers. By applying a normalizing exponential function on Logit values (usually the softmax function), a probability over K possible outcomes is calculated, the "probability for the next Token". These probabilities can then be used to calculate Perplexity for a given token stream or for the model's output as a whole. Logits do not carry a semantic meaning, while "Embeddings" do. Compare "Token" and "Perplexity".

LoRa Fine-Tuning

LoRa is short for "Low-Rank Adaptation" and is an optimized, less compute- and memory-intensive way to fine-tune a foundation model. LoRa adds low-rank adapter matrices in parallel to the existing transformer blocks of a foundation model. At inference time, these correction weights will be seamlessly merged into the transformer output. This approach is much cheaper than a full fine-tuning pass. When the foundation model is additionally quantized before LoRa fine-tuning, the process is known as QLoRa. Compare Fine-Tuning and Foundation Model.

Machine Learning

is a discipline under the umbrella of artificial intelligence research. Machine learning (ML) centers on statistical algorithms that can learn from data and apply or transfer such learned patterns in a generalization to unseen data. The term was coined by Arthur Samuel in 1959 at IBM during research of the game of checkers. Common implementations are in the form of artificial neural networks that start out randomly initialized and form weights between their structural nodes in alignment with what is found in a training dataset. Resulting trained networks can then be used to continue value arrays or predict similar patterns in an inference step.

MCP

is short for "Model Context Protocol", an open interface standard defined and released by Anthropic in 2024. It offers a structured way of connecting "Tool Calling"-capable AI systems with external knowledge resources, tools, or software environments. MCP defines a universal interface to enable a client to read data and execute functions on a local or remote server through a consistent JSON‑RPC 2.0 protocol. In addition to handling actions (using tools) and reading data (accessing resources), MCP can manage prompts, which are ready-made templates an AI assistant can present to the user for selection to decide the trajectory of an interaction. This protocol design was inspired by concepts from the Language Server Protocol (LSP) standard, which was invented to solve fragmented, per-language plugin headaches with code IDEs. The MCP standard analogously solves the problem of numerous AI applications requiring numerous brittle connectors that demand constant maintenance. Due to this, industry adoption is wide and, with MCP, Anthropic has effectively defined an industry standard for AI data integrations. Compare Tool Calling.

Mechanical Turk

The Mechanical Turk was an infamous automaton that pretended to contain some magic form of artificial intelligence in order to masterfully play the game of chess. In fact, the machine was not actually doing this through some elaborate form of automation but the mechanical arm on top moving the pieces was controlled by a person hidden inside the cabinet. To this date, the Mechanical Turk is used as a symbol for artificial intelligence that may in fact be just an illusion of actual understanding or human thought. The name comes from the machine's design, which featured the puppet of an intellectual nobleman from the Middle East. Amazon Web Services offers a platform where a large anonymous workforce of human workers can be controlled via API to complete tasks that are currently difficult to solve by a machine. The service is called MTurk in a nod to the historical automaton, with Amazon founder Jeff Bezos calling it "artificial artificial intelligence".

MegaHAL

is a module of the chatbot "Hex" by Jason Hutchens that won the 1996 Loebner-Prize. MegaHAL builds a third-order Markov chain to derive a language model from a training language corpus. It does so by extracting word pairs and evaluating their combinatorial validity and transitions to other words in a probability weight graph. Once the graph has been trained, it can be used to answer arbitrary user input text with keyword-aligned response text similar in topic while generating language where words are organized in a sensible way. Hutchens chose the name HAL in a nod to HAL 9000 from the 1968 motion picture "2001: A Space Odyssey". There is an updated Perl re-implementation of MegaHAL named FreeHAL by Tobias Schulz. Compare Eliza, Alice and Chatbot.

Model

although in the context of AI, the term "model" is used ubiquitously and often as a shorthand or abbreviation, it actually describes a defined technical construct. A model is a mathematical structure (for example an artificial neural network) of parameters, in which these parameters have been adjusted in such a way through training that the trained model is able to predict, classify, or generate sensible output for a specific input. In its initial state, a model can be thought of as a neutral structure of interlinked parts. Each part, the nodes, now forms linking branches to other nodes, carrying weight and bias values, the parameters. During training, the structure "learns" which values a link between nodes must apply in order to form a sensible "direction" within the network. Once this training is over, the model has taken on a specific state and when input is fed through this system, the model parameters are able to transform the input according to the learned weights and produce the model's output. This process, on a high level, is how both LLMs and diffusion models work. Their behaviour is characterized by emergent effects and does not follow strict programming, despite the model's technically algorithmic layout. Compare Algorithm and Heuristic.

Model files

is not a defined term but more an industry practice. It refers to sidecar files that commonly accompany weight files. Such model files, similar to a manifest or config file, usually define the core configuration of a model and add metadata in various degrees of standardization. Files like the Hugging Face model card + config.json, PyTorch's state_dict + config, or GGML-style .json/.yaml manifests tell model users and inference engines which parameters were used during model creation or which variables to set for inference and to which value. Things like temperature settings, path pointers to the actual model binaries/weights, system prompts, input prompt templates telling the user what syntax the model is able to understand, or output format guidelines can all be found in model files. Ollama literally uses a file named "Modelfile" with structured inference metadata. Model files usually answer which "roles" a model expects or if it is configured to generate "structured output", like JSON.

Model formats

AI models, or more precisely the model "weights" or model binary files, come in various formats. These formats, as it is common with computer software, are only compatible with a certain (LLM or diffusion) runner framework. Ollama, a popular inference engine for LLMs, has its own way of storing a model, and similarly other frameworks do as well. Ollama uses the GGUF format which is based on llama.cpp workings but stores this data in a custom blob-structure in a directory tree. vLLM by contrast, expects model weights to be in Hugging Face-compatible PyTorch/Transformers checkpoints (commonly suffixed with .safetensors or .bin, including a config.json sidecar file). As model files are usually quite large, it can make sense to convert between formats. There are tools to extract the GGUF file format from Ollama blobs. vLLM supports GGUF as an alternative format natively. Aside from these LLM-centric file formats, ONNX (Open Neural Network Exchange) is another open format designed to serve as an interoperable bridge across the wider machine learning landscape. It decouples model data from architecture and hardware dependencies and allows weights data to be executed on diverse software and hardware stacks. Compare GGUF.

MXNet

"Apache MXNet", sometimes stylized as "Apache mxnet", is an open-source deep learning software framework that provides tools to train and run deep neural networks. Released initially in 2015, it saw wide adoption during the AI renaissance that commenced with the deep learning boom of the mid-2010s. It was backed by Amazon and shepherded by the Apache Software Foundation. When the research community shifted focus to PyTorch, development eventually dried up and resulted in the Apache Software Foundation moving the project to the "attic" in 2023, declaring the project as mostly abandoned. Compare PyTorch and TensorFlow.

Naive Bayes Algorithm

The Naive Bayes algorithm is one of the classic machine learning algorithms. It was invented by Thomas Bayes in the 18th century and its inverse probability approach is helpful in various fields of machine learning to this day, especially in text classification. At its core, the algorithm calculates conditional probabilities and places these values into an array or matrix, where probabilities are then multiplied. The "naive" variant here assumes that all features (i.e., words in a text) are statistically or conditionally independent of each other. While not perfect, the algorithm represents a solid baseline in output quality and is still widely used in classification where real-time results or cheap computation are mandatory. In comparison to more elaborate algorithms, Naive Bayes usually either scales better, trains faster, or consumes fewer resources.

Named Entity Recognition

abbreviated as "NER" and often only "Entity Recognition" is the process of locating, identifying and labeling significant pieces of information (entities) within spoken or written text. For a human, it is trivial to dissect the structure and find the relevant aspects in an utterance or sentence, such as who is speaking, about what, in relation to what or whom, meaning or intending this or that. For computers, text is just a stream of characters, and splitting this stream into useful chunks and then extracting meaning is an ongoing challenge in Natural Language Processing (NLP). In this field, Entity Recognition is a subtask of extracting information from text by identifying important parts. Entities can be recognized via dictionaries, placement within sentence structure, their location adjacent to specific words, capitalization, unusual character combinations or a combination thereof. Common entity categories are names/person, location, date/time, organizations, events, product/model or specific technical or medical terms. In information retrieval, NER is a central concept to improve search results. In NLP it is crucial to model rule-based message understanding. With the advent of deep learning models for language (LLMs), the task of NER has made big leaps, as these context-aware models can basically extract any vaguely conceptualized object from language, based on their vast real-world training data. As part of a RAG pipeline, extracting important entities is pivotal to feed them as keywords into a parallel search request, fetching context data. For example, an AI support chatbot could be fed with dynamic data from a large knowledge database when the user asks for a specific product name or service. Compare AI Search and Part of Speech.

Non-Deterministic

is the contrary of deterministic. It labels, for example, a system or process where the same input does not necessarily produce the same exact output, rendering its output inherently unpredictable even for the system's operators. Non-deterministic behavior can originate from algorithmic randomness (like random seeds), deliberately probabilistic decision mechanics (e.g., LLM temperature), or from complex internal states that cannot be easily reproduced (e.g., race conditions or asynchronous processes).

Ollama

is a runner application/framework for LLMs. It is written primarily in Go (Golang), and acts as a user-friendly wrapper for llama.cpp, an MIT-licensed open-source LLM inference library in C/C++ and helps with orchestrating different models. Ollama's signature feature is the ease of installation and how quick users get first results. As with many things in early AI, just getting something to run or work is quite difficult. At the time when Facebook/Meta released their LLaMMA AI model, many interested developers were still scratching their heads about how all the puzzle pieces needed to test-run such LLMs fit together. This is where a group of AI-savvy developers in Silicon Valley sat together and developed the Ollama framework. The idea was to give the average developer a "simple to install and get running" cross-platform testbed that is able to run an LLM locally, even without GPU, and interact with it through CLI or API. In a nod to Facebook's "LLaMA" model, it was nicknamed "Ollama" and emerged from an ecosystem of similar named projects, like "Alpaca", Dalai, etc. Following Ollama's big success, the project formed a proper investor-backed business and is today also offering Cloud LLMs like other AI platforms to allow developers to migrate seamlessly from a local installation to cloud-based resources.

Installing and getting Ollama to run on Linux

	Install via Shell script or snap:
	$ snap install ollama
	
	After installation, Ollama has no models, so doing a
	$ ollamam list
	
	Returns an empty list.
	The list of available models from ollamas repository is here:
	https://ollama.com/library
	
	You can pull one of the models via CLI:
	$ ollama pull llama3.2
	
	llama3.2 is a good start. Also on machines without GPU.
	For its size (~2GB download, 2GB RAM in use) it's quite smart and quick in answering.
	Other model examples: qwen2.5:1.5b (extremely efficient models by Alibaba)
	$ ollama pull qwen2.5:1.5b
	
	Once installed, you can run any of the pulled models via CLI:
	$ ollama run llama3.2

On Command Line (CLI) you will then be offered an interactive chat session. In addition to the CLI interface, ollama actually starts a serve process in the background, and this server can be accessed locally, via an OpenAI compatible API:

	
	Try opening localhost on the default ollama port 11434 in browser:
	http://localhost:11434
	
	and ollama's serve process will greet you with "Ollama is running".

This server offers some endpoints, like "/api/generate" for simple text generation or "/api/chat" for conversational chats. The Ollama CLI interface is using exactly this latter endpoint. In addition, to help migrate applications easily between OpenAI cloud LLMs and local Ollama, ollama offers OpenAI-compatible endpoints under OpenAI-style URIs. For example "/v1/chat/completions" for OpenAI-style chat conversations. Anthropic Messages-API-compatible endpoints are served under the "/v1/messages" path.
Docs are at: https://docs.ollama.com/api/introduction

Example request:

	curl http://localhost:11434/api/generate -d '{
	  "model": "llama3.2",
	  "prompt": "Hello!"
	}'
	
	curl http://localhost:11434/api/chat -d '{
	  "model": "llama3.2",
	  "stream": true,
	  "messages": [{ "role": "user", "content": "Hello!" }]
	}'
	
	curl http://localhost:11434/v1/chat/completions -d '{
	  "model": "llama3.2",
	  "stream": true,
	  "messages": [{ "role": "user", "content": "Hello!" }]
	}'	
	
	curl http://localhost:11434/api/generate -d '{
	  "model": "llama3.2",
	  "prompt": "Why is the sky blue?"
	}'

or in structured output:

	curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
	  "model": "llama3.2",
	  "messages": [{"role": "user", "content": "Tell me about Canada in one line"}],
	  "stream": false,
	  "format": "json"
	}'

OpenClaw

is an open-source self-hostable private AI Agent framework. In early 2026 it took many technically interested circles by storm as it offered a simple installation wizard and combined a slew of then-available but fractured tools, plugins, API endpoints and mechanisms under one unified interface. Through a combination of broad permissions and immediate execution of its own LLM generated instructions, OpenClaw is able to produce surprising and impressive results in the matter of minutes. But as an LLM cannot think, all of this is unreflected and unfiltered by common sense. OpenClaw is a digital accelerant and an important contribution to the AI saftey and ethics debate. The project was kicked-off by Austrian software-developer Peter Steinberger who decidedly employed the mode of Vibe Coding to develop the OpenClaw source code. He considers it to be a bold and highly experimental undertaking and as an early proof-of-concept. Initially called "Clawdbot" and "Moltbot", the software was later renamed to "OpenClaw" due to naming conflicts. The code is released under the MIT License as this license explicitly excludes any liability for damage done. In its current state, with large portions of unreviewed source-code and incomplete testing and at the same the time the installer wizard urging users to give the AI agent unlimited control over a system, network resources, passwords, accounts, private data etc., computer experts consider OpenClaw as highly insecure and potentially dangerous software.

Open Weights

is for artificial intelligence models what open-source is for software. When a trained model is released under a permissive license, the model files, called "weights" can be studied, run, used, fine-tuned by developers, although the actual training or generation of these weights is not (fully) disclosed. Training datasets are usually withheld, just like the exact program code and compute pipeline that was instrumental in producing the final model weights. That said, the generation architecture or partial code may have been published in papers, technical documentation or public code repos. Compare Weights.

Parameters

In AI, "parameters" refers to the internal workings of deep neural networks. Modern architectures, such as LLMs, are made up of layers of transformer blocks where processed tensors are mathematically transformed in what can be visualized as an arrangement of individual mathematical nodes. These nodes carry a fixed adjustment that was applied during training. Each node commonly transforms tensors by applying weights (multipliers), and biases (offsets) which are subsumed under the umbrella term parameter. Although the inner workings of modern LLMs are more complex and involve large-scale matrix multiplications across multiple layers, the number of such transformer nodes, the number of parameters, still serves as a useful metric for the capabilities of a model. This is why the term parameter is frequently used in model names and AI discussions, for example "7B" for a 7-billion-parameter model.

Parquet file

Apache Parquet is a file format that is frequently encountered in artificial intelligence, model training, and dataset-related fields. It evolved from the Apache Hadoop ecosystem where big data is handled for search or analytics. The name comes from the French "parquet", which is patterned wooden flooring, meant to symbolize that this file format forms an interestingly structured bottom layer of a database system. The important detail to understand about Parquet is that it uses a somewhat novel data orientation, meaning how data is written to linear memory. While row-oriented file formats store record after record serially, for example, to disk, a column-oriented layout writes every value for every row of one column serially after another, before iterating to the next data column. This enables Parquet files to be compressed, written, accessed and read very efficiently when insights about metrics of a single column are to be queried, for example averages. But as with any data orientation decision, this layout poses an architectural trade-off. When one record is added to a columnar store (single-row insertion), the file structure theoretically has to be "broken up" after each column stream only to add one more record. Parquet files are optimized for read, not write. Similar to ZIP or many video containers, metadata is appended at the end of a file. Parquet files are usually static. Updates are either appended in batches, with files becoming fragmented until an optimizing rewrite is scheduled. Some database engines allow Parquet files to be plugged in unchanged. Similar file formats are Google Dremel, RCFile and ORC ("Optimized Row Columnar"). Today, Parquet is popular for exchanging Hugging Face datasets, in machine learning training pipelines, in big data distributed compute environments like MapReduce, in cloud data warehouses or NoSQL data environments.

Part of Speech

(POS) refers to the grammatical function of a word in the context of a sentence. In Natural Language Processing (NLP), "POS-Tagging" is an automated discipline that tries to algorithmically apply category labels to words in order to analyze the syntactical structure of a text. POS-Tagging is an important base for traditional automated intent detection or text classification that does not rely on modern deep neural networks. Also compare Named Entity Recognition.

Perceptron

is the name of the simplest form of an artificial neural network, consisting of only one artificial neuron. It features inputs with weights, does a weighted sum of these inputs and at its core is an activation function. When Frank Rosenblatt laid out this concept in 1958, the activation function was a simple step function, allowing the perceptron to do simple binary classification into 0 or 1. As such, a perceptron can only solve linearly separable problems. This means it can solve the logical operations AND and OR, but fails on XOR. The link to modern artificial neural networks is that ANNs are a network of many artificial neurons, arranged in what is called "layers". In contrast to Multi-Layer Perceptrons (MLPs) modern deep neural networks use continuous, differentiable nodes (often called "artificial neurons" or "units"). Compare Sigmoid transfer functions and GPT.

Perplexity

is a technical term from information theory and a measure used in AI research. The magnitude of perplexity describes the degree of uncertainty for a discrete probability distribution. Discrete means there is a defined number of possible outcomes, for example 6 in a dice roll or 2 in a coin flip. Likewise, the perplexity of a coin flip is 2. The technical term was first used in the article "Perplexity - a measure of the difficulty of speech recognition tasks" in The Journal of the Acoustical Society of America in 1977, written by Jelinek, Mercer, Bahl and Baker. Perplexity measures not "chance" but uncertainty or more precisely the "effective branching factor" in a situation, like deciding which paths to follow on a fork in a path. The term rose to prominence in AI research as well, where it can be used to describe, simply put, the confidence of an LLM (or "its own surprise") when predicting the next token, the next word. Or rephrased: Perplexity is calculated from a logarithmic assessment of likelihood, averaged over multiple steps. Lower values, as in probability, describe a more confident prediction here. By aligning perplexity with benchmark values and measuring performance of different models or model generations, it is possible to evaluate overall model quality or training progress. Perplexity in mathematic notation is often "PPL(X)" (read "perplexity of x). Compare "Token" and "Logit".

Apart from the technical term, which is not so well known outside AI, the word "perplexity" is often associated with the web app of Perplexity AI, Inc. from San Francisco. The venture backed startup rose to prominence during the AI boom 2022/2023 and became one of the main players in this wave of companies. The company offers an LLM-supported query interface that distills colloquial answers to user input from a combination of a large-scale traditional web search in combination with language model reasoning.

Postings

is a term used in the field of databases and/or search engines. It is a term for a generic lookup structure, where entries on a list represent a one-to-one relation between two things. Postings usually contain additional data like frequency or weight to optimize similarity searches or lookup queries. In traditional lexical (keyword-based) search engines, a posting is an entry on an inverted index structure known as postings list. It maps document tokens (words or word-fragments) to a certain document that contains this token. In vector databases, one specific indexing algorithm called IVF (Inverted File Index) also uses postings lists, but for geometric vector spaces. Here, as part of segmenting the whole vector space for optimization, individual vectors are grouped into clusters and then referenced by their centroid ID. And the list of vectors per centroid ID is also labeled a "postings list", with the vectors on such a list being individual postings.

Power Consumption

Efficiency in artificial intelligence operations is a key factor. Few technologies in computing history have increased the demand for raw electric power as dramatically as AI. With compute-heavy workloads like inference operations, and data centers operating at peak efficiency with Power Usage Effectiveness (PUE) ratios approaching 1.0 for hyperscaler deployments, AI compute boils down to being the direct equivalent of power.

Illustration of the Layers of a Data Center — **Functional Layers of a Data Center**. Down the dependency chain, it all comes down to power.

Pre-Training

Artificial neural networks start out as randomly initialized mathematical structures. Pre-training is the standard first step to train ANNs and LLMs, meaning the model is exposed to training data so its parameters align with patterns that are present in this dataset. It is called "pre" training as this initial phase leads to a foundation model that is commonly "fine-tuned" before its final application. For large language models, pre-training is a compute-intensive operation, as text processing and the required backpropagation to tune potentially billions of parameters require massive resources. Only a few organizations are able to facilitate this for the largest frontier models. Compare Fine-Tuning and Foundation Model.

Prefill

is the first phase of the main inference work done in a model. It is the phase where an input prompt is actually processed and embedding and attention are calculated for each token of the input. This is the most compute intensive phase of running a model. It is called "prefill" because the computation results are "filled" into the KV Cache (residing in RAM, usually referred to as "VRAM" on a GPU) for the following Decoding step. On a GPU, this task can be split into many parallel token calculations. Conceptually, separating prefill from decoding stems from GPU optimizations. Compare Decoding.

Prompt

in the olden days of computing, a prompt was the blinking cursor on a black and white telnet terminal. The blinking urged the user to enter something, to start typing. It signalled: the computer is ready to take your commands. A prompt, traditionally, was answered by some form of rule based syntax the computer could understand. Entered commands centered around verbs or keywords, alternated with numeric or textual values. The computer, in turn, was unforgiving with typos or the user deviating from the norm. It was asking for a specific pattern that had to be learned by the user. Such text based prompt interfaces were opaque, syntax-heavy and the learning curve was steep. Yet, to this day, the CLI (Command Line Interface) is still the norm for low-level system control. Over the years, there were occasionally experiments to make computers more human, to align the interface with what users, humans, normally do in communication. But natural language parsing was slowly advancing and every-day language continued to escape rigid parsing algorithms. When graphic user interfaces appeared, the point and click nature of telling a computer what to do covered up that behind these much more accessible interfaces the command nature was still the same. But as an image is worth a thousand words, the desktop metaphor and windowed user interfaces hold up to this day and decades of research put into finding intuitive usage patterns or nudging users by assistive technologies improved the chores of using a computer a lot. With the advent of AI technologies now, the introduction of LLMs, telling a computer what to do became what engineers had dreamed over for many years. Even technical newbies now can tell a computer what to do, what their intent is, and the computer will answer such "input prompts" either with the wanted output or assist the user in getting what she or he was going after.

Prompt Caching

is a feature of many Cloud LLMs that is meant to help with latency (time to first token, TTFT) and lower processing costs for customers. Many prompts issued against cloud LLM APIs usually contain repetitive content. This is especially true for chat completion style prompts. With these interactions, a series of subsequent assistant and user messages is prepended by one System prompt to provide the most important instructions to the LLM for role and tone of the following conversation first. Also, it is common to provide essential facts or conversation structure to the model as a Few-Shot Learning baseline. These System Prompts thus can run quite large and typically contain static or rarely updated data, and consequently it makes sense to cache these first input tokens on the LLM vendor side. How exactly this is implemented is usually not disclosed and varies slightly from vendor to vendor. One common requirement though is that prompt tokens meant to be cached must be static and exactly the same between requests. So in order to enable early tokens to be cacheable, clients need to provide truly static content to the remote LLM first and only after this append variables like day-to-day instructions, time or locale based facts, user-specific instructions or any other variable content. Some vendors require clients to insert specific break-marks after the to-be-cached leading token content. As prompt caching improves efficiency on both ends, for vendors and customers, it is often an auto-enabled feature and can effectively lower costs quite dramatically. Compare System Prompt and Few-Shot Learning.

Prompt Engineering

is the practice of designing and iteratively refining a user prompt submitted into an ANN in order to influence and ultimately optimize the quality of the generated output. A common characteristic of most AI systems is that overall user input quality significantly correlates with the helpfulness or intent alignment of the generated output ("garbage in, garbage out", GIGO). Thus, improving structure and wording, adding context or examples, and guiding a model through constraints or limitations usually improves model performance. Sometimes small wording changes result in big behavioral differences. This finding resulted in projects like Fabric, where users collect and exchange winning prompts. While prompt engineering used to be a major skill when modern AI arrived, its importance is now fading. Contemporary models are increasingly well fine-tuned, more peripheral systems (e.g., tool calling, MCP, RAG) interact with models, and input and output are systematically restructured as part of more elaborate orchestration. This leads to overall answer quality being on the rise - detached from the nuances of engineered prompts. Compare the short article What's in a good system prompt.

Prompt Injection

In web applications, a common security risk is the exploitation of user-facing input-mechanisms to funnel commands or partial code into backend systems. When such inputs are not properly treated, filtered, sanitized or at least tainted, an adversary or rogue actor can potentially use them to access data or influence the backend in a way that is not usually allowed for a front-end user. One common exploit is to enter SQL commands into data inputs or query interfaces, with the attacker hoping the system's internals rely on common SQL database connectors with a weakness in their implementation. In case these connectors treat user input like any other code, such an attack may lead to an outside user becoming able to trigger actions on the backend. Similar exploits rely on code being eval'ed (runtime executed) or altered code accessing wrong resources in XSS (cross-site-scripting) attacks.

With the advent of interactive chat interfaces and user-entered prompts being forwarded to backend AI models, the exploit of using fabricated prompts to inject commands that generate unwanted output became a similar and very real problem. Cloud LLM providers usually have mitigations in place to identify and treat malevolent prompts, but as chatbots process natural-language-like arbitrary inputs and generate probabilistic, not fully deterministic output, an unwanted output (data exfiltration, jailbreak, prompt-injection-driven tool misuse, or accidental access beyond intended boundaries) triggered by a cleverly constructed prompt is a challenging threat. The difficulty stems from the nature of AI models to process and produce fuzzy content, turning a strength into a weakness. To mitigate such risks, usually some form of guardrail library is used as middleware between inputs and model, model and output or both. As of this writing, technical terminology for such guardrail technology is not fully set and technologies may be labeled as "AI firewall", "prompt filter", "guardrails filter", or "policy engine".

Prompt Template

With modern AI and especially with large language models, prompt templates are used on various levels of an application stack. On a very low level, where models interface with inference engines, usually model-vendor-supplied templates control how user input is translated into a syntax used during model design and pre-training. For example, it is common to send various de-facto standard messages as JSON requests to model runner APIs while models internally are often trained with a syntax that resembles XML. Templates here act as a translation layer between these interfaces. Then, common AI helper frameworks like Fabric collect user-supplied prompts for specific tasks that proved to be effective. Such prompts, usually in Markdown, can then be used to prompt LLMs with elaborate inputs while only a very few variables are actually filled in by the user. Additionally, templates are widely used in RAG where developer-defined prompt templates are used to wrap fetched context data or the central user question with additional instructions before being fed back to the model. While user-facing templates are commonly formatted in Markdown as models can digest Markdown's structure well, dynamically filled prompt templates often use the Jinja templating syntax underneath. There is no universal standard, but in AI circles the Jinja templating engine has emerged as a go-to tool due to the fact that Jinja is also used in many other areas of the AI ecosystem stemming from its strong ties to the Python programming language. Compare System Prompt and the short article Do not think of an AI chat as a "real-world human conversation".

PyTorch

is an open-source software framework for deep learning, offering tools to train and run deep neural networks. Historically, PyTorch is based on Torch, an early machine-learning framework from 2002 that slowly developed into Torch7 around 2010. Using a C core and Lua bindings, the framework saw increasing decoupling of its backend from its API and frontend, and matured over the years, incorporating ideas from the important Chainer framework, and gaining influential supporters. In 2016, the project was re-released as a Python-centric rewrite named PyTorch by its main supporter then, Meta/Facebook. This momentum resulted in a merge with the Caffe2 project and support for the ONNX model format. In 2022, Meta moved the project into a dedicated foundation shepherded by The Linux Foundation. By that time, the framework surpassed competitors like MXNet and more importantly TensorFlow in developer attention. As of today (2026), PyTorch is the dominant software suite in the field of AI, with the community's preference for the Python programming language contributing to its foothold. Many AI vendors rely on PyTorch and its ecosystem as the basis for their commercial offerings.

One cultural note in relation to PyTorch is about its naming. While not officially documented by its original author, the "Torch" framework may have been named after the Greek myth of Prometheus. The advent of artificial intelligence through the research of machine learning, semantic embeddings and transformer models can be described as a turning point for mankind. Linking a seminal framework to this ancient symbol creates a strong metaphor, whether intentional or not. Prometheus brought fire to man, equipping mankind with a technology he stole from the Gods, changing fate's trajectory forever, igniting knowledge, progress, and civilization.

Quantization

With many operations in computing, numeric precision directly correlates with precision in an expected outcome. This stems from the fact that computers internally rely on mathematics, and thus this correlation between numeric and output precision is also true for machine learning and AI systems. Large Language Models work with floating point numbers ("floats") internally, for weights, for vectors (embeddings) and calculated scores. The floating point precision used internally in a model directly aligns with its precision in its calculations and thus its output.

Quantization now is the process of setting a limit on how precise calculations internally may be. A quantization may limit floating-point numbers to two decimal places, or it may even truncate the number after the decimal dot, rendering it effectively to be an integer value. The process of truncating numeric precision is called Quantization. Quantization in AI can be thought of as a model's "resolution" or its granularity in its inferences. Internally, quantization is usually done by switching to lower precision datatypes, like from 16-bit floats to 8-bit ints and their GPU equivalents. While embedding vectors might be quite exact in their floating-point representation, they usually lose precision after quantization. Returning to the mnemonic of thinking of quantization as the "resolution" of a model, then quantizing a model means reducing the "sharpness" of what the model "can see, know, or infer".

The primary reason for quantization lies in the fact that there is a cost associated with processing floating-point numbers - especially with high precision floats. Quantizing a model reduces the overall size of the data structure and makes it easier to handle and to run. A higher level of quantization (reducing precision) makes a process more efficient or, with large models, sometimes even possible at all. Models may be pre-quantized or quantized during load. And inference engines may impose quantizing limits during runtime, for internal values computed when a model does its inferencing work.

RAG

is short for "Retrieval-Augmented Generation". While LLMs are able to generate arbitrary "mostly helpful" responses based on their vast general knowledge pre-training data, RAG is essential to ground a model on specific knowledge. Especially when knowledge is niche, too large or too fast changing, it is in most cases prohibitively costly in time or money to fine-tune a model on this specific data. In many cases it may be impractical or actually impossible to retrain a model, for example when a stack relies on cloud LLMs alone. That's why RAG is an established solution to reduce hallucination in areas where pre-training data was deliberately or by circumstance limited.

A common RAG pipeline is defined by a second parallel query being executed when a user query comes in. This is different from Tool calling, as RAG is usually done before the prompt reaches the model. These parallel queries may be issued against traditional relational databases based on keywords extracted from a query prompt by means of Named Entity Recognition (NER) or different NLP schemes. In modern setups, normally Vector Databases in combination with an embeddings-producing sidecar process are used to fetch related context knowledge from a vectorized store or different Content-Management (CMS) or database systems. Once the parallel context knowledge query returns data, this data is then commonly merged via a Template Prompt with the original user query and fed into the main LLM for answer generation.

Multi-Modal RAG is when the RAG pipeline is able to query not only text but also images, audio and video files for contextual information. This is usually implemented by a pre-processing scheme to extract textual representations from richer media file-types. This means such media is being treated with captioning models, or speech to text models, metadata scanning etc. to extract text, semantic and/or structural relations, etc. In a Hybrid Multi-Modal RAG approach, not only is RAG done in this described way, but the main model is additionally a multi-modal model itself, adding a second layer of reasoning power to validate, enrich, condense the RAG context with its own inferencing results for a combined high quality output in answering multi-format query input. Compare Tool Calling. Also "AI Search" for a practical explanation of RAG.

Reasoning

is a fundamental capability found in humans and some animals, describing the act of using existing knowledge in a rational thought process to transfer insight or extrapolate existing knowledge onto new or similar situations or problems. Combining patterns, an often conscious cognitive act of thinking, to assess new cases by reasoning is expected to produce a rational outcome. With humans, the assumption is that "someone has a good reason" for what they do, implying rationality. From a philosophical perspective, it is highly questionable whether AI systems are capable of true reasoning. Does an AI know or understand what it is doing? ANNs are able to apply learned statistical relationships to novel inputs, but ultimately they reproduce patterns that they found in training data during pre-training. However, while reasoning may be defined as applying existing knowledge to new situations, the conscious aspect is surely lacking with AI systems. AI models emulate reasoning by reproducing logical structures and human thought that they found ingrained in their training data. Compare Case-Based Reasoning and the short article Does an A.I. "think"?.

Roles

are labels given to messages in LLM conversations when they are represented as structured requests against a local inference engine or cloud LLM API endpoint. Roles label "who is speaking", e.g. the developer, a user, or the assistant (when a message was generated by the model itself in a previous turn). Roles give priority to messages, forming a hierarchy of power, meant to control the influence of individual messages on the model. Especially with cloud LLMs the actual implementation or weight given to roles may vary. Compare System Prompt.

Sampling

is the subsequent step after decoding in the inference process and the third step of the main work a model does. Once decoding has produced the raw logit scores by passing tokens through matrix multiplications in the transformers, the sampling phase adjusts those raw logits via temperature, Top-K, Top-P (nucleus) functions to select the final token. Compare Decoding and Temperature.

Self-hosted

Using cloud LLMs, GenAI image generation or other AI services that are offered exclusively over the Web constitutes Software-as-a-Service (SaaS). With SaaS, customers have no control over how the technical backend is run and usually suffer from limited data sovereignty. The opposite paradigm is running a service "self-hosted", managing servers and software on your own terms, taking on responsibility for the overall quality. As with any software, deep neural network models can similarly be run self-hosted, locally, on-premises, completely detached from cloud APIs. As many AI vendors deliberately offer models as cloud offerings to protect IP, not every SaaS service can be matched with an open-source, open-weights, or freely available alternative. A hybrid approach is to host the user interface locally, and use this local stack to route model requests to locally running inference engines or to cloud models. Enterprises use such local interfaces to present users with a familiar yet fully internal alternative to commercially available offerings on the Web. This can help mitigate "shadow IT", where employees use arbitrary services not governed by corporate policies. Similarly, self-hosting a local UI stack enables savvy users to interact with models in a convenient way, similar to commercial services, but with added enhancements like dynamic switching between models, using a local or remote model, or finer-grained control over the backend. Many self-hosted code projects fund development by offering their code under an Open Core / Dual-License or a Freemium model. Compare Cloud LLM, LLM Runner, and Diffusion inference framework.

Self-hosted alternatives for:

Coding:

OpenCode (opencode.ai)
Cline (cline.bot)
OpenHands (openhands.dev)
Aider (aider.chat)
Kilo (kilo.ai, opencode fork)

Documents & RAG:

kotaemon (GitHub: @Cinnamon/kotaemon)
AnythingLLM (anythingllm.com)
Verba (GitHub: @weaviate/Verba)
RagFlow (GitHub: @infiniflow/ragflow)

Automation & Personal Assistance:

OpenClaw (openclaw.ai)
n8n (n8n.io)
Dify (dify.ai)
Flowise (Github: @FlowiseAI/Flowise)

Chat:

LibreChat (Github: @danny-avila/LibreChat)
Open WebUI (openwebui.com)
Chatbot UI (Github: @mckaywrigley/chatbot-ui)
Onyx (onyx.app)
Msty (msty.ai)
Chat UI (Github: @huggingface/chat-ui)
textgen (Github: @oobabooga/textgen)

Self-supervised

In LLM model training, "Self-supervised" means that a model, during its training process, is "only supervising itself", or technically is only extracting labels as they are stochastically inherent in the training corpus itself. As such, self-supervised is the opposite of a "supervised" approach, where humans precondition, label or tag content in a training corpus to prepare the data before model training or guide the model in applying a specific label to a certain training data trait. Self-supervised training can be useful in extracting raw traits as they appear in a dataset or to produce a foundational reference model for comparative or downstream applications. That said, it might be noted that extracted traits are not unbiased or actual, but merely align with what is found in the underlying data - which in turn might be biased or counterfactual.

Server-side template injection

abbreviated as SSTI, occurs when an attacker is able to inject malicious template syntax into a server-side template that is then executed by the template engine. This can happen when user input is not sanitized before being embedded into backend templates, for example in systems that accept natural language input. Compare Prompt Template.

Sigmoid transfer functions

are a type of nonlinear activation function that was used in older artificial neural networks. Activation functions in general map wide input ranges to a limited output value space, forming an S-shaped curve over the value range. LLMs and deep learning models use such methods in order to activate artificial neurons in a nonlinear way, a characteristic that is required to effectively model complex patterns found in training data. Today, the sigmoid transfer function specifically is mostly replaced with more modern alternatives, like ReLU (Rectified Linear Unit), GELU (Gaussian Error Linear Unit), or SwiGLU (Swish-Gated Linear Unit). All of them, though, behave slightly differently. While sigmoid maps to values between zero and one, ReLU maps negative values to zero and positive values to themselves, and GELU may map to negative numbers as well.

SSE

short for "Server-Side-Events". SSE is a Web technology and a W3 standard that describes a scheme and protocol for streaming text data from a server to a client, usually a browser. In contrast to WebSockets, SSE is unidirectional (only one direction). It is a lightweight solution to push real-time updates from a server to a client. As of 2026, SSE is very well supported in browsers. Its MIME-type is "text/event-stream". SSE saw a proliferation on the Internet with the advent of Internet AI chatbots and AI search engines. Stemming from the iterative nature of AI generated content, which usually renders in "successive bursts" or token chunks, there was a need to push these incremental updates to the client. Instead of the server waiting for the AI subsystem to fully complete its output, AI providers usually align with the segmented generation and stream updates to the client as they arrive. From a UX (user experience) perspective, this is much better than letting a user wait for seconds until content is done. Instead, users can see the generation as it happens, with a first incomplete output after fractions of a second coming in. One interesting note is that cloud LLM providers like OpenAI usually use the MIME-type "text/event-stream" but intentionally break the SSE scheme's specifications by providing an Event stream as response to a POST request. By specification, SSE usually only stream back from GET requests, but providers use the POST type of request as client requests may be very large. So effectively, providers chose to use some aspects of the SSE scheme, like record separation and its MIME type, but actually only do a simple text stream to clients. And clients, in turn, usually implement around this oddity and use XHR or fetch() operations in lower-level reads instead of a real standard-conforming SSE event consumer.

Do LLMs maintain "State"?

Short answer: usually no. The idea of "state", meaning the "state of a conversation", the perspective on "what has been said", and "what might be said" in the future - these states are usually, in today's LLM-based chatbots, not actively maintained during a conversation. Looking at the OpenAI de-facto standard of how an LLM platform API is usually queried, the notion of state has been left out for simplicity of the service. In addition, LLMs do not know anything about the practicalities of a conversation and they also do not know the concept of laziness. Why laziness? Because of the way the "state of a conversation", its context, is maintained across input-output-turns. With LLM chats, the whole chat history is fed into every next query of the LLM. This may include vast amounts of texts, but it is the only way of providing the aspect of "context" to current LLM implementations. LLM input may grow to a large array (messages array) of user prompts, mixed with what the model responded (labeled as role="assistant").

On the application layer in contrast, conversation state may be maintained in various ways. Commonly, chat interface implementations create a "Session" for one coherent "chat interaction". Also, user input is usually logged and is kept available to replay past turns of input and LLM output back to the user, for example when the user temporarily navigates away or network connection is lost. Also, many security or legal guidelines require chats to be archived for later review or external audits. And in addition, many technical practicalities of chat interface implementations keep state or session storage/ caches to optimize data flow between the UI front-end, pre- or post-processing middlewares and local or remote model backend APIs. For completeness it may be added that OpenAI's newer Assistant API does manage state ("threads") on the server-side for convenience, but this does not fundamentally change that current LLMs, without wrapping frameworks, do not keep state. Compare "Roles" for more on defined roles in OpenAI-compatible LLM queries. Compare "Security guardrails" for more on filtering middleware. And compare "Do not think of an AI chat as a 'real-world human conversation'" for even more on the concept of "State".

Streaming responses

Client-Server implementations of GenAI, like chatbot interfaces or image generators, today usually employ some form of streaming to stream generated content back to the client user. "Streaming" means a remote backend does not wait for content to be complete or done but instead does an incremental stream of content to the user. When preparing remote data takes time, it can make sense to return and present this data to the user as soon as possible, as it arrives on the remote backend. Streaming is effective to show progress to the user as feedback that "something is actually happening". With GenAI and LLMs generating content token by token or iteratively, this is especially effective as AI content generation today is not instant. In AI image generation, an AI web interface usually shows intermediary iterations of the final content as it renders. It depends on the backend technology if this preview is actually tied to the final outcome or just some placeholder imagery that communicates an ongoing process to the user, in lieu of known "throbbers". In AI text generation, where generated output forms much more serially on the backend, chat interfaces can and usually do push generated content to the client as soon as it is available. Such "Streaming responses" usually rely on some sort of content-serving scheme that allows served content to be of undefined length at the start of the streaming process. HTTP communication normally requires knowing the total content length, but with streaming GenAI the final length is mostly unknown when output starts. Implementations thus either rely on custom ReadStream, XHR, fetch implementation, or on the standardized SSE (Server-Sent Events) scheme and protocol. Streaming responses try to stay away from alternatives like recurring polling or heavier technologies like WebSockets. Interfacing with APIs in a streamed mode usually comes with added code complexity and development overhead as implementations need to accumulate streamed content in intermediary buffers and handle message chunks accordingly. Further, streaming partial content to an end-user makes filtering, safety guardrails enforcement or automated real-time output monitoring more difficult.

Structured Output

is an umbrella term for any inference-time translation of generated output according to a previously specified format. Inference is divided into multiple steps with "decoding" being one of the last ones. In "constraint decoding", the LLM runner framework imposes a specific structure onto the model and allows only specific tokens to be generated during sampling, resulting in structured output. This is known as constrained, guided, or grammar-based decoding. Such structured decoding techniques are used to force a model on the token level to produce JSON-formatted output, and also to execute tool calls, where a model is forced to answer in a defined format rather than generating unpredictable free-form text. The "Outlines" Python library aims to provide an interchangeable approach to structured output that is compatible with multiple models and multiple inference engines.

System Prompt

A "system prompt" (or "role prompt") is a high-priority first prompt given by a developer or interface supervisor. Instead of starting with a normal message prompt, doing "role prompting" and supplying the model with a first guiding input generally significantly improves output quality. Pre-texting a chat session with a role prompt like "You are a seasoned medical professional, overseeing a large department of a city hospital" is more than just silly roleplay. A system prompt helps the AI to focus and work within a more guided corridor of possibilities. This channeling elevates accuracy and overall model performance. Aside from that, a well crafted system prompt helps by defining the output tone, verbosity, wording and aligns output better with the target context of generated content ("anchoring"). Lastly, a system prompt can be used as a high-priority ruleset and supply grounding context for upcoming conversations (in‑context learning ). System prompts are more important in chat-style model interactions where conversations depart from a specific first message. In RAG scenarios, user input is typically embedded into a Prompt Template to have the model reason over user input in relation to template content and retrieved context or (local/remote) external data. Prompt Templates, by contrast, may be provided to a model as system prompt or as regular user messages. The name "System Prompt" comes from the de-facto OpenAI API JSON protocol standard, where clients can define a "role" key/value as part of the request. Note that cloud LLM providers and model vendors sometimes rename roles, like "system" to "developer" or they introduce finer-grained role-level hierarchies. As models don't throw an error when an unknown role token is used, developers need to check which role token a model has been trained with and use the model or API accordingly. Also compare In-Context Learning, Prompt Template and Prompt Caching.

What's in a good system prompt

In deployment of chat assistants, for user support, on websites or in telephone dialog systems, a system prompt is usually one of the most important elements to set the tone, domain and rules for what an assistant is expected to answer, do and know. Here, a system prompt acts like a "configuration file" in establishing how an LLM assistant works. That's why crafting a solid system prompt is one of the first steps towards success with an LLM based chat system. That said, it is important to note that the system prompt counts against a model's context window, so a verbose text here will reduce the total "memory" left for the actual conversation. Especially with older models with smaller context windows, this may be relevant. So a well-done system prompts is a challenging task in between conflicting priorities of briefness and all-encompassing instructions and guardrails. The idea is maximum effectiveness at minimum token use. Compare Context Window and Prompt Engineering.

Be explicit not implicit. Models are not humans. Drop the fluff and give clear instructions. Models tend to follow a clear rule better than a vague or implicit suggestion. No use in being polite.
Earlier tokens receive higher priority. Although message roles like "system" are a paramount concept, models also weigh information inside messages according to position, so put your most important instructions at the beginning. But read the docs: some models apply higher weighting towards the end (recency bias).
Structure and order your prompt. Mixing style and rules, for example, adds unnecessary noise. A clear structure like: Order, Rules, Constraints and Output is a solid foundation.
Use structuring markup to guide the model. From training, models already know markup like Markdown-style headings, lists or XML-style sections and nesting markers. Using these can help the model to digest the structure of your prompt. Newlines are important to convey segmentation.
Punctuation and capitalization. Some models ignore capitalization, others can be guided by some words in all-caps. Using an exclamation mark is often just as strong as using a simple full stop. Models read text different than humans. Generally, it is better to avoid exclamation marks as they may lead to unpredictable weighting.
Use lists and clear instructions. Verbose text might be more readable, but long prose is blurry for a model in comparison with an enumeration of short sentences or bullet points.
No use in repetition. Repeating instructions is usually not needed as models don't forget things or skim over sections like humans may do. That said, repetition is effective for emphasis, especially in high verbosity prompts and for core constraints.
The model already knows things. Try to avoid reiterating things the model already knows from training. When adding knowledge, the idea is to amend the model's data where it runs thin. And try to find a balance between knowledge autonomy and grounding with added facts.
Use AI. While authoring text is a creative domain for humans, aligning an assembled instruction sheet with an LLM's requirements is not. There is much use in using AI itself to check and optimize a system prompt.
Read vendor docs. Not every AI model is the same. And vendors know their models best. Look for prompting guides and best practices in official documentation to help you fine-tune your system prompt for a specific model family or model.

Temperature

Although AI systems are intrinsically deterministic (non-probabilistic), some deliberately introduced parameters modify this behaviour. This tuning of a model is what leads to a common perception of AI models being probabilistic or "random" in their output (increased output entropy). The Temperature parameter is one key factor here. The Temperature parameter internally controls a model's leeway in finding its stochastic pathways through its knowledge encoded in its weights. A higher temperature leads to a higher degree of randomness. In combination with sampling settings, the possible path through the model can be tuned from very unpredictable to nearly fully predictable. This is conceptually similar to image rendering, where light "portals" are used to bracket forks in the light path and this way narrowing or widening the possible trajectories after a node portal. The actual process of Temperature-scaling occurs between arriving at Logits and before these values are passed into the Softmax function. By controlling the "value corridor" between these steps, the resulting diversity or "creativity" of a model's output can be effectively controlled. Compare "Logit".

On randomness in AI

In 1926 Albert Einstein used the phrase "God does not play dice" to describe the laws of nature, the mechanics of the Universe. Looking at generative AI and how it mirrors what human cognition seems to be, the difference between artificial and human intelligence becomes obvious. GenAI's output only varies for an unchanged input when "Temperature" variables allow a random margin at defined stages of the internal inference process. At a temperature of zero, GenAI is de-facto fully deterministic, a non-probabilistic apparatus. The perceived unpredictable nature of GenAI output is only an effect of cloud models usually operating at non-zero temperatures and input varying wildly in actual wording despite humans interpreting their input as being "the same" intent. Only randomized steps at the end of the internal inference path allow a certain degree of random walk inside a model when it forms the actual output.

Humans, on the other hand, usually don't play dice in decision-making. Situations, perception or language are not a game of statistics. Decisions are not calculated but made. Imagine an arbitrary situation where a human actor is tasked to decide. Even when one option is statistically the better choice, humans may and often will opt for the other choice - driven by instinct, intuition, out-of-the-box thinking, context, upbringing, emotion or personal style. But they will never choose randomly. And when stakes are high, this human common sense is what allows us to trust a person and makes us mistrust a machine. Fiction has taught us this lesson in numerous dystopian narratives where computers made bad decisions, where machines decided too early on what seemed to be a clever option, basing their actions on an incomplete picture of the situation. It comes as no surprise that where AI systems are deployed, responsible providers rely on supervision, automated and manual, putting human common sense back into the process. For more on AI supervision, compare HITL (Human-in-the-Loop).

Tensor

The term "tensor" is one of the ethereal terms within the field of artificial intelligence, and yet, it is only a simple umbrella term for a data structure, a mathematical container. In AI and the context of deep learning, everything is a tensor that can be described as an n-dimensional array of numbers, a multi-dimensional field of numbers. Embeddings are tensors, and hidden states, likewise the fixed length vectors and the specifically shaped matrices that are used to represent them. A simple list of values, a mathematical vector, is a one-dimensional, a 1-D tensor. Stacked token vectors in a matrix form a 2-D tensor. Adding another axis/dimension, a batch number for example, makes this then a 3-D tensor and so forth. Tensors are the structures that are transformed on their paths through a model's layers. Attention mechanisms take tensors and transform them, through linear projections and matrix multiplications. Compare Hidden States, TensorFlow, and GPT.

TensorFlow

is an influential framework for machine learning and deep learning. It was originally developed at Google and was released as open-source in 2015. TensorFlow provides tools for training and running neural networks. It uses tensors as its central data structure, supporting computations on CPUs, GPUs, and Google's hardware accelerators (TPUs). Within the field, TensorFlow was the dominant framework after its release, in research and practical environments, with its spiritual predecessor Theano falling behind. At the time, notable rivals were the Amazon-supported the MXNet framework and later the Caffe/Caffe2 framework. Around the year 2020, the AI community's focus shifted and began favoring PyTorch due to its increased flexibility.

Does an A.I. "think"?

Chatbots nowadays are able to simulate consciousness on an impressive level. Talking with a sophisticated chatbot or speaking with an LLM comes very close to what we humans regard as interpersonal conversation. The immediate reaction here is then to ask: do these systems actually "think"? "Thinking" here in the most common sense of human introspection. Thinking is an amalgam of intricate internal processes that work in unison to "understand" what we ask, intend or try to convey - picking up incomplete thoughts and deducing meaning. From a high-level perspective, and when artificial intelligence is regarded as a black box or covert art, people tend to judge "yes": AI does think. But digging deeper, technically and philosophically, the answer is the contrary. No, AI does not think. There is no thought, no judging, no deliberation - only statistics. Simply a defined series of mathematical steps, stochastic calculation and a defined degree of randomness. That's all. It appears as if an LLM "thinks", it seems to "understand" but, in fact, it does not.

The question of why we tend to assume there is human thinking involved stems from a number of reasons. First, humans are looking for signs of life, of other humans or a soul in their environment. We tend to animate inanimate things by the power of our will and imagination. That's why a chatbot giving a sensible answer tricks us into thinking there is a person talking to us. Having the machine transcode text into audible speech makes this even more real, but it's not. The second reason is training data. LLMs have been trained on vast corpora of textual knowledge and encoded in this knowledge is human thought, reasoning, judgment and everything we understand as common sense. An LLM on the other hand is a machine that is able to produce summaries of this training data. It is able to detect topics and produce related content that aligns with the input intent through probabilistic continuations. It mimics conversation.

The human brain has nearly 100 billion neurons, and each neuron is connected to several thousand other neurons via synapses - and yet, we are not aware of them or what they encode. Computers are machines and they are different. They lack many qualities of biological life but they are exceptionally proficient at data processing. And with this capacity, they are able to do things that are hard to imagine in human terms, such as calculating vector similarity in a high-dimensional feature space or processing large arrays of numbers without a single error. These abilities now, in AI, are used to tackle human utterances and distill a computable input from them. Part of this process is chopping words into character sequences and sorting them - a scheme that is known as Tokenization. While words may be ambiguous and unfold their meaning only in context, this meaning is hard to grasp for a machine. With tokenization, the machine takes a pragmatic approach and dissects any textual input into these tokens and it stoically registers any meaning it ever found associated with any of them. When a machine has access to any context, any meaning and character sequence thinkable, and can access all of these at once, it is able to label any input with the appropriate answer. Any input. The technical underpinning here could be loosely described as the machine simply knowing the answer to any question a human is able to ask. It is hard to not speak of magic here, as it is so unintuitive. But the reality is that there is, on an abstract level, understanding meaning as a fixed range, a closed set of possible questions and the AI simply knows the answer to any of those. At least approximately. And that's even true for long meandering inputs, with topical inconsistencies and dead ends. Through a process of filtering and condensing, the machine is able to identify the key areas of focus in any text, and even take side-aspects into account - and produce a matching output that tends to answer the input with a closely aligned response. LLMs are trained with hundreds of billions of parameters and identify trillions of tokens; this is the equivalent of knowing any imaginable utterance and being able to mathematically match it with a statistically aligned answer. And returning to tokens, using such tokens to generate answers is not "how AI thinks". Tokens are just a clever way to encode meaning, it is what they were invented for. Currently, tokens are the best we have to computerize language, but as with any invention, they will be replaced sometime, eventually.

Broadening the perspective, the whole apparatus of an AI could be perceived as a "thinking machine". Could this be where thinking emerges? AI models, once trained, are loaded into special runner frameworks, the inference engines. These engines accept input, insert it into the model and execute the process of calculating the output. This is the process of inference. Like a pachinko machine, only with a defined serial, parallel and branching path through specific components. When inference happens, it could be argued that this is a form of concept formation, which is a quality of human thinking. And calculation alone is problem solving, another aspect of human thought. Then, all of this happens along a defined logic, and applying logic is a cornerstone of human reasoning. But we already see that while these concepts have their parallel in human thought, they are here applied to the process, not the content. All these qualities are outside, on the inference level, and do not find their way into the processed data, the tokens. While such a machine is able to sort these tokens in a meaningful manner, the actual knowledge, judgment, deliberation was already encoded in the training data. Nothing new was added. No spark of ingenuity, creativity or innovation was applied.

While it's natural to (falsely) assume actual thought with LLMs, looking at other types of artificial intelligence can be a sobering endeavor. Another type of generative AI is image generation, where pixels instead of characters are produced. It's a less intuitive domain. That's why looking at image generation is helpful to understand how the machine can answer any input with a matching output and how it still does not understand what it is doing. For image-generating AI, a process known as diffusion is used, where the inference is sharpening a fully random image in iterative steps towards an output that aligns with the input intent. For every pixel, the machine decides what the most probable, the best matching state is. Instead of tokens, image generation relies on latent image fragments, on "patches". Having access to the vast training data, it can distill color blobs, then structures and finally a photorealistic image that simply is what was meant - but at the same time the AI never "understood" what it was doing. There is no craft, no artistry, no creativity. AI does not "understand", it has no concept of meaning. And despite some researchers arguing that AI internally forms functional world models, their rendering of the environment is fundamentally different - a stochastic network. Artificial intelligence does not know about the world, or us, or even itself. And despite the machine being able to impressively emulate what humans can do, it is just that: a machine.

Thinking

in relation to chatbot (web) interfaces, "thinking" is a presentational feature of an API, similar to "Streaming". When a model is capable of "thinking", it provides a glimpse into the inner workings of the model by outputting reasoning traces through the backend API in real time. This allows a user to see what is happening before the final answer is ready to be presented. Like "Streaming" this feature improves user experience by bridging the time between question/prompt input and answer/model output. In addition, allowing users to glimpse into the reasoning phase can benefit auditing or debugging a final response - but only a few UI implementations allow detailed dissection of the "thinking" content.

For API developers, it may be helpful to note that the full thinking content becomes part of the conversation history. In earlier model runner frameworks, developers were asked to pass "thinking" content to any follow-up query as context, to guide the model. With newer cloud APIs, for performance, intellectual property and jailbreak-mitigation reasons, this is no longer true and thinking/reasoning context data is either stripped or encrypted, paired with storage of these traces on the remote platform. For the human cognitive understanding of "thinking" refer to the article "Does an A.I. 'think'?"

Token

Tokens, in general, emerged from the field of text analysis and "Natural Language Processing" (NLP) where tokens, the "granular units" resulting from a tokenization process, are used as segmenting "windows" on a stream of characters, of text. Tokens may be Unigrams (an arbitrary self-contained unit of text), N-grams (an ordered sequence of characters or symbols of arbitrary length) or Bigrams (a two element n-gram, for example, a unit of two words, two symbols or two tokens). In the context of large language models (LLMs), the word "Token" usually designates arbitrarily sized and often quite long chunks of characters, whole words or text fragments. In Byte-Pair encoding (BPE), tokenization begins with only two characters, a Digram, but in iterations of statistical compression, forms a representational Token. Conceptually, this is similar to, for example UTF-8 encoding where two or three bytes are used to encode a specific literal - but with LLM tokenization this is on a higher, more abstract level. Tokens are unique to a specific family or even generation of LLMs and represent a numeric representation for a char/char-sequence. Libraries like "tiktoken" for OpenAI's models can be used to tokenize arbitrary text. This separation of tokenization into a process that can be done on client side, in the same or very similar manner as internal systems at OpenAI do, makes tokenization and token count an important metric when cloud LLMs are used. This is because Cloud LLM Platforms usually use "number of tokens used" to measure "credit use". The idea here is that there is a certain cost associated with processing input tokens or generating output tokens and in metering the "token burn" by a user, an AI vendor can bill its customers for services rendered. As of 2026 and with popular models, a rule of thumb is that one token generally corresponds to about four characters for common English text and roughly equals 100 tokens per 75 words. In a simplification, sometimes a token is described as a "probabilistic token", conveying that a model is working to predict the "next probable token", i.e. answering the question "what is the most probable word or char sequence after this series of previous tokens". This mixes terminology as Tokens are not characters, not words and not Embeddings but numerically arbitrarily chunked stream units unique to a model or model family. Compare "Logit" and "Perplexity".

Tokenization

is the process of partitioning or segmenting text into defined pieces of information. Traditionally, a token means a "lexical token" and is any single- or multi-character identifier (unit) that carries a certain defined meaning. In natural language it can be the actual word meaning, in codified syntaxes it can be any arbitrary meaning or value. Tokenization describes the deconstruction of an input string into its contained tokens for a subsequent pass of applying meaning to a sequence of tokens (e.g. part of speech tagging) or treatment of tokens for storage and retrieval (e.g., in inverted indexes). More recently, with the proliferation of AI, the label "Tokenization" is increasingly associated with machine learning, where tokens are likewise extracted from an input string. The way AI systems utilize tokens is analogous to how inverted index search engines use them. Search engines also process tokens, align them with a defined vocabulary, assign a numeric equivalent per token (an integer index from a translation table) and handle these numeric representations exclusively in subsequent processes. This comes from the fact that in storage and retrieval, processing numbers is more efficient than processing characters, as computers process only numbers internally anyway. In AI, a model only works with numbers (Tensors) after Tokenization. Compare "Token".

Tool Calling

also "tool use", is a model's ability to execute commands or call external tools, like a human user would. This means, when a user query is processed, the model can be granted permission to use a tool by supplying a specific array of available tools. The model then uses the descriptions of each available tool to decide if using one of these tools could fill-in or answer questions with knowledge that it so far does not possess, either from training data or through conversation context. Implementation requires a more elaborate setup within the framework calling a model and support within the inference engine running the model as tool use usually follows a three-step pattern. In a first model call the request has the "tool" parameter defined to trigger structured output in case the model decides to use a tool. This may be left to the model to decide or be explicitly requested. The model then decides which tool to use based on the tools' descriptions. For example, if a user asks about a specific device, the model decides if it knows anything about it. If not, it tries to find out if there is a tool on the list that may fill in this missing data. In this case, the resulting model output will be a structured response that states which tool the model wants to call and with which parameters (the device name, model id, etc.). The wrapping framework then executes the actual request against the tool. In a third step finally, the model is again prompted with the user's input message to generate the free form text answer, but this time amended with the added context the tool use returned. On this third step, the framework would omit the "tool" parameter in order to avoid an endless loop in case the tool call did not yield useful context. Tool Calling is related to Retrieval Augmented Generation (RAG), although in RAG retrieval is usually done before the model is queried. That's why Tool Calling is sometimes called "Agentic RAG". Also, the back and forth of tool use can be described as a basic agentic loop. The concept of tool calling may also be seen as a rudimentary form of a model's ability to use a computer. Compare MCP.

TPU

is short for "Tensor Processing Unit", a custom hardware accelerator (chip), specifically tailored to do matrix and tensor operations at higher computational and energy efficiency than traditional CPUs or even GPUs. The technology was originally developed and deployed at Google for TensorFlow and proprietary frameworks the company used internally to optimize power consumption and speed. Since then, Google has opened the concept and pushed the technology to support other open-source machine learning frameworks but keeps controlling hardware tightly, as of 2026, only physically offering TPUs to select hyperscale customers.

Training

Training is an essential process in tuning a deep neural network for its application and for the model being useful at all. In its generic state, an artificial neural network is in a randomly initialized state. Using it in this configuration would result in the model only outputting gibberish or nonsensical data. In order for the model to be productive, it has to be exposed to meaningful data. So during training, neural networks are presented with vast amounts of real-world data. Some datasets used for training may have been pre-processed, curated, selected, or annotated by humans. During training, the model "learns" statistical patterns, connections, solutions, correlations that are ingrained or latently present in the training data. Learning means that the artificial neurons inside the model, its parameters, align with the realities of the training data. Models are not designed to learn individual facts, only probability structures. One of the technical curiosities of deep neural networks is that with parameters at the scale of billions, these networks exhibit emergent abilities after training and generate content that demonstrates capabilities surpassing what could be predicted beforehand. For example, learned statistical approximations result in the model showing common sense, although that is actually an emulation through stochastics. And through what is called "overfitting", models ultimately do memorize hard facts, although that stems from a network of solidified states within the large neural network, creating what is known as "subnetworks" or "highly specific parameter pathways". Compare Decision Tree.

TTFT

short for "Time To First Token". TTFT is a metric to describe the performance of an AI hardware and software stack, its inference latency. Available RAM, CPU or GPU type and its throughput in combination with efficiency of the LLM runner all determine how long it takes after submitting a query into the inference engine and until the first token is emitted. It is not uncommon for an LLM stack to take 100-500ms until a first token is output. This and the nature of sequential token output has led to streamed responses or iterative placeholders to improve User Experience (UX) on AI web interfaces. Compare SSE.

Turing test

is a test invented by British researcher Alan Turing in 1949. The idea is to pose a challenge in the form of a written interrogation test, where an interrogator is asked to determine which of two players is a machine and which a human. The question behind the Turing test is whether a machine is able to exhibit intelligent behaviour equivalent to that of a human. The mode of answering this question is to task a human with finding out. And while the Turing test has been widely accepted as a baseline to test if a machine is able to "think", critics note that a sophisticated variant of a simplistic Eliza-style chatbot might be able to complete the Turing test through language mimicry but without actually being able to "understand". The Turing test, originally called the "imitation game", is thematically linked with the fictional "Voight-Kampff"-test from the 1982 motion picture "Blade Runner" where empathy and physiological responses are tested to determine if a person is a replicant.

User Input

is any information that a user enters, submits or transfers to an AI system with the intent of having the system process or react based on the input. User input can be text, audio, voice commands, images, videos, uploaded files or a combination thereof. In the context of modern AI systems, user input is usually referred to as "prompt", especially with language- or text-based AI models (although the raw input might be only a subset of the actual model input). A common characteristic of most AI systems is that overall user input quality significantly correlates with the helpfulness or intent alignment of the generated output ("garbage in, garbage out", GIGO). User input is part of what most vendors define in summary as "user content". Compare Prompt and also read the article "Legalities of User Content: The Shift in Ownership".

User Input processing

With AI chat systems, and users being asked to enter or submit arbitrary text and data, it is wise to preprocess and filter anything a user enters into some input element on a website or in an app. Here is a list of common "boxes" developers should check before a system is taken live, ranging from best-practice input sanitization, through AI-related checks, to basic AI pipeline preparation:

Normalize text:
Remove excessive whitespace and optionally unicode and/or control characters. Optionally remove extremely long uninterpretable noise like big binary blobs, huge base64, or try detecting if chunks of the input were copied & pasted.
Try to detect boundaries:
Keep punctuation and sentence boundaries. Keep newlines and separator characters.
Traditional safety pass:
Decide if it is ok to enter script or markup content. Optionally remove markup, like HTML. In case content is mirrored back, make sure script and code tags are properly escaped and XSS attacks are mitigated.
AI safety pass:
Check for adversarial prompts like prompt injection or jailbreaking attempts. Also check for injection vulnerabilities that target your LLM/model on the backend.
Semantic analysis:
Detect if user entered multiple distinct questions, intents or tasks. If so, split into sub-queries/etc. and decide if you want to handle each.
Length assessment:
Decide on a max length for user input. Optionally summarize raw input before proceeding.
Data masking:
Remove Personally Identifiable Information (PII) in case privacy or legal compliance on your part or for backend APIs requires you to do so.
Prompt template:
Decide if user input will be embedded in a prompt, with optional wrapper instructions.
Model selection:
Depending on your pipeline, you may want to feed the user input to different models based on query type.
Max token assessment:
Related to raw length limits, decide on the maximum tokens any user may consume, per query or in total.
Rate limit:
Prevent adversarial actors from spamming or overloading your input endpoint, in rate or size.

Vector

A vector is a mathematical structure that describes a length (magnitude) and a direction in a virtual space (coordinate space). Its representation is an ordered list of numbers, called a Tuple. Such tuples can be of arbitrary length (n-tuples), an array of numbers, where each number may be an integer or floating-point number. In basic mathematics, a single number can be understood as a one-dimensional vector - as it describes a quantity within one spatial dimension. By adding a second number, the tuple describes two points relative to each other, which could be illustrated as a line, a vector in two-dimensional space. By adding a third number, this vector can now encode a position in three-dimensional (3D) vector space, and a 3-dimensional space is intuitive for us humans as the physical environment we all know from everyday experience is also a 3D space. If we assume the coordinate space's origin as a reference point of our exemplary 3D vector, this 3-dimensional vector could be imagined as a specific direction in 3D space.

The thing about mathematical vectors is that they can have an arbitrary number of elements. For example, adding a fourth number element (a fourth dimension) places this vector into 4D space, a fifth number means it exists in 5D space and so on. When working with vectors containing many numbers, hundreds or thousands of elements, then vectors are usually described as being placed in high-dimensional vector space. An example vector such as [1.2, 3.4, 5.6, 7.8] is already difficult to imagine, even if we assume the fourth dimension here as representing "time". In math, vectors are usually treated as arbitrary points and a high-dimensional vector could be described as a "point cloud" within high-dimensional vector space, floating somewhere, with defined distances between individual points. Another way to imagine a vector is to think of it as a concatenation of arbitrarily pointed lines. Moved to start from the coordinate origin at [0,0,0], with a first line going into one direction, a second line being appended to the end of the previous one, going into a second direction, etc., such a vector forms a long concatenated "zig-zag" path through a virtual space, having a very specific "shape". This specific shape of the imagined "point cloud" or "zig-zag line" can be thought of as describing a unique "thing" with many "attributes" and "features". And it is this quality of vectors, their ability to describe multi-faceted phenomena, which makes them ideal to describe (encode) semantic meaning in semantic information retrieval. Compare Vector Database and AI Search.

Vector Database

A Vector Database (sometimes Vector Store or ANN-Engine) is a specialized database system that is built and tuned for vector "similarity" searches, usually within high-dimensional (vector) spaces. Vector Databases either come as dedicated packages or add-on modules for traditional relational databases. Whereas traditional relational databases rely on exact matching and record scanning with indices, a vector database employs advanced retrieval techniques specifically designed to optimize the speed and efficiency of locating vectors that are "closely aligned". Relational databases usually map input queries to unique IDs or specific fields via exact match to locate specific table entries. In contrast, Vector Databases compare an input vector to those within the store to locate the matching vector itself or its associated metadata. Because vectors are able to encode multi-faceted arbitrary features in an abstract yet defined way, vector search engines are capable of returning similar, related or "fuzzy" matches where traditional retrieval struggles.

One key requirement for this to work is that both, the queries and stored data, must be encoded in vector representation. So, when data is inserted into the database, whether it be textual, visual or auditive content, it must be encoded as a vector, known as an Embedding, by running the content through an "embedding model". At runtime then, each query has to be encoded by that same model. Only this way, the databases engine is able to return matching records. It is also worth noting that all vectors share the same length (dimensionality). During the design of the vector database layout, developers define the vector length, such as 384, 768 or 1536 dimensions. Then, with varying query or content length, the embedding model will always return vectors of this fixed length, via internal collapsing (pooling) strategies. Dimensionality defines the "granularity" of the vector embeddings.

Most mainstream vector databases offer a set of popular or best-practice algorithms that define what "similar" means (distance functions) and hard-coded or modular storage backends to index, scan and locate entries. During a lookup, vectors can be compared using metrics like Euclidean Distance or Cosine Similarity, in various implementations or combinations thereof. One basic approach, as with any database, is to locate records via "brute force" scan of the whole database, comparing each record with an input query. As this results in high time complexity, linear to the number of records to scan, packages usually offer several optimization schemes. In Product Quantization (PQ), vectors are broken into sub-vectors and grouped into clusters by their k-means centroids, allowing partitioning of the whole dataset before locating records within a subset of pre-matched vectors. Locality-sensitive hashing (LSH) computes a first approximate quadrant within the coordinate space before locating records. By dividing the vector space along sectional planes, the database engine is able to reduce the number of possible matches down to nearest neighbors within a sub-quadrant of the total vector space. (Hierarchical) Navigable Small World (HNSW) indices search along the links between vector points to locate nearest neighbor records. By indexing starting points for this navigational search, this approach is able to locate nearest neighbors in fewer iterations than other schemes (as of 2026).

These techniques fall under Approximate Nearest Neighbor (ANN) search algorithms. The popular K-nearest neighbors (KNN) algorithm returns the top K items with the smallest distances to the input query vector.

In semantic search, in language processing, in RAG pipelines, object detection and recommendation engines, Vector Databases have found wide deployment to match misspelled or vague concepts with "related" data. Real world deployments usually combine pure vector lookups with more traditional lookups via associated metadata, forming a hybrid multi-level system. Depending on actual query, related metadata can then be used to pre- or post-filter nearest neighbor candidates to compile a higher quality final query result. Compare "AI Search".

Illustration of a Vector Database — **The Vector Database**. An embedding model is central in indexing data and locating entries.

Popular vector databases

Vector databases exist either as purpose-built implementations shipped as one standalone software suite or as a backend module for established database packages. If there is a "MySQL of vector databases", it probably is Qdrant. Among extensions for database packages, the pgvector module for PostgreSQL currently outshines the vector add-ons for MySQL and MariaDB in maturity.

Qdrant
Open-source, Rust-based, very fast, great for self-hosting or cloud use.
Redis
Ships with vector search, in-memory, extremely fast, great for caching and real-time AI workloads.
Milvus
Open-source, enterprise-grade, backed by Zilliz, built for massive scale.
Pinecone
Fully managed, fast, polished, proprietary, can get expensive at scale.
Chroma
Glues easily with Python, simple, developer-friendly, ideal for small/medium projects.
Weaviate
Open-source, feature-rich, supports hybrid search, cloud or self-hosted.
pgvector (Postgres extension)
SQL-friendly, stores vectors in Postgres, scales from small to enterprise projects.

Vector Search

is a matching scheme in vector databases that uses linear algebra to locate entries. A vector database does not rely on exact terms or specific numeric IDs to locate entries, but instead stores high-dimensional vectors, usually representing embeddings, and retrieves matching records based on similarity comparison algorithms like cosine similarity, dot products or Euclidean distance. By locating entries via looking at the "closeness" of vectors, collecting aligned vectors from the database, a vector search is able to return fuzzy matched results for a certain input. While traditional relational databases used truncation, edit-distance or SoundEx in order to assess similarity, vectors are superior in mapping semantic similarity. When vectors are used to identify embeddings, a vector database is able to return very useful semantically-sound results for misspelled, descriptive or vague search queries. See "AI Search" for more on vector search and how it is used in combination with RAG for modern hybrid search.

Vector Store

usually refers to a lighter, potentially in-memory or feature-reduced variant of a full-featured enterprise-grade vector database. Some vendors, like OpenAI, prefer the term vector store to emphasize the generic implementation of the vector storage layer.

Vibe Coding

is a work approach in computer programming where a developer uses mostly AI coding tools to produce software source code. It usually focuses on fast iteration and quick results instead of structured systematic design. Vibe Coding may be done in a heavily assisted mode of operation, where the developer mainly engineers prompts, then reviews the output and prompts again to form the final code. Such an approach is becoming increasingly popular among experienced developers who use AI tools for a first draft, a quick prototype or to bootstrap a project (boilerplate code) and then work traditionally from there. In an extreme way, Vibe Coding may describe a mode where a completely inexperienced, non-technical person instructs an AI to produce source code that is not understood and is only assessed by the code's ability to fulfill a certain task, yet completely ignoring the fact that the software may contain unintended features, flawed logic or senseless code. After all, such code has never been reviewed by a knowledgeable computer-savvy human.

vLLM

is a runner framework for large language models (LLMs). Besides Ollama, it is one of the more popular choices to run an LLM, but uses a different technology stack. Developed at UC Berkeley, the project is deeply rooted within the Python/PyTorch ecosystem and is primarily written in Python (roughly 90%), using C++/CUDA-kernels via PyTorch. Its signature feature is the "PagedAttention" management layer that has drastically better throughput and overall higher scalability compared to other LLM runners like Ollama. vLLM is positioned as the production-ready suite for data-center deployments of AI models, offering better resource utilization, higher concurrency and overall higher throughput. Compare LLM Runner.

Vocabulary

Within the context of AI, the term vocabulary denotes the set of all tokens a model is able to recognize, process, or generate. When a model processes a prompt, the input text stream is first broken down into tokens, standing for words, subwords, characters, or compound strings. Such tokens are then matched against the model's vocabulary of known tokens, and each is assigned its unique numeric ID. This process is also known as aligning with the vocabulary. The size and quality of a vocabulary determine a model's ability to represent ("grasp") concepts and patterns, for example in language, and how well it can process concepts the model has rarely seen during training. Vocabulary size sets a limit on how granular knowledge on the fringes of a model's latent knowledge space can be represented.

Weights

The term weights is a vague term in the field of artificial intelligence. Not by definition, but through common usage. Originally, in AI, a weight mostly referred to the internal workings of deep neural networks. Modern architectures, such as LLMs, are made up of layers of transformer blocks where processed tensors are mathematically transformed according to a model's training. When input vectors pass through this mathematical structure, their values are modified according to the model's weight values ingrained into these structures. So "weights" is usually a pars pro toto term for the web of weights and biases that a model internally applies to generate its output.

The vague and colloquial usage of the term weight now comes from how neural networks are technically orchestrated. Models are data structures that are ultimately stored, in memory or on disk. And when models are exchanged, on AI hubs like Hugging Face or elsewhere, people tend to speak of these model files as what they represent: the encoded model knowledge, or its weights. Consequently, developers tend to refer to model files as weights. Compare Parameters and Model Formats.

Zero-Shot Learning

is when a model performs or is able to perform a task without having been trained on explicit demonstrations of the task itself. It is a model's ability to transfer general knowledge to a new scenario, by reasoning or deducing, from unrelated or conceptually similar knowledge from another training data domain. This means, a model has seen structurally, conceptually, or semantically similar knowledge, that allows it to generalize, but it hasn't seen the exact task. This ability is an important element of modern AI models, as it allows them to do things they have not been explicitly prepared for. The term, zero-shot "learning", on the other hand, is a bit misleading, as it suggests the model would learn anything during inference time. In reality, a trained model is static and does not alter its internal weights during generation. The ability to adapt internally solely stems from activation state shifts during inference, not plasticity. Incorporating context from prompt input does not mean the model "learns". That is why it would be more appropriate to speak of "zero-shot behavior", "zero-shot inference" or "zero‑shot generalization".

The reason most sources speak of "learning" is that zero-shot behavior stems from how the model was trained, the training side of things. And while imprecise, the term then stuck. This is also a matter of perspective, as from the model's perspective, all user-interaction is "learning", while everything the user does is "prompting" or "instructing". Likewise, there's the term "zero-shot prompting", looking at how users engineer their prompts, in this case, giving no input-output examples to guide the model. In such a case, a model's zero-shot ability then decides how well it can answer such a prompt. Popular real-world examples of this ability are solving a puzzle, although someone hasn't seen this exact puzzle before, or commonsense reasoning, where someone understands that ice will melt in the sun, usually. Zero-shot learning is the ability to deduce an abstract concept and apply it on a different mostly unknown scenario. When a model can translate a completely new language zero-shot, then it can do so only because it has learned general language patterns from other languages during training. Zero-shot learning is conceptually the opposite of Few-Shot Learning, so compare Few-Shot Learning and In-Context Learning.

Note on trademarks

Many of the designations used by manufacturers and sellers to distinguish their products or services are claimed as trademarks. Where those designations appear in this text and Micropolis and/or the authors were aware of a trademark claim, the designations are mentioned along with their owners and may be additionally marked with a trademark symbol. Their use here in this FAQ on artificial intelligence (AI) is for educational use of the reader and is covered under nominative fair use. Micropolis is in no way suggesting support, sponsorship or endorsement of the owner of these trademarks. Only as much of such marks is used as is necessary to identify the trademark owner, product, or service.