How Much Data and Cost Does It Take to Build a Large Language Model (LLM)?

We casually chat with AI like ChatGPT or Claude every day. And at some point, it is only natural to wonder:
"Could I build an AI like this on my laptop?"

The short answer: it is practically impossible. Building a large language model (LLM) goes far beyond what an individual can experiment with alone. Like constructing a massive building or developing a space launch vehicle, it requires enormous capital and large-scale infrastructure.

So exactly how much data and cost goes into this? Let us look at the concrete figures through news reports and official company announcements.

A Scale of Data That Encompasses the Entire Internet

Training an LLM is comparable, in human terms, to "reading and memorizing every book in every library." But the volume AI must consume is beyond our imagination.

Real Data Scale: The Meta Example

In 2024, Meta (the parent company of Facebook) released its latest model Llama 3 and disclosed the scale of its training data.

"Llama 3 was trained on more than 15 trillion tokens collected from publicly available sources."
— Meta Llama 3 Official Announcement (2024)

Do you have a sense of how large 15 trillion tokens is?

If one English word is approximately 0.75 tokens, then 15 trillion tokens is roughly 20 trillion words.
Converting to the average book (about 100,000 words), that is equivalent to reading more than 200 million books.
If a person read one word per second without sleeping, it would take roughly 630,000 years.

LLM training requires nearly all text that exists on the internet: Wikipedia, news, academic papers, code (GitHub), and publicly available books. Collecting and cleaning this text is also necessary, which means removing duplicates, discarding low-quality data, filtering harmful content, and more.

GPU and IT Infrastructure Costs in the Trillions

Once the data is ready, you need the "brain" to process it: GPUs. Not ordinary computers, but AI-computation-dedicated semiconductors like the NVIDIA H100.

Mark Zuckerberg's Announcement

In early 2024, Meta CEO Mark Zuckerberg announced staggering plans via Instagram Reels:

"By the end of this year we will have 350,000 NVIDIA H100s, and in total, infrastructure equivalent to 600,000 H100s."
— Mark Zuckerberg, Instagram Reels (January 2024)

Analyzing this from an economic perspective:

H100 price: As of 2024, a single H100 GPU costs approximately $25,000–$ 40,000.
Calculation: 350,000 units × $35,000 average = GPU purchase cost alone exceeds **$ 12 billion**.

Global big-tech companies are pouring amounts rivaling national budgets into building data centers.

AI Training Costs in the Hundreds of Billions to Trillions

After buying the expensive hardware, the power must be turned on and the AI model must be trained. But data center operating costs, including construction, labor, and electricity, are growing exponentially.

Real AI Training Costs from Big-Tech CEOs

OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei have both cited specific figures related to training costs in interviews.

Model	Cost Scale (as mentioned)	Source
GPT-4	"It cost more than $100 million to train the GPT-4 model."	Sam Altman (Wired interview, 2023)
Claude	"The models we are currently training are at the $1 billion level."	Dario Amodei (podcast interview, 2024)
Future models	"In 2026–2027, models costing $10 billion will emerge."	Dario Amodei (same interview)

The costs mentioned here refer to a "single final training run." The full project cost, including numerous failed training attempts, data labeling labor, and salaries for top-tier engineers, is far higher.

Why Do Big-Tech Companies Invest at This Enormous Scale?

This trillion-dollar level of investment is not simply a show of financial strength. In AI research, there is an empirical tendency known as scaling laws.

Scaling laws describe the observed result that when the volume of data, model size, and computation are increased in proportion, model performance improves along a relatively predictable curve. Up to a certain point, the more resources invested, the more refined the model becomes, and the more complex the problems it can handle.

Big-tech companies have confirmed this pattern directly through developing multiple generations of models. As a result, they have developed a conviction that "investing more data and computation leads to better performance," forming a competitive structure in which each company races to secure more capital and GPUs. It is both a technology competition and a strategic investment toward long-term market dominance.

The company that holds the best AI model will have an advantage across many future industries, including search, productivity tools, advertising, cloud, robotics, and autonomous driving. The current large-scale investment can therefore be understood as a long-term strategy to capture future market leadership rather than short-term profit.

Because of this structure, it is very difficult for university research labs or small startups to build foundation models (large-scale models that serve as the basis for diverse services) like GPT, Claude, or Gemini from scratch. AI development has evolved from a simple competition of ideas into a capital-intensive industry that requires large-scale computational resources, infrastructure, and capital together.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

A Scale of Data That Encompasses the Entire Internet​

Real Data Scale: The Meta Example​

GPU and IT Infrastructure Costs in the Trillions​

Mark Zuckerberg's Announcement​

AI Training Costs in the Hundreds of Billions to Trillions​

Real AI Training Costs from Big-Tech CEOs​

Why Do Big-Tech Companies Invest at This Enormous Scale?​