Mining LLM Pretraining Data: Topics, Skills, and Cognitive Patterns

Community Article Published April 29, 2025

AGI as imagined by Qwen

Summary

The technical blog post details an analysis of pretraining data from various Large Language Models (LLMs) like GPT-2, Falcon, and Gemma2. Using text mining techniques including embeddings, clustering, and LLM-based annotation on datasets like OpenWebText, The Pile, and C4, the study identified key patterns.

Findings show the data is dominated by topics like Technology, Politics, Health, Business, and Culture, originating from diverse sources including web scrapes, academic papers, code repositories, and news media. The data reflects the work of professionals primarily in Journalism/Media, Content Creation, Analysis/Research, Academia, and Tech/Engineering. Consequently, LLMs learn corresponding skills (e.g., Research, Critical Thinking, Communication, Domain Expertise) and task representations (e.g., Analysis, Content Creation, Compliance).

The analysis also uncovered distinct writing styles, underlying cognitive frameworks (beliefs, frames, schemas, memes), and common cognitive biases (like Confirmation Bias) embedded in the data. LLM capability progression appears linked to data scale and task frequency, following a power law. The study concludes that LLMs are powerful data-driven simulators whose capabilities and limitations are shaped by the composition and inherent biases of their pretraining corpora, highlighting the importance of data understanding and curation.

Introduction

Large Language Models (LLMs) demonstrate complex capabilities derived from their extensive pretraining data. This post presents a technical analysis of the pretraining corpora associated with several prominent LLMs. By employing text mining techniques, we investigate the constituent topics, professional personas, skills, tasks, and cognitive patterns embedded within these datasets. The objective is to provide empirical insights into the data-driven factors influencing LLM behavior and capability development.

Theoretical Framework: LLMs as Data-Driven Simulators

LLMs can be conceptualized as high-dimensional interpolative databases that learn statistical representations of human language and associated activities. Their function relies on predicting subsequent tokens based on patterns learned from vast text corpora, effectively simulating the personas, tasks, and cognitive styles present in the data. Skill acquisition often follows a power law distribution, where common skills are foundational, and rarer skills emerge with increased model scale and data exposure, sometimes exhibiting characteristics like few-shot or one-shot learning due to their statistical rarity and potential orthogonality in the learned representation space. Prompts serve as queries into this learned space, activating relevant 'vector programs'—implicit task representations encoded in embedding relationships.

Methodology for Data Analysis

The analysis encompassed pretraining corpora relevant to LLMs spanning various capability levels, including GPT-2, GPT-Neo, GPT-NeoX-20b, Falcon-40b, K-65b, and Gemma2-27b. Datasets analyzed included samples from OpenWebText, The Pile, C4, RefinedWeb, and RealNews.

The analytical pipeline consisted of:

Data Sampling: Initial sampling of ~300k records per dataset.
Embedding and Clustering: Application of GTE and SGPT sentence embeddings (using models appropriate to the data source, e.g., T5 for C4) followed by clustering to identify semantic groups. Cross-cluster sampling was performed to ensure diversity.
Dataset Refinement: Downsampling to a core dataset of 15k diverse records for intensive analysis.
LLM-based Annotation: Utilization of Exaone-3.5-32B-Instruct for detailed feature extraction and labeling (topics, inferred job profiles, skills, tasks, cognitive elements like attitudes, beliefs, frames, schemas, memes, biases).
Label Aggregation: Application of stella_en_1.5B_v5 for hierarchical clustering of generated labels to identify salient patterns.
Synthesis: Use of NotebookLM to consolidate and summarize findings across all analyses.

Empirical Findings

The analysis yielded several key findings regarding the composition and characteristics of the pretraining data:

1. Dominant Topic Clusters: The data exhibits broad thematic coverage. Major topics identified include:

Technology: Software Development (Python, JavaScript, C++, web dev), AI/ML, Cybersecurity, Cloud Computing, Blockchain.
Politics & Government: Governance Systems, Elections, Policy Analysis (tax, health, education), International Relations, Diplomacy.
Health & Medicine: Disease Pathology & Treatment, Healthcare System Analysis, Medical Research (biomedical, clinical trials), Public Health, Mental Health.
Business & Finance: Economic Theory & Indicators, Financial Markets, Corporate Operations & Strategy, Industry Analysis (e-commerce, sustainability, cannabis), Consumer Behavior.
Culture & Society: Arts & Entertainment (film, music, literature), Media & Communication, Social Issues (equality, justice), Religion & Philosophy, Education, Sports & Lifestyle.

2. Prevalent Data Sources: Analysis confirms the heterogeneity of sources, including:

Web Corpora: Pile-CC, OpenWebText2.
Academic/Research: PubMed Central, ArXiv, PhilPapers, NIH ExPorter, various journal archives.
Code Repositories: GitHub.
Legal/Patent Data: FreeLaw, USPTO Backgrounds, Patent databases.
Technical/Community Forums: Stack Exchange, Ubuntu IRC, HackerNews.
Literature/Books: BookCorpus2, Books3, Project Gutenberg.
News/Media Outlets: Associated with RealNews and C4 datasets.
Other: Wikipedia, Subtitles (OpenSubtitles, YouTube), Email corpora (Enron).

3. Dominant Authorial Personas (Inferred Job Profiles): Clustering suggests the data originates from distinct professional groups:

Journalism & Media: Broad representation including specialized reporters (tech, politics, health, finance), editors, critics, analysts.
Content Creation & Digital Media: Roles focused on digital platforms, including bloggers, social media managers, web developers, authors, podcasters.
Analysis & Research: Financial, data, policy, and industry analysts; researchers; scientists; academics.
Academia & Education: Professors, researchers, lecturers, instructional designers, educational technologists.
Technical & Engineering: Software developers, engineers (various disciplines), systems administrators.
Other Significant Roles: Legal professionals, healthcare practitioners & administrators, business management, marketing & sales, environmental specialists.

4. Prevalent Skill Representations: The analysis identified recurring skill clusters reflected in the data:

Language Proficiency (English): Grammar, syntax, vocabulary, writing clarity.
Information Processing: Research techniques, fact-checking, source evaluation, critical thinking, data analysis/interpretation, synthesis, summarization.
Communication: Written communication (diverse styles), verbal articulation, presentation, interviewing.
Technical Aptitude: Computer/digital literacy, web fundamentals (HTML/CSS), programming concepts, cybersecurity principles.
Domain Expertise: Knowledge specific to fields like law, finance, politics, healthcare, science, history, culture.
Cognitive/Interpersonal: Emotional intelligence, empathy, analytical reasoning, problem-solving, strategic planning, organization.

5. Common Task Representations: Specific tasks frequently represented in the data include:

Research & Analysis: Information gathering, data compilation/analysis, trend analysis, background research, legal/market/historical research, fact verification.
Content Generation & Structuring: Drafting text (articles, reports, summaries, emails), organizing content, structuring documents (outlines, headings), summarizing, transcribing.
Compliance & Standardization: Adhering to regulations/guidelines (legal, ethical, industry), ensuring data privacy, standardizing procedures.
Information Integration & Synthesis: Combining data from multiple sources, incorporating expert quotes/opinions, synthesizing diverse perspectives.
Documentation & Presentation: Formatting text/documents, creating visuals, managing document submission, publishing content.
Stakeholder Interaction: Conducting interviews, engaging community/stakeholders, managing communications, responding to inquiries.

6. Dominant Writing Styles: Stylistic analysis linked common writing patterns to inferred professional roles:

Journalistic/Reportorial: Objective, factual, structured (Associated with Journalists, Analysts).
Corporate/Strategic: Formal, diplomatic, policy-oriented (Associated with PR, Strategists, Business roles).
Creative/Narrative: Descriptive, storytelling, persuasive (Associated with Writers, Marketers, Bloggers).
Analytical/Technical/Scientific: Expository, detailed, evidence-based (Associated with Scientists, Analysts, Tech Writers).
Advocacy/Persuasive: Action-oriented, awareness-raising (Associated with Activists, Politicians).
Instructional/Explanatory: Didactic, practical guidance (Associated with Educators, Trainers).
Other Styles: PR/Marketing, Reflective/Philosophical, Interactive/Social Media, Legal/Official were also identified.

7. Inferred Cognitive Frameworks (Beliefs, Frames, Schemas, Memes): The analysis identified recurring cognitive patterns influencing the data:

Beliefs: Underlying assumptions (e.g., value of democracy, importance of evidence, skepticism towards authority, environmental responsibility, tech optimism).
Frames: Interpretive structures (e.g., problem-solution, social justice, economic impact, sustainability, accountability, narrative).
Schemas: Procedural knowledge patterns (e.g., scientific method, legal compliance procedures, investigation process, crisis management protocols).
Memes: Culturally transmitted ideas within groups (e.g., "Data-Driven," "Fail Fast," "Think Globally, Act Locally," "Open-Source Ethos").
Attitudes: Expressed stances (e.g., critical, skeptical, optimistic, pragmatic, objective, client-centric).
Motives: Inferred goals (e.g., consumer protection, public awareness, social justice, compliance, innovation, efficiency).
Dominant Mindset: Characterized by professionalism, practicality, data-reliance, clarity, structure, outcome-focus, and critical analysis.

8. Manifestation of Cognitive Biases: Human cognitive biases are present in the data, potentially influencing LLM outputs:

Confirmation Bias: Prevalent tendency to favor information confirming existing beliefs.
Anchoring Bias: Over-reliance on initial information points.
Availability Heuristic: Overestimation based on ease of mental recall.
Other Identified Biases: Status Quo Bias, Self-serving Bias, Halo Effect, Sunk Cost Fallacy, Legal Compliance Bias, In-group Favoritism, Motivated Reasoning.

9. LLM Capability Evolution: Comparative analysis suggests a progression in capabilities linked to model scale and data characteristics:

Emergence & Scaling: Capabilities appear linked to task frequency, with rarer tasks requiring larger models/datasets (power law scaling).
GPT-2 Baseline: Demonstrates task competence on high-frequency topics/pathways but exhibits fragility.
GPT-3 Level Advancements: Marked improvements in few-shot/zero-shot learning, task generalization, and coherence.
GPT-4 Level Refinements: Enhanced reasoning, improved factual accuracy, superior context handling, broader knowledge base, improved translation.
Capability Trajectory: Observed progression from basic language tasks -> knowledge retrieval -> comprehension/RAG -> logical reasoning -> complex instruction following -> deep expertise/long-chain reasoning.

Discussion and Implications

This analysis underscores the profound influence of pretraining data composition on LLM capabilities and behavior. Models act as mirrors, reflecting the distribution of topics, skills, professional norms, cognitive styles, and biases inherent in their training corpora.

Versatility vs. Bias: The breadth of data enables wide-ranging applications. However, the statistical nature of learning means LLMs inevitably internalize and potentially amplify prevalent viewpoints, professional jargon, cultural memes, and cognitive biases from the data. Dominant personas may disproportionately shape the model's default style or perspective.
Importance of Data Provenance: Understanding the sources, the inferred authorial roles, and the associated cognitive frameworks is critical for interpreting LLM outputs, anticipating limitations, and developing effective prompting strategies.
Efficiency of Data Curation: The power law governing skill emergence suggests diminishing returns from simply scaling undifferentiated data. Strategic curation of high-quality, diverse, and potentially more balanced datasets may offer a more efficient path toward robust and reliable capabilities.

Conclusion

Mining pretraining data reveals that LLMs learn a complex statistical model of the human knowledge, skills, and cognitive patterns represented therein. Their capabilities are a direct consequence of this data's composition, encompassing its diverse topics, the skills and tasks of its creators, their writing styles, cognitive frameworks, and inherent biases. Effective development and application of LLMs necessitate a critical understanding of this data-model relationship, promoting awareness of both their remarkable potential and their intrinsic limitations.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote