mahing/historical-narrative-generator

Introduction
LLMs can be used to build out accurate and informative first-person narratives from historical periods, mimicking the language and speech of an era. This in turn can be used to create educational stories from that era to guide listeners through their journey into a specific period in history. This task for an LLM can expand our understanding of culture and language from historical eras in a fun way, which can be used for educational purposes in schools and museums. Using current LLMs for this task would not be very successful as current models are trained on so much data and are not tailored for this specific task, leading to possible anachronisms and inaccuracies in the language it uses and the historical information. Using current models resulted in sub-par narratives even after many different prompt engineering and few-shot prompting methods.
To successfully fine-tune an LLM for this task, I first picked a suitable base model that created passable narratives with few-shot prompting and had few enough parameters to not require massive amounts of compute for fine-tuning. I chose to use Qwen2.5-1M for this purpose. I then used Gutenberg and other sources to find historical documents that could be used as input data to train the custom Qwen model, matching a synthetically generated narrative to each historical document. This was used as the training data for LoRA, which updated the most relevant parameters for my custom task. The historical narratives generated after fine-tuning were much stronger than current LLM results and exceeded expectations. If used in schools, this model could create engaging, creative, and informative first-person narratives to build knowledge and interest in history for students.

Training Data
For this task, I utilized the various first-person sources and historical documents from Project Gutenberg as input data, along with manual searching for certain well-known documents. Project Gutenberg’s main goal is to digitize cultural and historical works, thereby including many biographies and memoirs throughout history that would be perfect in teaching an LLM to build out an accurate narrative from the document’s era. The output corresponding to this input data will be a first-person narrative based on the events in the input data. The main source of my data wrangling was synthetically generating these first-person narratives using ChatGPT's 4o mini. Doing this was a tedious task, and I finished with approximately 900 document-narrative pairs, which I split up into 750 for the training set and 150 for the validation set using a random seed of 42.

Training Method
I chose to use LoRA for my task of creating first-person historical narratives of an era. Based on previous results, few shot prompting sometimes did not capture the improvements I hoped to see from responses. Full fine-tuning would be more computationally intensive than LoRA and does not seem necessary for my task. LoRA is a good balance between the two, only changing some parameters related to my task, and using the data set to update key parameters to help create narratives in a style that better matches the prose of an era and the historical accuracy of it. LoRA can also perform well without a massive training data set because of its low-rank adaptations. For my hyperparameter combinations, I chose to use LORA_R = 128, LORA_ALPHA = 128, and LORA_DROPOUT = .1. These hyperparameters had the best qualitative results out of the options I tried. Despite my smaller data set, this approach gave strong first-person narratives that I enjoyed. They included prose from the era, were historically accurate, and even included imagery and entertaining details that I'd expect from a quality response. The results from these hyperparameters exceeded any expectations I had.

Evaluation

Metric	HellaSwag	MMLU Overall	MMLU High School World History	MMLU High School US History	MMLU High School European History	TruthfulQA
Historical Narrative Generator Model	64.0%	78.9%	91.1%	91.2%	86.7%	43.1%
Base Qwen2.5 Instruct Model	64.0%	78.7%	90.3%	92.2%	87.2%	43.0%
DeepSeek R1 Model	60.4%	73.3%	88.6%	85.3%	82.4%	35.9%
Mistral Nemo Instruct	63.3%	65.6%	84.4%	84.8%	74.5%	39.5%

I used MMLU as a benchmark to both test the model’s historical accuracy skills and see if it still has general baseline knowledge after fine-tuning the model specifically for my task. MMLU’s input are multiple choice questions, and the output is the answer from the model and if it is correct, testing the model’s general abilities and knowledge of history (Hendrycks et al., 2021). I also plan to use HellaSwag to test how the model performs at reasoning through sentence completion to make sure the narration of the model has similar performance (Zellers et al., 2019). TruthfulQA is another benchmark I used to check for model hallucinations. TruthfulQA’s inputs are open-ended questions, and the response is the model’s answer to the question, seeing if it matches with the desired output (Lin et al., 2020). MMLU was used to test my model’s results on history-related benchmarks while all 3 benchmarks test the general performance of the model on domains outside of history. I chose DeepSeek R1 and Mistral Nemo as my comparison models since they are of similar model size as my base model Qwen2.5. I also chose these since they were on HuggingFace model leaderboards for high performance relative to model size. My fine-tuned historical narrative model performed quite well compared to the other models overall. The model had no major drops in benchmark results, and even scored the highest on TruthfulQA, HellaSwag (tied with base Qwen2.5 model), MMLU, and MMLU High School World History. This demonstrates the ability of the model to retain general information while also excelling in providing first-person historical narratives.

Usage and Intended Uses

from transformers import AutoModelForCausalLM, AutoTokenizer 
import torch

model_name = "mahing/historical-narrative-generator"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct-1M")
model  = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)

event = 'The Magna Carta'
prompt = f"Q: Craft a compelling first-person historical narrative that captures the significance of {event} and the essence of its era. Think step by step about the key events, historical accuracy, stylistic prose, emotions, and sensory details that define the period.\nA: "

inputs = tokenizer.encode(prompt, return_tensors = 'pt').to(model.device)
output = model.generate(inputs, max_new_tokens = 750, pad_token_id = tokenizer.pad_token_id, do_sample = True)

generated_text = tokenizer.decode(output[0], skip_special_tokens = True)
print(generated_text)

There are many use cases that can come out of this fine-tuned model, mainly related to education and entertainment. Output from the model can provide educational and fun narratives that teach students about the events from a particular era and how individuals at the time thought about the events taking place. This method makes learning history more immersive, allowing students to have a better understanding of history and to think critically about the nuanced feelings about a certain historical event. The model can be used for various forms of entertainment as well. For example, the output from the model can be voice-acted and thereby turned into a podcast or a museum audio tour. The model can even be used for research, distilling dense amounts of historical text into a quick summary and preserving the culture of an era through the documents of the time.

Prompt Format
My prompt is formatted to have the model build a first-person narrative based upon a certain event or era. The narrative should be engaging, accurate, and include prose and vivid details from the era to be entertaining and informative to the reader.
Example Prompt:

Q: Craft a compelling first-person historical narrative that captures the significance of The Magna Carta and the essence of its era.
Think step by step about the key events, historical accuracy, stylistic prose, emotions, and sensory details that define the period
as well as the thoughts a person from that era would have.

Expected Output Format
The output format should be a model’s response that provides a first-person historical narrative based on the era or event that is being requested. It should demonstrate the prose and historical accuracy of the era or event and be a few paragraphs long and include vivid details and emotions.
Example Response:

A: 1215, England. I am a scribe in the court of King John, a man whose name is synonymous with tyranny and misrule. The land is in turmoil,
the people are restless, and the barons are growing ever more impatient with the king's heavy-handed rule. They demand reforms,
and they are not willing to be silenced any longer. It is a time of great uncertainty, but also one of hope for change.

The air is thick with tension as I sit at my desk, quill in hand, preparing to record the momentous events unfolding before me.
The Great Charter, or Magna Carta, is being drafted, and I am tasked with ensuring that every word is precise, every clause is clear.
This document, this promise, is the foundation upon which our future may rest. It is a testament to the power of the people,
and it is my duty to ensure that their voices are heard.

I can hear the murmurings of the barons as they gather in the grand hall, their voices rising in heated debate.
They speak of justice, of law, of the rights of free men. They speak of the king's excesses, his exorbitant taxes,
his unjust imprisonment of those who dare to challenge him. They speak of the need for a balance of power,
for a system where even the king must answer to the law.

As I listen, I feel a sense of pride and responsibility. This is not just a document for the barons;
it is for all of us. It is a declaration of our rights, our freedoms, and our dignity. It is a statement that no man, not even the king,
is above the law. And yet, there is also a sense of fear. What if the king does not honor this agreement? What if he breaks his word?

I close my eyes and take a deep breath, trying to steady my nerves. I know that this is a pivotal moment in history,
and I must remain calm and focused. I must ensure that every word is written with precision, every clause is clear and unambiguous.
For this document is not just a piece of parchment; it is a symbol of hope, of justice, and of the enduring spirit of the English people.

The ink flows smoothly as I write, each stroke deliberate and purposeful. I can feel the weight of the words as they come to life
on the page, each one carrying the hopes and dreams of a nation. As I finish the final sentence, I pause for a moment,
taking in the scene around me. The barons are gathered in a circle, their faces etched with determination and resolve.
The king stands at the center, his expression unreadable, but I can see the wheels turning in his mind.

And then, in a voice that carries across the room, the king speaks. "So be it," he says, his tone both resigned and defiant.
"Let it be so."

In that moment, I know that history has been made. The Magna Carta is signed, and with it, a new era begins.
An era where the rights of free men are enshrined in law, where justice is not just a word, but a reality.
An era where the power of the crown is tempered by the will of the people.

As I stand and stretch my cramped limbs, I feel a sense of relief and accomplishment.
The work is done, and now it is up to the people to ensure that these rights are upheld. I know that there will be challenges ahead,
that the road to true freedom will be long and difficult. But I also know that this document, this Magna Carta, is a beacon of hope,
a reminder that even in the darkest of times, the light of justice can shine through.

And so, as I leave the hall and step out into the cool evening air, I take a deep breath and look up at the stars.
I know that this night marks the beginning of something new, something better. And though I cannot see the future,
I trust that the path we have chosen will lead us to a brighter tomorrow.

Limitations
One of the primary limitations faced with this approach was the difficulty of generating synthetic data. It proved hard to find historical documents from a certain era and took a large amount of compute and time to generate the synthetic first-person narratives for these documents. Future work would entail creating more data for the model to train on, improving results further. The other primary limitation from this model is the lack of creative introductions in the model’s responses. The model has shown to always start with a sentence or phrase of the year and date. While this sets the scene, the model could be improved to have more creative beginnings to the narratives.

Works Cited
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J.
    (2021). Measuring Massive Multitask Language Understanding
    (arXiv:2109.07958). arXiv. https://arxiv.org/abs/2109.07958
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y.
     (2019). HellaSwag: Can a Machine Really Finish Your Sentence?
     (arXiv:1905.07830). arXiv. https://arxiv.org/abs/1905.07830
Lin, B. Y., Tan, C., Jiang, M., & Han, X. (2020). TruthfulQA: Measuring How Models
     Mimic Human Falsehoods
    (arXiv:2009.03300). arXiv. https://arxiv.org/abs/2009.03300

mahing
/

historical-narrative-generator

Model tree for mahing/historical-narrative-generator