SPARK-mini-base / README.md

Update README.md

2c06526 verified 11 months ago

6.01 kB

	---
	library_name: transformers
	tags:
	- climate
	language:
	- en
	pipeline_tag: text-generation
	---

	# Model Card for Model ID

	SPARK-mini-base is a 3.8B parameter, domain specific, language model trained on an extensive dataset curated from documents generated by the nuclear power industry.

	The model was developed by continuously-pretraining Microsoft's Phi-3-mini-4k-instruct with over 35B tokens of high quality data curated from millions of public documents originating within the nuclear power domain.
	SPARK-mini-base was trained by Nuclearn AI, and is released as a research artifact, demonstration tool, and domain specific base LLM for further fine tuning by downstream practitioners working within or tangetial to the nuclear industry domain.

	SPARK-mini-base is trained using next token prediction objective without any alignment - it requires multishot prompting to respond properly. An instruction tuned version is available at [SPARK-mini-instruct](https://huggingface.co/NuclearnAI/SPARK-mini-instruct).

	## Uses

	SPARK-mini-base is a base LLM with no alignment process (SFT, RLHF, etc) applied and like other base models, must be multi-shot prompted for adequate performance. For a model with instruction based alignment, please see [SPARK-mini-instruct](https://huggingface.co/NuclearnAI/SPARK-mini-instruct).

	Nuclearn targets a few specific use cases with this open-source model release:

	1. Accelerating the work of technical staff at national research labs or regulatory agencies by providing a domain specific language model from which futher use cases can be fine tuned.
	2. Improving the performance of systems deployed in the Nuclear industry that currently utilize language models as feature extractors or model trunks in predictive AI systems.
	3. Accessibilty for practitioners without hardware accelerator or cloud connection capablities.

	### Direct Use

	SPARK-mini-base is a base model without alignment - multishot prompting is required. Prompting techniques should follow techniques applicable to other base language models without alignment. See huggingface prompting [docs](https://huggingface.co/docs/transformers/main/en/tasks/prompting#base-vs-instructchat-models).

	SPARK-mini-base is trained with 'prompt pre-training' as demonstrated in [Galactica: A Large Language Model for Science](https://arxiv.org/pdf/2211.09085) for steerability in different dimensions important to end users.

	### License

	License: [CC-BY-NC](https://creativecommons.org/licenses/by-nc/4.0/deed.en) with exceptions made below for unrestricted use.

	The license permits free use by a limited number of commercial entities including:

	1. Operating nuclear utilties
	2. Regulatory Bodies (Commercial or Government)
	3. Research Labs and Research Focused groups (e.g. National Laboratories and Electric Power Specific Research Groups)

	## Bias, Risks, and Limitations

	- This model has been trained extensively on Nuclear Power related information, but like every LM, still makes factual and logical mistakes.
	- The model should not be used for production use cases without futher training or applicable guardrails.
	- Intentional bias has been trained into the model for steerability
	- Base model is trained without text formatting. Further fine tuning will be needed for formatted responses (see SPARK-mini-instruct).

	## How to Get Started with the Model

	```python
	# Requires transformers 4.41 for Phi3 compatibility
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "nuclearnai/SPARK-mini-base"
	model = AutoModelForCausalLM.from_pretrained(
	model_name
	).to("cuda")

	tokenizer = AutoTokenizer.from_pretrained(
	model_name
	)

	# Generate using min_p sampling
	prompt = """The ECCS is"""

	# Note that no chat template is used for base model
	input_ids = tokenizer.encode(
	prompt,
	return_tensors="pt",
	add_special_tokens=False,
	).to("cuda")

	output = model.generate(
	input_ids=input_ids,
	min_p=0.2,
	temperature=0.7,
	do_sample=True,
	max_new_tokens=100,
	)

	print(tokenizer.decode(output[0], skip_special_tokens=False))

	```

	Output:
	```
	The ECCS is designed to cool the reactor core and to provide additional shutdown capability following initiation of the following accident conditions: 1. Loss-of-coolant accident (LOCA) including a pipe break or a spurious relief or safety valve opening in the RCS which would result in a discharge larger than that which could be made up by the normal make-up system. 2. Loss-of-secondary-coolant accident including a pipe
	```
	## Training Details

	### Training Data

	All training data for SPARK-mini-base is obtained from publically available sources, but is not being released.

	Specific details on the training data, or sharing the training data will be made available on a case by case basis by contacting Nuclearn at [email protected]

	### Training Procedure

	Training procedure follows best practices for continuous pretraining of base LLMs.

	The model was trained in bf16 using DeepSpeed Zero3 on a multinode, private A100 server cluster.

	## Evaluation

	SPARK-mini-base was evaluated on a set of private benchmarks created specifically for testing specific Nuclear Industry knowledge.

	#### Completions (HellaSWAG for Nuclear)
	- Modeled after the HellaSWAG Benchmark
	- Various completions of complex Nuclear plant operational scenarios and fact passages.

	#### Multiple Choice QA (MMLU for Nuclear)
	- Modeled after the MMLU benchmark
	- Multiple Choice question and answer on Nuclear Plant Operations, Systems, Engineering, etc...

	## Environmental Impact

	- Hardware Type: A100-80GB SXM4
	- Cloud Provider: Nuclearn Training Cluster

	### Model Architecture and Objective

	SPARK-mini-base is based on the Phi3 architecture.

	### Compute Infrastructure

	SPARK-mini-base is trained on the Nuclearn Training cluster - an A100-80GB server cluster with 800Gb/s Infiniband connectivity

	## Model Card Authors

	Bradley Fox, Nuclearn Inc
	Jerrold Vincent, Nuclearn Inc
	Nate Irby, Nuclearn Inc