GLM4-9B-Neon-v2-6.5bpw-h8-exl2 / README.md

Update README.md

47c1d02 verified 13 days ago

7.04 kB

	---
	license: mit
	datasets:
	- allura-org/Celeste-Filtered
	- allura-org/neon-41k
	- EVA-UNIT-01/Lilith-v0.2
	language:
	- en
	base_model:
	- allura-org/GLM4-9B-Neon-v2
	base_model_relation: quantized
	quantized_by: Meggido
	library_name: transformers
	---
	# ⚡ExLlamaV2 quant of : [GLM4-9B-Neon-v2](https://huggingface.co/allura-org/GLM4-9B-Neon-v2)
	> [!note]
	> ➡️ Exl2 version : [0.2.9](https://github.com/turboderp/exllamav2/releases/tag/v0.2.9)<br/>
	> ➡️ Cal. dataset : Default.<br/>
	> 📄 <a href="https://huggingface.co/Meggido/GLM4-9B-Neon-v2-6.5bpw-h8-exl2/resolve/main/measurement.json" download>Measurement.json</a> file.

	<img src="image_28.png">
	<small>Image by CalamitousFelicitousness</small>

	---

	# GLM-4-9B-0414 Neon v2

	RP finetune of GLM-4-9B-0414. Feels nice, lots of personality, if bit quirky sometimes. Nice prose, not too Claude-ish or Gemini-ish. Doesn't seem to like too long system prompts or charcards though. Seems to like JSON formatted system prompts.

	Model was trained by Auri.

	---

	Training notes

	Model was trained on a dataset consisting of 77M tokens of synthetic RP and short story gen data for one epoch. Training took around 11 hours on 2xRTX 3090 workstation, generously provided by [OwenArli](https://huggingface.co/OwenArli). Went with some sane defaults for training config, QLoRA plus CCE for a nice chunk of memory usage optimization, 16k fit on 48GB nicely with some room to spare. I seem to have a problem with Eval/Loss being broken, not sure why, otherwise it trained smoothly.

	Huge thanks to [ArliAI](https://www.arliai.com/) for providing compute and collaborating on this run!

	Format

	Model responds to GLM4 instruct formatting, exactly like it's base model. Backends struggle to add BOS token automatically, so you'll need to do it yourself. Jinja template should work for chat completions.

	```
	[gMASK]<sop><\|system\|>
	{system_prompt}<\|user\|>
	{prompt}<\|assistant\|>
	```

	Recommended Samplers

	Nothing special, just classics.

	```
	Temperature - 1
	Min-P - 0.1
	Repetition Penalty - 1.03
	```

	[Example master import for SillyTavern (using Shingane-v1 system prompt by Steelskull)](https://huggingface.co/allura-org/GLM4-9B-Neon-v2/blob/main/GLM-Shingane-v1.json)

	Running on KoboldCPP and other backends

	To run GGUFs correctly, you need the most recent version of KoboldCPP, and to pass `--overridekv glm4.rope.dimension_count=int:64` to the CLI command or put `glm4.rope.dimension_count=int:64` into overridekv box in the GUI (under the Tokens tab at the very bottom).

	Thanks to DaringDuck and tofumagnate for info how to apply this fix.

	To run this model on vLLM, you'll need to build it from source from the git repo, full GLM4 support hasn't reached release yet.

	ExLLaMAv2 and v3 based backends, such as TabbyAPI should support the model out of the box.

	Latest versions of llama.cpp server should also allow running GGUFs out-of-the-box.

	---

	Special Thanks

	Once again, huge kudos to OwenArli for providing compute and helping with tuning along the way!

	Big thanks to Artus for providing free inference for pre-release showcase of this model!

	And big thanks to BeaverAI community for giving feedback and helping to figure out optimal settings!

	---

	Training config
	<details><summary>See Axolotl config</summary>

	```yaml
	# Model
	base_model: /home/owen/models/GLM-4-9B-0414
	strict: false
	model_type: AutoModelForCausalLM

	# Liger Kernels and CCE (optimization)
	plugins:
	- axolotl.integrations.liger.LigerPlugin
	- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
	liger_rope: false
	liger_rms_norm: false
	liger_glu_activation: false
	liger_fused_linear_cross_entropy: false
	cut_cross_entropy: true

	# Output and HuggingFace
	output_dir: ./GLM-9B-Neon-v2
	hub_model_id: AuriAetherwiing/GLM-9B-Neon-v2-LoRA
	hf_use_auth_token: true
	hub_strategy: "all_checkpoints"

	# WandB
	wandb_project: allura-org
	wandb_entity:
	wandb_name: GLM-9B-Neon-v2

	# === Data Configuration ===

	# Data
	#chat_template: chatml
	#train_on_inputs: false
	group_by_length: false
	datasets:
	- path: ./Neon/neon.jsonl
	type: chat_template
	field_messages: conversations
	message_field_role: from
	message_field_content: value
	- path: ./Neon/S2.jsonl
	type: chat_template
	field_messages: conversations
	message_field_role: from
	message_field_content: value
	- path: ./Neon/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
	type: chat_template
	field_messages: conversations
	message_field_role: from
	message_field_content: value

	dataset_prepared_path: ./lora_last_run_prepared

	## Evaluation
	val_set_size: 0.01
	evals_per_epoch: 2
	eval_table_size:
	eval_max_new_tokens: 128

	# Technical aspects
	sequence_len: 16384
	save_safetensors: true
	saves_per_epoch: 2
	logging_steps: 1
	#special_tokens:
	# pad_token: <pad>
	# Quantization
	bf16: auto
	fp16:
	tf32: false
	## For LoRA
	load_in_8bit: false
	load_in_4bit: true

	# LoRA
	peft_use_rslora: false
	peft_use_dora: false # better but slower
	adapter: qlora # lora or qlora
	lora_model_dir:
	lora_r: 64 # 64 is optimal for most trains on instruct
	lora_alpha: 64
	lora_dropout: 0.1
	lora_target_linear: true
	lora_fan_in_fan_out:
	lora_target_modules:

	# loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
	#loraplus_lr_embedding:

	# Training hyperparameters
	# max_steps:
	num_epochs: 1

	# Anti Overfit and Stability
	weight_decay: 0.01
	max_grad_norm: 1.0

	## Learning Rate
	warmup_ratio: 0.05
	learning_rate: 1e-5
	lr_scheduler: rex
	#lr_scheduler_kwargs:
	# min_lr: 0.0000024
	optimizer: adamw_torch # usually adamw_torch or paged_adamw_8bit

	## Batch Size
	gradient_accumulation_steps: 32 # More effective batch size - stabler train, usually. MBS also speeds it up.
	micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
	eval_batch_size: 1

	# Optimizations
	pad_to_sequence_len: true
	sample_packing: true
	eval_sample_packing: false
	flash_attention: true
	xformers_attention:
	gradient_checkpointing:
	gradient_checkpointing_kwargs:
	use_reentrant: false

	# Set to a divisor (> 1) of the number of GPUs available
	#sequence_parallel_degree: 2 # Split sequences across 4 GPUs
	# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
	#heads_k_stride: 1
	# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
	# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
	#ring_attn_func:

	# deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json

	fsdp:
	- full_shard
	- auto_wrap
	fsdp_config:
	fsdp_limit_all_gathers: true
	fsdp_sync_module_states: true
	fsdp_offload_params: false
	fsdp_use_orig_params: false
	fsdp_cpu_ram_efficient_loading: true
	fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
	fsdp_transformer_layer_cls_to_wrap: Glm4DecoderLayer
	fsdp_state_dict_type: FULL_STATE_DICT
	fsdp_sharding_strategy: FULL_SHARD
	fsdp_activation_checkpointing: true
	```

	</details>