See axolotl config

axolotl version: 0.8.0

base_model: Dans-DiscountModels/7b-m-dans-personalityengine-v1.2.1-rc-2
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code:

# wandb configuration
wandb_project: 7b-m-dans-optimizersweeps
wandb_watch:

wandb_run_id: repremover-1-1-ademamix-hi-lr-b1_0.9-b2_0.999-b3_0.999-a3
wandb_log_model:

# push checkpoints to hub
hub_model_id: Dans-DiscountModels/7b-m-dans-optimizersweeps-repremover-1-ademamix-hi-lr-b1_0.9-b2_0.999-b3_0.999-a3
# how to push checkpoints to hub
# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
hub_strategy: "every_save"
# Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
# Required to be true when used in combination with `push_dataset_to_hub`
hf_use_auth_token: true

# where to save the finished model to
output_dir: ./7b-m-dans-optimizersweeps

# where to save the dataset to
dataset_prepared_path: ./7b-m-dans-optimizersweeps-data

save_safetensors: true

# dataset settings (local or huggingface repo)
datasets:
  - path: Dans-DiscountModels/pretokenization-test-3
    ds_type: parquet
    type:

plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

adapter:
lora_model_dir:

val_set_size: 0.01
sequence_len: 8192

sample_packing: false
eval_sample_packing: false

pad_to_sequence_len: true

gradient_checkpointing: true
# gradient_checkpointing_kwargs:
# use_reentrant: false

gradient_accumulation_steps: 1
micro_batch_size: 4

num_epochs: 3

optimizer: ademamix
optim_args: "beta1=0.9,beta2=0.999,beta3=0.999,alpha=3"

lr_scheduler: rex
learning_rate: 0.0000003
cosine_min_lr_ratio:

# weight_decay: 0.03
max_grad_norm: 0.001

train_on_inputs: false
group_by_length: true

bf16: true
fp16: false
tf32: false

early_stopping_patience:

resume_from_checkpoint:
auto_resume_from_checkpoints: false

local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.1

evals_per_epoch: 24
eval_table_size:
eval_max_new_tokens:

saves_per_epoch: 1
save_total_limit: 2

debug: false

deepspeed: deepspeed_configs/zero3_bf16.json

fsdp:
fsdp_config:

special_tokens:

7b-m-dans-optimizersweeps-repremover-1-ademamix-hi-lr-b1_0.9-b2_0.999-b3_0.999-a3

This model is a fine-tuned version of Dans-DiscountModels/7b-m-dans-personalityengine-v1.2.1-rc-2 on the Dans-DiscountModels/pretokenization-test-3 dataset. It achieves the following results on the evaluation set:

Loss: 2.0807

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-07
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 32
total_eval_batch_size: 32
optimizer: Use ademamix and the args are: beta1=0.9,beta2=0.999,beta3=0.999,alpha=3
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 41
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss
2.0376	0.0072	1	2.1458
2.266	0.0432	6	2.1209
2.3072	0.0863	12	2.1567
2.1834	0.1295	18	2.1176
2.2331	0.1727	24	2.1274
2.0526	0.2158	30	2.1447
2.1291	0.2590	36	2.1181
2.0287	0.3022	42	2.1111
2.0806	0.3453	48	2.1281
2.1527	0.3885	54	2.1251
2.0467	0.4317	60	2.1076
2.042	0.4748	66	2.1276
2.2089	0.5180	72	2.1132
2.1008	0.5612	78	2.1159
2.0221	0.6043	84	2.1347
2.1413	0.6475	90	2.1175
2.0915	0.6906	96	2.1063
2.1075	0.7338	102	2.1235
2.073	0.7770	108	2.1222
2.0233	0.8201	114	2.1013
2.0238	0.8633	120	2.1119
1.9943	0.9065	126	2.1113
2.1516	0.9496	132	2.1001
1.9908	0.9928	138	2.1056
2.0712	1.0360	144	2.1164
1.9448	1.0791	150	2.0980
2.0915	1.1223	156	2.1211
2.0078	1.1655	162	2.1321
2.0026	1.2086	168	2.1190
1.9923	1.2518	174	2.1133
1.9858	1.2950	180	2.1243
2.0569	1.3381	186	2.1171
2.0747	1.3813	192	2.1190
2.1171	1.4245	198	2.1202
2.0104	1.4676	204	2.1201
2.0687	1.5108	210	2.1154
1.9147	1.5540	216	2.1033
2.066	1.5971	222	2.1139
2.0126	1.6403	228	2.1087
1.9889	1.6835	234	2.1063
2.0591	1.7266	240	2.1181
2.034	1.7698	246	2.1020
1.9738	1.8129	252	2.1168
1.9927	1.8561	258	2.1237
2.0525	1.8993	264	2.1123
2.0224	1.9424	270	2.0778
2.0619	1.9856	276	2.1094
2.0039	2.0288	282	2.1158
1.9935	2.0719	288	2.1052
2.0777	2.1151	294	2.0973
2.0022	2.1583	300	2.1108
1.9482	2.2014	306	2.1180
1.9782	2.2446	312	2.0977
2.033	2.2878	318	2.1206
1.9988	2.3309	324	2.1248
2.0149	2.3741	330	2.1043
2.0014	2.4173	336	2.0963
2.0494	2.4604	342	2.1087
1.9977	2.5036	348	2.0982
2.0774	2.5468	354	2.1248
2.0185	2.5899	360	2.1123
2.0085	2.6331	366	2.0941
1.9551	2.6763	372	2.1039
1.8634	2.7194	378	2.0949
1.9425	2.7626	384	2.0882
2.0172	2.8058	390	2.1258
1.9783	2.8489	396	2.0867
2.0236	2.8921	402	2.1192
1.9302	2.9353	408	2.1233
1.9088	2.9784	414	2.0807

Framework versions

Transformers 4.51.3
Pytorch 2.5.1+cu124
Datasets 3.5.0
Tokenizers 0.21.1

Dans-DiscountModels
/

7b-m-dans-optimizersweeps-repremover-1-ademamix-hi-lr-b1_0.9-b2_0.999-b3_0.999-a3

7b-m-dans-optimizersweeps-repremover-1-ademamix-hi-lr-b1_0.9-b2_0.999-b3_0.999-a3

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Dans-DiscountModels/7b-m-dans-optimizersweeps-repremover-1-ademamix-hi-lr-b1_0.9-b2_0.999-b3_0.999-a3

Evaluation results