Nova-Nox-Neural-Network
All images used are created by Rikka Botan.
Flash Technical Report (Japanese)
https://qiita.com/peony_snow/items/8ae4e83b8de5c342ab62
About
N4: Nova-Nox-Neural-Network is a mechanism designed to enhance accuracy by integrating the self-referential capabilities of the Attention mechanism with a simplified version of the selective copying architecture inspired by S6, thereby enabling the acquisition of a more expressive QK matrix.
The architecture employs ASGG: Adaptive Swish-GELU Gating as the activation function within its MLP components, contributing to richer representational capacity.
Furthermore, it utilizes DyT for normalization, which improves computational efficiency.
Key Features
A simplified Selective copying mechanism
ASGG: Adaptive Swish-GELU Gating + MLP
DyT: Dynamic Tanh Normalization
training result
Training Setting
Parameters:127M
(vocab_size=32768, hidden_size=768, inter_size=1536, heads=6, layers=18)
Optimizer: AdamW
(lr=6e-4, betas=(0.9, 0.95), eps=1e-9, weight_decay=1e-1, warmup_steps=2000)
batch size: 8
accumlation: 16
dataset: fineweb (0.5B token, 1 epoch: 976 steps)
max length: 512
dtype: bfloat16
Implementation and License
This repository is official pure pytorch implementation.
Licensed under "MIT License".
Commercial use permitted
How to use
- Clone the repository
git clone https://github.com/Rikka-Botan/Nova-Nox-Neural-Network.git
- Import necessary libraries
import torch
from torch import nn
import torch.nn.functional as F
from model.N4_modeling import N4C
- Model create
"""
Args:
hidden_size: int - model hidden size,
inter_size: int - model mlp intermediate size,
vocab_size : int - tokenizer vocab num,
heads: int - heads num,
layers: int - N4D(Decoder) layers num
"""
hidden_size = 768
intermediate_size = 3072
vocab_size = 32064
heads = 6
layers = 6
model = N4C(
hidden_size,
intermediate_size,
vocab_size,
heads,
layers
)
output = model(tokenized_text)
How to Train
- training code
from torch.optim import AdamW
optimizer = AdamW(
model.parameters(),
lr=6.0e-4,
betas=(0.9, 0.95),
eps=1e-8,
weight_decay=1e-1
)
for batch in dataloader:
optimizer.zero_grad()
batch = batch.to(device)
loss = model.to(device)(input=batch, labels=batch)[1]
loss.backward()
optimizer.step()
How to inference
- inference code
# N4: Nova Nox Neural Network inference
# coding=utf-8
# Copyright 2025 Rikka Botan. All rights reserved
# Licensed under the "MIT License"
import torch
from transformers import AutoTokenizer
import os
from model.n4_modeling import N4C
model_name = "mistralai/Mistral-7B-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
cwd=os.path.abspath('your model path')
model = N4C(
vocab_size=32768,
hidden_size=768,
inter_size=1536,
heads=6,
layers=18,
bias=False
)
state_dict = torch.load(os.path.join(cwd, 'N4_test_model.bin'), weights_only=True)
model.load_state_dict(state_dict, strict=False)
model = model.to('cpu')
model.eval()
text = "Large Language Models (LLMs) are advanced artificial intelligence systems designed to"
inputs = tokenizer(text, return_tensors='pt')
output = model.generate_n4c(
input_ids=inputs["input_ids"].to('cpu'),
max_new_tokens=128
temperature: float = 0.7,
top_k: int = 10,
top_p: float = 2,
eos_token_id: int = 2)
for token in inputs['input_ids']:
print(tokenizer.decode(token), end=" ")
for token in output:
print(tokenizer.decode(token), end=" ", flush=True)
Acknowledgements
I thank the developers of python and pytorch.
I thank all the researchers for their efforts to date.
I thank Japan's high standard of education.
And most of all, thank you for your interest in this repository.
Citations
I would be happy to include a citation at the end, but it is not required.
Feel free to use this model.
Contact Us
About Author
Rikka Botan
Japanese independent researcher having shy and pampered personality >_<
Twin-tail hair is a charm point :)
Interested in natural language processings.
Usually using python and C.