Upload Qwen2MMForCausalLM

Browse files

Files changed (8) hide show

README.md +199 -0
config.json +112 -0
configuration_qwen2mm.py +201 -0
generation_config.json +9 -0
model.safetensors +3 -0
modeling_phi4mm.py +1877 -0
processing_phi4mm.py +744 -0
speech_conformer_encoder.py +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,112 @@

+{
+  "_name_or_path": "/home/azureuser/phi4/qwen_works/Speech-to-Text-Training/MODELS/Qwen-Nahin-3-6/checkpoint-201000",
+  "architectures": [
+    "Qwen2MMForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "audio_processor": {
+    "config": {
+      "activation": "swish",
+      "activation_checkpointing": {
+        "interval": 1,
+        "module": "transformer",
+        "offload": false
+      },
+      "attention_dim": 1024,
+      "attention_heads": 16,
+      "batch_norm": false,
+      "bias_in_glu": true,
+      "causal": true,
+      "chunk_size": -1,
+      "cnn_layer_norm": true,
+      "conv_activation": "swish",
+      "conv_glu_type": "swish",
+      "depthwise_multiplier": 1,
+      "depthwise_seperable_out_channel": 1024,
+      "dropout_rate": 0.0,
+      "encoder_embedding_config": {
+        "input_size": 80
+      },
+      "ext_pw_kernel_size": 1,
+      "ext_pw_out_channel": 1024,
+      "input_layer": "nemo_conv",
+      "input_size": 80,
+      "kernel_size": 3,
+      "left_chunk": 18,
+      "linear_units": 1536,
+      "nemo_conv_settings": {
+        "conv_channels": 1024
+      },
+      "num_blocks": 24,
+      "relative_attention_bias_args": {
+        "t5_bias_max_distance": 500,
+        "type": "t5"
+      },
+      "time_reduction": 8
+    },
+    "name": "cascades"
+  },
+  "auto_map": {
+    "AutoConfig": "configuration_qwen2mm.Qwen2MMConfig",
+    "AutoModelForCausalLM": "modeling_phi4mm.Qwen2MMForCausalLM",
+    "AutoTokenizer": "./"
+  },
+  "bos_token_id": 151644,
+  "embd_layer": {
+    "audio_embd_layer": {
+      "compression_rate": 8,
+      "downsample_rate": 1,
+      "embedding_cls": "audio",
+      "enable_gradient_checkpointing": true,
+      "projection_cls": "mlp",
+      "use_conv_downsample": false,
+      "use_qformer": false
+    },
+    "embedding_cls": "image_audio",
+    "image_embd_layer": {
+      "crop_size": 448,
+      "embedding_cls": "tune_image",
+      "enable_gradient_checkpointing": true,
+      "hd_transform_order": "sub_glb",
+      "image_token_compression_cls": "avg_pool_2d",
+      "projection_cls": "mlp",
+      "use_hd_transform": true,
+      "with_learnable_separator": true
+    }
+  },
+  "eos_token_id": 151645,
+  "hidden_act": "silu",
+  "hidden_size": 896,
+  "initializer_range": 0.02,
+  "intermediate_size": 4864,
+  "max_position_embeddings": 131072,
+  "max_window_layers": 24,
+  "model_type": "qwen2-mm",
+  "num_attention_heads": 14,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 2,
+  "pad_token_id": 151643,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "speech_lora": {
+    "dp": 0.01,
+    "layer": "((layers.*self_attn\\.(qkv|o)_proj)|(layers.*mlp\\.(gate_up|down)_proj))",
+    "lora_alpha": 640,
+    "r": 320
+  },
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.48.3",
+  "use_cache": false,
+  "use_sliding_window": false,
+  "vision_lora": {
+    "dp": 0.0,
+    "layer": "layers.*((self_attn\\.(qkv_proj|o_proj))|(mlp\\.(gate_up|down)_proj))",
+    "lora_alpha": 512,
+    "r": 256
+  },
+  "vocab_size": 194498
+}

configuration_qwen2mm.py ADDED Viewed

	@@ -0,0 +1,201 @@

+# coding=utf-8
+# Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Qwen2 model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class Qwen2MMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Qwen2Model`]. It is used to instantiate a
+    Qwen2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of
+    Qwen2-7B-beta [Qwen/Qwen2-7B-beta](https://huggingface.co/Qwen/Qwen2-7B-beta).
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 151936):
+            Vocabulary size of the Qwen2 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Qwen2Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22016):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 32):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to use sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
+        max_window_layers (`int`, *optional*, defaults to 28):
+            The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+    ```python
+    >>> from transformers import Qwen2Model, Qwen2Config
+    >>> # Initializing a Qwen2 style configuration
+    >>> configuration = Qwen2Config()
+    >>> # Initializing a model from the Qwen2-7B style configuration
+    >>> model = Qwen2Model(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "qwen2-mm"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `Qwen2`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        intermediate_size=22016,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=32,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        use_sliding_window=False,
+        sliding_window=4096,
+        max_window_layers=28,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.use_sliding_window = use_sliding_window
+        self.sliding_window = sliding_window  # we check `use_sliding_window` in the modeling code
+        self.max_window_layers = max_window_layers
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_dropout = attention_dropout
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 151644,
+  "eos_token_id": [
+    151645
+  ],
+  "pad_token_id": 151643,
+  "transformers_version": "4.48.3"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f642731b0bdfe6f3703ef4b208b6cdf17e0c47d2b2de1a99067ed633fca06ff
+size 1950430024

modeling_phi4mm.py ADDED Viewed

	@@ -0,0 +1,1877 @@

+# coding=utf-8
+# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Phi-4-MM model."""
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import numpy as np
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, SlidingWindowCache, StaticCache
+from transformers.generation import GenerationMixin
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import _flash_attention_forward
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_flash_attn_greater_or_equal_2_10,
+    logging,
+    replace_return_docstrings,
+)
+from transformers import AutoConfig, AutoModelForCausalLM, PretrainedConfig
+# from .configuration_phi4mm import Phi4MMConfig
+from .processing_phi4mm import InputMode
+# from .vision_siglip_navit import get_siglip_vision_model
+from .speech_conformer_encoder import ConformerEncoder
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "TBA"
+_CONFIG_FOR_DOC = "Qwen2MMConfig"
+# Special token ids
+_IMAGE_SPECIAL_TOKEN_ID = 1516444  # '<|endoftext10|>', or we can better name it (in `tokenizer_config.json`)
+_AUDIO_SPECIAL_TOKEN_ID = 151644  # '<|endoftext11|>'
+_COMPATIBLE_IMAGE_SPECIAL_TOKEN_ID_RANGE = [-9999, -1]  # For backward compatibility
+_COMPATIBLE_AUDIO_SPECIAL_TOKEN_ID_RANGE = [float('-inf'), -10000]  # For backward compatibility
+# class Phi4MMImageEmbedding(nn.Module):
+#     """Image embedding."""
+#     def __init__(self, config: PretrainedConfig, **kwargs) -> None:
+#         super().__init__()
+#         # n_embed or hidden_size
+#         hidden_size = config.n_embd if hasattr(config, 'n_embd') else config.hidden_size
+#         if hasattr(config, 'embd_pdrop') or hasattr(config, 'embed_pdrop'):
+#             embd_drop = config.embd_pdrop if hasattr(config, 'embd_pdrop') else config.embed_pdrop
+#             self.drop = nn.Dropout(embd_drop)
+#         else:
+#             self.drop = None
+#         logger.info(f"create image tower {config.img_processor}")
+#         enable_gradient_checkpointing = kwargs.get('enable_gradient_checkpointing', False)
+#         # Load SigLIP model
+#         self.img_processor = get_siglip_vision_model(
+#             _flash_attn_2_enabled=config._attn_implementation == 'flash_attention_2'
+#         )
+#         pe_weight = self.img_processor.embeddings.position_embedding.weight
+#         L, D = pe_weight.size()
+#         H = int(math.sqrt(L))
+#         assert H**2 == L
+#         if H % 2 != 0: #and kwargs.get('image_token_compression_cls', None) is None:
+#             self.img_processor_padding = nn.ReflectionPad2d((0, 1, 0, 1))
+#             H += 1
+#         image_dim_out = D
+#         # ((448/14)//2)**2
+#         self.num_img_tokens = (H//2)**2
+#         self.base_feat_height_target = H
+#         if enable_gradient_checkpointing:
+#             self.img_processor.encoder.gradient_checkpointing = True
+#         self.image_dim_out = image_dim_out
+#         self.img_sizes = None
+#         self.image_attention_mask = None
+#         # global_gn and sub_gn for hd transform, serves as line separator
+#         self.use_hd_transform = kwargs.get('use_hd_transform', False)
+#         self.with_learnable_separator = kwargs.get('with_learnable_separator', False)
+#         self.hd_transform_order = kwargs.get('hd_transform_order', 'glb_sub')
+#         self.freeze_img_processor = kwargs.get('freeze_img_processor', False)
+#         self.crop_size = kwargs.get('crop_size', 336)
+#         logger.info(f'freeze_img_processor = {self.freeze_img_processor}')
+#         # image token compression
+#         self.image_token_compression_cls = kwargs.get('image_token_compression_cls', None)
+#         if self.image_token_compression_cls == 'avg_pool_2d':
+#             self.image_token_compression = nn.AvgPool2d(kernel_size=2, stride=2)
+#             self.base_feat_height_reduction = 1
+#             self.base_feat_height_target = self.base_feat_height_target // 2
+#         elif self.image_token_compression_cls is None:
+#             self.image_token_compression = None
+#             self.base_feat_height_reduction = 2
+#         else:
+#             raise NotImplementedError(f'image_token_compression_cls = {self.image_token_compression_cls}, not implemented')
+#         # with_hd_transform and with_learnable_separator should have same value
+#         assert self.use_hd_transform == self.with_learnable_separator, 'use_hd_transform and with_learnable_separator should have same value'
+#         if self.with_learnable_separator:
+#             assert self.use_hd_transform, 'learnable separator is only for hd transform'
+#             # 1024 * 4, merge spatial to channel dimension
+#             self.glb_GN = nn.Parameter(torch.zeros([1, 1, self.image_dim_out * self.base_feat_height_reduction**2]))
+#             self.sub_GN = nn.Parameter(torch.zeros([1, 1, 1, self.image_dim_out * self.base_feat_height_reduction**2]))
+#             logger.info(f'learnable separator enabled for hd transform, hd_transform_order = {self.hd_transform_order}')
+#         projection_cls = kwargs.get('projection_cls', 'linear')
+#         if projection_cls == 'linear':
+#             self.img_projection = nn.Linear(image_dim_out, hidden_size)
+#         elif projection_cls == 'mlp' and self.use_hd_transform:
+#             dim_projection = hidden_size
+#             depth = 2
+#             layers = [nn.Linear(image_dim_out * self.base_feat_height_reduction**2, dim_projection)]
+#             for _ in range(1, depth):
+#                 layers.extend([nn.GELU(),
+#                                 nn.Linear(dim_projection, dim_projection)])
+#             self.img_projection = nn.Sequential(*layers)
+#         elif projection_cls == 'mlp':
+#             # follow llava-v1.5's implementation
+#             # (do not use image_projection and image_proj_norm)
+#             dim_projection = hidden_size
+#             depth = 2
+#             layers = [nn.Linear(image_dim_out, dim_projection)]
+#             for _ in range(1, depth):
+#                 layers.extend([nn.GELU(),
+#                                 nn.Linear(dim_projection, dim_projection)])
+#             self.img_projection = nn.Sequential(*layers)
+#         else:
+#             raise NotImplementedError(f'projection_cls = {projection_cls}, not implemented')
+#         self.vocab_size = config.vocab_size
+#         self.img_features = None
+#         if isinstance(config.img_processor, dict):
+#             self.layer_idx = config.img_processor.get('layer_idx', -2)
+#             self.type_feature = config.img_processor.get('type_feature', 'patch')
+#         else:
+#             self.layer_idx = -2
+#             self.type_feature = 'patch'
+#     def set_img_features(self, img_features: torch.FloatTensor) -> None:
+#         self.img_features = img_features
+#     def set_img_sizes(self, img_sizes: torch.LongTensor) -> None:
+#         self.img_sizes = img_sizes
+#     def set_img_attn_mask(self, image_attention_mask: torch.FloatTensor) -> None:
+#         self.image_attention_mask = image_attention_mask
+#     def get_img_features(self, img_embeds: torch.FloatTensor, attention_mask=None) -> torch.FloatTensor:
+#         LAYER_IDX = self.layer_idx
+#         TYPE_FEATURE = self.type_feature
+#         if self.freeze_img_processor:
+#             with torch.no_grad():
+#                 if attention_mask is not None:
+#                     img_processor_output = self.img_processor(img_embeds, output_hidden_states=True, patch_attention_mask=attention_mask)
+#                 else:
+#                     img_processor_output = self.img_processor(img_embeds, output_hidden_states=True)
+#                 img_feature = img_processor_output.hidden_states[LAYER_IDX]
+#         else:
+#             if attention_mask is not None:
+#                 img_processor_output = self.img_processor(img_embeds, output_hidden_states=True, patch_attention_mask=attention_mask)
+#             else:
+#                 img_processor_output = self.img_processor(img_embeds, output_hidden_states=True)
+#             img_feature = img_processor_output.hidden_states[LAYER_IDX]
+#         if TYPE_FEATURE == "patch":
+#             patch_feature = img_feature
+#             if self.image_token_compression is not None:
+#                 # reshape to 2D tensor
+#                 width = int(math.sqrt(patch_feature.size(1)))
+#                 patch_feature = patch_feature.view(-1, width, width, patch_feature.size(-1))
+#                 # convert to NCHW
+#                 patch_feature = patch_feature.permute(0, 3, 1, 2)
+#                 if getattr(self, 'img_processor_padding', None) is not None:
+#                     patch_feature = self.img_processor_padding(patch_feature)
+#                 patch_feature = self.image_token_compression(patch_feature)
+#                 # convert to NHWC
+#                 patch_feature = patch_feature.permute(0, 2, 3, 1)
+#                 patch_feature = patch_feature.view(-1, patch_feature.size(1) * patch_feature.size(2), patch_feature.size(-1))
+#             elif getattr(self, 'img_processor_padding', None) is not None:
+#                 width = int(math.sqrt(patch_feature.size(1)))
+#                 patch_feature = patch_feature.view(-1, width, width, patch_feature.size(-1))
+#                 # convert to NCHW
+#                 patch_feature = patch_feature.permute(0, 3, 1, 2)
+#                 patch_feature = self.img_processor_padding(patch_feature)
+#                 # convert to NHWC
+#                 patch_feature = patch_feature.permute(0, 2, 3, 1)
+#                 patch_feature = patch_feature.view(-1, patch_feature.size(1) * patch_feature.size(2), patch_feature.size(-1))
+#             return patch_feature
+#         if TYPE_FEATURE == "cls_patch":
+#             if self.image_token_compression is not None:
+#                 # reshape to 2D tensor
+#                 patch_feature = img_feature[:, 1:]
+#                 cls_feature = img_feature[:, 0]
+#                 width = math.sqrt(patch_feature.size(1))
+#                 patch_feature = patch_feature.view(-1, width, width, patch_feature.size(-1))
+#                 patch_feature = self.image_token_compression(patch_feature)
+#                 patch_feature = patch_feature.view(-1, patch_feature.size(-2) * patch_feature.size(-1))
+#                 img_feature = torch.cat([cls_feature, patch_feature], dim=1)
+#             return img_feature
+#         logger.info(f'processed img feature size = {img_feature.size()}')
+#         raise NotImplementedError
+#     def spatiotemporal_pool(self, x, num_img_tokens, batch_size=1, T=1):
+#         if self.image_pos_embed is not None:
+#             x = x.view(batch_size * T, -1, x.shape[-1])
+#             num_tokens = x.shape[-2]
+#             h, w = int(num_tokens ** 0.5), int(num_tokens ** 0.5)
+#             assert h * w == num_tokens, 'only support square feature maps for now'
+#             x = x.view(batch_size * T, h, w, x.shape[-1])
+#             pos_embed = self.image_pos_embed(x)
+#             x = x + pos_embed
+#             x = x.view(batch_size, T * h * w, x.shape[-1])
+#         if self.visual_temporal_embed is not None:
+#             visual_temporal_embed = self.visual_temporal_embed(x.view(batch_size, T, -1, x.shape[-1])[:, :, 0])
+#             x = x.view(batch_size, T, -1, x.shape[-1]) + visual_temporal_embed.view(1, T, 1, x.shape[-1])
+#         new_x = []
+#         # [bsz, T * H' * W', C] -> [bsz, T, C]
+#         spatial_avg_pool_x = x.view(batch_size, T, -1, x.shape[-1]).mean(dim=2)
+#         new_x.append(spatial_avg_pool_x)
+#         # [bsz, T * H' * W', C] -> [bsz, H'*W', C]
+#         temporal_avg_pool_x = x.view(batch_size, T, -1, x.shape[-1]).mean(dim=1)
+#         new_x.append(temporal_avg_pool_x)
+#         x = torch.cat(new_x, dim=1).view(-1, self.image_dim_out)
+#         num_img_tokens += T
+#         return x, num_img_tokens
+#     def forward(self, input_ids: torch.LongTensor, input_embeds: torch.FloatTensor, image_sizes=None, **kwargs) -> torch.FloatTensor:
+#         if isinstance(input_ids, tuple):
+#             # # pipeline parallel
+#             input_ids, input_embeds = input_ids
+#         img_embeds = input_embeds
+#         if image_sizes is None and 'image_sizes' in kwargs:
+#             image_sizes = kwargs['image_sizes']
+#         img_sizes = image_sizes
+#         if self.img_features is not None:
+#             img_embeds = self.img_features.clone()
+#             self.img_features = None
+#         if self.img_sizes is not None:
+#             img_sizes = self.img_sizes
+#         dtype = self.img_processor.embeddings.patch_embedding.weight.dtype
+#         if img_embeds is not None:
+#             # convert to bf16
+#             img_embeds = img_embeds.to(dtype)
+#         if self.image_attention_mask is not None:
+#             image_attention_mask = self.image_attention_mask.clone()
+#             self.image_attention_mask = None
+#         elif 'image_attention_mask' in kwargs:
+#             image_attention_mask = kwargs['image_attention_mask']
+#         else:
+#             image_attention_mask = None
+#         input_shape = input_ids.size()
+#         input_ids = input_ids.view(-1, input_shape[-1])
+#         with torch.no_grad():
+#             positions = torch.nonzero(input_ids == _IMAGE_SPECIAL_TOKEN_ID, as_tuple=False)
+#             positions_tuple = torch.nonzero(input_ids == _IMAGE_SPECIAL_TOKEN_ID, as_tuple=True)
+#         # logger.info(f'position size: {positions.size()} ...')
+#         fake_image_forward = False
+#         select = False
+#         hd_transform = False
+#         if isinstance(self.img_projection, nn.Sequential):
+#             target_device = self.img_projection[0].bias.device
+#             target_dtype = self.img_projection[0].bias.dtype
+#         else:  # It's a single nn.Linear layer
+#             target_device = self.img_projection.bias.device
+#             target_dtype = self.img_projection.bias.dtype
+#         num_img_tokens = self.num_img_tokens
+#         if len(positions.tolist()) > 0:
+#             if self.use_hd_transform and img_sizes is not None and len(img_sizes):
+#                 hd_transform = True
+#                 assert img_embeds.ndim == 5, f'(branch 1) img_embeds size: {img_embeds.size()}, expect 5D tensor for hd transform'
+#                 # img_embeds: (num_images, max_num_crops, 3, H, W)
+#                 # img_sizes: (num_images, 2).view(1, -1)
+#                 bs = img_embeds.shape[0]
+#                 # Nx(HW)xC
+#                 if image_attention_mask is not None and len(image_attention_mask) > 0:
+#                     img_features = self.get_img_features(img_embeds.flatten(0, 1), attention_mask=image_attention_mask.type(torch.BoolTensor).flatten(0,1).to(target_device))
+#                 else:
+#                     img_features = self.get_img_features(img_embeds.flatten(0, 1))
+#                 base_feat_height_target = self.base_feat_height_target
+#                 base_resolution = self.crop_size
+#                 base_feat_height_reduction = self.base_feat_height_reduction
+#                 base_feat_height = base_feat_width = int(np.sqrt(img_features.shape[1]))
+#                 assert base_feat_height == base_feat_height_target and base_feat_width == base_feat_height_target, f'base_feat_height: {base_feat_height}, base_feat_width: {base_feat_width}, expect {base_feat_height_target} features for hd transform'
+#                 # bs x max_num_crops x (24x24) x C
+#                 img_features = img_features.view(bs, -1, base_feat_height * base_feat_width, self.image_dim_out)
+#                 C = self.image_dim_out
+#                 H = base_feat_height
+#                 output_imgs = []
+#                 output_len = []
+#                 # training is tensor, inference is list
+#                 if isinstance(img_sizes, torch.Tensor):
+#                     img_sizes = img_sizes.view(-1, 2)
+#                 for _bs in range(bs):
+#                     h, w = img_sizes[_bs]
+#                     h = h // base_resolution
+#                     w = w // base_resolution
+#                     B_ = h * w
+#                     # 1 x (24x24) x 1024
+#                     global_img_feature = img_features[_bs, :1]
+#                     # 1 x 12 x 12 x 4096
+#                     glb_img = global_img_feature.reshape(1,H,H,C).reshape(1,H//base_feat_height_reduction,base_feat_height_reduction,H//base_feat_height_reduction,base_feat_height_reduction,C).contiguous().permute(0,1,3,2,4,5).reshape(1,H//base_feat_height_reduction,H//base_feat_height_reduction,base_feat_height_reduction*base_feat_height_reduction*C).contiguous()
+#                     temp_glb_GN = self.sub_GN.repeat(1, H//base_feat_height_reduction, 1, 1)
+#                     # 1 x 156 x 4096
+#                     glb_img = torch.cat([glb_img, temp_glb_GN], dim=2).reshape(1,-1,base_feat_height_reduction*base_feat_height_reduction*C)
+#                     # (max_num_crops-1) x (12x12) x C
+#                     sub_img = img_features[_bs, 1:]
+#                     # 16x574x1024
+#                     # get rid of padding sub_img
+#                     sub_img = sub_img[:B_]
+#                     # (num_crops, 12, 2, 12, 2, 1024) -> (num_crops, 12, 12, 2, 2, 1024) -> (num_crops, 12*12, 4*1024)
+#                     sub_img = sub_img.reshape(B_,H,H,C).reshape(B_,H//base_feat_height_reduction,base_feat_height_reduction,H//base_feat_height_reduction,base_feat_height_reduction,C).contiguous().permute(0,1,3,2,4,5).reshape(B_,-1,base_feat_height_reduction*base_feat_height_reduction*C).contiguous()
+#                     sub_img = sub_img.reshape(1, h, w, base_feat_height // base_feat_height_reduction, base_feat_width // base_feat_height_reduction, -1).permute(0,1,3,2,4,5).reshape(1,h*base_feat_height//base_feat_height_reduction,w*base_feat_width//base_feat_height_reduction,base_feat_height_reduction*base_feat_height_reduction*C)
+#                     if image_attention_mask is not None and len(image_attention_mask) > 0:
+#                         reshaped_image_attention_mask = image_attention_mask[_bs,1:B_+1,0::2,0::2].reshape(1, h, w, base_feat_height // base_feat_height_reduction, base_feat_width // base_feat_height_reduction).permute(0,1,3,2,4).reshape(1,h*base_feat_height//base_feat_height_reduction,w*base_feat_width//base_feat_height_reduction)
+#                         useful_height = int(reshaped_image_attention_mask[0,:,0].sum().item())
+#                         useful_width = int(reshaped_image_attention_mask[0,0,:].sum().item())
+#                         sub_img = sub_img[:,:useful_height, :useful_width]
+#                         temp_sub_GN = self.sub_GN.repeat(1, useful_height, 1, 1)
+#                         temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction
+#                     else:
+#                         temp_sub_GN = self.sub_GN.repeat(1, h*base_feat_height//base_feat_height_reduction, 1, 1)
+#                         temp_len = int((h*w+1)*self.num_img_tokens+ 1 + (h+1)*base_feat_height//base_feat_height_reduction)
+#                     sub_img = torch.cat([sub_img, temp_sub_GN], dim=2).reshape(1,-1,base_feat_height_reduction*base_feat_height_reduction*C)
+#                     # (1, num_img_tokens, 1024*4)
+#                     # glb + sub
+#                     if self.hd_transform_order == 'glb_sub':
+#                         output_imgs.append(torch.cat([glb_img, self.glb_GN, sub_img], dim=1))
+#                     elif self.hd_transform_order == 'sub_glb':
+#                         output_imgs.append(torch.cat([sub_img, self.glb_GN, glb_img], dim=1))
+#                     else:
+#                         raise NotImplementedError(f'hd_transform_order = {self.hd_transform_order}, not implemented')
+#                     #temp_len = int((h*w+1)*144 + 1 + (h+1)*12)
+#                     assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
+#                     output_len.append(temp_len)
+#                 num_img_tokens = output_len
+#                 img_set_tensor = []
+#                 for _output_img in output_imgs:
+#                     img_feature_proj = self.img_projection(_output_img.to(target_device).to(target_dtype))
+#                     img_set_tensor.append(img_feature_proj)
+#                 #logger.info(f'img_embeds size: {img_embeds.size()}, image sizes: {img_sizes} loading time {datetime.now() - start_time}')
+#                 #assert sum(num_img_tokens) == len(g_values), f'(branch 1) sum(num_img_tokens): {sum(num_img_tokens)}, g_values size: {len(g_values)}, g_values {g_values}'
+#             else:
+#                 raise NotImplementedError
+#             select = True
+#         else:
+#             # # create a fake image tensor
+#             # # TODO: need define image size for different vision model
+#             if self.training:
+#                 img_embeds = torch.zeros(1, 3, self.crop_size, self.crop_size, dtype=target_dtype, device=input_ids.device)
+#                 tt = (
+#                     self.get_img_features(img_embeds)
+#                     .to(target_device)
+#                     .to(target_dtype)
+#                     .reshape(-1, 1024)
+#                 )
+#                 if self.use_hd_transform:
+#                     img_set_tensor = self.img_projection(tt.reshape(-1, self.image_dim_out*self.base_feat_height_reduction**2) * self.glb_GN[0] * self.sub_GN[0, 0])
+#                 else:
+#                     img_set_tensor = self.img_projection(tt)  # adapted visual features.
+#                 fake_image_forward = True
+#         # we use the token embedding layer from the huggingface model, this is REQUIRED to make sure we are using the loaded weights.
+#         hidden_states = kwargs['wte'](input_ids)
+#         if select:
+#             if hd_transform:
+#                 # new implementation without in-place operation
+#                 # Ref: https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca/modeling_phi3_v.py#L233
+#                 # Ref: https://pytorch.org/docs/stable/generated/torch.Tensor.index_put.html
+#                 # Ref: https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html#torch.Tensor.index_put_
+#                 # img_set_tensor: a list of tensors, each tensor has shape (1, N_tokens, C)
+#                 assert all([_img_set_tensor.shape[0] == 1 for _img_set_tensor in img_set_tensor]), 'img_set_tensor should have shape (1, N_tokens, C)'
+#                 # Shape: (merged_N_tokens, C)
+#                 merged_img_set_tensor = torch.cat(img_set_tensor, dim=1).squeeze(0)
+#                 merged_img_set_tensor = merged_img_set_tensor.to(hidden_states.dtype).to(hidden_states.device)
+#                 # Temporarily disable autocast to avoid issue on bf16 tensors
+#                 # Ref: https://github.com/pytorch/pytorch/issues/132715
+#                 with torch.autocast(device_type=hidden_states.device.type, enabled=False):
+#                     new_hidden_states = hidden_states.index_put(
+#                         indices=positions_tuple,
+#                         values=merged_img_set_tensor,
+#                         accumulate=False
+#                     )
+#                 hidden_states = new_hidden_states
+#             else:
+#                 raise NotImplementedError
+#         if fake_image_forward and self.training:
+#             hidden_states = hidden_states + (0 * img_set_tensor[0].to(hidden_states.dtype).to(hidden_states.device)).sum()
+#         if self.drop is not None:
+#             hidden_states = self.drop(hidden_states)
+#         return hidden_states
+class Phi4MMAudioEmbedding(nn.Module):
+    """Audio embedding."""
+    def __init__(self, config: PretrainedConfig, **kwargs) -> None:
+        super().__init__()
+        self.config = config
+        # n_embed or hidden_size for text LM
+        hidden_size = config.n_embd if hasattr(config, 'n_embd') else config.hidden_size
+        if hasattr(config, 'embd_pdrop') or hasattr(config, 'embed_pdrop'):
+            embd_drop = config.embd_pdrop if hasattr(config, 'embd_pdrop') else config.embed_pdrop
+            self.drop = nn.Dropout(embd_drop)
+        else:
+            self.drop = None
+        audio_dim_out = None # Set this variable according to the actual audio processor
+        logger.info(f"create audio processor {config.audio_processor}")
+        self.layer_idx = -2
+        if isinstance(config.audio_processor, dict) and config.audio_processor.get('name', None) == "cascades":
+            encoder_config = config.audio_processor.get("config", None)
+            assert encoder_config is not None
+            self.encoder = ConformerEncoder(**encoder_config)
+            # fake initialization, create encoder_embedding layer only so that
+            # in decoding, all parameters can be loaded in from_pretrained_function
+            # in training, we do post init after from_pretrained function to make sure the correct initialization
+            self.encoder.post_init({})
+            audio_dim_out = encoder_config["attention_dim"]
+            n_mels = encoder_config["input_size"]
+        else:
+            raise NotImplementedError
+        assert audio_dim_out is not None, "Remember to set values for audio_dim_out"
+        self.audio_dim_out = audio_dim_out
+        self.audio_dim_in = n_mels
+        self.freeze_audio_processor = kwargs.get('freeze_audio_processor', False)
+        logger.info(f'freeze_audio_processor = {self.freeze_audio_processor}')
+        self.downsample_rate = kwargs.get('downsample_rate', 1)
+        enable_gradient_checkpointing = kwargs.get('enable_gradient_checkpointing', False)
+        if enable_gradient_checkpointing:
+            self.encoder.gradient_checkpointing_enable()
+            logger.info(f'gradient checkpointing enabled for audio processor')
+        projection_cls = kwargs.get('projection_cls', 'linear')
+        if projection_cls == 'linear':
+            self.audio_projection = nn.Linear(audio_dim_out, hidden_size)
+        elif projection_cls == 'mlp':
+            # follow llava-v1.5's implementation
+            # (do not use image_projection and image_proj_norm)
+            dim_projection = hidden_size
+            depth = 2
+            self.linear_downsample_rate = self.downsample_rate
+            layers_for_speech = [nn.Linear(audio_dim_out * self.linear_downsample_rate, dim_projection)]
+            for _ in range(1, depth):
+                layers_for_speech.extend([nn.GELU(), nn.Linear(dim_projection, dim_projection)])
+            audio_projection_for_speech = nn.Sequential(*layers_for_speech)
+            layers_for_vision = [nn.Linear(audio_dim_out * self.linear_downsample_rate, dim_projection)]
+            for _ in range(1, depth):
+                layers_for_vision.extend([nn.GELU(), nn.Linear(dim_projection, dim_projection)])
+            # audio_projection_for_vision = nn.Sequential(*layers_for_vision)
+            self.audio_projection = nn.ModuleDict({
+                'speech': audio_projection_for_speech #,
+                # 'vision': audio_projection_for_vision
+            })
+        else:
+            raise NotImplementedError(f'projection_cls = {projection_cls}, not implemented')
+        self.vocab_size = config.vocab_size
+        self.input_embeds = None
+        self.audio_embed_sizes = None
+    def post_init(self, audio_config):
+        # execute after the from_pretrained() initialization of the phi4mm model
+        if audio_config.get('name', None) == "cascades":
+            init_model_config = audio_config.get("init_model", {})
+            self.encoder.post_init(init_model_config)
+            # remove the init model in config so it is not saved in the config.
+            # This might affect the model loading in resuming training and decoding.
+            if "init_model" in audio_config:
+                audio_config.pop("init_model")
+    def set_audio_embeds(self, input_embeds: torch.FloatTensor) -> None:
+        self.input_embeds = input_embeds
+    def set_audio_embed_sizes(self, audio_embed_sizes: torch.LongTensor) -> None:
+        self.audio_embed_sizes = audio_embed_sizes
+    def get_audio_features(self, input_embeds: torch.FloatTensor, audio_attention_mask: torch.Tensor, audio_projection_mode: str='speech'):
+        if self.freeze_audio_processor:
+            with torch.no_grad():
+                audio_features, masks = self.encoder(input_embeds, audio_attention_mask)
+        else:
+            audio_features, masks = self.encoder(input_embeds, audio_attention_mask)
+        if isinstance(self.audio_projection, nn.Sequential):
+            audio_set_tensor = self.audio_projection(audio_features)
+        elif isinstance(self.audio_projection, nn.ModuleDict):
+            audio_set_tensor = self.audio_projection[audio_projection_mode](audio_features)
+        else:
+            raise NotImplementedError
+        return audio_set_tensor
+    def forward(self, input_ids: torch.LongTensor, input_embeds: torch.FloatTensor, audio_embed_sizes=None, audio_attention_mask=None, audio_projection_mode='speech', **kwargs) -> torch.FloatTensor:
+        '''
+        arguments:
+            input_ids: input text ids (B, U)
+            input_embeds: audio features (B, T, D)  B: num audios in a sequence
+        '''
+        if self.input_embeds is not None:
+            input_embeds = self.input_embeds.clone()
+        if self.audio_embed_sizes is not None:
+            audio_embed_sizes = self.audio_embed_sizes.clone()
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_shape[-1])
+        MAX_INPUT_ID = int(1e9)
+        with torch.no_grad():
+            positions = torch.nonzero(input_ids == _AUDIO_SPECIAL_TOKEN_ID, as_tuple=False)
+            positions_tuple = torch.nonzero(input_ids == _AUDIO_SPECIAL_TOKEN_ID, as_tuple=True)
+        if isinstance(self.audio_projection, nn.Sequential):
+            target_device = self.audio_projection[0].bias.device
+            target_dtype = self.audio_projection[0].bias.dtype
+        elif isinstance(self.audio_projection, nn.ModuleDict):
+            target_device = self.audio_projection[audio_projection_mode][0].bias.device
+            target_dtype = self.audio_projection[audio_projection_mode][0].bias.dtype
+        else:  # It's a single nn.Linear layer
+            target_device = self.audio_projection.bias.device
+            target_dtype = self.audio_projection.bias.dtype
+        if input_embeds is not None:
+            input_embeds = input_embeds.to(target_device).to(target_dtype)
+        if len(positions.tolist()) > 0:
+            audio_set_tensor = self.get_audio_features(input_embeds, audio_attention_mask, audio_projection_mode)
+        else:
+            # # create an audio tensor
+            # To do: not sure if this is required for text only input
+            if self.training:
+                audio_embeds = torch.zeros(1, 500, self.audio_dim_in).to(target_device).to(target_dtype)
+                audio_attention_mask = audio_embeds.new_ones(audio_embeds.size()[:2]).long()
+                audio_set_tensor = self.get_audio_features(audio_embeds, audio_attention_mask, audio_projection_mode)
+        # print(kwargs['wte'])
+        # print(input_ids)
+        # print(kwargs['wte'](input_ids))
+        # print(audio_embed_sizes)
+        # print(len(positions.tolist()))
+        # print(audio_set_tensor)
+        # print(pppp)
+        hidden_states = kwargs['wte'](input_ids)
+        if len(positions.tolist()) > 0:
+            assert audio_embed_sizes.sum().item() == len(positions), \
+                f"please ensure the encoder outputs have the same length as defined in input_ids! \n audio_embed_sizes.sum().item(): {audio_embed_sizes.sum().item()} \n len(positions): {len(positions)} \n audio_embed_sizes: {audio_embed_sizes} \n positions: {positions} \n input_ids.shape \n {input_ids.shape}"
+            # new implementation without in-place operation
+            # Ref: https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca/modeling_phi3_v.py#L233
+            # Ref: https://pytorch.org/docs/stable/generated/torch.Tensor.index_put.html
+            # Ref: https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html#torch.Tensor.index_put_
+            # audio_set_tensor: shape (N_audios, N_padded_tokens, C)
+            # Shape: (merged_N_tokens, C)
+            merged_audio_set_tensor = torch.cat([
+                audio_set_tensor[i, :audio_embed_sizes[i], :]
+                for i in range(len(audio_embed_sizes))
+            ], dim=0)
+            merged_audio_set_tensor = merged_audio_set_tensor.to(hidden_states.dtype).to(hidden_states.device)
+            # Temporarily disable autocast to avoid issue on bf16 tensors
+            # Ref: https://github.com/pytorch/pytorch/issues/132715
+            with torch.autocast(device_type=hidden_states.device.type, enabled=False):
+                new_hidden_states = hidden_states.index_put(
+                    indices=positions_tuple,
+                    values=merged_audio_set_tensor,
+                    accumulate=False
+                )
+            hidden_states = new_hidden_states
+        else:
+            if self.training:
+                hidden_states  = hidden_states + (0 * audio_set_tensor[:,0].to(hidden_states.dtype).to(hidden_states.device)).sum()
+        if self.drop is not None:
+            hidden_states = self.drop(hidden_states)
+        return hidden_states
+class Phi4MMImageAudioEmbedding(nn.Module):
+    """Image-audio embedding."""
+    def __init__(self, config: PretrainedConfig, **kwargs) -> None:
+        super().__init__()
+        self.vocab_size = config.vocab_size
+        # self.image_input_id = kwargs.get('image_input_id', -1)
+        self.audio_input_id = kwargs.get('audio_input_id', -10000)
+        # assert self.image_input_id != self.audio_input_id, 'image_input_id and audio_input_id should be different'
+        # self.image_embd_layer_kwargs = kwargs['image_embd_layer']
+        # self.image_embed = Phi4MMImageEmbedding(config, **self.image_embd_layer_kwargs)
+        self.audio_embd_layer_kwargs = kwargs['audio_embd_layer']
+        self.audio_embed = Phi4MMAudioEmbedding(config, **self.audio_embd_layer_kwargs)
+        # self.input_image_embeds = None
+        # self.image_sizes = None
+        # self.image_attention_mask = None
+        self.input_audio_embeds = None
+        self.audio_embed_sizes = None
+    def post_init(self, audio_config):
+        # post init for audio embedding
+        # ref: model.model.embed_tokens_extend.post_init(audio_config) in phyagi/getters/model.py
+        self.audio_embed.post_init(audio_config)
+    # def set_input_image_embeds(self, input_image_embeds: torch.FloatTensor) -> None:
+    #     self.input_image_embeds = input_image_embeds
+    # def set_image_sizes(self, image_sizes: torch.LongTensor) -> None:
+    #     self.image_sizes = image_sizes
+    # def set_img_attn_mask(self, image_attention_mask: torch.FloatTensor) -> None:
+    #     self.image_attention_mask = image_attention_mask
+    def set_input_audio_embeds(self, input_audio_embeds: torch.FloatTensor) -> None:
+        self.input_audio_embeds = input_audio_embeds
+    def set_audio_embed_sizes(self, audio_embed_sizes: torch.LongTensor) -> None:
+        self.audio_embed_sizes = audio_embed_sizes
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        input_embeds,
+        input_image_embeds: Optional[torch.FloatTensor]=None,
+        input_audio_embeds: Optional[torch.FloatTensor]=None,
+        image_sizes=None,
+        image_attention_mask=None,
+        audio_embed_sizes=None,
+        audio_attention_mask=None,
+        audio_projection_mode='speech',
+        wte=None,
+    ) -> torch.FloatTensor:
+        MAX_INPUT_ID = int(1e9)
+        assert -MAX_INPUT_ID < self.audio_input_id #< self.image_input_id
+        # override image and audio embeddings and sizes from object itself
+        # this is for inference
+        # ref: phyagi/eval/utils/text_generation_vision_audio_pipeline.py
+        # if self.input_image_embeds is not None:
+        #     assert input_image_embeds is None
+        #     input_image_embeds = self.input_image_embeds.clone()
+        #     # NOTE weijian: set input_image_embeds to None after first call in for eval stage
+        #     #               during evaluation, it will call model's forward() multiple times
+        #     #               the first time input_ids contains the prompt (including <|image_{}|>) and input_embeds exists
+        #     #               from the second time, the input_ids will only contain the generated text
+        #     #               thus, the input_image_embeds is no longer needed
+        #     self.input_image_embeds = None
+        # if self.image_sizes is not None:
+        #     assert image_sizes is None
+        #     image_sizes = self.image_sizes
+        if self.input_audio_embeds is not None:
+            assert input_audio_embeds is None
+            input_audio_embeds = self.input_audio_embeds.clone()
+            self.input_audio_embeds = None
+        if self.audio_embed_sizes is not None:
+            assert audio_embed_sizes is None
+            audio_embed_sizes = self.audio_embed_sizes.clone()
+        # if self.image_attention_mask is not None:
+        #     assert image_attention_mask is None
+        #     image_attention_mask = self.image_attention_mask.clone()
+        #     self.image_attention_mask = None
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_shape[-1])
+        # backward compatibility
+        with torch.no_grad():
+            new_input_ids = input_ids.clone()
+            # new_input_ids[(input_ids >= _COMPATIBLE_IMAGE_SPECIAL_TOKEN_ID_RANGE[0]) &
+            #             (input_ids <= _COMPATIBLE_IMAGE_SPECIAL_TOKEN_ID_RANGE[1])] = _IMAGE_SPECIAL_TOKEN_ID
+            new_input_ids[(input_ids >= _COMPATIBLE_AUDIO_SPECIAL_TOKEN_ID_RANGE[0]) &
+                        (input_ids <= _COMPATIBLE_AUDIO_SPECIAL_TOKEN_ID_RANGE[1])] = _AUDIO_SPECIAL_TOKEN_ID
+            input_ids = new_input_ids
+        # with torch.no_grad():
+        #     image_position_mask = input_ids == _IMAGE_SPECIAL_TOKEN_ID
+        #     non_image_position_mask = ~image_position_mask
+        assert input_embeds is None
+        # if self.training:
+        #     assert input_image_embeds is not None or input_audio_embeds is not None
+        if self.training:
+            assert input_audio_embeds is not None
+        # if input_image_embeds is not None:
+        #     image_hidden_states = self.image_embed(
+        #         input_ids=input_ids,
+        #         input_embeds=input_image_embeds,
+        #         image_sizes=image_sizes,
+        #         wte=wte,
+        #         image_attention_mask=image_attention_mask
+        #     )
+        if input_audio_embeds is not None:
+            audio_hidden_states = self.audio_embed(
+                input_ids=input_ids,
+                input_embeds=input_audio_embeds,
+                audio_embed_sizes=audio_embed_sizes,
+                audio_attention_mask=audio_attention_mask,
+                wte=wte,
+                audio_projection_mode=audio_projection_mode,
+            )
+        # merge image and audio hidden states
+        # NOTE weijian: for non-image-audio tokens, here we use audio hidden states
+        #               actually, in the debug code above, the non-image-audio tokens from image_hidden_states and audio_hidden_states should be the same
+        # if input_image_embeds is not None and input_audio_embeds is not None:
+        #     dtype = image_hidden_states.dtype
+        #     hidden_states = image_hidden_states * image_position_mask.to(dtype).unsqueeze(-1) + audio_hidden_states * non_image_position_mask.to(dtype).unsqueeze(-1)
+        # elif input_image_embeds is not None:
+        #     hidden_states = image_hidden_states
+        # elif input_audio_embeds is not None:
+        if input_audio_embeds is not None:
+            hidden_states = audio_hidden_states
+        else:
+            assert wte is not None
+            hidden_states = wte(input_ids)
+        return hidden_states
+########################################################################################################################
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/qwen2/modular_qwen2.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_qwen2.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+from typing import Callable, List, Optional, Tuple, Union
+import torch
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, SlidingWindowCache, StaticCache
+from transformers.generation import GenerationMixin
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import (
+    LossKwargs,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.utils.deprecation import deprecate_kwarg
+from .configuration_qwen2mm import Qwen2MMConfig
+####################################################################
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "meta-qwen2/Qwen2-2-7b-hf"
+_CONFIG_FOR_DOC = "Qwen2MMConfig"
+class Qwen2MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+class Qwen2Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: Qwen2MMConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=True)
+        self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True)
+        self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=True)
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        sliding_window = None
+        if (
+            self.config.use_sliding_window
+            and getattr(self.config, "sliding_window", None) is not None
+            and self.layer_idx >= self.config.max_window_layers
+        ):
+            sliding_window = self.config.sliding_window
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
+                logger.warning_once(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            sliding_window=sliding_window,  # main diff with Llama
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class Qwen2RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Qwen2RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class Qwen2DecoderLayer(nn.Module):
+    def __init__(self, config: Qwen2MMConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = Qwen2Attention(config=config, layer_idx=layer_idx)
+        self.mlp = Qwen2MLP(config)
+        self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        if config.sliding_window and config._attn_implementation != "flash_attention_2":
+            logger.warning_once(
+                f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
+                "unexpected results may be encountered."
+            )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+class Qwen2RotaryEmbedding(nn.Module):
+    def __init__(self, config: Qwen2MMConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    def _dynamic_frequency_update(self, position_ids, device):
+        """
+        dynamic RoPE layers should recompute `inv_freq` in the following situations:
+        1 - growing beyond the cached sequence length (allow scaling)
+        2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
+        """
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_seq_len_cached:  # growth
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: may break with compilation
+            self.max_seq_len_cached = seq_len
+        if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len:  # reset
+            # This .to() is needed if the model has been moved to a device after being initialized (because
+            # the buffer is automatically moved, but not the original copy)
+            self.original_inv_freq = self.original_inv_freq.to(device)
+            self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
+            self.max_seq_len_cached = self.original_max_seq_len
+    @torch.no_grad()
+    def forward(self, x, position_ids):
+        if "dynamic" in self.rope_type:
+            self._dynamic_frequency_update(position_ids, device=x.device)
+        # Core RoPE block
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
+        cos = cos * self.attention_scaling
+        sin = sin * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+QWEN2_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`Qwen2MMConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+@add_start_docstrings(
+    "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
+    QWEN2_START_DOCSTRING,
+)
+class Qwen2PreTrainedModel(PreTrainedModel):
+    config_class = Qwen2MMConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Qwen2DecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+QWEN2_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
+            `past_key_values`).
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
+            Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
+            returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
+            Two formats are allowed:
+            - a [`~cache_utils.Cache`] instance, see our
+            [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
+            - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+            shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
+            cache format.
+            The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
+            legacy cache format will be returned.
+            If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
+            have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
+            of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+            Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
+            this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
+            the complete sequence length.
+"""
+@add_start_docstrings(
+    "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
+    QWEN2_START_DOCSTRING,
+)
+class Qwen2MMModel(Qwen2PreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
+    Args:
+        config: Qwen2MMConfig
+    """
+    def __init__(self, config: Qwen2MMConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+######QWEN#################
+        self.embed_tokens_extend = None
+        if isinstance(config.embd_layer, dict):
+            embedding_config = {
+                'embedding_cls': config.embd_layer['embedding_cls'],
+                **config.embd_layer
+            }
+            self.embed_tokens_extend = Phi4MMImageAudioEmbedding(config, **embedding_config)
+        self._attn_implementation = config._attn_implementation
+############################
+        self.layers = nn.ModuleList(
+            [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Qwen2RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
+    def forward(
+        # self,
+        # input_ids: torch.LongTensor = None,
+        # attention_mask: Optional[torch.Tensor] = None,
+        # position_ids: Optional[torch.LongTensor] = None,
+        # past_key_values: Optional[Cache] = None,
+        # inputs_embeds: Optional[torch.FloatTensor] = None,
+        # use_cache: Optional[bool] = None,
+        # output_attentions: Optional[bool] = None,
+        # output_hidden_states: Optional[bool] = None,
+        # return_dict: Optional[bool] = None,
+        # cache_position: Optional[torch.LongTensor] = None,
+########QWEN############
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        input_image_embeds: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[torch.LongTensor] = None,
+        image_attention_mask=None,
+        input_audio_embeds: Optional[torch.FloatTensor] = None,
+        audio_embed_sizes=None,
+        audio_attention_mask=None,
+        audio_projection_mode=None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+##########################
+        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        # if inputs_embeds is None:
+        #     inputs_embeds = self.embed_tokens(input_ids)
+############QWEN###########
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens_extend(
+                input_ids=input_ids,
+                input_embeds=inputs_embeds,
+                input_image_embeds=input_image_embeds,
+                input_audio_embeds=input_audio_embeds,
+                image_sizes=image_sizes,
+                image_attention_mask=image_attention_mask,
+                audio_embed_sizes=audio_embed_sizes,
+                audio_attention_mask=audio_attention_mask,
+                audio_projection_mode=audio_projection_mode,
+                wte=self.embed_tokens,
+            )
+###########################
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+        hidden_states = inputs_embeds
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                    position_embeddings=position_embeddings,
+                    **flash_attn_kwargs,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        output = BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+        return output if return_dict else output.to_tuple()
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and past_key_values is not None:
+                is_padding_right = attention_mask[:, -1].sum().item() != input_tensor.size()[0]
+                if is_padding_right:
+                    raise ValueError(
+                        "You are attempting to perform batched generation with padding_side='right'"
+                        " this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to "
+                        " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
+                    )
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_static_cache = isinstance(past_key_values, StaticCache)
+        using_sliding_window_cache = isinstance(past_key_values, SlidingWindowCache)
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if (
+            self.config._attn_implementation == "sdpa"
+            and not (using_static_cache or using_sliding_window_cache)
+            and not output_attentions
+        ):
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                sliding_window=self.config.sliding_window,
+                is_training=self.training,
+            ):
+                return None
+        dtype, device = input_tensor.dtype, input_tensor.device
+        min_dtype = torch.finfo(dtype).min
+        sequence_length = input_tensor.shape[1]
+        # SlidingWindowCache or StaticCache
+        if using_sliding_window_cache or using_static_cache:
+            target_length = past_key_values.get_max_cache_shape()
+        # DynamicCache or no cache
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            device=device,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+            config=self.config,
+            past_key_values=past_key_values,
+        )
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type in ["cuda", "xpu"]
+            and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        config: Qwen2MMConfig,
+        past_key_values: Cache,
+    ):
+        """
+        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+        Args:
+            attention_mask (`torch.Tensor`):
+                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
+            sequence_length (`int`):
+                The sequence length being processed.
+            target_length (`int`):
+                The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
+            dtype (`torch.dtype`):
+                The dtype to use for the 4D attention mask.
+            device (`torch.device`):
+                The device to plcae the 4D attention mask on.
+            cache_position (`torch.Tensor`):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            batch_size (`torch.Tensor`):
+                Batch size.
+            config (`Qwen2MMConfig`):
+                The model's configuration class
+            past_key_values (`Cache`):
+                The cache class that is being used currently to generate
+        """
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
+            )
+            diagonal_attend_mask = torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
+            if config.sliding_window is not None:
+                # if we have sliding window, we should not attend to tokens beyond sliding window length, so we mask them out also
+                # the check is needed to verify is current checkpoint was trained with sliding window or not
+                if not isinstance(past_key_values, SlidingWindowCache) or sequence_length > target_length:
+                    sliding_attend_mask = torch.arange(target_length, device=device) <= (
+                        cache_position.reshape(-1, 1) - config.sliding_window
+                    )
+                    diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
+            causal_mask *= diagonal_attend_mask
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                if attention_mask.shape[-1] > target_length:
+                    attention_mask = attention_mask[:, :target_length]
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
+                    causal_mask.device
+                )
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+        return causal_mask
+class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
+class Qwen2MMForCausalLM(Qwen2PreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = Qwen2MMModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
+    @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        # self,
+        # input_ids: torch.LongTensor = None,
+        # attention_mask: Optional[torch.Tensor] = None,
+        # position_ids: Optional[torch.LongTensor] = None,
+        # past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        # inputs_embeds: Optional[torch.FloatTensor] = None,
+        # labels: Optional[torch.LongTensor] = None,
+        # use_cache: Optional[bool] = None,
+        # output_attentions: Optional[bool] = None,
+        # output_hidden_states: Optional[bool] = None,
+        # return_dict: Optional[bool] = None,
+        # cache_position: Optional[torch.LongTensor] = None,
+        # logits_to_keep: Union[int, torch.Tensor] = 0,
+    ######QWEN###############
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        input_image_embeds: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[torch.LongTensor] = None,
+        image_attention_mask=None,
+        input_audio_embeds: Optional[torch.FloatTensor] = None,
+        audio_embed_sizes=None,
+        audio_attention_mask=None,
+        input_mode=None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        num_logits_to_keep: int = 0,
+ ####################################
+        **kwargs: Unpack[KwargsForCausalLM],
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            num_logits_to_keep (`int` or `torch.Tensor`, *optional*):
+                If an `int`, compute logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+                If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+                This is useful when using packed tensor format (single dimension for batch and sequence length).
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, Qwen2ForCausalLM
+        >>> model = Qwen2ForCausalLM.from_pretrained("meta-qwen2/Qwen2-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-qwen2/Qwen2-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+###########QWEN##########
+        if isinstance(input_mode, torch.Tensor):
+            # len(input_mode) == num_beams in beam search, and all elements of input_mode should have the same value
+            input_mode = input_mode[0].item()
+        input_mode = InputMode(input_mode)
+        if input_mode in [InputMode.VISION_SPEECH, InputMode.VISION]:
+            # self.set_lora_adapter('vision')
+            audio_projection_mode = 'vision'
+        elif input_mode == InputMode.SPEECH:
+            # self.set_lora_adapter('speech')
+            audio_projection_mode = 'speech'
+        elif input_mode == InputMode.LANGUAGE:
+            # self.unset_lora_adapter()
+            audio_projection_mode = 'speech'
+        else:
+            raise ValueError(f"Invalid input_mode: {input_mode}")
+##################################
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            input_image_embeds=input_image_embeds,
+            image_sizes=image_sizes,
+            image_attention_mask=image_attention_mask,
+            input_audio_embeds=input_audio_embeds,
+            audio_embed_sizes=audio_embed_sizes,
+            audio_attention_mask=audio_attention_mask,
+            audio_projection_mode=audio_projection_mode,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            **kwargs,
+        )
+        hidden_states = outputs[0]
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-num_logits_to_keep, None) if isinstance(num_logits_to_keep, int) else num_logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        input_image_embeds=None,
+        image_sizes=None,
+        image_attention_mask=None,
+        input_audio_embeds=None,
+        audio_embed_sizes=None,
+        audio_attention_mask=None,
+        input_mode=None,
+        cache_position=None,
+        position_ids=None,
+        use_cache=True,
+        num_logits_to_keep=None,
+        **kwargs
+    ):
+        # Overwritten -- this model may need to switch between short and long rope, invalidating the cache in the
+        # process
+        # When the first time input length reached long and short factor switching point, enforce re-compute cache
+        # It will cause downside of slower at this single token position, however, better than current failure.
+        if (
+            past_key_values
+            and self.config.rope_scaling
+            and input_ids.shape[1] >= self.config.original_max_position_embeddings + 1
+        ):
+            past_length = cache_position[0]
+            if past_length <= self.config.original_max_position_embeddings:
+                past_key_values = None
+        model_inputs = super().prepare_inputs_for_generation(
+            input_ids=input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            input_image_embeds=input_image_embeds,
+            image_sizes=image_sizes,
+            image_attention_mask=image_attention_mask,
+            input_audio_embeds=input_audio_embeds,
+            audio_embed_sizes=audio_embed_sizes,
+            audio_attention_mask=audio_attention_mask,
+            input_mode=input_mode,
+            cache_position=cache_position,
+            position_ids=position_ids,
+            use_cache=use_cache,
+            num_logits_to_keep=num_logits_to_keep,
+            **kwargs,
+        )
+        return model_inputs
+#######################################################################################################
+AutoConfig.register("qwen2-mm", Qwen2MMConfig)
+AutoModelForCausalLM.register(Qwen2MMConfig, Qwen2MMForCausalLM)
+Qwen2MMConfig.register_for_auto_class()
+Qwen2MMForCausalLM.register_for_auto_class("AutoModelForCausalLM")

processing_phi4mm.py ADDED Viewed

	@@ -0,0 +1,744 @@

+# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for Phi4MM
+"""
+import re
+from typing import List, Optional, Tuple, Union
+import math
+from enum import Enum
+import numpy as np
+import scipy
+import torch
+import torchvision
+from transformers import AutoFeatureExtractor, AutoImageProcessor
+from transformers.feature_extraction_sequence_utils import SequenceFeatureExtractor
+from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
+from transformers.image_utils import (
+    ImageInput,
+    make_list_of_images,
+    valid_images,
+)
+from transformers.processing_utils import ProcessorMixin
+from transformers.tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
+from transformers.utils import TensorType, logging
+from torch.nn.utils.rnn import pad_sequence
+logger = logging.get_logger(__name__)
+# Special tokens
+_COMPATIBLE_IMAGE_SPECIAL_TOKEN_PATTERN = r'<\|im_start_im+\|>'  # For backward compatibility
+_COMPATIBLE_AUDIO_SPECIAL_TOKEN_PATTERN = r'<\|im_start+\|>'  # For backward compatibility
+_IMAGE_SPECIAL_TOKEN = '<|im_start_im|>'
+_AUDIO_SPECIAL_TOKEN = '<|im_start|>'
+_IMAGE_SPECIAL_TOKEN_ID = 1516444  # '<|endoftext10|>', or we can better name it (in `tokenizer_config.json`)
+_AUDIO_SPECIAL_TOKEN_ID = 151644  # '<|endoftext11|>'
+class InputMode(Enum):
+    LANGUAGE = 0
+    VISION = 1
+    SPEECH = 2
+    VISION_SPEECH = 3
+class Phi4MMImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a Phi4MM image processor.
+    """
+    model_input_names = ["input_image_embeds", "image_sizes", "image_attention_mask"]
+    def __init__(
+        self,
+        dynamic_hd,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.dynamic_hd = dynamic_hd
+    def find_closest_aspect_ratio(self, aspect_ratio, target_ratios, width, height, image_size):
+        best_ratio_diff = float('inf')
+        best_ratio = (1, 1)
+        area = width * height
+        for ratio in target_ratios:
+            target_aspect_ratio = ratio[0] / ratio[1]
+            ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+            if ratio_diff < best_ratio_diff:
+                best_ratio_diff = ratio_diff
+                best_ratio = ratio
+            elif ratio_diff == best_ratio_diff:
+                if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                    best_ratio = ratio
+        return best_ratio
+    def dynamic_preprocess(self, image, min_num=1, max_num=12, image_size=384, mask_size=27, use_thumbnail=True):
+        orig_width, orig_height = image.size
+        w_crop_num = math.ceil(orig_width/float(image_size))
+        h_crop_num = math.ceil(orig_height/float(image_size))
+        if w_crop_num * h_crop_num > max_num:
+            aspect_ratio = orig_width / orig_height
+            # calculate the existing image aspect ratio
+            target_ratios = set(
+                (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+                i * j <= max_num and i * j >= min_num)
+            target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+            # find the closest aspect ratio to the target
+            target_aspect_ratio = self.find_closest_aspect_ratio(
+                aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+            # calculate the target width and height
+            target_width = image_size * target_aspect_ratio[0]
+            target_height = image_size * target_aspect_ratio[1]
+        else:
+            target_width = image_size * w_crop_num
+            target_height = image_size * h_crop_num
+            target_aspect_ratio = (w_crop_num, h_crop_num)
+        # Calculate the ratio
+        ratio_width = target_width / orig_width
+        ratio_height = target_height / orig_height
+        if ratio_width < ratio_height:
+            new_size = (target_width, int(orig_height * ratio_width))
+            padding_width = 0
+            padding_height = target_height - int(orig_height * ratio_width)
+        else:
+            new_size = (int(orig_width * ratio_height), target_height)
+            padding_width = target_width - int(orig_width * ratio_height)
+            padding_height = 0
+        attention_mask = torch.ones((int(mask_size*target_aspect_ratio[1]), int(mask_size*target_aspect_ratio[0])))
+        if padding_width >= 14:
+            attention_mask[:, -math.floor(padding_width/14):] = 0
+        if padding_height >= 14:
+            attention_mask[-math.floor(padding_height/14):,:] = 0
+        assert attention_mask.sum() > 0
+        if min(new_size[1], target_height) < 10 or min(new_size[0], target_width) < 10:
+            raise ValueError(f'the aspect ratio is very extreme {new_size}')
+        image = torchvision.transforms.functional.resize(image, [new_size[1], new_size[0]],)
+        resized_img = torchvision.transforms.functional.pad(image, [0, 0, padding_width, padding_height], fill=[255,255,255])
+        return resized_img, attention_mask
+    def pad_to_max_num_crops(self, images, max_crops=5):
+        """
+        images: B x 3 x H x W, B<=max_crops
+        """
+        B, _, H, W = images.shape
+        if B < max_crops:
+            pad = torch.zeros(max_crops - B, 3, H, W, dtype=images.dtype, device=images.device)
+            images = torch.cat([images, pad], dim=0)
+        return images
+    def pad_mask_to_max_num_crops(self, masks, max_crops=5):
+        B, H, W = masks.shape
+        if B < max_crops:
+            pad = torch.ones(max_crops - B, H, W, dtype=masks.dtype, device=masks.device)
+            masks = torch.cat([masks, pad], dim=0)
+        return masks
+    def preprocess(
+        self,
+        images: ImageInput,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+    ):
+        """
+        Args:
+            images (`ImageInput`):
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+        """
+        images = make_list_of_images(images)
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+        # Basic settings.
+        img_processor = torchvision.transforms.Compose([
+            torchvision.transforms.ToTensor(),
+            torchvision.transforms.Normalize(
+                (0.5, 0.5, 0.5),
+                (0.5, 0.5, 0.5)
+            ),
+        ])
+        dyhd_base_resolution = 448
+        # Dynamic HD
+        base_resolution = dyhd_base_resolution
+        images = [image.convert('RGB') for image in images]
+        # cover 384 and 448 resolution
+        mask_resolution = base_resolution // 14
+        elems, image_attention_masks = [], []
+        for im in images:
+            elem, attention_mask = self.dynamic_preprocess(im, max_num=self.dynamic_hd, image_size=base_resolution, mask_size=mask_resolution)
+            elems.append(elem)
+            image_attention_masks.append(attention_mask)
+        hd_images = [img_processor(im) for im in elems]
+        global_image = [torch.nn.functional.interpolate(im.unsqueeze(0).float(), size=(base_resolution, base_resolution), mode='bicubic',).to(im.dtype) for im in hd_images]
+        shapes = [[im.size(1), im.size(2)] for im in hd_images]
+        mask_shapes = [[mask.size(0), mask.size(1)] for mask in image_attention_masks]
+        global_attention_mask = [torch.ones((1, mask_resolution, mask_resolution)) for _ in hd_images]
+        hd_images_reshape = [im.reshape(1, 3,
+                                            h//base_resolution,
+                                            base_resolution,
+                                            w//base_resolution,
+                                            base_resolution
+                                            ).permute(0,2,4,1,3,5).reshape(-1, 3, base_resolution, base_resolution).contiguous() for im, (h, w) in zip(hd_images, shapes)]
+        attention_masks_reshape = [mask.reshape(1,
+                                            h//mask_resolution,
+                                            mask_resolution,
+                                            w//mask_resolution,
+                                            mask_resolution
+                                            ).permute(0,1,3,2,4).reshape(-1, mask_resolution, mask_resolution).contiguous() for mask, (h, w) in zip(image_attention_masks, mask_shapes)]
+        downsample_attention_masks = [mask[:,0::2,0::2].reshape(1,
+                                            h//mask_resolution,
+                                            w//mask_resolution,
+                                            mask_resolution//2+mask_resolution%2,
+                                            mask_resolution//2+mask_resolution%2
+                                            ).permute(0,1,3,2,4) for mask, (h,w) in zip(attention_masks_reshape, mask_shapes)]
+        downsample_attention_masks = [mask.reshape(mask.size(1)*mask.size(2), mask.size(3)*mask.size(4))for mask in downsample_attention_masks]
+        num_img_tokens = [256 + 1 + int(mask.sum().item()) + int(mask[:,0].sum().item()) + 16 for mask in downsample_attention_masks]
+        hd_images_reshape = [torch.cat([_global_image] + [_im], dim=0) for _global_image, _im in zip(global_image, hd_images_reshape)]
+        hd_masks_reshape = [torch.cat([_global_mask] + [_mask], dim=0) for _global_mask, _mask in zip(global_attention_mask, attention_masks_reshape)]
+        max_crops = max([img.size(0) for img in hd_images_reshape])
+        image_transformed = [self.pad_to_max_num_crops(im, max_crops) for im in hd_images_reshape]
+        image_transformed = torch.stack(image_transformed, dim=0)
+        mask_transformed = [self.pad_mask_to_max_num_crops(mask, max_crops) for mask in hd_masks_reshape]
+        mask_transformed = torch.stack(mask_transformed, dim=0)
+        returned_input_image_embeds = image_transformed
+        returned_image_sizes = torch.tensor(shapes, dtype=torch.long)
+        returned_image_attention_mask = mask_transformed
+        returned_num_img_tokens = num_img_tokens
+        data = {
+            "input_image_embeds": returned_input_image_embeds,
+            "image_sizes": returned_image_sizes,
+            "image_attention_mask": returned_image_attention_mask,
+            "num_img_tokens": returned_num_img_tokens,
+        }
+        return BatchFeature(data=data, tensor_type=return_tensors)
+AudioInput = Tuple[Union[np.ndarray, torch.Tensor], int]
+AudioInputs = List[AudioInput]
+def speechlib_mel(sample_rate, n_fft, n_mels, fmin=None, fmax=None):
+    """Create a Mel filter-bank the same as SpeechLib FbankFC.
+    Args:
+        sample_rate (int): Sample rate in Hz. number > 0 [scalar]
+        n_fft (int): FFT size. int > 0 [scalar]
+        n_mel (int): Mel filter size. int > 0 [scalar]
+        fmin (float): lowest frequency (in Hz). If None use 0.0.
+            float >= 0 [scalar]
+        fmax: highest frequency (in Hz). If None use sample_rate / 2.
+            float >= 0 [scalar]
+    Returns
+        out (numpy.ndarray): Mel transform matrix
+            [shape=(n_mels, 1 + n_fft/2)]
+    """
+    bank_width = int(n_fft // 2 + 1)
+    if fmax is None:
+        fmax = sample_rate / 2
+    if fmin is None:
+        fmin = 0
+    assert fmin >= 0, "fmin cannot be negtive"
+    assert fmin < fmax <= sample_rate / 2, "fmax must be between (fmin, samplerate / 2]"
+    def mel(f):
+        return 1127.0 * np.log(1.0 + f / 700.0)
+    def bin2mel(fft_bin):
+        return 1127.0 * np.log(1.0 + fft_bin * sample_rate / (n_fft * 700.0))
+    def f2bin(f):
+        return int((f * n_fft / sample_rate) + 0.5)
+    # Spec 1: FFT bin range [f2bin(fmin) + 1, f2bin(fmax) - 1]
+    klo = f2bin(fmin) + 1
+    khi = f2bin(fmax)
+    khi = max(khi, klo)
+    # Spec 2: SpeechLib uses trianges in Mel space
+    mlo = mel(fmin)
+    mhi = mel(fmax)
+    m_centers = np.linspace(mlo, mhi, n_mels + 2)
+    ms = (mhi - mlo) / (n_mels + 1)
+    matrix = np.zeros((n_mels, bank_width), dtype=np.float32)
+    for m in range(0, n_mels):
+        left = m_centers[m]
+        center = m_centers[m + 1]
+        right = m_centers[m + 2]
+        for fft_bin in range(klo, khi):
+            mbin = bin2mel(fft_bin)
+            if left < mbin < right:
+                matrix[m, fft_bin] = 1.0 - abs(center - mbin) / ms
+    return matrix
+class Phi4MMAudioFeatureExtractor(SequenceFeatureExtractor):
+    model_input_names = ["input_audio_embeds", "audio_embed_sizes", "audio_attention_mask"]
+    def __init__(self, audio_compression_rate, audio_downsample_rate, audio_feat_stride, **kwargs):
+        feature_size = 80
+        sampling_rate = 16000
+        padding_value = 0.0
+        super().__init__(feature_size, sampling_rate, padding_value, **kwargs)
+        self.compression_rate = audio_compression_rate
+        self.qformer_compression_rate = audio_downsample_rate
+        self.feat_stride = audio_feat_stride
+        self._eightk_method = "fillzero"
+        self._mel = speechlib_mel(16000, 512, 80, fmin=None, fmax=7690).T
+        self._hamming400 = np.hamming(400)  # for 16k audio
+        self._hamming200 = np.hamming(200)  # for 8k audio
+    def duration_to_frames(self, duration):
+        """duration in s, estimated frames"""
+        frame_rate = 10
+        num_frames = duration * 1000 // frame_rate
+        return num_frames
+    def __call__(
+        self,
+        audios: List[AudioInput],
+        return_tensors: Optional[Union[str, TensorType]] = None,
+    ):
+        # Ref: https://github.com/huggingface/transformers/blob/v4.47.0/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py#L161
+        returned_input_audio_embeds = []
+        returned_audio_embed_sizes = []
+        audio_frames_list = []
+        for audio_data, sample_rate in audios:
+            audio_embeds = self._extract_features(audio_data, sample_rate)
+            audio_frames = len(audio_embeds) * self.feat_stride
+            audio_embed_size = self._compute_audio_embed_size(audio_frames)
+            returned_input_audio_embeds.append(torch.tensor(audio_embeds))
+            returned_audio_embed_sizes.append(torch.tensor(audio_embed_size).long())
+            audio_frames_list.append(audio_frames)
+        returned_input_audio_embeds = pad_sequence(
+            returned_input_audio_embeds, batch_first=True
+        )
+        returned_audio_embed_sizes = torch.stack(returned_audio_embed_sizes, dim=0)
+        audio_frames = torch.tensor(audio_frames_list)
+        returned_audio_attention_mask = torch.arange(0, audio_frames.max()).unsqueeze(0) < audio_frames.unsqueeze(1) if len(audios) > 1 else None
+        data = {
+            "input_audio_embeds": returned_input_audio_embeds,
+            "audio_embed_sizes": returned_audio_embed_sizes,
+        }
+        if returned_audio_attention_mask is not None:
+            data["audio_attention_mask"] = returned_audio_attention_mask
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    def _extract_spectrogram(self, wav, fs):
+        """Extract spectrogram features from waveform.
+        Args:
+            wav (1D array): waveform of the input
+            fs (int): sampling rate of the waveform, 16000 or 8000.
+                If fs=8000, the waveform will be resampled to 16000Hz.
+        Output:
+            log_fbank (2D array): a TxD matrix of log Mel filterbank features.
+                D=80, and T is the number of frames.
+        """
+        if wav.ndim > 1:
+            wav = np.squeeze(wav)
+        # by default, we extract the mean if stereo
+        if len(wav.shape) == 2:
+            wav = wav.mean(1)
+        # Resample to 16000 or 8000 if needed
+        if fs > 16000:
+            wav = scipy.signal.resample_poly(wav, 1, fs // 16000)
+            fs = 16000
+        elif 8000 < fs < 16000:
+            wav = scipy.signal.resample_poly(wav, 1, fs // 8000)
+            fs = 8000
+        elif fs < 8000:
+            raise RuntimeError(f"Unsupported sample rate {fs}")
+        if fs == 8000:
+            if self._eightk_method == "resample":
+                # Input audio is 8 kHz. Convert to 16 kHz before feature
+                # extraction
+                wav = scipy.signal.resample_poly(wav, 2, 1)
+                fs = 16000
+            # Do nothing here for fillzero method
+        elif fs != 16000:
+            # Input audio is not a supported sample rate.
+            raise RuntimeError(f"Input data using an unsupported sample rate: {fs}")
+        preemphasis = 0.97
+        if fs == 8000:
+            n_fft = 256
+            win_length = 200
+            hop_length = 80
+            fft_window = self._hamming200
+        elif fs == 16000:
+            n_fft = 512
+            win_length = 400
+            hop_length = 160
+            fft_window = self._hamming400
+        # Spec 1: SpeechLib cut remaining sample insufficient for a hop
+        n_batch = (wav.shape[0] - win_length) // hop_length + 1
+        # Here we don't use stride_tricks since the input array may not satisfy
+        # memory layout requirement and we need writeable output
+        # Here we only use list of views before copy to desination
+        # so it is more efficient than broadcasting
+        y_frames = np.array(
+            [wav[_stride : _stride + win_length] for _stride in range(0, hop_length * n_batch, hop_length)],
+            dtype=np.float32,
+        )
+        # Spec 2: SpeechLib applies preemphasis within each batch
+        y_frames_prev = np.roll(y_frames, 1, axis=1)
+        y_frames_prev[:, 0] = y_frames_prev[:, 1]
+        y_frames = (y_frames - preemphasis * y_frames_prev) * 32768
+        S = np.fft.rfft(fft_window * y_frames, n=n_fft, axis=1).astype(np.complex64)
+        if fs == 8000:
+            # Need to pad the output to look like 16 kHz data but with zeros in
+            # the 4 to 8 kHz bins.
+            frames, bins = S.shape
+            padarray = np.zeros((frames, bins))
+            S = np.concatenate((S[:, 0:-1], padarray), axis=1)  # Nyquist bin gets set to zero
+        spec = np.abs(S).astype(np.float32)
+        return spec
+    def _extract_features(self, wav, fs):
+        """Extract log filterbank features from waveform.
+        Args:
+            wav (1D array): waveform of the input
+            fs (int): sampling rate of the waveform, 16000 or 8000.
+                If fs=8000, the waveform will be resampled to 16000Hz.
+        Output:
+            log_fbank (2D array): a TxD matrix of log Mel filterbank features.
+                D=80, and T is the number of frames.
+        """
+        spec = self._extract_spectrogram(wav, fs)
+        spec_power = spec**2
+        fbank_power = np.clip(spec_power.dot(self._mel), 1.0, None)
+        log_fbank = np.log(fbank_power).astype(np.float32)
+        return log_fbank
+    def _compute_audio_embed_size(self, audio_frames):
+        integer = audio_frames // self.compression_rate
+        remainder = audio_frames % self.compression_rate
+        result = integer if remainder == 0 else integer + 1
+        integer = result // self.qformer_compression_rate
+        remainder = result % self.qformer_compression_rate
+        result = integer if remainder == 0 else integer + 1  # qformer compression
+        return result
+class Phi4MMProcessor(ProcessorMixin):
+    r"""
+    Constructs a Phi4MM processor which raps an image processor, a audio processor, and a GPT tokenizer into a single processor.
+    [`Phi4MMProcessor`] offers all the functionalities of [`Phi4MMImageProcessor`] and [`GPT2Tokenizer`]. See the
+    [`~Phi4MMProcessor.__call__`] and [`~Phi4MMProcessor.decode`] for more information.
+    Args:
+        image_processor ([`Phi4MMImageProcessor`], *optional*):
+            The image processor is a required input.
+        tokenizer ([`GPT2Tokenizer`], *optional*):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "audio_processor", "tokenizer"]
+    tokenizer_class = "Qwen2Tokenizer"
+    image_processor_class = "AutoImageProcessor"  # Phi4MMImageProcessor will be registered later
+    audio_processor_class = "AutoFeatureExtractor"  # Phi4MMAudioFeatureExtractor will be registered later
+    def __init__(self, image_processor, audio_processor, tokenizer):
+        self.image_processor = image_processor
+        self.audio_processor = audio_processor
+        self.tokenizer = tokenizer
+    def __call__(
+        self,
+        text: Union[TextInput, List[TextInput]],
+        images: Optional[ImageInput] = None,
+        audios: Optional[AudioInputs] = None,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Optional[Union[bool, str, TruncationStrategy]] = None,
+        max_length=None,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
+    ) -> BatchFeature:
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forards the `text`
+        and `kwargs` arguments to GPT2Tokenizer's [`~GPT2Tokenizer.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        Phi4MMImageProcessor's [`~Phi4MMImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. Both channels-first and channels-last formats are supported.
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
+                Select a strategy to pad the returned sequences (according to the model's padding side and padding
+                index) among:
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            truncation (`bool`, *optional*):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model.
+            - **input_image_embeds** -- Pixel values to be fed to a model.
+            - **image_sizes** -- List of tuples specifying the size of each image in `input_image_embeds`.
+            - **image_attention_mask** -- List of attention masks for each image in `input_image_embeds`.
+            - **input_audio_embeds** -- Audio embeddings to be fed to a model.
+            - **audio_embed_sizes** -- List of integers specifying the size of each audio in `input_audio_embeds`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model.
+        """
+        image_inputs = self.image_processor(images, return_tensors=return_tensors) if images is not None else {}
+        audio_inputs = self.audio_processor(audios, return_tensors=return_tensors) if audios is not None else {}
+        inputs = self._convert_images_audios_text_to_inputs(
+            image_inputs,
+            audio_inputs,
+            text,
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            return_tensors=return_tensors,
+        )
+        # idenfity the input mode
+        if len(image_inputs) > 0 and len(audio_inputs) > 0:
+            input_mode = InputMode.VISION_SPEECH
+        elif len(image_inputs) > 0:
+            input_mode = InputMode.VISION
+        elif len(audio_inputs) > 0:
+            input_mode = InputMode.SPEECH
+        else:
+            input_mode = InputMode.LANGUAGE
+        inputs["input_mode"] = torch.tensor([input_mode.value], dtype=torch.long)
+        return inputs
+    @property
+    def special_image_token_id(self):
+        return self.tokenizer.convert_tokens_to_ids(self.special_image_token)
+    def get_special_image_token_id(self):
+        return self.tokenizer.convert_tokens_to_ids(self.special_image_token)
+    @property
+    def chat_template(self):
+        return self.tokenizer.chat_template
+    def _convert_images_audios_text_to_inputs(
+        self, images, audios, text, padding=False, truncation=None, max_length=None, return_tensors=None
+    ):
+        # prepare image id to image input ids
+        if len(images) > 0:
+            input_image_embeds = images["input_image_embeds"]
+            image_sizes = images["image_sizes"]
+            image_attention_mask = images["image_attention_mask"]
+            num_img_tokens = images['num_img_tokens']
+        else:
+            input_image_embeds = torch.tensor([])
+            image_sizes = torch.tensor([])
+            image_attention_mask = torch.tensor([])
+            num_img_tokens = []
+        # prepare audio id to audio input ids
+        if len(audios) > 0:
+            input_audio_embeds = audios["input_audio_embeds"]
+            audio_embed_sizes = audios["audio_embed_sizes"]
+            audio_attention_mask = audios.get("audio_attention_mask", None)
+        else:
+            input_audio_embeds = torch.tensor([])
+            audio_embed_sizes = torch.tensor([])
+            audio_attention_mask = None
+        # Replace certain special tokens for compatibility
+        # Ref: https://stackoverflow.com/questions/11475885/python-replace-regex
+        if isinstance(text, str):
+            text = [text]
+        assert isinstance(text, list)
+        processed_text = [re.sub(_COMPATIBLE_IMAGE_SPECIAL_TOKEN_PATTERN, _IMAGE_SPECIAL_TOKEN, t) for t in text]
+        processed_text = [re.sub(_COMPATIBLE_AUDIO_SPECIAL_TOKEN_PATTERN, _AUDIO_SPECIAL_TOKEN, t) for t in processed_text]
+        input_ids_list = [self.tokenizer(t).input_ids for t in processed_text]
+        img_cnt, audio_cnt = 0, 0  # only needed for later assertion
+        image_token_count_iter = iter(num_img_tokens)
+        audio_embed_size_iter = iter(audio_embed_sizes.tolist())
+        new_input_ids_list = []
+        for input_ids in input_ids_list:
+            i = 0
+            while i < len(input_ids):
+                token_id = input_ids[i]
+                if token_id == _AUDIO_SPECIAL_TOKEN_ID:
+                    token_count = next(audio_embed_size_iter)
+                    audio_cnt += 1
+                elif token_id == _IMAGE_SPECIAL_TOKEN_ID:
+                    token_count = next(image_token_count_iter)
+                    img_cnt += 1
+                else:
+                    i += 1
+                    continue
+                tokens = [token_id] * token_count
+                input_ids = input_ids[:i] + tokens + input_ids[i + 1:]
+                i += token_count
+            input_ids = torch.tensor(input_ids, dtype=torch.long)
+            new_input_ids_list.append(input_ids)
+        lengths = torch.tensor([len(input_ids) for input_ids in new_input_ids_list])
+        max_len = lengths.max()
+        input_ids = input_ids.new_full((len(new_input_ids_list), max_len), self.tokenizer.pad_token_id)
+        # batched inference requires left padding
+        ########QWEN##############
+        for i in range(len(new_input_ids_list)):
+            input_ids[i, max_len - len(new_input_ids_list[i]):] = new_input_ids_list[i]
+        # for i in range(len(new_input_ids_list)):
+        #     input_ids[i, :len(new_input_ids_list[i])] = new_input_ids_list[i]
+        # If the below assertion fails, it might be that input pure-text
+        # messages contain image/audio special tokens literally
+        # (<|endoftext10|>, <|endoftext11|>).
+        assert (
+            img_cnt == len(num_img_tokens)
+        ), (
+            f"Number of image tokens in prompt_token_ids ({img_cnt}) "
+            f"does not match number of images ({len(num_img_tokens)})"
+        )
+        assert (
+            audio_cnt == len(audio_embed_sizes)
+        ), (
+            f"Number of audio tokens in prompt_token_ids ({audio_cnt}) "
+            f"does not match number of audios ({len(audio_embed_sizes)})"
+        )
+        # prepare attention mask
+        ########QWEN##############
+        seq_range = torch.arange(max_len - 1, -1, -1)
+        # seq_range = torch.arange(0,max_len)
+        attention_mask = seq_range.unsqueeze(0) < lengths.unsqueeze(1)
+        # # prepare batch feature
+        # print(input_ids)
+        # print(attention_mask)
+        # print(pp)
+        data = {
+            "input_ids": input_ids,
+            "input_image_embeds": input_image_embeds,
+            "image_sizes": image_sizes,
+            "image_attention_mask": image_attention_mask,
+            "input_audio_embeds": input_audio_embeds,
+            "audio_embed_sizes": audio_embed_sizes,
+            "audio_attention_mask": audio_attention_mask,
+            "attention_mask": attention_mask,
+        }
+        return BatchFeature(
+            data=data
+        )
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to GPT2Tokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to GPT2Tokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+    @property
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        audio_processor_input_names = self.audio_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names + audio_processor_input_names))
+AutoImageProcessor.register("Phi4MMImageProcessor", Phi4MMImageProcessor)
+AutoFeatureExtractor.register("Phi4MMAudioFeatureExtractor", Phi4MMAudioFeatureExtractor)

speech_conformer_encoder.py ADDED Viewed

The diff for this file is too large to render. See raw diff