prhegde
/

aligned-merge-aanaphi-phi2-orage-3b

@@ -19,13 +19,13 @@ These models are selected with consideration for the compute resource constraint
 The proposed methodology involves the following steps:
 1. Generate a response \\( y \\) for a given prompt \\( x \\) from a policy to be optimized, denoted as \\( \pi(y/x) \\).
 2. Request the teacher to produce the corrected version \\( y' \\) of \\( y \\) denoted as \\( \tau(y'/y, x) \\). The teacher can be a human or a more precise LLM.
-3. Assume a distribution function \\(f\\) that gives the likelihood of given text. For a prompt \\(x\\) and a corresponding response \\(y\\), \\(f(y, x)\\) can be written as \\(P(x)*P(y/x)\\). Log-likelyhood becomes \\(log(P(x)) + \log(P(y/x))\\).
 4. Calculate the amount of correction required as a difference between the above distribution \\( f(y, x) - f(y', x) = \log(P(y/x)) - \log(P(y'/x)) \\).
 5. The distribution \\(P\\) can be parameterized through the policy \\(\pi\\). For training this model, \\(P\\) is directly set to \\(\pi\\).
-7. Optimize the policy \\(\pi\\) to minimize the necessary correction = \\(min_{\pi} ( log(\pi(y/x) - log(\pi(y’/x) )\\)
 ## Domain specific custom objective
-This framework allows for the selection of \\(P\\), offering the flexibility to choose. If additional prior assumptions are available, they can be integrated. For instance, a prior concerning the distribution of response lengths could be included, limiting the model to produce responses of a certain length. If \\(P(y)\\) = \\(\pi(y)\\) * \\(l(y)\\), where \\(l(y)\\) is a prior specific to a target domain, the optimization function becomes \\(min_{\pi} ( log(\pi(y/x)) - log(\pi(y’/x) ) + log(l(y)) - log(l(y’)) \\). This indicates the aim to minimize the extra loss specific to the target domain.
 ## Connection with Direct Preference Optimization (DPO) and Contrastive Preference Learning (CPL)
 The proposed approach has a direct connection to the [DPO](https://arxiv.org/pdf/2305.18290) and [CPL](https://arxiv.org/pdf/2310.13639) frameworks.

 The proposed methodology involves the following steps:
 1. Generate a response \\( y \\) for a given prompt \\( x \\) from a policy to be optimized, denoted as \\( \pi(y/x) \\).
 2. Request the teacher to produce the corrected version \\( y' \\) of \\( y \\) denoted as \\( \tau(y'/y, x) \\). The teacher can be a human or a more precise LLM.
+3. Assume a distribution function \\(f\\) that gives the likelihood of given text. For a prompt \\(x\\) and a corresponding response \\(y\\), \\(f(y, x)\\) can be written as \\(P(x)*P(y/x)\\). Log-likelyhood becomes \\(\log(P(x)) + \log(P(y/x))\\).
 4. Calculate the amount of correction required as a difference between the above distribution \\( f(y, x) - f(y', x) = \log(P(y/x)) - \log(P(y'/x)) \\).
 5. The distribution \\(P\\) can be parameterized through the policy \\(\pi\\). For training this model, \\(P\\) is directly set to \\(\pi\\).
+7. Optimize the policy \\(\pi\\) to minimize the necessary correction = \\(min_{\pi} ( \log(\pi(y/x) - \log(\pi(y’/x) )\\)
 ## Domain specific custom objective
+This framework allows for the selection of \\(P\\), offering the flexibility to choose. If additional prior assumptions are available, they can be integrated. For instance, a prior concerning the distribution of response lengths could be included, limiting the model to produce responses of a certain length. If \\(P(y)\\) = \\(\pi(y)\\) * \\(l(y)\\), where \\(l(y)\\) is a prior specific to a target domain, the optimization function becomes \\(min_{\pi} ( \log(\pi(y/x)) - \log(\pi(y’/x) ) + \log(l(y)) - \log(l(y’)) \\). This indicates the aim to minimize the extra loss specific to the target domain.
 ## Connection with Direct Preference Optimization (DPO) and Contrastive Preference Learning (CPL)
 The proposed approach has a direct connection to the [DPO](https://arxiv.org/pdf/2305.18290) and [CPL](https://arxiv.org/pdf/2310.13639) frameworks.