I made Phi-14b into a (primitive) reasoner using a prototype MLX-GRPO trainer

https://preview.redd.it/4ldnwck141he1.png?width=962&format=png&auto=webp&s=8c64e19a26652495913675a71003e62b013c9526

Hey everyone! Quick post today, since it's 2am here and I really should be getting to bed 😂

So someone (@ActuallyIsaak) managed to get a prototype of the GRPO algo working in MLX, which you can check out the draft of here. I spent a little bit of time messing around with it a bit, and with only three hand-written samples, I managed to get Phi-14b to implement this CoT! Not only that, but it managed to perfectly recall some factual information, and generalise from it. Interestingly, I didn't get this generalisation behaviour with my first version, in which I didn't include a CoT (more often than not it started hallucinating some other Mark Lord - often, for some reason, a 60-year-old businessman lol)