Phi-4 Llamafied + 4 Bug Fixes + GGUFs, Dynamic 4bit Quants

Hey r/LocalLLaMA ! I've uploaded fixed versions of Phi-4, including GGUF + 4-bit + 16-bit versions on HuggingFace!

We’ve fixed over 4 bugs (3 major ones) in Phi-4, mainly related to tokenizers and chat templates which affected inference and finetuning workloads. If you were experiencing poor results, we recommend trying our GGUF upload. A detailed post on the fixes will be released tomorrow.

We also Llamafied the model meaning it should work out of the box with every framework including Unsloth. Fine-tuning is 2x faster, uses 70% VRAM & has 9x longer context lengths with Unsloth.

View all Phi-4 versions with our bug fixes: https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa

Phi-4 Uploads (with our bug fixes)
GGUFs including 2, 3, 4, 5, 6, 8, 16-bit
Unsloth Dynamic 4-bit
4-bit Bnb
Original 16-bit

I uploaded Q2_K_L quants which works well as well - they are Q2_K quants, but leaves the embedding as Q4 and lm_head as Q6 - this should increase accuracy by a bit!

To use Phi-4 in llama.cpp, do:

./llama.cpp/llama-cli
    --model unsloth/phi-4-GGUF/phi-4-Q2_K_L.gguf
    --prompt '<|im_start|>user<|im_sep|>Provide all combinations of a 5 bit binary number.<|im_end|><|im_start|>assistant<|im_sep|>'
    --threads 16

Which will produce:

A 5-bit binary number consists of 5 positions, each of which can be either 0 or 1. Therefore, there are \(2^5 = 32\) possible combinations. Here they are, listed in ascending order:
1. 00000
2. 00001
3. 00010

I also uploaded Dynamic 4bit quants which don't quantize every layer to 4bit, and leaves some in 16bit - by using only an extra 1GB of VRAM, you get superior accuracy, especially for finetuning! - Head over to https://github.com/unslothai/unsloth to finetune LLMs and Vision models 2x faster and use 70% less VRAM!

Dynamic 4bit quants leave some layers as 16bit and not 4bit