Full vs distilled models

In this workshop, we’ll use distilled models. What are those and how are they different than full sized models like Claude or ChatGPT?

In essence, model distillation is a technique where a smaller model (the student) is trained to imitate the behavior of a larger, more powerful model (the teacher), typically by learning from the teacher’s outputs (i.e. how it responds to different prompts and examples rather than directly from the original training data). This allows the student model to retain much of the teacher’s performance while being faster and more lightweight.

Distilled models are smaller

ChatGPT runs on massive servers with specialized hardware (like GPUs), often distributed across data centers (which require a ton of fresh water for cooling the servers—yikes). These models have hundreds of billions of parameters, which require gigabytes (or terabytes) of memory and powerful parallel computation to function. By contrast, distilled models are compressed, smaller versions of larger models. They might have millions or a few billion parameters, and they’re designed to be efficient enough to run on a personal laptop with as little as 8GB or 32GB of RAM.

Distilled models are better for specific tasks

What this means in terms of capabilities is that, because full sized models are bigger (have more parameters), they also tend to have broader capabilities and can handle complex, open-ended tasks. Distilled models are more lightweight and and may also be less fluent, creative for general purpose tasks. That being said, they are often good enough for specific tasks, especially if customized and fine-tuned on your own data.

<aside> 📌

This is important from a “usability” standpoint; it means that prompts need to be much more explicit and give a lot of context, to get good answers from distilled models.

</aside>

Distilled models are more private

So the “trade-off” of distilled models is some loss of quality, but more privacy and control over the data that you share with these models. They are not cloud-based like Claude or ChatGPT; the inputs and outputs never leave your machine, meaning that you can chat with more personal material or just material you are afraid might get used for ChatGPT’s next training cycle.

Finally, distilled models are more “hackable” since ****you can retrain them on custom datasets or embed them into tools. You can also mess with the settings (more on that later). While you can fine-tune OpenAI’s GPTs with their API or custom GPT builder, these options are mostly costly (financially and environmentally).