#0001ResearchModel

Introducing DeepThought-8B: A small, capable reasoning model

Our first release: a small reasoning model built on LLaMA-3.1 8B that breaks down its thinking into transparent, structured steps.

Nov 27, 2024

Today we're releasing DeepThought-8B, a small, capable AI reasoning model built on LLaMA-3.1 8B. This release represents our first step toward making AI reasoning more transparent and controllable, while demonstrating that smaller, more efficient models can achieve sophisticated reasoning capabilities that rival models of much larger scales. DeepThought-8B unlocks test-time compute scaling during inference for all, taking as many reasoning steps as needed to solve complex problems.

We're excited to make DeepThought-8B available today, with powerful features that allow you to modulate the way the model reasons. In the coming weeks, we will be opening up our developer API (currently in closed BETA) and regular updates of the open-source model weights.

What makes DeepThought different?

DeepThought-8B approaches tasks differently from traditional language models. Given a problem, it breaks down its thinking into clear steps until it reaches a conclusion. Here's what this looks like in practice:

{
  "step": 1,
  "type": "problem_understanding",
  "thought": "The user is asking how many Rs there are in the word 'strawberry'"
}

Each step is documented, making it easier to understand how the model arrives at its answers.

Key Features

Transparent Reasoning: The model shows its work, step by step.
Programmable Approach: You can guide how the model reasons without retraining it through our API.
Test-time Compute: The model can take as many steps as needed to solve a problem.
Small but Mighty: At 8B parameters, DeepThought runs on consumer GPUs with 16GB+ VRAM, making sophisticated AI reasoning accessible without requiring enterprise-grade hardware.
Structured Output: Consistent JSON-formatted reasoning chains for easy integration.

Performance

While we're excited about DeepThought-8B's capabilities, we believe in transparency and community-driven evaluation. Our internal testing shows promising results across reasoning, math and coding benchmarks, but we encourage you to test these capabilities yourself.

Overall benchmarks

Model	All	Math	Reasoning	Coding	IF
DeepThought-8B	55.91	55.68	69.34	25.61	84.03
Qwen-2-72B-Ins	56.24	72.21	57.17	31.23	66.39
Qwen-2.5-72B-Ins	52.53	69.30	51.12	23.35	69.15
Llama-3.1-70B-Ins	47.34	57.05	39.13	24.58	62.34
Llama-3.1-8B-Ins	45.90	53.51	36.78	23.07	61.86
Claude-3.5-Sonnet	69.17	70.48	93.18	39.91	88.82
GPT-4o	70.99	74.78	89.22	46.75	88.37
o1-mini	71.70	91.04	91.67	42.89	85.67

Insight

DeepThought-8B, at just 8B parameters, achieves reasoning and instruction-following scores competitive with models 9x its size, and outperforms Llama-3.1-70B on both reasoning and IF benchmarks.

Some early findings

Strong performance in step-by-step problem-solving
Competitive results on coding and mathematical tasks
Reliable instruction following with transparent reasoning chains
Performance scales with test-time compute, allowing deeper reasoning on complex tasks

Limitations

Like all models, DeepThought-8B has its limitations. We're actively working on:

Improving mathematical reasoning for complex problems
Enhancing long-context processing
Increasing competence in edge cases

Rather than hyping benchmark scores that might not reflect real-world usage, we invite you to:

Test the model in your specific use cases
Share your findings with our community
Help us identify areas for improvement

You can report your findings and discuss the model's performance by tagging or DMing @ruliad on X. You can also send us an email at feedback@ruliad.co.

Download

Download DeepThought-8B-Llama-v0.01-alpha now on our Hugging Face repo.

What's next? This is just the beginning. We'll be regularly updating the model based on your feedback and our ongoing research.

What's next

Interested in this research or want to collaborate? We'd love to hear from you.

Get in touch →