EAGLE Draft Models¶
The following code configures vLLM to use speculative decoding where proposals are generated by an EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found in examples/offline_inference/spec_decode.py
Eagle Drafter Example¶
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=4,
speculative_config={
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
"draft_tensor_parallel_size": 1,
"num_speculative_tokens": 2,
"method": "eagle",
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Eagle3 Drafter Example¶
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=2,
speculative_config={
"model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
"draft_tensor_parallel_size": 2,
"num_speculative_tokens": 2,
"method": "eagle3",
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Pre-Trained Eagle Draft Models¶
A variety of EAGLE draft models are available on the Hugging Face hub:
Warning
If you are using vllm<0.7.0, please use this script to convert the speculative model and specify "model": "path/to/modified/eagle/model" in speculative_config.