Open
Description
Hi, team,
I’ve been running lmms_eval on the ChartQA benchmark using the Qwen/Qwen2-VL-2B-Instruct model under two different settings, and I’m seeing significant discrepancies in performance, despite using the same model.
- Standard Python version
python3 -m lmms_eval \
--model=qwen2_vl \
--model_args=pretrained=Qwen/Qwen2-VL-2B-Instruct,device_map=cuda \
--tasks=chartqa \
--batch_size=1 \
--log_samples \
--log_samples_suffix=qwen2_vl \
--output_path="./logs1"
Results:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
chartqa | Yaml | none | 0 | relaxed_augmented_split | ↑ | 0.8912 | ± | 0.0088 |
chartqa | Yaml | none | 0 | relaxed_human_split | ↑ | 0.5552 | ± | 0.0141 |
chartqa | Yaml | none | 0 | relaxed_overall | ↑ | 0.7232 | ± | 0.0090 |
- vLLM version
python3 -m lmms_eval \
--model vllm \
--model_args model_version=Qwen/Qwen2-VL-2B-Instruct,tensor_parallel_size=4 \
--tasks chartqa \
--batch_size 300 \
--log_samples \
--log_samples_suffix vllm \
--output_path ./logs2
Results:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
chartqa | Yaml | none | 0 | relaxed_augmented_split | ↑ | 0.6752 | ± | 0.0133 |
chartqa | Yaml | none | 0 | relaxed_human_split | ↑ | 0.3208 | ± | 0.0132 |
chartqa | Yaml | none | 0 | relaxed_overall | ↑ | 0.4980 | ± | 0.0100 |
Issue:
Even though both configurations use Qwen/Qwen2-VL-2B-Instruct, I got lower performance results when I used the vLLM version.
Is there any known limitation or additional configuration needed when using the vllm backend?
Metadata
Metadata
Assignees
Labels
No labels