Performance gap between python and vLLM backends using the same Qwen2-VL-2B-Instruct model on ChartQA task

Hi, team,

I&rsquo;ve been running lmms_eval on the ChartQA benchmark using the Qwen/Qwen2-VL-2B-Instruct model under two different settings, and I&rsquo;m seeing significant discrepancies in performance, despite using the same model.


1. Standard Python version
```
python3 -m lmms_eval \
    --model=qwen2_vl \
    --model_args=pretrained=Qwen/Qwen2-VL-2B-Instruct,device_map=cuda \
    --tasks=chartqa \
    --batch_size=1 \
    --log_samples \
    --log_samples_suffix=qwen2_vl \
    --output_path="./logs1"
```

Results: 
| Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|-------|-------|------|-----:|-----------------------|---|-----:|---|-----:|
|chartqa|Yaml   |none  |     0|relaxed_augmented_split|&uarr;  |0.8912|&plusmn;  |0.0088|
|chartqa|Yaml   |none  |     0|relaxed_human_split    |&uarr;  |0.5552|&plusmn;  |0.0141|
|chartqa|Yaml   |none  |     0|relaxed_overall        |&uarr;  |0.7232|&plusmn;  |0.0090|

2. vLLM version
```
python3 -m lmms_eval \
    --model vllm \
    --model_args model_version=Qwen/Qwen2-VL-2B-Instruct,tensor_parallel_size=4 \
    --tasks chartqa \
    --batch_size 300 \
    --log_samples \
    --log_samples_suffix vllm \
    --output_path ./logs2
```

Results:
| Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|-------|-------|------|-----:|-----------------------|---|-----:|---|-----:|
|chartqa|Yaml   |none  |     0|relaxed_augmented_split|&uarr;  |0.6752|&plusmn;  |0.0133|
|chartqa|Yaml   |none  |     0|relaxed_human_split    |&uarr;  |0.3208|&plusmn;  |0.0132|
|chartqa|Yaml   |none  |     0|relaxed_overall        |&uarr;  |0.4980|&plusmn;  |0.0100|

## Issue:
Even though both configurations use Qwen/Qwen2-VL-2B-Instruct, I got lower performance results when I used the vLLM version.

Is there any known limitation or additional configuration needed when using the vllm backend?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance gap between python and vLLM backends using the same Qwen2-VL-2B-Instruct model on ChartQA task #698

Issue:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	Yaml	none	relaxed_augmented_split	↑	0.8912	±	0.0088
chartqa	Yaml	none	relaxed_human_split	↑	0.5552	±	0.0141
chartqa	Yaml	none	relaxed_overall	↑	0.7232	±	0.0090

Performance gap between python and vLLM backends using the same Qwen2-VL-2B-Instruct model on ChartQA task #698

Description

Issue:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions