Skip to content

Performance gap between python and vLLM backends using the same Qwen2-VL-2B-Instruct model on ChartQA task #698

Open
@mikittt

Description

@mikittt

Hi, team,

I’ve been running lmms_eval on the ChartQA benchmark using the Qwen/Qwen2-VL-2B-Instruct model under two different settings, and I’m seeing significant discrepancies in performance, despite using the same model.

  1. Standard Python version
python3 -m lmms_eval \
    --model=qwen2_vl \
    --model_args=pretrained=Qwen/Qwen2-VL-2B-Instruct,device_map=cuda \
    --tasks=chartqa \
    --batch_size=1 \
    --log_samples \
    --log_samples_suffix=qwen2_vl \
    --output_path="./logs1"

Results:

Tasks Version Filter n-shot Metric Value Stderr
chartqa Yaml none 0 relaxed_augmented_split 0.8912 ± 0.0088
chartqa Yaml none 0 relaxed_human_split 0.5552 ± 0.0141
chartqa Yaml none 0 relaxed_overall 0.7232 ± 0.0090
  1. vLLM version
python3 -m lmms_eval \
    --model vllm \
    --model_args model_version=Qwen/Qwen2-VL-2B-Instruct,tensor_parallel_size=4 \
    --tasks chartqa \
    --batch_size 300 \
    --log_samples \
    --log_samples_suffix vllm \
    --output_path ./logs2

Results:

Tasks Version Filter n-shot Metric Value Stderr
chartqa Yaml none 0 relaxed_augmented_split 0.6752 ± 0.0133
chartqa Yaml none 0 relaxed_human_split 0.3208 ± 0.0132
chartqa Yaml none 0 relaxed_overall 0.4980 ± 0.0100

Issue:

Even though both configurations use Qwen/Qwen2-VL-2B-Instruct, I got lower performance results when I used the vLLM version.

Is there any known limitation or additional configuration needed when using the vllm backend?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions