Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.
# Add to your Claude Code skills
git clone https://github.com/NetEase-Media/grps_trtllmGRPS + TensorRT-LLM
实现纯C++版,相比vllm serve更优性能的OpenAI LLM服务,支持Chat、Ai-agent、Multi-modal
、多卡推理等。
grps接入trtllm
实现更高性能的、支持OpenAI模式访问、支持Ai-agent以及多模态的LLM
服务:
C++实现完整LLM服务,包含tokenizer(支持huggingface, sentencepiecetokenizer)、llm推理
、vit等部分。grps的自定义http功能实现OpenAI接口协议,支持chat和function call模式。LLM的prompt构建风格以及生成结果的解析风格,以实现不同的
和模式,支持。LLMchatfunction callai-agenttensorrt推理后端与opencv库,支持多模态LLM。inflight batching、multi-gpu、paged attention、kv-cache reuse、lookahead decoding等
TensorRT-LLM推理加速技术。triton_server <--> tokenizer_backend <--> trtllm_backend之间的进程间通信,纯C++实现,性能有稳定的提升。2025-05-14
2025-05-07
2025-04-17
2025-03-25
2025-03-22
2025-03-20
2025-03-06
2025-03-04
2025-02-28
2025-02-24
2025-02-23
2025-02-21
2025-02-05
2025-01-24
2025-01-08
2024-12-24