Scenario 4: Chatbot Benchmarks in Generative AI
The chatbot scenario covers chatbot/dialog use cases where the prompt and responses are shorter.
- The prompt length is fixed to 100 tokens.
- The response length is fixed to 100 tokens.
Important
The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:
- The number of concurrent requests.
- The number of tokens in the prompt.
- The number of tokens in the response.
- The variance of (2) and (3) across requests.
Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The generation heavy scenario is performed in the following region.
Brazil East (Sao Paulo)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 134.80 126.97 1.56 36.46 2 128.71 235.26 1.57 70.05 4 122.01 436.12 1.63 131.04 8 113.84 762.01 1.81 222.59 16 101.20 1,177.66 1.99 347.43 32 83.96 2,021.49 2.31 610.16 64 64.47 3,191.72 3.07 950.61 128 43.12 3,772.60 4.92 1,120.64 256 21.76 4,094.46 8.56 1,212.42 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 94.04 87.41 1.95 29.44 2 88.13 163.85 1.93 58.04 4 86.49 315.44 2.03 108.02 8 80.10 550.10 2.39 171.44 16 70.13 861.65 2.56 288.47 32 62.39 1,517.61 3.06 476.62 64 42.36 2,139.38 3.76 753.58 128 29.22 3,137.09 5.74 1,023.88 256 17.13 3,229.42 9.78 1,117.58 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.20 48.67 2.05 29.20 2 49.53 96.67 2.06 58.00 4 49.08 188.00 2.12 112.80 8 48.40 356.00 2.23 213.60 16 47.26 645.33 2.44 387.20 32 42.22 1,077.33 2.90 646.40 64 44.95 1,162.65 5.41 697.59 128 44.92 1,162.64 10.84 697.58 256 45.02 1,162.21 21.58 697.32 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 111.04 109.67 0.91 65.80 2 108.57 212.33 0.91 127.40 4 105.67 408.00 0.91 244.80 8 102.65 408.00 1.02 461.60 16 96.48 1,370.66 1.13 822.40 32 78.96 2,110.49 1.42 822.40 64 89.80 2,522.64 2.41 1,513.58 128 89.69 2,516.96 4.94 1,510.17 256 90.27 2,517.19 9.96 1,510.31 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 28.93 21.65 4.60 13.01 2 31.72 50.89 3.90 30.54 4 30.86 91.23 4.17 54.74 8 29.61 163.06 4.33 97.84 16 27.66 277.48 4.49 166.49 32 26.01 615.83 4.77 369.50 64 22.49 1,027.87 5.67 616.77 128 17.22 1,527.06 7.37 616.77 256 10.67 1,882.65 11.44 1,131.71 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 97.11 51.67 1.98 30.14 2 95.38 99.17 2.04 57.87 4 93.91 183.96 2.10 107.50 8 89.79 318.53 2.23 186.09 16 81.05 506.12 2.47 294.03 32 64.15 909.40 3.18 530.15 64 50.35 1,405.67 4.08 818.96 128 33.59 1,786.60 6.26 1,040.74 256 18.77 1,866.83 11.43 1,086.94 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 52.05 52.57 1.95 30.80 2 50.70 100.90 2.00 59.19 4 49.96 192.32 2.06 112.89 8 47.75 369.74 2.15 216.13 16 44.36 643.94 2.30 377.65 32 36.74 982.39 2.74 576.42 64 31.27 1605.80 3.23 942.49 128 20.59 1,841.44 4.96 1,082.95 256 11.49 2,333.32 8.88 1,368.63 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 42.36 38.82 2.23 26.07 2 42.49 77.95 2.18 52.86 4 42.15 155.04 2.15 106.28 8 39.72 274.21 2.33 192.82 16 37.28 527.72 2.36 366.20 32 32.87 828.91 2.88 538.91 64 24.48 1,175.93 3.40 816.00 128 19.21 1,522.53 5.38 1,023.93 256 10.11 1,668.07 8.49 1,127.35 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 112.29 95.11 1.82 31.65 2 109.27 186.61 1.91 60.55 4 104.19 350.17 1.98 115.70 8 93.66 625.10 2.24 200.55 16 84.60 1,087.14 2.46 354.44 32 68.80 1,718.20 2.96 557.70 64 53.25 2,455.21 3.53 827.78 128 38.02 3,366.97 5.48 1,113.31 256 25.19 3,983.61 8.35 1,322.15
Germany Central (Frankfurt)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 134.80 126.97 1.56 36.46 2 128.71 235.26 1.57 70.05 4 122.01 436.12 1.63 131.04 8 113.84 762.01 1.81 222.59 16 101.20 1,177.66 1.99 347.43 32 83.96 2,021.49 2.31 610.16 64 64.47 3,191.72 3.07 950.61 128 43.12 3,772.60 4.92 1,120.64 256 21.76 4,094.46 8.56 1,212.42 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 94.04 87.41 1.95 29.44 2 88.13 163.85 1.93 58.04 4 86.49 315.44 2.03 108.02 8 80.10 550.10 2.39 171.44 16 70.13 861.65 2.56 288.47 32 62.39 1,517.61 3.06 476.62 64 42.36 2,139.38 3.76 753.58 128 29.22 3,137.09 5.74 1,023.88 256 17.13 3,229.42 9.78 1,117.58 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 28.93 21.65 4.60 13.01 2 31.72 50.89 3.90 30.54 4 30.86 91.23 4.17 54.74 8 29.61 163.06 4.33 97.84 16 27.66 277.48 4.49 166.49 32 26.01 615.83 4.77 369.50 64 22.49 1,027.87 5.67 616.77 128 17.22 1,527.06 7.37 616.77 256 10.67 1,882.65 11.44 1,131.71 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 97.11 51.67 1.98 30.14 2 95.38 99.17 2.04 57.87 4 93.91 183.96 2.10 107.50 8 89.79 318.53 2.23 186.09 16 81.05 506.12 2.47 294.03 32 64.15 909.40 3.18 530.15 64 50.35 1,405.67 4.08 818.96 128 33.59 1,786.60 6.26 1,040.74 256 18.77 1,866.83 11.43 1,086.94 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 52.05 52.57 1.95 30.80 2 50.70 100.90 2.00 59.19 4 49.96 192.32 2.06 112.89 8 47.75 369.74 2.15 216.13 16 44.36 643.94 2.30 377.65 32 36.74 982.39 2.74 576.42 64 31.27 1605.80 3.23 942.49 128 20.59 1,841.44 4.96 1,082.95 256 11.49 2,333.32 8.88 1,368.63 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 42.36 38.82 2.23 26.07 2 42.49 77.95 2.18 52.86 4 42.15 155.04 2.15 106.28 8 39.72 274.21 2.33 192.82 16 37.28 527.72 2.36 366.20 32 32.87 828.91 2.88 538.91 64 24.48 1,175.93 3.40 816.00 128 19.21 1,522.53 5.38 1,023.93 256 10.11 1,668.07 8.49 1,127.35 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 112.29 95.11 1.82 31.65 2 109.27 186.61 1.91 60.55 4 104.19 350.17 1.98 115.70 8 93.66 625.10 2.24 200.55 16 84.60 1,087.14 2.46 354.44 32 68.80 1,718.20 2.96 557.70 64 53.25 2,455.21 3.53 827.78 128 38.02 3,366.97 5.48 1,113.31 256 25.19 3,983.61 8.35 1,322.15
UK South (London)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 134.80 126.97 1.56 36.46 2 128.71 235.26 1.57 70.05 4 122.01 436.12 1.63 131.04 8 113.84 762.01 1.81 222.59 16 101.20 1,177.66 1.99 347.43 32 83.96 2,021.49 2.31 610.16 64 64.47 3,191.72 3.07 950.61 128 43.12 3,772.60 4.92 1,120.64 256 21.76 4,094.46 8.56 1,212.42 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 94.04 87.41 1.95 29.44 2 88.13 163.85 1.93 58.04 4 86.49 315.44 2.03 108.02 8 80.10 550.10 2.39 171.44 16 70.13 861.65 2.56 288.47 32 62.39 1,517.61 3.06 476.62 64 42.36 2,139.38 3.76 753.58 128 29.22 3,137.09 5.74 1,023.88 256 17.13 3,229.42 9.78 1,117.58 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.20 48.67 2.05 29.20 2 49.53 96.67 2.06 58.00 4 49.08 188.00 2.12 112.80 8 48.40 356.00 2.23 213.60 16 47.26 645.33 2.44 387.20 32 42.22 1,077.33 2.90 646.40 64 44.95 1,162.65 5.41 697.59 128 44.92 1,162.64 10.84 697.58 256 45.02 1,162.21 21.58 697.32 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 111.04 109.67 0.91 65.80 2 108.57 212.33 0.91 127.40 4 105.67 408.00 0.91 244.80 8 102.65 408.00 1.02 461.60 16 96.48 1,370.66 1.13 822.40 32 78.96 2,110.49 1.42 822.40 64 89.80 2,522.64 2.41 1,513.58 128 89.69 2,516.96 4.94 1,510.17 256 90.27 2,517.19 9.96 1,510.31 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 28.93 21.65 4.60 13.01 2 31.72 50.89 3.90 30.54 4 30.86 91.23 4.17 54.74 8 29.61 163.06 4.33 97.84 16 27.66 277.48 4.49 166.49 32 26.01 615.83 4.77 369.50 64 22.49 1,027.87 5.67 616.77 128 17.22 1,527.06 7.37 616.77 256 10.67 1,882.65 11.44 1,131.71 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 97.11 51.67 1.98 30.14 2 95.38 99.17 2.04 57.87 4 93.91 183.96 2.10 107.50 8 89.79 318.53 2.23 186.09 16 81.05 506.12 2.47 294.03 32 64.15 909.40 3.18 530.15 64 50.35 1,405.67 4.08 818.96 128 33.59 1,786.60 6.26 1,040.74 256 18.77 1,866.83 11.43 1,086.94 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 52.05 52.57 1.95 30.80 2 50.70 100.90 2.00 59.19 4 49.96 192.32 2.06 112.89 8 47.75 369.74 2.15 216.13 16 44.36 643.94 2.30 377.65 32 36.74 982.39 2.74 576.42 64 31.27 1605.80 3.23 942.49 128 20.59 1,841.44 4.96 1,082.95 256 11.49 2,333.32 8.88 1,368.63 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 42.36 38.82 2.23 26.07 2 42.49 77.95 2.18 52.86 4 42.15 155.04 2.15 106.28 8 39.72 274.21 2.33 192.82 16 37.28 527.72 2.36 366.20 32 32.87 828.91 2.88 538.91 64 24.48 1,175.93 3.40 816.00 128 19.21 1,522.53 5.38 1,023.93 256 10.11 1,668.07 8.49 1,127.35 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 112.29 95.11 1.82 31.65 2 109.27 186.61 1.91 60.55 4 104.19 350.17 1.98 115.70 8 93.66 625.10 2.24 200.55 16 84.60 1,087.14 2.46 354.44 32 68.80 1,718.20 2.96 557.70 64 53.25 2,455.21 3.53 827.78 128 38.02 3,366.97 5.48 1,113.31 256 25.19 3,983.61 8.35 1,322.15
US Midwest (Chicago)
- Model:
cohere.command-r-08-2024
(Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 134.80 126.97 1.56 36.46 2 128.71 235.26 1.57 70.05 4 122.01 436.12 1.63 131.04 8 113.84 762.01 1.81 222.59 16 101.20 1,177.66 1.99 347.43 32 83.96 2,021.49 2.31 610.16 64 64.47 3,191.72 3.07 950.61 128 43.12 3,772.60 4.92 1,120.64 256 21.76 4,094.46 8.56 1,212.42 - Model:
cohere.command-r-plus-08-2024
(Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 94.04 87.41 1.95 29.44 2 88.13 163.85 1.93 58.04 4 86.49 315.44 2.03 108.02 8 80.10 550.10 2.39 171.44 16 70.13 861.65 2.56 288.47 32 62.39 1,517.61 3.06 476.62 64 42.36 2,139.38 3.76 753.58 128 29.22 3,137.09 5.74 1,023.88 256 17.13 3,229.42 9.78 1,117.58 - Model:
meta.llama-3.2-90b-vision-instruct
(Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 50.20 48.67 2.05 29.20 2 49.53 96.67 2.06 58.00 4 49.08 188.00 2.12 112.80 8 48.40 356.00 2.23 213.60 16 47.26 645.33 2.44 387.20 32 42.22 1,077.33 2.90 646.40 64 44.95 1,162.65 5.41 697.59 128 44.92 1,162.64 10.84 697.58 256 45.02 1,162.21 21.58 697.32 - Model:
meta.llama-3.2-11b-vision-instruct
(Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 111.04 109.67 0.91 65.80 2 108.57 212.33 0.91 127.40 4 105.67 408.00 0.91 244.80 8 102.65 408.00 1.02 461.60 16 96.48 1,370.66 1.13 822.40 32 78.96 2,110.49 1.42 822.40 64 89.80 2,522.64 2.41 1,513.58 128 89.69 2,516.96 4.94 1,510.17 256 90.27 2,517.19 9.96 1,510.31 - Model:
meta.llama-3.1-405b-instruct
(Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 28.93 21.65 4.60 13.01 2 31.72 50.89 3.90 30.54 4 30.86 91.23 4.17 54.74 8 29.61 163.06 4.33 97.84 16 27.66 277.48 4.49 166.49 32 26.01 615.83 4.77 369.50 64 22.49 1,027.87 5.67 616.77 128 17.22 1,527.06 7.37 616.77 256 10.67 1,882.65 11.44 1,131.71 - Model:
meta.llama-3.1-70b-instruct
(Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 97.11 51.67 1.98 30.14 2 95.38 99.17 2.04 57.87 4 93.91 183.96 2.10 107.50 8 89.79 318.53 2.23 186.09 16 81.05 506.12 2.47 294.03 32 64.15 909.40 3.18 530.15 64 50.35 1,405.67 4.08 818.96 128 33.59 1,786.60 6.26 1,040.74 256 18.77 1,866.83 11.43 1,086.94 - Model:
meta.llama-3-70b-instruct
(Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 31.07 31.12 3.28 18.29 2 30.33 59.43 3.40 34.88 4 29.39 113.76 3.51 66.48 8 27.14 210.00 3.77 123.22 16 24.04 351.38 4.24 205.78 32 19.40 523.68 5.23 306.44 64 16.12 837.45 6.28 491.00 128 9.48 920.97 10.63 541.91 256 5.73 1,211.95 17.79 713.19 - Model:
cohere.command-r-16k v1.2
(Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 42.36 38.82 2.23 26.07 2 42.49 77.95 2.18 52.86 4 42.15 155.04 2.15 106.28 8 39.72 274.21 2.33 192.82 16 37.28 527.72 2.36 366.20 32 32.87 828.91 2.88 538.91 64 24.48 1,175.93 3.40 816.00 128 19.21 1,522.53 5.38 1,023.93 256 10.11 1,668.07 8.49 1,127.35 - Model:
cohere.command-r-plus
(Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 112.29 95.11 1.82 31.65 2 109.27 186.61 1.91 60.55 4 104.19 350.17 1.98 115.70 8 93.66 625.10 2.24 200.55 16 84.60 1,087.14 2.46 354.44 32 68.80 1,718.20 2.96 557.70 64 53.25 2,455.21 3.53 827.78 128 38.02 3,366.97 5.48 1,113.31 256 25.19 3,983.61 8.35 1,322.15 - Model:
cohere.command
(Cohere Command 52 B) model hosted on one Small Cohere V2 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 34.98 28.85 3.21 17.30 8 29.51 119.83 5.34 71.62 32 27.44 293.58 5.91 177.09 128 25.56 482.88 6.67 291.95 - Model:
cohere.command-light
(Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 71.85 54.49 1.74 30.21 8 41.91 191.52 2.87 105.63 32 31.37 395.49 3.55 216.87 128 28.27 557.57 3.9 302.44 - Model:
meta.llama-2-70b-chat
(Llama2 (70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster -
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM) 1 17.65 15.92 5.88 9.76 8 14.95 91.02 6.44 59.32 32 12.14 238.73 8.33 148.11 128 7.81 411.52 12.44 259.44