Scenario 4: Chatbot Benchmarks in Generative AI

The chatbot scenario covers chatbot/dialog use cases where the prompt and responses are shorter.

  • The prompt length is fixed to 100 tokens.
  • The response length is fixed to 100 tokens.
Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

  1. The number of concurrent requests.
  2. The number of tokens in the prompt.
  3. The number of tokens in the response.
  4. The variance of (2) and (3) across requests.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The generation heavy scenario is performed in the following region.

Brazil East (Sao Paulo)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 134.80 126.97 1.56 36.46
2 128.71 235.26 1.57 70.05
4 122.01 436.12 1.63 131.04
8 113.84 762.01 1.81 222.59
16 101.20 1,177.66 1.99 347.43
32 83.96 2,021.49 2.31 610.16
64 64.47 3,191.72 3.07 950.61
128 43.12 3,772.60 4.92 1,120.64
256 21.76 4,094.46 8.56 1,212.42
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 94.04 87.41 1.95 29.44
2 88.13 163.85 1.93 58.04
4 86.49 315.44 2.03 108.02
8 80.10 550.10 2.39 171.44
16 70.13 861.65 2.56 288.47
32 62.39 1,517.61 3.06 476.62
64 42.36 2,139.38 3.76 753.58
128 29.22 3,137.09 5.74 1,023.88
256 17.13 3,229.42 9.78 1,117.58
Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 50.20 48.67 2.05 29.20
2 49.53 96.67 2.06 58.00
4 49.08 188.00 2.12 112.80
8 48.40 356.00 2.23 213.60
16 47.26 645.33 2.44 387.20
32 42.22 1,077.33 2.90 646.40
64 44.95 1,162.65 5.41 697.59
128 44.92 1,162.64 10.84 697.58
256 45.02 1,162.21 21.58 697.32
Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 111.04 109.67 0.91 65.80
2 108.57 212.33 0.91 127.40
4 105.67 408.00 0.91 244.80
8 102.65 408.00 1.02 461.60
16 96.48 1,370.66 1.13 822.40
32 78.96 2,110.49 1.42 822.40
64 89.80 2,522.64 2.41 1,513.58
128 89.69 2,516.96 4.94 1,510.17
256 90.27 2,517.19 9.96 1,510.31
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 28.93 21.65 4.60 13.01
2 31.72 50.89 3.90 30.54
4 30.86 91.23 4.17 54.74
8 29.61 163.06 4.33 97.84
16 27.66 277.48 4.49 166.49
32 26.01 615.83 4.77 369.50
64 22.49 1,027.87 5.67 616.77
128 17.22 1,527.06 7.37 616.77
256 10.67 1,882.65 11.44 1,131.71
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 97.11 51.67 1.98 30.14
2 95.38 99.17 2.04 57.87
4 93.91 183.96 2.10 107.50
8 89.79 318.53 2.23 186.09
16 81.05 506.12 2.47 294.03
32 64.15 909.40 3.18 530.15
64 50.35 1,405.67 4.08 818.96
128 33.59 1,786.60 6.26 1,040.74
256 18.77 1,866.83 11.43 1,086.94
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 52.05 52.57 1.95 30.80
2 50.70 100.90 2.00 59.19
4 49.96 192.32 2.06 112.89
8 47.75 369.74 2.15 216.13
16 44.36 643.94 2.30 377.65
32 36.74 982.39 2.74 576.42
64 31.27 1605.80 3.23 942.49
128 20.59 1,841.44 4.96 1,082.95
256 11.49 2,333.32 8.88 1,368.63
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 42.36 38.82 2.23 26.07
2 42.49 77.95 2.18 52.86
4 42.15 155.04 2.15 106.28
8 39.72 274.21 2.33 192.82
16 37.28 527.72 2.36 366.20
32 32.87 828.91 2.88 538.91
64 24.48 1,175.93 3.40 816.00
128 19.21 1,522.53 5.38 1,023.93
256 10.11 1,668.07 8.49 1,127.35
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 112.29 95.11 1.82 31.65
2 109.27 186.61 1.91 60.55
4 104.19 350.17 1.98 115.70
8 93.66 625.10 2.24 200.55
16 84.60 1,087.14 2.46 354.44
32 68.80 1,718.20 2.96 557.70
64 53.25 2,455.21 3.53 827.78
128 38.02 3,366.97 5.48 1,113.31
256 25.19 3,983.61 8.35 1,322.15

Germany Central (Frankfurt)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 134.80 126.97 1.56 36.46
2 128.71 235.26 1.57 70.05
4 122.01 436.12 1.63 131.04
8 113.84 762.01 1.81 222.59
16 101.20 1,177.66 1.99 347.43
32 83.96 2,021.49 2.31 610.16
64 64.47 3,191.72 3.07 950.61
128 43.12 3,772.60 4.92 1,120.64
256 21.76 4,094.46 8.56 1,212.42
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 94.04 87.41 1.95 29.44
2 88.13 163.85 1.93 58.04
4 86.49 315.44 2.03 108.02
8 80.10 550.10 2.39 171.44
16 70.13 861.65 2.56 288.47
32 62.39 1,517.61 3.06 476.62
64 42.36 2,139.38 3.76 753.58
128 29.22 3,137.09 5.74 1,023.88
256 17.13 3,229.42 9.78 1,117.58
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 28.93 21.65 4.60 13.01
2 31.72 50.89 3.90 30.54
4 30.86 91.23 4.17 54.74
8 29.61 163.06 4.33 97.84
16 27.66 277.48 4.49 166.49
32 26.01 615.83 4.77 369.50
64 22.49 1,027.87 5.67 616.77
128 17.22 1,527.06 7.37 616.77
256 10.67 1,882.65 11.44 1,131.71
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 97.11 51.67 1.98 30.14
2 95.38 99.17 2.04 57.87
4 93.91 183.96 2.10 107.50
8 89.79 318.53 2.23 186.09
16 81.05 506.12 2.47 294.03
32 64.15 909.40 3.18 530.15
64 50.35 1,405.67 4.08 818.96
128 33.59 1,786.60 6.26 1,040.74
256 18.77 1,866.83 11.43 1,086.94
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 52.05 52.57 1.95 30.80
2 50.70 100.90 2.00 59.19
4 49.96 192.32 2.06 112.89
8 47.75 369.74 2.15 216.13
16 44.36 643.94 2.30 377.65
32 36.74 982.39 2.74 576.42
64 31.27 1605.80 3.23 942.49
128 20.59 1,841.44 4.96 1,082.95
256 11.49 2,333.32 8.88 1,368.63
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 42.36 38.82 2.23 26.07
2 42.49 77.95 2.18 52.86
4 42.15 155.04 2.15 106.28
8 39.72 274.21 2.33 192.82
16 37.28 527.72 2.36 366.20
32 32.87 828.91 2.88 538.91
64 24.48 1,175.93 3.40 816.00
128 19.21 1,522.53 5.38 1,023.93
256 10.11 1,668.07 8.49 1,127.35
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 112.29 95.11 1.82 31.65
2 109.27 186.61 1.91 60.55
4 104.19 350.17 1.98 115.70
8 93.66 625.10 2.24 200.55
16 84.60 1,087.14 2.46 354.44
32 68.80 1,718.20 2.96 557.70
64 53.25 2,455.21 3.53 827.78
128 38.02 3,366.97 5.48 1,113.31
256 25.19 3,983.61 8.35 1,322.15

UK South (London)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 134.80 126.97 1.56 36.46
2 128.71 235.26 1.57 70.05
4 122.01 436.12 1.63 131.04
8 113.84 762.01 1.81 222.59
16 101.20 1,177.66 1.99 347.43
32 83.96 2,021.49 2.31 610.16
64 64.47 3,191.72 3.07 950.61
128 43.12 3,772.60 4.92 1,120.64
256 21.76 4,094.46 8.56 1,212.42
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 94.04 87.41 1.95 29.44
2 88.13 163.85 1.93 58.04
4 86.49 315.44 2.03 108.02
8 80.10 550.10 2.39 171.44
16 70.13 861.65 2.56 288.47
32 62.39 1,517.61 3.06 476.62
64 42.36 2,139.38 3.76 753.58
128 29.22 3,137.09 5.74 1,023.88
256 17.13 3,229.42 9.78 1,117.58
Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 50.20 48.67 2.05 29.20
2 49.53 96.67 2.06 58.00
4 49.08 188.00 2.12 112.80
8 48.40 356.00 2.23 213.60
16 47.26 645.33 2.44 387.20
32 42.22 1,077.33 2.90 646.40
64 44.95 1,162.65 5.41 697.59
128 44.92 1,162.64 10.84 697.58
256 45.02 1,162.21 21.58 697.32
Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 111.04 109.67 0.91 65.80
2 108.57 212.33 0.91 127.40
4 105.67 408.00 0.91 244.80
8 102.65 408.00 1.02 461.60
16 96.48 1,370.66 1.13 822.40
32 78.96 2,110.49 1.42 822.40
64 89.80 2,522.64 2.41 1,513.58
128 89.69 2,516.96 4.94 1,510.17
256 90.27 2,517.19 9.96 1,510.31
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 28.93 21.65 4.60 13.01
2 31.72 50.89 3.90 30.54
4 30.86 91.23 4.17 54.74
8 29.61 163.06 4.33 97.84
16 27.66 277.48 4.49 166.49
32 26.01 615.83 4.77 369.50
64 22.49 1,027.87 5.67 616.77
128 17.22 1,527.06 7.37 616.77
256 10.67 1,882.65 11.44 1,131.71
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 97.11 51.67 1.98 30.14
2 95.38 99.17 2.04 57.87
4 93.91 183.96 2.10 107.50
8 89.79 318.53 2.23 186.09
16 81.05 506.12 2.47 294.03
32 64.15 909.40 3.18 530.15
64 50.35 1,405.67 4.08 818.96
128 33.59 1,786.60 6.26 1,040.74
256 18.77 1,866.83 11.43 1,086.94
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 52.05 52.57 1.95 30.80
2 50.70 100.90 2.00 59.19
4 49.96 192.32 2.06 112.89
8 47.75 369.74 2.15 216.13
16 44.36 643.94 2.30 377.65
32 36.74 982.39 2.74 576.42
64 31.27 1605.80 3.23 942.49
128 20.59 1,841.44 4.96 1,082.95
256 11.49 2,333.32 8.88 1,368.63
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 42.36 38.82 2.23 26.07
2 42.49 77.95 2.18 52.86
4 42.15 155.04 2.15 106.28
8 39.72 274.21 2.33 192.82
16 37.28 527.72 2.36 366.20
32 32.87 828.91 2.88 538.91
64 24.48 1,175.93 3.40 816.00
128 19.21 1,522.53 5.38 1,023.93
256 10.11 1,668.07 8.49 1,127.35
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 112.29 95.11 1.82 31.65
2 109.27 186.61 1.91 60.55
4 104.19 350.17 1.98 115.70
8 93.66 625.10 2.24 200.55
16 84.60 1,087.14 2.46 354.44
32 68.80 1,718.20 2.96 557.70
64 53.25 2,455.21 3.53 827.78
128 38.02 3,366.97 5.48 1,113.31
256 25.19 3,983.61 8.35 1,322.15

US Midwest (Chicago)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 134.80 126.97 1.56 36.46
2 128.71 235.26 1.57 70.05
4 122.01 436.12 1.63 131.04
8 113.84 762.01 1.81 222.59
16 101.20 1,177.66 1.99 347.43
32 83.96 2,021.49 2.31 610.16
64 64.47 3,191.72 3.07 950.61
128 43.12 3,772.60 4.92 1,120.64
256 21.76 4,094.46 8.56 1,212.42
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 94.04 87.41 1.95 29.44
2 88.13 163.85 1.93 58.04
4 86.49 315.44 2.03 108.02
8 80.10 550.10 2.39 171.44
16 70.13 861.65 2.56 288.47
32 62.39 1,517.61 3.06 476.62
64 42.36 2,139.38 3.76 753.58
128 29.22 3,137.09 5.74 1,023.88
256 17.13 3,229.42 9.78 1,117.58
Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 50.20 48.67 2.05 29.20
2 49.53 96.67 2.06 58.00
4 49.08 188.00 2.12 112.80
8 48.40 356.00 2.23 213.60
16 47.26 645.33 2.44 387.20
32 42.22 1,077.33 2.90 646.40
64 44.95 1,162.65 5.41 697.59
128 44.92 1,162.64 10.84 697.58
256 45.02 1,162.21 21.58 697.32
Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 111.04 109.67 0.91 65.80
2 108.57 212.33 0.91 127.40
4 105.67 408.00 0.91 244.80
8 102.65 408.00 1.02 461.60
16 96.48 1,370.66 1.13 822.40
32 78.96 2,110.49 1.42 822.40
64 89.80 2,522.64 2.41 1,513.58
128 89.69 2,516.96 4.94 1,510.17
256 90.27 2,517.19 9.96 1,510.31
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 28.93 21.65 4.60 13.01
2 31.72 50.89 3.90 30.54
4 30.86 91.23 4.17 54.74
8 29.61 163.06 4.33 97.84
16 27.66 277.48 4.49 166.49
32 26.01 615.83 4.77 369.50
64 22.49 1,027.87 5.67 616.77
128 17.22 1,527.06 7.37 616.77
256 10.67 1,882.65 11.44 1,131.71
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 97.11 51.67 1.98 30.14
2 95.38 99.17 2.04 57.87
4 93.91 183.96 2.10 107.50
8 89.79 318.53 2.23 186.09
16 81.05 506.12 2.47 294.03
32 64.15 909.40 3.18 530.15
64 50.35 1,405.67 4.08 818.96
128 33.59 1,786.60 6.26 1,040.74
256 18.77 1,866.83 11.43 1,086.94
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 31.07 31.12 3.28 18.29
2 30.33 59.43 3.40 34.88
4 29.39 113.76 3.51 66.48
8 27.14 210.00 3.77 123.22
16 24.04 351.38 4.24 205.78
32 19.40 523.68 5.23 306.44
64 16.12 837.45 6.28 491.00
128 9.48 920.97 10.63 541.91
256 5.73 1,211.95 17.79 713.19
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Cohere Small V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 42.36 38.82 2.23 26.07
2 42.49 77.95 2.18 52.86
4 42.15 155.04 2.15 106.28
8 39.72 274.21 2.33 192.82
16 37.28 527.72 2.36 366.20
32 32.87 828.91 2.88 538.91
64 24.48 1,175.93 3.40 816.00
128 19.21 1,522.53 5.38 1,023.93
256 10.11 1,668.07 8.49 1,127.35
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 112.29 95.11 1.82 31.65
2 109.27 186.61 1.91 60.55
4 104.19 350.17 1.98 115.70
8 93.66 625.10 2.24 200.55
16 84.60 1,087.14 2.46 354.44
32 68.80 1,718.20 2.96 557.70
64 53.25 2,455.21 3.53 827.78
128 38.02 3,366.97 5.48 1,113.31
256 25.19 3,983.61 8.35 1,322.15
Model: cohere.command (Cohere Command 52 B) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 34.98 28.85 3.21 17.30
8 29.51 119.83 5.34 71.62
32 27.44 293.58 5.91 177.09
128 25.56 482.88 6.67 291.95
Model: cohere.command-light (Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 71.85 54.49 1.74 30.21
8 41.91 191.52 2.87 105.63
32 31.37 395.49 3.55 216.87
128 28.27 557.57 3.9 302.44
Model: meta.llama-2-70b-chat(Llama2 (70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 17.65 15.92 5.88 9.76
8 14.95 91.02 6.44 59.32
32 12.14 238.73 8.33 148.11
128 7.81 411.52 12.44 259.44