Scenario 1: Stochastic Length Benchmarks in Generative AI

This scenario mimics text generation use cases where the size of the prompt and response are unknown ahead of time. In this scenario, because of the unknown length of the prompt and response, we've used a stochastic approach where both the prompt and response length follow a normal distribution:

  • The prompt length follows a normal distribution with a mean of 480 tokens and a standard deviation of 240 tokens
  • The response length follows a normal distribution with a mean of 300 tokens and a standard deviation of 150 tokens.
Important

The performance (inference speed, throughput, latency) of a hosting dedicated AI cluster depends on the traffic scenarios going through the model that it's hosting. Traffic scenarios depend on:

  1. The number of concurrent requests.
  2. The number of tokens in the prompt.
  3. The number of tokens in the response.
  4. The variance of (2) and (3) across requests.

Review the terms used in the hosting dedicated AI cluster benchmarks. For a list of scenarios and their descriptions, see Chat and Text Generation Scenarios. The fusion scenario is performed in the following region.

Brazil East (Sao Paulo)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 143.82 142.16 3.89 15.07
2 141.16 276.64 4.28 27.37
4 136.15 517.89 4.98 45.85
8 121.71 858.28 4.97 84.62
16 105.84 1,243.61 5.53 122.45
32 88.15 2,126.25 6.53 210.29
64 67.40 3,398.12 8.63 319.28
128 45.86 4,499.76 13.96 427.76
256 24.14 4,784.32 25.79 453.83
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 119.49 118.18 4.50 13.08
2 115.14 225.40 4.90 23.69
4 109.71 404.66 4.63 48.83
8 95.83 702.76 5.03 85.92
16 81.12 1,029.98 6.07 125.54
32 70.92 1,819.24 7.02 182.65
64 52.10 2,778.58 8.79 313.12
128 35.58 3,566.59 13.80 438.64
256 20.75 4,065.93 24.69 481.11
Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 48.75 47.98 6.37 9.40
2 47.28 92.89 6.63 18.00
4 45.10 176.53 6.65 35.80
8 42.53 333.45 7.04 67.80
16 38.39 597.84 7.95 119.70
32 29.86 929.18 10.12 187.40
64 30.00 933.09 20.11 187.20
128 30.03 934.30 39.85 186.00
256 30.05 932.61 76.19 187.79
Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 105.74 104.30 2.75 21.70
2 103.21 204.22 2.82 42.40
4 99.41 393.69 3.10 77.10
8 93.98 745.29 3.26 146.70
16 81.62 1,294.14 3.64 262.60
32 60.55 1,924.74 4.97 384.40
64 60.54 1,928.70 10.03 379.40
128 62.57 1,912.53 19.68 383.09
256 60.00 1,911.45 38.36 386.14
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 32.66 25.79 10.78 5.56
2 31.36 50.81 10.06 11.68
4 29.86 96.01 10.87 21.52
8 27.89 170.45 10.87 34.09
16 24.74 282.52 13.51 60.35
32 21.51 457.24 16.73 91.42
64 17.68 676.90 18.29 152.47
128 13.06 1,035.08 25.59 222.67
256 7.82 1,302.71 41.88 289.08
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 95.50 51.58 6.12 9.78
2 92.25 98.89 6.44 18.53
4 90.51 184.54 7.37 30.67
8 83.38 326.71 7.64 57.06
16 71.45 509.03 8.77 90.02
32 58.48 724.23 10.00 138.82
64 44.74 1,146.92 14.07 206.58
128 27.00 1,434.57 22.48 268.58
256 18.03 1,635.95 41.06 309.97
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 49.76 49.58 6.42 9.33
2 48.04 95.38 6.80 17.53
4 46.09 181.21 6.99 33.60
8 44.19 330.46 7.43 60.67
16 40.56 591.52 8.40 104.42
32 31.35 869.36 9.68 168.46
64 23.87 1062.52 12.57 201.11
128 16.86 1,452.66 17.64 276.09
256 9.84 1,792.81 30.08 347.26
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 51.30 50.46 4.63 12.75
2 51.06 97.86 5.07 23.14
4 47.52 186.75 5.30 44.48
8 43.55 305.45 5.68 75.18
16 36.49 505.11 6.71 127.88
32 29.02 768.40 8.84 177.03
64 18.57 735.37 14.55 168.00
128 12.59 809.50 21.27 186.76
256 6.54 859.45 38.69 200.42
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 122.46 101.28 4.31 13.21
2 114.38 177.67 5.70 17.78
4 107.48 367.88 5.09 45.22
8 95.32 644.56 7.23 62.61
16 82.42 1,036.84 7.91 62.61
32 66.46 1,529.28 10.12 145.82
64 45.70 1,924.84 12.43 206.26
128 33.96 2,546.35 18.22 272.53
256 23.86 2,914.77 30.75 298.88

Germany Central (Frankfurt)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 143.82 142.16 3.89 15.07
2 141.16 276.64 4.28 27.37
4 136.15 517.89 4.98 45.85
8 121.71 858.28 4.97 84.62
16 105.84 1,243.61 5.53 122.45
32 88.15 2,126.25 6.53 210.29
64 67.40 3,398.12 8.63 319.28
128 45.86 4,499.76 13.96 427.76
256 24.14 4,784.32 25.79 453.83
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 119.49 118.18 4.50 13.08
2 115.14 225.40 4.90 23.69
4 109.71 404.66 4.63 48.83
8 95.83 702.76 5.03 85.92
16 81.12 1,029.98 6.07 125.54
32 70.92 1,819.24 7.02 182.65
64 52.10 2,778.58 8.79 313.12
128 35.58 3,566.59 13.80 438.64
256 20.75 4,065.93 24.69 481.11
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 32.66 25.79 10.78 5.56
2 31.36 50.81 10.06 11.68
4 29.86 96.01 10.87 21.52
8 27.89 170.45 10.87 34.09
16 24.74 282.52 13.51 60.35
32 21.51 457.24 16.73 91.42
64 17.68 676.90 18.29 152.47
128 13.06 1,035.08 25.59 222.67
256 7.82 1,302.71 41.88 289.08
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 95.50 51.58 6.12 9.78
2 92.25 98.89 6.44 18.53
4 90.51 184.54 7.37 30.67
8 83.38 326.71 7.64 57.06
16 71.45 509.03 8.77 90.02
32 58.48 724.23 10.00 138.82
64 44.74 1,146.92 14.07 206.58
128 27.00 1,434.57 22.48 268.58
256 18.03 1,635.95 41.06 309.97
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 49.76 49.58 6.42 9.33
2 48.04 95.38 6.80 17.53
4 46.09 181.21 6.99 33.60
8 44.19 330.46 7.43 60.67
16 40.56 591.52 8.40 104.42
32 31.35 869.36 9.68 168.46
64 23.87 1062.52 12.57 201.11
128 16.86 1,452.66 17.64 276.09
256 9.84 1,792.81 30.08 347.26
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 51.30 50.46 4.63 12.75
2 51.06 97.86 5.07 23.14
4 47.52 186.75 5.30 44.48
8 43.55 305.45 5.68 75.18
16 36.49 505.11 6.71 127.88
32 29.02 768.40 8.84 177.03
64 18.57 735.37 14.55 168.00
128 12.59 809.50 21.27 186.76
256 6.54 859.45 38.69 200.42
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 122.46 101.28 4.31 13.21
2 114.38 177.67 5.70 17.78
4 107.48 367.88 5.09 45.22
8 95.32 644.56 7.23 62.61
16 82.42 1,036.84 7.91 62.61
32 66.46 1,529.28 10.12 145.82
64 45.70 1,924.84 12.43 206.26
128 33.96 2,546.35 18.22 272.53
256 23.86 2,914.77 30.75 298.88

UK South (London)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 143.82 142.16 3.89 15.07
2 141.16 276.64 4.28 27.37
4 136.15 517.89 4.98 45.85
8 121.71 858.28 4.97 84.62
16 105.84 1,243.61 5.53 122.45
32 88.15 2,126.25 6.53 210.29
64 67.40 3,398.12 8.63 319.28
128 45.86 4,499.76 13.96 427.76
256 24.14 4,784.32 25.79 453.83
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 119.49 118.18 4.50 13.08
2 115.14 225.40 4.90 23.69
4 109.71 404.66 4.63 48.83
8 95.83 702.76 5.03 85.92
16 81.12 1,029.98 6.07 125.54
32 70.92 1,819.24 7.02 182.65
64 52.10 2,778.58 8.79 313.12
128 35.58 3,566.59 13.80 438.64
256 20.75 4,065.93 24.69 481.11
Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 48.75 47.98 6.37 9.40
2 47.28 92.89 6.63 18.00
4 45.10 176.53 6.65 35.80
8 42.53 333.45 7.04 67.80
16 38.39 597.84 7.95 119.70
32 29.86 929.18 10.12 187.40
64 30.00 933.09 20.11 187.20
128 30.03 934.30 39.85 186.00
256 30.05 932.61 76.19 187.79
Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 105.74 104.30 2.75 21.70
2 103.21 204.22 2.82 42.40
4 99.41 393.69 3.10 77.10
8 93.98 745.29 3.26 146.70
16 81.62 1,294.14 3.64 262.60
32 60.55 1,924.74 4.97 384.40
64 60.54 1,928.70 10.03 379.40
128 62.57 1,912.53 19.68 383.09
256 60.00 1,911.45 38.36 386.14
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 32.66 25.79 10.78 5.56
2 31.36 50.81 10.06 11.68
4 29.86 96.01 10.87 21.52
8 27.89 170.45 10.87 34.09
16 24.74 282.52 13.51 60.35
32 21.51 457.24 16.73 91.42
64 17.68 676.90 18.29 152.47
128 13.06 1,035.08 25.59 222.67
256 7.82 1,302.71 41.88 289.08
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 95.50 51.58 6.12 9.78
2 92.25 98.89 6.44 18.53
4 90.51 184.54 7.37 30.67
8 83.38 326.71 7.64 57.06
16 71.45 509.03 8.77 90.02
32 58.48 724.23 10.00 138.82
64 44.74 1,146.92 14.07 206.58
128 27.00 1,434.57 22.48 268.58
256 18.03 1,635.95 41.06 309.97
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 49.76 49.58 6.42 9.33
2 48.04 95.38 6.80 17.53
4 46.09 181.21 6.99 33.60
8 44.19 330.46 7.43 60.67
16 40.56 591.52 8.40 104.42
32 31.35 869.36 9.68 168.46
64 23.87 1062.52 12.57 201.11
128 16.86 1,452.66 17.64 276.09
256 9.84 1,792.81 30.08 347.26
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 51.30 50.46 4.63 12.75
2 51.06 97.86 5.07 23.14
4 47.52 186.75 5.30 44.48
8 43.55 305.45 5.68 75.18
16 36.49 505.11 6.71 127.88
32 29.02 768.40 8.84 177.03
64 18.57 735.37 14.55 168.00
128 12.59 809.50 21.27 186.76
256 6.54 859.45 38.69 200.42
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 122.46 101.28 4.31 13.21
2 114.38 177.67 5.70 17.78
4 107.48 367.88 5.09 45.22
8 95.32 644.56 7.23 62.61
16 82.42 1,036.84 7.91 62.61
32 66.46 1,529.28 10.12 145.82
64 45.70 1,924.84 12.43 206.26
128 33.96 2,546.35 18.22 272.53
256 23.86 2,914.77 30.75 298.88

US Midwest (Chicago)

Model: cohere.command-r-08-2024 (Cohere Command R 08-2024) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 143.82 142.16 3.89 15.07
2 141.16 276.64 4.28 27.37
4 136.15 517.89 4.98 45.85
8 121.71 858.28 4.97 84.62
16 105.84 1,243.61 5.53 122.45
32 88.15 2,126.25 6.53 210.29
64 67.40 3,398.12 8.63 319.28
128 45.86 4,499.76 13.96 427.76
256 24.14 4,784.32 25.79 453.83
Model: cohere.command-r-plus-08-2024 (Cohere Command R+ 08-2024) model hosted on one Large Cohere V2_2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 119.49 118.18 4.50 13.08
2 115.14 225.40 4.90 23.69
4 109.71 404.66 4.63 48.83
8 95.83 702.76 5.03 85.92
16 81.12 1,029.98 6.07 125.54
32 70.92 1,819.24 7.02 182.65
64 52.10 2,778.58 8.79 313.12
128 35.58 3,566.59 13.80 438.64
256 20.75 4,065.93 24.69 481.11
Model: meta.llama-3.2-90b-vision-instruct (Meta Llama 3.2 90B Vision) model (text input only) hosted on one Large Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 48.75 47.98 6.37 9.40
2 47.28 92.89 6.63 18.00
4 45.10 176.53 6.65 35.80
8 42.53 333.45 7.04 67.80
16 38.39 597.84 7.95 119.70
32 29.86 929.18 10.12 187.40
64 30.00 933.09 20.11 187.20
128 30.03 934.30 39.85 186.00
256 30.05 932.61 76.19 187.79
Model: meta.llama-3.2-11b-vision-instruct (Meta Llama 3.2 11B Vision) model (text input only) hosted on one Small Generic V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 105.74 104.30 2.75 21.70
2 103.21 204.22 2.82 42.40
4 99.41 393.69 3.10 77.10
8 93.98 745.29 3.26 146.70
16 81.62 1,294.14 3.64 262.60
32 60.55 1,924.74 4.97 384.40
64 60.54 1,928.70 10.03 379.40
128 62.57 1,912.53 19.68 383.09
256 60.00 1,911.45 38.36 386.14
Model: meta.llama-3.1-405b-instruct (Meta Llama 3.1 (405B)) model hosted on one Large Generic 4 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 32.66 25.79 10.78 5.56
2 31.36 50.81 10.06 11.68
4 29.86 96.01 10.87 21.52
8 27.89 170.45 10.87 34.09
16 24.74 282.52 13.51 60.35
32 21.51 457.24 16.73 91.42
64 17.68 676.90 18.29 152.47
128 13.06 1,035.08 25.59 222.67
256 7.82 1,302.71 41.88 289.08
Model: meta.llama-3.1-70b-instruct (Meta Llama 3.1 (70B)) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 95.50 51.58 6.12 9.78
2 92.25 98.89 6.44 18.53
4 90.51 184.54 7.37 30.67
8 83.38 326.71 7.64 57.06
16 71.45 509.03 8.77 90.02
32 58.48 724.23 10.00 138.82
64 44.74 1,146.92 14.07 206.58
128 27.00 1,434.57 22.48 268.58
256 18.03 1,635.95 41.06 309.97
Model: meta.llama-3-70b-instruct (Meta Llama 3) model hosted on one Large Generic unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 30.51 30.36 10.47 5.73
2 28.85 57.37 11.09 10.68
4 27.99 108.49 11.13 21.08
8 25.61 196.68 13.27 34.65
16 21.97 318.82 15.36 56.37
32 16.01 428.45 18.55 82.88
64 11.60 563.70 24.31 108.58
128 7.50 650.40 40.64 40.64
256 4.58 927.31 67.42 172.42
Model: cohere.command-r-16k v1.2 (Cohere Command R) model hosted on one Small Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 51.30 50.46 4.63 12.75
2 51.06 97.86 5.07 23.14
4 47.52 186.75 5.30 44.48
8 43.55 305.45 5.68 75.18
16 36.49 505.11 6.71 127.88
32 29.02 768.40 8.84 177.03
64 18.57 735.37 14.55 168.00
128 12.59 809.50 21.27 186.76
256 6.54 859.45 38.69 200.42
Model: cohere.command-r-plus (Cohere Command R+) model hosted on one Large Cohere V2 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 122.46 101.28 4.31 13.21
2 114.38 177.67 5.70 17.78
4 107.48 367.88 5.09 45.22
8 95.32 644.56 7.23 62.61
16 82.42 1,036.84 7.91 62.61
32 66.46 1,529.28 10.12 145.82
64 45.70 1,924.84 12.43 206.26
128 33.96 2,546.35 18.22 272.53
256 23.86 2,914.77 30.75 298.88
Model: cohere.command (Cohere Command 52 B) model hosted on one Large Cohere unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 36.32 31.29 8.15 7.12
8 30.15 106.03 13.19 23.86
32 23.94 204.41 23.90 45.84
128 14.36 254.54 65.26 56.58
Model: cohere.command-light (Cohere Command Light 6 B) model hosted on one Small Cohere unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 69.17 69.19 3.57 15.69
8 38.75 208.22 6.54 45.08
32 17.98 337.35 13.49 75.50
128 4.01 397.36 37.69 92.17
Model: meta.llama-2-70b-chat (Llama2 70 B) model hosted on one Llama2 70 unit of a dedicated AI cluster
Concurrency Token-level Inference Speed (token/second) Token-level Throughput (token/second) Request-level Latency (second) Request-level Throughput (Request per minute) (RPM)
1 17.86 17.18 13.60 4.32
8 14.48 68.62 16.63 16.58
32 9.82 174.40 20.78 44.58
128 3.89 319.34 43.87 85.33