Cohere Rerank 3.5

Review performance benchmarks for the cohere.rerank.3-5 (Cohere Rerank 3.5) model hosted on one RERANK_COHERE unit of a dedicated AI cluster in OCI Generative AI.

A rerank model takes a query and a list of texts as input and ranks the texts based on their relevancy score to the query, that's, how well each text matches the query.

Rerank 3.5 Benchmark Scenarios
  • The query is 100 tokens for all scenarios.
  • All scenarios have only one supporting document that's 10,000 tokens long.
  • Each scenario chunks this 10,000-token document based on a max_tokens_per_doc parameter. These values are 64, 128, 256, 512, 1024, 2048, and 4096.
  • The maximum chunk size is 4096 tokens which is the maximum tokens that a Rerank 3.5 model can process in one pass.
  • Because the document is 10,000 tokens long and the model's context length is 4096 tokens, in all the scenarios, the document is broken into chunks.
  • Each chunk includes:
    • Padding tokens: To ensure the input fits the model's expected format.
    • The query: 100 tokens.
    • A document section: For example, for a max_tokens_per_doc of 4096 tokens, each chunk includes one of the following document sections:
      • Document section 1: Document from 0 to 3,992 tokens.
      • Document section 2: Document from 3,993 to 7,985 tokens.
      • Document section 3: Document from 7,986 to 9,999 tokens. This section is smaller than the other two sections, because the document is only 10,000 tokens long.
  • Each benchmark scenario is defined by R(max_tokens_per_doc, 100).
  • See details for the model and review the following sections:
    • Available regions for this model.
    • Dedicated AI clusters for hosting this model.
  • Review the metrics.

R(64,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.13 7.64
2 0.11 8.96
4 0.11 9.12
8 0.11 9.06
24 0.12 8.33
48 0.14 7.19
96 0.17 5.86

R(128,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.11 9.15
2 0.11 9.12
4 0.11 9.00
8 0.11 8.81
24 0.13 7.71
48 0.16 6.34
96 0.20 4.81

R(256,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.11 9.10
2 0.11 9.03
4 0.11 8.73
8 0.12 8.14
24 0.15 6.47
48 0.20 4.91
96 0.28 3.52

R(512,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.11 8.94
2 0.11 8.61
4 0.12 7.91
8 0.14 6.85
24 0.20 4.87
48 0.30 3.22
96 0.54 1.83

R(1024,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.12 8.11
2 0.13 7.22
4 0.15 6.24
8 0.19 4.99
24 0.45 2.20
48 0.73 1.34
96 1.38 0.72

R(2048,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.15 6.13
2 0.18 5.14
4 0.25 3.84
8 0.38 2.52
24 1.05 0.94
48 2.01 0.49
96 3.77 0.26

R(4096,100)

Batch Size Request-level Latency (second) Request-level Throughput (Request per second) (RPS)
1 0.19 4.65
2 0.25 3.71
4 0.39 2.43
8 0.78 1.24
24 1.98 0.49
48 3.80 0.26
96 7.35 0.14