Cohere Rerank 3.5

Review performance benchmarks for the cohere.rerank.3-5 (Cohere Rerank 3.5) model hosted on one RERANK_COHERE unit of a dedicated AI cluster in OCI Generative AI.

A rerank model takes a query and a list of texts as input and ranks the texts based on their relevancy score to the query, that's, how well each text matches the query.

Rerank 3.5 Benchmark Scenarios

The query is 100 tokens for all scenarios.
All scenarios have only one supporting document that's 10,000 tokens long.
Each scenario chunks this 10,000-token document based on a max_tokens_per_doc parameter. These values are 64, 128, 256, 512, 1024, 2048, and 4096.
The maximum chunk size is 4096 tokens which is the maximum tokens that a Rerank 3.5 model can process in one pass.
Because the document is 10,000 tokens long and the model's context length is 4096 tokens, in all the scenarios, the document is broken into chunks.
Each chunk includes:
- Padding tokens: To ensure the input fits the model's expected format.
- The query: 100 tokens.
- A document section: For example, for a max_tokens_per_doc of 4096 tokens, each chunk includes one of the following document sections:
  - Document section 1: Document from 0 to 3,992 tokens.
  - Document section 2: Document from 3,993 to 7,985 tokens.
  - Document section 3: Document from 7,986 to 9,999 tokens. This section is smaller than the other two sections, because the document is only 10,000 tokens long.
Each benchmark scenario is defined by R(max_tokens_per_doc, 100).

See details for the model and review the following sections:
- Available regions for this model.
- Dedicated AI clusters for hosting this model.
Review the metrics.

R(64,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.13	7.64
2	0.11	8.96
4	0.11	9.12
8	0.11	9.06
24	0.12	8.33
48	0.14	7.19
96	0.17	5.86

R(128,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.11	9.15
2	0.11	9.12
4	0.11	9.00
8	0.11	8.81
24	0.13	7.71
48	0.16	6.34
96	0.20	4.81

R(256,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.11	9.10
2	0.11	9.03
4	0.11	8.73
8	0.12	8.14
24	0.15	6.47
48	0.20	4.91
96	0.28	3.52

R(512,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.11	8.94
2	0.11	8.61
4	0.12	7.91
8	0.14	6.85
24	0.20	4.87
48	0.30	3.22
96	0.54	1.83

R(1024,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.12	8.11
2	0.13	7.22
4	0.15	6.24
8	0.19	4.99
24	0.45	2.20
48	0.73	1.34
96	1.38	0.72

R(2048,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.15	6.13
2	0.18	5.14
4	0.25	3.84
8	0.38	2.52
24	1.05	0.94
48	2.01	0.49
96	3.77	0.26

R(4096,100)


Batch Size	Request-level Latency (second)	Request-level Throughput (Request per second) (RPS)
1	0.19	4.65
2	0.25	3.71
4	0.39	2.43
8	0.78	1.24
24	1.98	0.49
48	3.80	0.26
96	7.35	0.14

Oracle Cloud Infrastructure Documentation