I am 100% sure all the LLM benchmarks are, well letโs just say incomplete- they just donโt work in real world scenarios, they do good hypothetically.
We need domain and industry specific benchmarks and we need them now.
Anyone creating anything lik