Very interested in getting a discussion going here. I work extensively with LLMs in my day job
We record things like tokens in, tokens out, scoring of each response (using a different LLM and prompt to evaluate the first output), then human QA on low scoring and a sample of others where we again record the number of edits and what was edited.Are you able to give examples of the metrics you record?
Are they traditional hardware stuff like CPU, GPU, memory usage, IO etc. or LLM specific?
We are mostly using OpenAI api and Databricks DBRX externally hosted fine tuned models. We host and train our own traditional NLP and clustering models in AWS sagemaker as well.It is a great topic, I have identified myself that whether or not the employer invests that I need to spend a bit more time on API's these days. What tools are you using and the hardware?