In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Criteria for a good eval ✅īelow are some of the criteria we look for in a good eval. I've had to use applications such as Symbolab to solve them. ![]() As a university student who uses ChatGPT on a daily basis, I have observed that it is incapable of solving such equations. What makes this a useful eval?ĬhatGPT has always struggled with mathematics, particularly with solving complex equations such as differential equations. This eval tests the models ability to solve complex differential equations. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. ![]() We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. ![]() In order for a PR to be merged, it must fail on GPT-4. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. □ Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |