The study employs the Python subset of CodeContests for training and evaluates models based on reward annotations for accuracy. During evaluation, 1000 samples per problem are generated across a range of temperatures, assessing the models' pass@k performance. The findings show a correlation between the pretraining methodology and optimal temperature settings, demonstrating that multi-token prediction pretraining yields models that excel in understanding tasks while fostering diversity in their outputs.
The utilization of the Python subset of CodeContests train split with reward annotations allows for precise evaluation of model output accuracy during the evaluation phase.
Generating 1000 samples per problem using varying temperatures from 0.5 to 0.9 enables a comprehensive analysis of model performance on problem-solving tasks.
The results indicate that multi-token prediction pretraining significantly enhances the finetuned models' ability to solve tasks while promoting diversity in outputs.
Our experiments reveal that different pretraining losses correlate to optimal temperatures for pass@k, emphasizing the impact of temperature selection on model evaluation.
#machine-learning #model-evaluation #natural-language-processing #code-contests #multi-token-prediction
Collection
[
|
...
]