
"The phi-3-mini model is a transformer decoder architecture with a default context length of 4K, extending to 128K in the long context version, phi-3-mini-128K."
"The phi-3-small model, with 7 billion parameters, utilizes a tiktoken tokenizer, 32 heads, and a hidden size of 4096, ensuring better multilingual tokenization."
"We switched from GELU activation to GEGLU for improved performance and training stability in the phi-3-small model, optimizing hyperparameters using Maximal Update Parametrization."
"A novel blocksparse attention module was designed to optimize training and inference speed by applying different sparsity patterns over KV cache for each attention head."
The phi-3-mini model is a transformer decoder architecture with default context length of 4K, extended to 128K in its long context variant. It employs a similar structure to Llama-2 for compatibility with open-source projects. The phi-3-small model, featuring 7 billion parameters, uses the tiktoken tokenizer and adopts a standard decoder architecture with enhanced performance through GEGLU activation and Maximal Update Parametrization. Also, a blocksparse attention module was introduced for improved efficiency in training and inference.
Read at Hackernoon
Unable to calculate read time
Collection
[
|
...
]