Experimenting with different Ollama sourced AI models I discovered that sometimes in a very strange way my CPU or GPU resources are not used efficiently.
By looking at the way Ollama models are packed we see that they are basically very similar to docker images where a full environment is specified.
For example we can list the config file for qwen2.5-coder:
✦ ❯ ollama show qwen2.5-coder:32b --modelfile
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM qwen2.5-coder:32b
FROM /usr/share/ollama/.ollama/models/blobs/sha256-ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9
TEMPLATE """{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|>
{{- else if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""
SYSTEM You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
LICENSE """
Apache License
Version 2.0, January 2004
...
As you can see we can very easily extend this model if we use the above config file dumped in put own file and change the FROM directive.
We can now create a new model by defining a file custom-qwen2.5-coder where we dump the above mod file
ollama show qwen2.5-coder:32b --modelfile >> custom-qwen2.5-coder
Then edit and make it like this:
...
FROM qwen2.5-coder:32b
...
SYSTEM You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
PARAMETER num_gpu 0
PARAMETER num_thread 18
LICENSE """
Apache License
...
Note the new PARAMETER directive that will configure how the model will be used.
The following article is a good tutorial but to sum-up we can set the following parameters:
- mirostat
What It Does: Like a volume control for the machine’s “creativity.” Turning it on makes the machine’s responses less predictable. Example: PARAMETER mirostat 1 (0 is off, 1 is on, 2 is extra on).
- mirostat_eta
What It Does: Adjusts how quickly the machine learns from what it’s currently talking about. Turning it down makes it more cautious; turning it up makes it adapt faster. Example: PARAMETER mirostat_eta 0.1
- mirostat_tau
What It Does: Helps decide if the machine should stick closely to the topic (lower values) or explore a bit more creatively (higher values). Example: PARAMETER mirostat_tau 5.0
- num_ctx
What It Does: Determines how much of the previous conversation the machine can remember at once. Larger numbers mean it can remember more of what was said earlier. Example: PARAMETER num_ctx 4096
- num_gqa
What It Does: Like tuning how many different tasks the machine can juggle at once. Important for very complex models. Example: PARAMETER num_gqa 8
- num_gpu
What It Does: Sets how many “brains” (or parts of the computer’s graphics card) the machine uses to think. More brains can mean faster or more detailed thinking. Example: PARAMETER num_gpu 50
- num_thread
What It Does: Determines how many separate conversations or tasks the machine can handle at the same time. Like opening more lanes on a highway. Example: PARAMETER num_thread 8
- repeat_last_n
What It Does: This is like telling the machine how much of the last part of the conversation to try not to repeat, keeping the chat fresh. Example: PARAMETER repeat_last_n 64
- repeat_penalty
What It Does: If the machine starts repeating itself, this is like a nudge to encourage it to come up with something new. Example: PARAMETER repeat_penalty 1.1
- temperature
What It Does: Controls how “wild” or “safe” the machine’s responses are. Higher temperatures encourage more creative responses. Example: PARAMETER temperature 0.7
- seed
What It Does: Sets up a starting point for generating responses. Using the same seed number with the same prompt will always give the same response. Example: PARAMETER seed 42
- stop
What It Does: Tells the machine when to stop talking, based on certain cues or keywords. It’s like saying “When you hear this word, it’s time to wrap up.” Example: PARAMETER stop "AI assistant:"
- tfs_z
What It Does: Aims to reduce randomness in the machine’s responses, keeping its “thoughts” more focused. Example: PARAMETER tfs_z 2.0
- num_predict
What It Does: Limits how much the machine can say in one go. Setting a limit helps keep its responses concise. Example: PARAMETER num_predict 128
- top_k
What It Does: Limits the machine’s word choices to the top contenders, which helps it stay on topic and make sense. Example: PARAMETER top_k 40
- top_p
What It Does: Works with top_k to fine-tune the variety of the machine’s responses, balancing between predictable and diverse. Example: PARAMETER top_p 0.9
Note that basically we changed only the allocation of GPU cores and threads.
PARAMETER num_gpu 0 this will just tell the ollama not to use GPU cores (I do not have a good GPU on my test machine). But you can use it to maximize the use of your GPU.
PARAMETER num_thread 18 this will just tell ollama to use 18 threads so using better the CPU resources. Note that usually models are configured in a conservative way. For example qwen2.5 was using a maximum of 6 CPU cores (6 threads) even if my machine has 20 cores.
To create a new model image just run the command:
ollama create custom-qwen2.5-coder:32b -f ./custom-qwen2.5-coder
Check if the new model was created:
✦ ❯ ollama list
NAME ID SIZE MODIFIED
custom-qwen2.5-coder:32b 353407281410 19 GB 14 hours ago
qwen2.5-coder:32b 4bd6cbf2d094 19 GB 15 hours ago
In the end do not forget to set the custom model as the default model in your tests.