At the end of last week I got a little into the weeds with Mojo. Since then I have recovered from the weeds with some useful learning, but also constructed a comparison with Julia that was surprising.
Mojo could do it
What got me in a funk was frustration with creating a vector of Mojo structs. What solved my problems was rereading the Mojo Manual especially the section on “Value lifecycle” where I began to suspect that the Move constructor - “__moveinit__” was part of the trouble. This lead to reading about the “moveable” trait (and rereading about traits was important too) which was necessary for AnyPointer - and AnyPointer is what you want to use when building a collection of Mojo structs.
It certainly helped to read the manual for a second time with a goal in mind - the search for surely there must be a way to do this - and also with a background in other languages like C++, C#, and Rust that helped make sense of the pithy content. I stand by the remark that if you expect Mojo to be easy like Python, you will be surprised. Python makes Mojo easy. Mojo is not as easy as Python - it extends Python to allow more controlled and powerful code, but demands more detailed understanding of how the underpinnings work.
After practicing it a bit and cleaning up, the code looked good. Mojo has a style where nothing is automatic - you declare stuff about structs which other languages do automatically. This gives you a level of control - you get to choose what automatic means - which is fun, but also can be tedious. Mojo has some macros like @value and @movable which bridge the gap by having you confirm that the compiler should just fill in the default behaviors with the simple and obvious functions for the value or movable traits, making the process much less tedious and clarifying intent. It is an interesting tradeoff between control and productivity which reflects Mojo’s philosophy in a consistent way.
Mojo and Julia - Race!
Well, that was nice. And I decided to finish the AV calculations with Mojo, which after the above rethink went forward without further problems. I need to render the geometry and write up the blog post.
But I did say I would compare Julia to Mojo for speed, and that still seemed a good idea. So I started with the Mojo Matrix Multiplication benchmark and put it into a Jupyter notebook for convenience, and then translated the code into another notebook using Julia. I run both on my Linux compact desktop with an 8-core Intel Coffee Lake at 2.4GHz which gives me up to 16 threads and AVX-2 for vector arithmetic, but the GPU is not available to either Mojo or Julia.
The Mojo code scaled up from 1.5 GFLOPs of Float32 to around 130 GFLOPs when vectorization and parallel processing were enabled. There was something odd about how it optimized code generation for a badly written “naive” baseline better than a normal baseline, but both worked. The vectorized version actually used a loop which starts from the normal baseline, with decorations to obtain the vectorization. The transform to parallel was pretty easy, just running the outer for-loop from the scheduler while the two inner for-loops were vectorized.
The Julia version ran the naive version slower than Mojo but it ran the normal baseline at identical speed to Mojo, at around 4 GFLOPs. When vectorization was done, Julia just needed an added “@turbo” prefix taking it to around 80 GFLOPs. I started adding scheduling to that by hand, using the Threads library, but then thought that is not the normal Julia approach. I just included the LinearAlgebra library and tried it with 1 thread and with 8 threads (the fastest, matching the core count). That was enlightening.
Single thread ran at 128 GFLOPs (same as Mojo multithread) while 8 thread ran at 750 GFLOPs. Yes, almost 6x faster than Mojo.
The notebooks are at my GitHub site. There are a few older projects public there, feel free to browse!
Was using the tuned BLAS library cheating? No. Julia is a mature project with a large ecosystem, building on the same ecosystem as Python just as Mojo builds on Python as a language baseline. It is perfectly normal use of Julia to import LinearAlgebra and run “BLAS.set_num_threads(8)” to get multithreading. The code for this is MUCH simpler to write than Mojo.
This just adds some perspective as to how much further Mojo may go. It would not be entirely surprising if they put a low priority on raising CPU perf to match BLAS, since GPUs are more important to really large projects these days. But if they do want to run at top speed on CPUs, this benchmark left a significant gap for improvement.
Summarizing, and actions
Mojo is a work in progress. It can use GPUs as well as CPUs, and control of memory is really important for big applications moving a lot of data, like some AI programs. However, it is not nearly as fast on my test as a mature BLAS library (aka “kernel”) and it is not more productive than Julia.
I will finish this blog chapter using Mojo. The code was done before I started comparing to Julia. Mojo works fine, it was interesting to learn, and I will check in on it every now and then. I’m looking at getting a new machine later this year with CPU and GPU in a unified memory space - not sure if that will be latest Apple, or Intel, or AMD - and when I do that will certainly see if Mojo can use the GPU. Actually, I will probably verify Mojo can use the GPU before choosing what to buy.
I will go back to Julia for my wavefront optics blog, which is likely to be the topic in June. It is productive and fast on the CPU I have now, .
Next week
Actually, this week. I expect to have rendered the 3D AV projector tomorrow and to be ready to post the description on Friday (I allow 24 hours from writing to sending so I can re-read and more clearly see the things that need editing). There will be a third post on this topic around 25th April.
Interesting. I suggested both Julia and Mojo as alternative languages to python in my last newsletter. I should give these languages a whirl at some point.