Table of Contents
- WHIZARD: Bootstrap the environment
- Maxwell Cluster: PyTorch, CUDA and the correct selection of a GPU
- Numerics: Welford's online algorithm
- WHIZARD: Assessing build and link time
- WHIZARD: Hunting a random number (non-)bug
- Org-mode: Update on Publish
- Org-mode: Publish
- Org-mode: Agenda
- Org-mode: Crypt
- Theory Cluster: theoc
- Theory Cluster: Crontab and Mail Delivery
- WHIZARD: UFO restrictions
Archived entries from file blog.
WHIZARD: Bootstrap the environment
How would you provide a user with your own program and some required tools?
I would write a simple bootstrap script, but, obviously, any user-interacton developer opinion seems to be different.
Maxwell Cluster: PyTorch, CUDA and the correct selection of a GPU
In the last few days, I tinkered with PyTorch and CUDA on DESY's Maxwell cluster where the software module system provides us with a wonderful working default setup of PyTorch and CUDA - many thanks to the IT! But, at some point during my playtime, I stumbled upon some NaNs (Not-a-Number), whose appearance I wanted to understand, however, without to debug a complete external code base.
After some quick digressions on floating point exceptions in Python, which did not prove to be helpful at all, I found that PyTorch implements for (such) numerical issues (and more) hooks.1
Of course, I tried immediately to implement such a hook, i.e. a module hook with register_module_forward_hook
.
And - to my misfortune - there was no such global module variant in the default PyTorch installation at Maxwell.
A quick lookup revealed that I had to use a more recent version of PyTorch than provided by the Maxwell system.
So, I tied myself to an update (process).
After some time investment, I ended up with a new Python environment and the "Get started" ("Do not care"-)installation instruction from PyTorch itself. And…
RuntimeError: CUDA error: no kernel image is available for execution on the device
I decided then to torture myself and my favorite search engine with the above error code, and, not much to my surprise, I did not like the completely unhelpful questions/answers (try it out for yourself).
After a prolonged search, fighting a plethora of non-sense, I came across this particular answer - thanks to ptrblck
on https://discuss.pytorch.org. Hallelujah.
Long story, short: PyTorch provides (at least for the conda
installer) its binaries with a minimum requirement on the GPU compute capability, lo and behold, of 3.7
.2
The Maxwell cluster offers different GPUs including nodes with Tesla K20X cards, which have only a 3.5
rating.
And, this is then the reason why I get the above error message, the kernel image for the Tesla cards is just not built-in into the conda binaries of PyTorch.
Finally, the good message is that I could solve my CUDA-no-kernel-image problem with a simple --constraint=V100
added to my SLURM resource allocation.
Back to business: J'ai bien d'autres chats à fouetter.
Numerics: Welford's online algorithm
I stumbled about Welford's online algorithm for computing (correct) variances within the C
-implementation of VEGAS in the GNU scientific library.
And I was completely confused by few lines of code.
In the end, I successfully reproduced Welford's calculations and wrote my notes down in an article, before my paper notes vanish in my paper tower of oblivion.
WHIZARD: Assessing build and link time
WHIZARD: Hunting a random number (non-)bug
One of my colleges - many thanks to P. S. - found a interesting trait in WHIZARD
with regard to assignment and usage of random numbers.
I accepted the challenge and spun up my GDB
.
Org-mode: Update on Publish
It took me quite some time to revisit my publishing setup inside Org-mode.
Sigh.
From the beginning, I was not really happy how I had ended in the mid of a working personal page and a blog due the lack of time and will - maybe. I had scattered some unfinished articles between my personal information, and vice versa. The sitemap, which I auto-generated, was a total mess. However, I liked the the article layout quite well, but I wanted to have the creation and last-modified dates correctly side by side (and in a specific manner formatted). But, my lack of knowledge regarding Emacs and Elisp slowed me down massively.
The good message is that I have fixed the directory structure, removed all unfinished articles, refined the navigation bar and unified the post-ambles in all pages. However, I messed up the documentation; I mean I did not document anything…
Hence, I will try my best to come up with a note on all my tweaks and explanation, but for now I am happy that I got this far…
Org-mode: Publish
Hey, I'm starting my own blog with the magnificent org-mode, and a little help from Spacemacs.
I will try to give some insight in my computing issues regarding my research at DESY Theory Group, and how I try to find and manage my solutions. The presented issues and solutions may be not very user-friendly, neither well-documented (or even written as such). However, I hope to shed some light in the wondrous world of tools I encounter in the world of high-energy particle physics. But, be reminded that I can only show a tiny subset…
Org-mode: Agenda
Hey, I'm starting an article about the organizational power of Org-mode: agenda.
I will update the article on my agenda design over time - when I know that my design is actually worth note-taking. For the first, you can find my thoughts on the motivation and a short introduction in the article.
Org-mode: Crypt
I've started searching for a way to store my sensible data, "secrets", with help of Emacs and GnuPG.
And, of course, I came across the EasyPG package within Emacs and org-crypt
.
If you are looking for a good way to store your password with GnuPG, however, without to bother yourself extensively with the GnuPG CLI. Then, I can recommend to you: Unix pass.
Theory Cluster: theoc
I've started a cumulative article on the theory cluster and how I perform my computations on it.
I begin with the introduction of krenew
to handle Kerberos and AFS during elongated and hup'ed computations.
Theory Cluster: Crontab and Mail Delivery
- Update
I'm a heavy user of scratch partitions, a place where we should point all I/O-heavy programs, such avoiding unnecessary network load on shared filesystems. However, the nature of the scratch partition is volatile - there will be no backups of it, no warnings about its status and so on. Thus, our data are merely existent. Therefore, it is important to have backups to other places, in my case the Theory-wide network filesystem (NFS). And, we need automation. Backups always need automation and an alert that something happened (or not)!
First, we have two choices to automate our backups:
- Crontab,
- Systemd timer.
Altough, I prefer systemd, in this case, I will go for simplicity, i.e. crontab (see info crontab
).
Second, we want to be notified that something happened - after sometime, when we know that everything works fine, we change this to something bad happened.
crontab -e > 0 5 * * * rsync -a --delete --stats <data> <backup>
Then, checking on info sendmail
, we see that the cluster has Postfix
installed. Yay, we can sendmail
.
We only need to create a forward file ~/.forward
for our user (see info local
) that contains only a single line with our forward email address.
We can then verify that it works with sendmail -t [ENTER] Hello World!!! [Ctrl+D]
.
Update: I'm not entirely sure whether we need to add a ~/.forward
as the mail and user accounts are connected by LDAP.
And, it's even a little bit more complicated, AFS provides the home directory.
However, it seems to work without (access to) the local forward.
WHIZARD: UFO restrictions
WHIZARD has the option to pass model restrictions to the tree-level matrix generator O'Mega
.
This options allows us to manipulate the production of the amplitudes beyond the simple process definition of WHIZARD.
The following statements are allowed (including a logical and-operator &&
to combine several restrictions):
- Explicit selection of a propagator,
3 + 4 ~ Z
, - Exclusion of a propagator or list of propagators,
!A
or!e:nue
, - Exclusion of a coupling constant or list of coupling constants,
^qlep:gnclep
, - Exclusion of specific vertix or list of vertices,
^[A:Z,W+,W-]
, - Exclusion of a specific vertix or list of vertices with a coupling constant or list of couplings,
^c1:c2:c3[H, H, H]
.
The examples are taken from the WHIZARD manual.
The restrictions feature becomes quite handy for UFO models, where we can remove unnecessary amplitude terms from our computation and spare computational resources.
I want to note that setting a coupling constant to zero also avoids the computation of a term (but not the code production), however, at the cost of an additional if-condition at each term execution (at least for O'Mega
).
We require the coupling constant name for the restrictions.
But, what are the coupling constants in a UFO model?
I can already state that coupling constant does not equal independent model parameter!
But then, how are the coupling constants connected to the model parameters?
Fortunately, each UFO model provides a couplings.py
file with following repeating structure:
GC_1 = Coupling(name = 'GC_1', value = '-(ee*complex(0,1))/3.', order = {'QED':1})
As the number of couplings can be quite huge, I want to have an automate solution:
First, I need to can scan coupling.py
with grep
for my model parameter (in value
) and then select the before line (of our grep
) and massage the output a little bit to have a list of the form a:b:c:d...
.
The result is then:
grep --before-context 1 FT0 SM_Ltotal_Ind5v2020v2_UFO/couplings.py | \ grep Coupling | \ cut -d' ' -f1 | \ tr '\n' ':'
Next, I just need to append this to a file and include it (with some manual edit) into my Sindarin files.
Back to business: J'ai bien d'autres chats à fouetter.
Footnotes:
Conda PyTorch build.sh. See for TORCH_CUDA_ARCH_LIST
.