Planet Linux Australia
Celebrating Australians & Kiwis in the Linux and Free/Open-Source community...

December 15, 2019

Are you Fans of the Blackbird? Speak up, I can’t hear you over the fan.

So, as of yesterday, I started running a pretty-close-to-upstream op-build host firmware stack on my Blackbird. Notable yak-shaving has included:

Apart from that, I was all happy as Larry. Except then I went into the room with the Blackbird in it an went “huh, that’s loud”, and since it was bedtime, I decided it could all wait until the morning.

It is now the morning. Checking fan speeds over IPMI, one fan stood out (fan2, sitting at 4300RPM). This was a bit of a surprise as what’s silkscreened on the board is that the rear case fan is hooked up to ‘fan2″, and if we had a “start from 0/1” mix up, it’d be the front case fan. I had just assumed it’d be maybe OCC firmware dying or something, but this wasn’t the case (I checked – thanks occtoolp9!)

After a bit of digging around, I worked out this mapping:

IPMI fan0Rear Case FanMotherboard Fan 2
IPMI fan1Front Case FanMotherboard Fan 3
IPMI fan2CPU FanMotherboard Fan 1

Which is about as surprising and confusing as you’d think.

After a bunch of digging around the Raptor ports of OpenBMC and Hostboot, it seems that the IPL Observer which is custom to Raptor controls if the BMC decides to do fan control or not.

You can get its view of the world from the BMC via the (incredibly user friendly) poking at DBus:

busctl get-property org.openbmc.status.IPL /org/openbmc/status/IPL org.openbmc.status.IPL current_status; busctl get-property org.openbmc.status.IPL /org/openbmc/status/IPL org.openbmc.status.IPL current_istep

Which if you just have the Hostboot patch in (like I first did) you end up with:

s "IPL_RUNNING"
s "21,3"

Which is where Hostboot exits the IPL process (as you see on the screen) and hands over to skiboot. But if you start digging through their op-build tree, you find that there’s a signal_linux_start_complete script which calls pnv-lpc to write two values to LPC ports 0x81 and 0x82. The pnv-lpc utility is the external/lpc/ binary from skiboot, and these two ports are the “extended lpc port 80h” state.

So, to get back fan control? First, build the lpc utility:

git clone git@github.com:open-power/skiboot.git
cd skiboot/external/lpc
make

and then poke the magic values of “IPL complete and linux running”:

$ sudo ./lpc io 0x81.b=254
[io] W 0x00000081.b=0xfe
$ sudo ./lpc io 0x82.b=254
[io] W 0x00000082.b=0xfe

You get a friendly beep, and then your fans return to sanity.

Of course, for that to work you need to have debugfs mounted, as this pokes OPAL debugfs to do direct LPC operations.

Next up: think of a smarter way to trigger that than “stewart runs it on the command line”. Also next up: work out the better way to determine that fan control should be on and patch the BMC.

Automatically updating containers with Docker

Running something in a container using Docker or Podman is cool, but maybe you want an automated way to always run the latest container? Using the :latest tag alone does not to this, that just pulls the latest container at the time. You could have a cronjob that just always pulls the latest containers and restarts the container but then if there’s no update you have an outage for no reason.

It’s not too hard to write a script to pull the latest container and restart the service only if required, then tie that together with a systemd timer.

To restart a container you need to know how it was started. If you have only one container then you could just hard-code it, however it gets more tricky to manage if you have a number of containers. This is where something like runlike can help!

First, start up your container however you need to (OwnTracks recorder, for example).

Next, let’s install runlike with pip.

sudo pip install runlike

Now, let’s create a simple script that takes one optional argument, the name of a running container. If the argument is omitted, it will default to all containers. The script will check if the latest image is different to the running image, and if so, restart the container using the new image with the same arguments as before (determined by runlike). If there is no newer image, then it will just leave the running container alone.

Create the script.

cat << \EOF | sudo tee /usr/local/bin/update-containers.sh
#!/bin/bash

# Abort on all errors, set -x
set -o errexit

# Get the containers from first argument, else get all containers
CONTAINER_LIST="${1:-$(docker ps -q)}"

for container in ${CONTAINER_LIST}; do
  # Get the image and hash of the running container
  CONTAINER_IMAGE="$(docker inspect --format "{{.Config.Image}}" --type container ${container})"
  RUNNING_IMAGE="$(docker inspect --format "{{.Image}}" --type container "${container}")"

  # Pull in latest version of the container and get the hash
  docker pull "${CONTAINER_IMAGE}"
  LATEST_IMAGE="$(docker inspect --format "{{.Id}}" --type image "${CONTAINER_IMAGE}")"

  # Restart the container if the image is different
  if [[ "${RUNNING_IMAGE}" != "${LATEST_IMAGE}" ]]; then
    echo "Updating ${container} image ${CONTAINER_IMAGE}"
    DOCKER_COMMAND="$(runlike "${container}")"
    docker rm --force "${container}"
    eval ${DOCKER_COMMAND}
  fi
done
EOF

Make the script executable.

sudo chmod a+x /usr/local/bin/update-containers.sh

You can test the script by just running it.

/usr/local/bin/update-containers.sh

Now that you have a script which will check for a new images and update containers, let’s make a systemd service and timer for it. This way you can schedule regular update checks whenever you like. If you want a script for a specific container, just add the container names as arguments to the script.

First, create the service

cat << EOF | sudo tee /etc/systemd/system/update-containers.service 
[Unit]
Description=Update containers
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/update-containers.sh
EOF

Next, create the matching timer service (note that the service and timer names need to match).

cat << EOF | sudo tee /etc/systemd/system/update-containers.timer 
[Unit]
Description=Timer for updating containers
Wants=network-online.target

[Timer]
OnActiveSec=24h
OnUnitActiveSec=24h

[Install]
WantedBy=timer.target
EOF

Reload systemd to pick up the new service and enable the timer.

sudo systemctl daemon-reload
sudo systemctl start update-containers.timer
sudo systemctl enable update-containers.timer

You can check the status of the timer and the service using standard systemd tools.

sudo systemctl status update-containers.timer
sudo systemctl status update-containers.service
sudo journalctl -u update-containers.service

That’s it! Sit back and let your containers be automatically updated for you. If you want to manually update a container, you could just use version tags and manage them separately.


Enabling Docker in Fedora 31 by reverting to cgroups v1

Fedora has switched to cgroups v2 by default now, but Docker doesn’t yet support it and so fails to start. If you want to use Docker then you need to revert cgroups to v1 by adding the systemd.unified_cgroup_hierarchy=0 kernel argument.

Add systemd.unified_cgroup_hierarchy=0 to the default GRUB config with sed.

sudo sed -i '/^GRUB_CMDLINE_LINUX/ s/"$/ systemd.unified_cgroup_hierarchy=0"/' /etc/default/grub

Now rebuild your GRUB config.

If you’re using BIOS boot then it’s this.

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

If you’re running EFI, then it’s this.

sudo grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

Now reboot and make sure Docker can start!

OwnTracks recorder in a container on Fedora with Let’s Encrypt and nginx

OwnTracks Recorder is a web application which maps locations over time. Generally, it connects to an MQTT server and subscribes to owntracks/+ topics for any location updates, but it also has a built in function to receive updates over HTTP.

I have been using OwnTracks with MQTT for a while, but found it to be too unreliable on Android (disconnects in the background and doesn’t reconnect nicely). Using HTTP is supposed to be more reliable, so this is how I set it up. The idea is to use OwnTracks on Android to post directly to the OwnTracks recorder over HTTP instead of MQTT and have recorder post the MQTT messages on our behalf using LUA scripts (for Home Assistant).

Friends is an important feature (to let members of the family see where eachother is located) and fortunately it is supported in HTTP mode (but it requires a little bit more configuration).

nginx and base configuration

We will use nginx as a reverse proxy in front of the recorder to provide both TLS and authentication to keep the service private and secure.

sudo dnf install nginx httpd-tools

Configure nginx to proxy to OwnTracks recorder by creating a new config file for the domain you are hosting on. For example, if your domain is owntracks.yourdomain.com then create a file at /etc/nginx/conf.d/owntracks.yourdomain.com. Later certbot will update this to add TLS configuration.

cat << \EOF |sudo tee /etc/nginx/conf.d/owntracks.yourdomain.com.conf
server {
  server_name owntracks.yourdomain.com;
  root /var/www/html;

  auth_basic "OwnTracks";
  auth_basic_user_file /etc/nginx/owntracks.htpasswd;
  proxy_set_header X-Limit-U $remote_user;

  location / {
    proxy_pass http://127.0.0.1:8083/;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $remote_addr;
  }

  location /ws {
    rewrite ^/(.*) /$1 break;
    proxy_pass http://127.0.0.1:8083;
    proxy_http_version  1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  location /view/ {
    proxy_buffering off;
    proxy_pass http://127.0.0.1:8083/view/;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $remote_addr;
  }
  location /static/ {
    proxy_pass http://127.0.0.1:8083/static/;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $remote_addr;
  }

  location /pub {
    proxy_pass http://127.0.0.1:8083/pub;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $remote_addr;
  }

  error_page 404 /404.html;
  location = /40x.html {
  }

  error_page 500 502 503 504 /50x.html;
  location = /50x.html {
  }
}
EOF

Now that we have a web server let’s open the ports to enable traffic on port 80 and 443.

sudo firewall-cmd --zone=FedoraServer --add-service=http
sudo firewall-cmd --zone=FedoraServer --add-service=https
sudo firewall-cmd --runtime-to-permanent

SELinux will block nginx from acting as a proxy and connecting to our other services, so we need to tell it that it’s OK.

 sudo setsebool -P httpd_can_network_connect 1
sudo setsebool -P httpd_can_network_relay 1

Note that we’ve set a password file for nginx to protect recorder in the config, now we need to create that file.

Let’s pretend that we have three users, Alice, Bob and Charlie. Create the nginx password file when you add the password for Alice, then add the password for the other two.

sudo htpasswd -c /etc/nginx/owntracks.htpasswd alice
sudo htpasswd /etc/nginx/owntracks.htpasswd bob
sudo htpasswd /etc/nginx/owntracks.htpasswd charlie

That’s the core nginx config done, next we will use cerbot to get a certificate and re-configure nginx to use TLS.

Certbot

Install certbot and the nginx plugin, which will let us get signed certificates from Let’s Encrypt. Using the plugin means it will configure nginx to handle the challenge and write the config file automatically. You will need to make sure that port 80 on your nginx server is available over the Internet (and probably also port 443 so that we can connect securely to recorder remotely) as well as a DNS entry pointing to your external IP (I’ll use owntracks.yourdomain.com as an example).

sudo dnf install certbot python3-certbot-nginx

Next, use certbot to get TLS certificates from Let’s Encrypt. Follow the prompts and be sure to enable TLS redirection so that all traffic will be encrypted.

sudo certbot --agree-tos \
--redirect \
--rsa-key-size 4096 \
--nginx \
-d owntracks.yourdomain.com

Now that we have a certificate, let’s enable auto renewals.

sudo systemctl enable --now certbot-renew.timer

OK, nginx should now be configured with TLS and managed by certbot.

Recorder with Docker

Now let’s get the recorder container going! First install and prepare Docker. Note that if you’re running on Fedora 31 or later, you need to revert to cgroup v1 first.

sudo groupadd -r docker
sudo gpasswd -a ${USER} docker
newgrp docker
sudo dnf install -y cockpit-docker docker
sudo systemctl start docker
sudo systemctl enable docker

Next let’s prepare the configuration and scripts for the container.

sudo mkdir -p /var/lib/owntracks/{config,scripts,logs}

Generally we pass variables into containers, but recorder also supports a config file so we’ll use that instead (OTR_LUASCRIPT is not supported as a variable, anyway). Replace the values for your MQTT server below.

NOTE: OTR_PORT must not be a number not a string, else it will be be ignored.

OTR_HOST="mqtt-broker"
OTR_PORT=mqtt-port
OTR_USER="mqtt-user"
OTR_PASS="mqtt-user-password"

cat << EOF | sudo tee /var/lib/owntracks/config/recorder.conf
OTR_TOPICS = "owntracks/#"
OTR_HTTPHOST = "0.0.0.0"
OTR_STORAGEDIR = "/store"
OTR_HTTPLOGDIR = "/logs"
OTR_LUASCRIPT = "/scripts/hook.lua"
OTR_HOST = "${OTR_HOST}"
OTR_PORT = ${OTR_PORT}
OTR_USER = "${OTR_USER}"
OTR_PASS = "${OTR_PASS}"
OTR_CLIENTID = "owntracks-recorder"
EOF

If you’re using TLS on your MQTT server, then copy over the CA (for example, /etc/pki/tls/certs/ca-bundle.crt) and set the OTR_CAFILE config option to point to the file as it will be inside the container. This will automatically enable TLS connection to your MQTT server.

sudo cp /etc/pki/tls/certs/ca-bundle.crt /var/lib/owntracks/config/ca.crt
echo 'OTR_CAFILE="/config/ca.crt"' | sudo tee -a /var/lib/owntracks/config/recorder.conf

Next get the Lua scripts ready which will allow recorder to forward HTTP events on to MQTT. We will write a file called hook.lua to run the script, which is referenced in the config above. It has a JSON dependency, which we will download from the Internet.

wget http://regex.info/code/JSON.lua
sudo mv JSON.lua /var/lib/owntracks/scripts/JSON.lua
cat << EOF | sudo tee /var/lib/owntracks/scripts/hook.lua
JSON = (loadfile "/scripts/JSON.lua")()

function otr_init()
end

function otr_exit()
end

function otr_hook(topic, _type, data)
    otr.log("DEBUG_PUB:" .. topic .. " " .. JSON:encode(data))
    if(data['_http'] == true) then
        if(data['_repub'] == true) then
           return
        end
        data['_repub'] = true
        local payload = JSON:encode(data)
        otr.publish(topic, payload, 1, 1)
    end
end

function otr_putrec(u, d, s)
        j = JSON:decode(s)
        if (j['_repub'] == true) then
                return 1
        end
end
EOF

Next we can run the container for recorder. We will map in all of the directories we created earlier and the configuration we created should be read in when the program in the container starts. Note that :Z option sets the SELinux context on those config files.

docker run -dit --name recorder \
--restart always \
-p 8083:8083 \
-v /var/lib/owntracks/store:/store:Z \
-v /var/lib/owntracks/config:/config:Z \
-v /var/lib/owntracks/scripts:/scripts:Z \
-v /etc/localtime:/etc/localtime:ro \
owntracks/recorder

OwnTracks should now be listening on port 8083, waiting for connections to come in through nginx!

Friends with OwnTracks

To set up friends in HTTP mode we need to get a shell on the container and load friends data into the database.

docker exec -it recorder /bin/sh

Inside the container we load friends data into the database. Let’s use our three friends as an example, Alice with her phone pixel3xl, Bob with his pixel4 and Charlie with her pixel3a, to set up notifications for everyone.

ocat --load=friends << EOF
alice-pixel3xl [ "bob/pixel4", "charlie/pixel3a" ]
bob-pixel4 [ "alice/pixel3xl, "charlie/pixel3a" ]
charlie-pixel3a [ "alice/pixel3xl, "bob/pixel4" ]
EOF

We can dump the friends data to see what we’ve loaded, then exit the container.

ocat -S /store --dump=friends
exit

Now, whenever Alice, Bob or Charlie update their location, recorder will return JSON data with the location of the other two. OwnTracks will then display that information under the Friends tab. Unfortunately, the one thing thing HTTP mode doesn’t support is Regions notifications to be notified when friends enter or leave defined way points, but I’ve found OwnTracks to be much more reliable with HTTP so I guess that’s a small price to pay…


December 14, 2019

Audiobooks – November 2019

Exactly: How Precision Engineers Created the Modern World by Simon Winchester

Starting from the early 18th century each chapter covers increasing greater accuracy and the technology that needed and used it. Nice read 8/10

The Secret Cyclist: Real Life as a Rider in the Professional Peloton by The Secret Cyclist

An okay read although I don’t follow the sport so had never heard of most of the names. It is still readable however and gives a good feel for the world. 6/10

Braving It: A Father, a Daughter, and an Unforgettable Journey into the Alaskan Wild by James Campbell

A father takes his 15 year-old daughter for two trips to a remote cabin and a 3rd trip hiking/canoeing along a remote river in Alaska. Well written and interesting. 8/10

The Left Behind: Decline and Rage in Rural America by Robert Wuthnow

Based on Interviews with small town Americans it talks about their lives and frustrations with Washington which they see as distant but interfering. 7/10

World War Z: An Oral History of the Zombie War by Max Brookes

This was the “almost” full text version. Lots of different actors reading each chapter (which are arranged as interviews). Great story and presentation works well. 9/10

Share

Booting temporary firmware on the Raptor Blackbird

In a future post, I’ll detail how to build my ported-to-upstream Blackbird firmware. Here though, we’ll explore booting some firmware temporarily to experiment.

Step 1: Copy your new PNOR image over to the BMC.
Step 2: …
Step 3: Profit!

Okay, not really, once you’ve copied over your image, ensure the computer is off and then you can tell the daemon that provides firmware to the host to use a file backend for it rather than the PNOR chip on the motherboard (i.e. yes, you can boot your system even when the firmware chip isn’t there – although I’ve not literally tried this).

root@blackbird:~# mboxctl --backend file:/tmp/blackbird.pnor 
SetBackend: Success
root@blackbird:~# obmcutil poweron

If we look at the serial console (ssh to the BMC port 2200) we’ll see Hostboot start, realise there’s newer SBE code, flash it, and reboot:

--== Welcome to Hostboot hostboot-b284071/hbicore.bin ==--

  3.02606|secure|SecureROM valid - enabling functionality
  5.14678|Booting from SBE side 0 on master proc=00050000
  5.18537|ISTEP  6. 5 - host_init_fsi
  5.47985|ISTEP  6. 6 - host_set_ipl_parms
  5.54476|ISTEP  6. 7 - host_discover_targets
  6.56106|HWAS|PRESENT> DIMM[03]=8080000000000000
  6.56108|HWAS|PRESENT> Proc[05]=8000000000000000
  6.56109|HWAS|PRESENT> Core[07]=1511540000000000
  6.61373|ISTEP  6. 8 - host_update_master_tpm
  6.61529|SECURE|Security Access Bit> 0x0000000000000000
  6.61530|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
  6.61543|ISTEP  6. 9 - host_gard
  7.20987|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
  7.20988|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
  7.20989|HWAS|FUNCTIONAL> Core[07]=1511540000000000
  7.21299|ISTEP  6.11 - host_start_occ_xstop_handler
  8.28965|ISTEP  6.12 - host_voltage_config
  8.47973|ISTEP  7. 1 - mss_attr_cleanup
  9.07674|ISTEP  7. 2 - mss_volt
  9.35627|ISTEP  7. 3 - mss_freq
  9.63029|ISTEP  7. 4 - mss_eff_config
 10.35189|ISTEP  7. 5 - mss_attr_update
 10.38489|ISTEP  8. 1 - host_slave_sbe_config
 10.45332|ISTEP  8. 2 - host_setup_sbe
 10.45450|ISTEP  8. 3 - host_cbs_start
 10.45574|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
 10.48675|ISTEP  8. 5 - host_attnlisten_proc
 10.50338|ISTEP  8. 6 - host_p9_fbc_eff_config
 10.50771|ISTEP  8. 7 - host_p9_eff_config_links
 10.53338|ISTEP  8. 8 - proc_attr_update
 10.53634|ISTEP  8. 9 - proc_chiplet_fabric_scominit
 10.55234|ISTEP  8.10 - proc_xbus_scominit
 10.56202|ISTEP  8.11 - proc_xbus_enable_ridi
 10.57788|ISTEP  8.12 - host_set_voltages
 10.59421|ISTEP  9. 1 - fabric_erepair
 10.65877|ISTEP  9. 2 - fabric_io_dccal
 10.66048|ISTEP  9. 3 - fabric_pre_trainadv
 10.66665|ISTEP  9. 4 - fabric_io_run_training
 10.66860|ISTEP  9. 5 - fabric_post_trainadv
 10.67060|ISTEP  9. 6 - proc_smp_link_layer
 10.67503|ISTEP  9. 7 - proc_fab_iovalid
 11.10386|ISTEP  9. 8 - host_fbc_eff_config_aggregate
 11.15103|ISTEP 10. 1 - proc_build_smp
 11.27537|ISTEP 10. 2 - host_slave_sbe_update
 11.68581|sbe|System Performing SBE Update for PROC 0, side 0
 34.50467|sbe|System Rebooting To Complete SBE Update Process
 34.50595|IPMI: Initiate power cycle
 34.54671|Stopping istep dispatcher
 34.68729|IPMI: shutdown complete

One of the improvements is we now get output from the SBE! This means that when we do things like mess up secure boot and non secure boot firmware (I’ll explain why/how this is a thing later), we’ll actually get something useful out of a serial port:

--== Welcome to SBE - CommitId[0x8b06b5c1] ==--
istep 3.19
istep 3.20
istep 3.21
istep 3.22
istep 4.1
istep 4.2
istep 4.3
istep 4.4
istep 4.5
istep 4.6
istep 4.7
istep 4.8
istep 4.9
istep 4.10
istep 4.11
istep 4.12
istep 4.13
istep 4.14
istep 4.15
istep 4.16
istep 4.17
istep 4.18
istep 4.19
istep 4.20
istep 4.21
istep 4.22
istep 4.23
istep 4.24
istep 4.25
istep 4.26
istep 4.27
istep 4.28
istep 4.29
istep 4.30
istep 4.31
istep 4.32
istep 4.33
istep 4.34
istep 5.1
istep 5.2
SBE starting hostboot

And then we’re back into normal Hostboot boot (which we’ve all seen before) and end up at a newer petitboot!

Petitboot 1.11 on a Raptor Blackbird

One notable absence from that screenshot is my installed Fedora is missing. This is because there appears to be a bug in the 5.3.7 kernel that’s currently upstream, and if we drop to the shell and poke at lspci and dmesg, we can work out what could be the culprit:

Exiting petitboot. Type 'exit' to return.
You may run 'pb-sos' to gather diagnostic data
No password set, running as root. You may set a password in the System Configuration screen.
# lspci
0000:00:00.0 PCI bridge: IBM Device 04c1
0001:00:00.0 PCI bridge: IBM Device 04c1
0001:01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
0002:00:00.0 PCI bridge: IBM Device 04c1
0002:01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0003:00:00.0 PCI bridge: IBM Device 04c1
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM Device 04c1
0004:01:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0005:00:00.0 PCI bridge: IBM Device 04c1
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
# dmesg|grep -i nvme
[    2.991038] nvme nvme0: pci function 0001:01:00.0
[    2.991088] nvme 0001:01:00.0: enabling device (0140 -> 0142)
[    3.121799] nvme nvme0: Identify Controller failed (19)
[    3.121802] nvme nvme0: Removing after probe failure status: -5
# uname -a
Linux skiroot 5.3.7-openpower1 #2 SMP Sat Dec 14 09:06:20 PST 2019 ppc64le GNU/Linux

If for some reason the device didn’t show up in lspci, then I’d look at the skiboot firmware log, which is /sys/firmware/opal/msglog.

Looking at upstream stable kernel patches, it seems like 5.3.8 has a interesting looking patch when you realize that ppc64le uses a 64k page size:

commit efac0f186ea654e8389f5017c7f643ef48cb4b93
Author: Kevin Hao <haokexin@gmail.com>
Date:   Fri Oct 18 10:53:14 2019 +0800

    nvme-pci: Set the prp2 correctly when using more than 4k page
    
    commit a4f40484e7f1dff56bb9f286cc59ffa36e0259eb upstream.
    
    In the current code, the nvme is using a fixed 4k PRP entry size,
    but if the kernel use a page size which is more than 4k, we should
    consider the situation that the bv_offset may be larger than the
    dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
    cause the command can't be executed correctly.
    
    Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Kevin Hao <haokexin@gmail.com>
    Signed-off-by: Keith Busch <kbusch@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

So, time to go try 5.3.8. My yaks are getting quite smooth.

Oh, and when you’re done with your temporary firmware, either fiddle with mboxctl or restart the systemd service for it, or reboot your BMC or… well, I gotta leave you something to work out on your own :)

Building OpenPOWER firmware on Fedora 31

One of the challenges with Fedora 31 is that /usr/bin/python is now Python 3 rather than Python 2. Just about every python script in existence relies on /usr/bin/python being Python 2 and not anything else. I can’t really recall, but this probably happened with the 1.5 to 2 transition as well (although IIRC that was less breaking).

What this means is that for projects that are half-way through converting to python 3, everything breaks.

op-build is one of these projects.

So, we need:

After all that, you can actually build a pnor image on Fedora 31. Even on Fedora 31 ppc64le, which is literally what I’ve just done.

December 13, 2019

Upstreaming Blackbird firmware (step 1: skiboot)

Now that I can actually boot the machine, I could test and send my patch upstream for Blackbird support in skiboot. One thing I noticed with the current firmware from Raptor is that the PCIe slot names were wrong. While a pretty minor point, it’s a bit funny that there’s only two slots and the names were wrong.

The PCIe slot names are used to call out the physical location of PCIe cards in the system, so if you, say, hit a bunch of errors, OS/firmware can say “It’s this card in the slot labeled BLAH on the board”.

With my patch, the slot table from skiboot is spat out looking like this:

[   64.296743001,5] PHB#0000:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..ff SLOT=SLOT1 PCIE 4.0 X16 
 [   64.296875483,5] PHB#0001:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=SLOT2 PCIE 4.0 X8 
 [   64.297054197,5] PHB#0001:01:00.0 [EP  ] 8086 f1a8 R:03 C:010802 (  mass-storage) LOC_CODE=SLOT2 PCIE 4.0 X8
 [   64.297285067,5] PHB#0002:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin SATA 
 [   64.297411565,5] PHB#0002:01:00.0 [LGCY] 1b4b 9235 R:11 C:010601 (          sata) LOC_CODE=Builtin SATA
 [   64.297554540,5] PHB#0003:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin USB 
 [   64.297732049,5] PHB#0003:01:00.0 [EP  ] 104c 8241 R:02 C:0c0330 (      usb-xhci) LOC_CODE=Builtin USB
 [   64.297848624,5] PHB#0004:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..01 SLOT=Builtin Ethernet 
 [   64.298026870,5] PHB#0004:01:00.0 [EP  ] 14e4 1657 R:01 C:020000 (      ethernet) LOC_CODE=Builtin Ethernet
 [   64.298212291,5] PHB#0004:01:00.1 [EP  ] 14e4 1657 R:01 C:020000 (      ethernet) LOC_CODE=Builtin Ethernet
 [   64.298424962,5] PHB#0004:01:00.2 [EP  ] 14e4 1657 R:01 C:020000 (      ethernet) LOC_CODE=Builtin Ethernet
 [   64.298587848,5] PHB#0005:00:00.0 [ROOT] 1014 04c1 R:00 C:060400 B:01..02 SLOT=BMC 
 [   64.298722540,5] PHB#0005:01:00.0 [ETOX] 1a03 1150 R:04 C:060400 B:02..02 LOC_CODE=BMC
 [   64.298850009,5] PHB#0005:02:00.0 [PCID] 1a03 2000 R:41 C:030000 (           vga) LOC_CODE=BMC

If you want to give it a go, grab the patch, build skiboot, and flash it on. Alternatively, you can download a built skiboot here. To flash it, do this:

# Copy to your BMC for the Blackbird
scp skiboot-v6.5-146-g376bed3f.lid.xz.stb root@blackbird:/tmp/

# then, ssh to the BMC
$ ssh root@blackbird

# ensure the machine is off
obmcutil poweroff --wait

# Now, make a backup copy (remember to copy it off /tmp on the bmc)
pflash -P PAYLOAD -r /tmp/skiboot-backup

# and flash the new skiboot:
pflash -e -P PAYLOAD -p /tmp/skiboot.lid.xz.stb

# now, power on the box
obmcutil poweron

Black(bird) boots!

Well, after the half false start of not having RAM so really not being able to do much (yeah yeah, I hear you – I’m weak for not just running Linux in L3), my RAM arrived today. Putting the sticks in was easy (of course), although does not make for an exciting photo.

One DIMM in the Blackbird

After that, I SSH’d the the BMC and then did “obmcutil poweron” (as is traditional) and started looking at the console via conneting via SSH to port 2200 on the BMC. I was then greeted by the (by this time in my life rather familiar) Hostboot:

--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--
 3.02902|secure|SecureROM valid - enabling functionality
   7.15613|Booting from SBE side 0 on master proc=00050000
   7.19697|ISTEP  6. 5 - host_init_fsi
   7.54226|ISTEP  6. 6 - host_set_ipl_parms
   8.06280|ISTEP  6. 7 - host_discover_targets
   9.19791|HWAS|PRESENT> DIMM[03]=8080000000000000
   9.19792|HWAS|PRESENT> Proc[05]=8000000000000000
   9.19794|HWAS|PRESENT> Core[07]=1511540000000000
   9.55305|ISTEP  6. 8 - host_update_master_tpm
   9.60521|SECURE|Security Access Bit> 0x0000000000000000
   9.60522|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
   9.63093|ISTEP  6. 9 - host_gard
   9.89867|HWAS|Blocking Speculative Deconfig
   9.90128|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
   9.90129|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
   9.90130|HWAS|FUNCTIONAL> Core[07]=1511540000000000
   9.90329|ISTEP  6.11 - host_start_occ_xstop_handler
  11.19092|ISTEP  6.12 - host_voltage_config
  11.30246|ISTEP  7. 1 - mss_attr_cleanup
  12.61924|ISTEP  7. 2 - mss_volt
  12.92705|ISTEP  7. 3 - mss_freq
  13.67475|ISTEP  7. 4 - mss_eff_config
  14.95827|ISTEP  7. 5 - mss_attr_update
  14.97307|ISTEP  8. 1 - host_slave_sbe_config
  15.05372|ISTEP  8. 2 - host_setup_sbe
  15.10258|ISTEP  8. 3 - host_cbs_start
  15.10381|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
  15.11144|ISTEP  8. 5 - host_attnlisten_proc
  15.11213|ISTEP  8. 6 - host_p9_fbc_eff_config
  15.13552|ISTEP  8. 7 - host_p9_eff_config_links
  15.20087|ISTEP  8. 8 - proc_attr_update
  15.20191|ISTEP  8. 9 - proc_chiplet_fabric_scominit
  15.21891|ISTEP  8.10 - proc_xbus_scominit
  15.22929|ISTEP  8.11 - proc_xbus_enable_ridi
  15.24717|ISTEP  8.12 - host_set_voltages
  15.26620|ISTEP  9. 1 - fabric_erepair
  15.42123|ISTEP  9. 2 - fabric_io_dccal
  15.42436|ISTEP  9. 3 - fabric_pre_trainadv
  15.42887|ISTEP  9. 4 - fabric_io_run_training
  15.43207|ISTEP  9. 5 - fabric_post_trainadv
  15.44893|ISTEP  9. 6 - proc_smp_link_layer
  15.45454|ISTEP  9. 7 - proc_fab_iovalid
  15.87126|ISTEP  9. 8 - host_fbc_eff_config_aggregate
  15.89174|ISTEP 10. 1 - proc_build_smp
  16.54194|ISTEP 10. 2 - host_slave_sbe_update
  18.63876|sbe|System Performing SBE Update for PROC 0, side 0
  41.69727|sbe|System Rebooting To Complete SBE Update Process
  41.72189|IPMI: Initiate power cycle
  42.40652|IPMI: shutdown complete

The first IPL updated the Self Boot Engine firmware on the chip, so it automatically applied the new firmware and rebooted to finish applying it. This is perfectly normal, it just shows itself as a longer boot time. Booting continues:

--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--
 3.02810|secure|SecureROM valid - enabling functionality
   6.07331|Booting from SBE side 0 on master proc=00050000
   6.11485|ISTEP  6. 5 - host_init_fsi
   6.60361|ISTEP  6. 6 - host_set_ipl_parms
   6.98640|ISTEP  6. 7 - host_discover_targets
   7.53975|HWAS|PRESENT> DIMM[03]=8080000000000000
   7.53976|HWAS|PRESENT> Proc[05]=8000000000000000
   7.53977|HWAS|PRESENT> Core[07]=1511540000000000
   7.79123|ISTEP  6. 8 - host_update_master_tpm
   7.79263|SECURE|Security Access Bit> 0x0000000000000000
   7.79264|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
   7.82684|ISTEP  6. 9 - host_gard
   8.26609|HWAS|Blocking Speculative Deconfig
   8.26865|HWAS|FUNCTIONAL> DIMM[03]=8080000000000000
   8.26866|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
   8.26867|HWAS|FUNCTIONAL> Core[07]=1511540000000000
   8.27142|ISTEP  6.11 - host_start_occ_xstop_handler
   9.69606|ISTEP  6.12 - host_voltage_config
   9.81183|ISTEP  7. 1 - mss_attr_cleanup
  10.95130|ISTEP  7. 2 - mss_volt
  11.39875|ISTEP  7. 3 - mss_freq
  12.15655|ISTEP  7. 4 - mss_eff_config
  13.63504|ISTEP  7. 5 - mss_attr_update
  13.65162|ISTEP  8. 1 - host_slave_sbe_config
  13.78039|ISTEP  8. 2 - host_setup_sbe
  13.78143|ISTEP  8. 3 - host_cbs_start
  13.78247|ISTEP  8. 4 - proc_check_slave_sbe_seeprom_complete
  13.79015|ISTEP  8. 5 - host_attnlisten_proc
  13.79114|ISTEP  8. 6 - host_p9_fbc_eff_config
  13.79734|ISTEP  8. 7 - host_p9_eff_config_links
  13.85128|ISTEP  8. 8 - proc_attr_update
  13.85783|ISTEP  8. 9 - proc_chiplet_fabric_scominit
  13.87991|ISTEP  8.10 - proc_xbus_scominit
  13.89056|ISTEP  8.11 - proc_xbus_enable_ridi
  13.91122|ISTEP  8.12 - host_set_voltages
  13.93077|ISTEP  9. 1 - fabric_erepair
  14.05235|ISTEP  9. 2 - fabric_io_dccal
  14.13131|ISTEP  9. 3 - fabric_pre_trainadv
  14.13616|ISTEP  9. 4 - fabric_io_run_training
  14.13934|ISTEP  9. 5 - fabric_post_trainadv
  14.14087|ISTEP  9. 6 - proc_smp_link_layer
  14.14656|ISTEP  9. 7 - proc_fab_iovalid
  14.59454|ISTEP  9. 8 - host_fbc_eff_config_aggregate
  14.61811|ISTEP 10. 1 - proc_build_smp
  15.24074|ISTEP 10. 2 - host_slave_sbe_update
  17.16022|sbe|System Performing SBE Update for PROC 0, side 1
  40.16808|ISTEP 10. 4 - proc_cen_ref_clk_enable
  40.27866|ISTEP 10. 5 - proc_enable_osclite
  40.31297|ISTEP 10. 6 - proc_chiplet_scominit
  40.55805|ISTEP 10. 7 - proc_abus_scominit
  40.57942|ISTEP 10. 8 - proc_obus_scominit
  40.58078|ISTEP 10. 9 - proc_npu_scominit
  40.60704|ISTEP 10.10 - proc_pcie_scominit
  40.66572|ISTEP 10.11 - proc_scomoverride_chiplets
  40.66874|ISTEP 10.12 - proc_chiplet_enable_ridi
  40.68407|ISTEP 10.13 - host_rng_bist
  40.75548|ISTEP 10.14 - host_update_redundant_tpm
  40.75785|ISTEP 11. 1 - host_prd_hwreconfig
  41.15067|ISTEP 11. 2 - cen_tp_chiplet_init1
  41.15299|ISTEP 11. 3 - cen_pll_initf
  41.15544|ISTEP 11. 4 - cen_pll_setup
  41.18530|ISTEP 11. 5 - cen_tp_chiplet_init2
  41.18762|ISTEP 11. 6 - cen_tp_arrayinit
  41.19050|ISTEP 11. 7 - cen_tp_chiplet_init3
  41.19286|ISTEP 11. 8 - cen_chiplet_init
  41.19553|ISTEP 11. 9 - cen_arrayinit
  41.19986|ISTEP 11.10 - cen_initf
  41.20215|ISTEP 11.11 - cen_do_manual_inits
  41.20497|ISTEP 11.12 - cen_startclocks
  41.20802|ISTEP 11.13 - cen_scominits
  41.21171|ISTEP 12. 1 - mss_getecid
  42.25709|ISTEP 12. 2 - dmi_attr_update
  42.30382|ISTEP 12. 3 - proc_dmi_scominit
  42.32572|ISTEP 12. 4 - cen_dmi_scominit
  42.32798|ISTEP 12. 5 - dmi_erepair
  42.35000|ISTEP 12. 6 - dmi_io_dccal
  42.35218|ISTEP 12. 7 - dmi_pre_trainadv
  42.35489|ISTEP 12. 8 - dmi_io_run_training
  42.37076|ISTEP 12. 9 - dmi_post_trainadv
  42.39541|ISTEP 12.10 - proc_cen_framelock
  42.40772|ISTEP 12.11 - host_startprd_dmi
  42.41974|ISTEP 12.12 - host_attnlisten_memb
  42.44506|ISTEP 12.13 - cen_set_inband_addr
  42.58832|ISTEP 13. 1 - host_disable_memvolt
  43.67808|ISTEP 13. 2 - mem_pll_reset
  43.75070|ISTEP 13. 3 - mem_pll_initf
  43.85043|ISTEP 13. 4 - mem_pll_setup
  43.87372|ISTEP 13. 6 - mem_startclocks
  43.88970|ISTEP 13. 7 - host_enable_memvolt
  43.89177|ISTEP 13. 8 - mss_scominit
  45.10013|ISTEP 13. 9 - mss_ddr_phy_reset
  45.38105|ISTEP 13.10 - mss_draminit
  45.95447|ISTEP 13.11 - mss_draminit_training
  47.20963|ISTEP 13.12 - mss_draminit_trainadv
  47.32161|ISTEP 13.13 - mss_draminit_mc
  47.49186|ISTEP 14. 1 - mss_memdiag
  69.53224|ISTEP 14. 2 - mss_thermal_init
  69.66891|ISTEP 14. 3 - proc_pcie_config
  69.71959|ISTEP 14. 4 - mss_power_cleanup
  69.72385|ISTEP 14. 5 - proc_setup_bars
  69.83889|ISTEP 14. 6 - proc_htm_setup
  69.84748|ISTEP 14. 7 - proc_exit_cache_contained
  69.89430|ISTEP 15. 1 - host_build_stop_image
  73.08679|ISTEP 15. 2 - proc_set_pba_homer_bar
  73.12352|ISTEP 15. 3 - host_establish_ex_chiplet
  73.13714|ISTEP 15. 4 - host_start_stop_engine
  73.19059|ISTEP 16. 1 - host_activate_master
  74.44590|ISTEP 16. 2 - host_activate_slave_cores
  74.53820|ISTEP 16. 3 - host_secure_rng
  74.54651|ISTEP 16. 4 - mss_scrub
  74.56565|ISTEP 16. 5 - host_load_io_ppe
  74.78752|ISTEP 16. 6 - host_ipl_complete
  75.50085|ISTEP 18.11 - proc_tod_setup
  75.94190|ISTEP 18.12 - proc_tod_init
  75.97575|ISTEP 20. 1 - host_load_payload
  77.12340|ISTEP 20. 2 - host_load_hdat
  78.05195|ISTEP 21. 1 - host_runtime_setup
  83.87001|htmgt|OCCs are now running in ACTIVE state
  89.72649|ISTEP 21. 2 - host_verify_hdat
  89.77252|ISTEP 21. 3 - host_start_payload
 [   90.400516933,5] OPAL skiboot-c81f9d6 starting…

The rest of the skiboot log was also spat out, and then the familiar Petitboot screen:

Welcome to Petitboot!

It lives! I even had a bit of a look at the sensors to see power consumption and temperatures. All looks good:

ipmitool sdr|grep -v ns
 occ0             | 0x00              | ok
 occ1             | 0x00              | ok
 p0_core3_temp    | 51 degrees C      | ok
 p0_core5_temp    | 49 degrees C      | ok
 p0_core7_temp    | 50 degrees C      | ok
 p0_core11_temp   | 49 degrees C      | ok
 p0_core15_temp   | 50 degrees C      | ok
 p0_core17_temp   | 50 degrees C      | ok
 p0_core19_temp   | 50 degrees C      | ok
 p0_core21_temp   | 50 degrees C      | ok
 dimm0_temp       | 36 degrees C      | ok
 dimm4_temp       | 39 degrees C      | ok
 fan0             | 1300 RPM          | ok
 fan1             | 1200 RPM          | ok
 fan2             | 1000 RPM          | ok
 p0_power         | 60 Watts          | ok
 p0_vdd_power     | 31 Watts          | ok
 p0_vdn_power     | 10 Watts          | ok
 cpu_1_ambient    | 30.90 degrees C   | ok
 pcie             | 27 degrees C      | ok
 ambient          | 25.40 degrees C   | ok

Next up? I guess I should install an OS.

Coming to grips with Kubernetes in 2020: podcasts

Share

It has become clear to me that it is time to care about Kubernetes more. I’m sure many people have cared for ages, but the things I want to build at the moment are starting to be more container based now that I am thinking more at the application layer than the cloud infrastructure layer. So how to do that? I thought I’d write down some notes on what has worked (or not) for me, in the hope it will help others. In this post, podcasts.

I thought podcasts would be an interesting way to get started with some nice overviews. This is especially true because I’m already a pretty heavy podcast user, so it was easy to slot into my existing routine. Unfortunately this hasn’t really worked out. I started with the podctl podcast, but they only ever talk about Red Hat stuff. It is very rare for a guest to not be a Red Hat employee for example. The presenters of this podcast seem to also really dislike OpenStack for reasons they never explain, which is annoying.

Then I figured maybe the Google Kubernetes podcast would be better, but it often lacks the depth I am interested in.

I am yet to find a good podcast which deep dives into technology instead of just talking about what is in the latest release. So maybe these podcasts are useful if you’re interested in what things dropped in the most recent release, but they’re not a good nor systematic way to get introduced to Kubernetes.

That said, I only just discovered the TGI Kubernetes youtube channel yesterday. It is not really what I wanted in a podcast given its a video blog, but I think it has prospects to be interesting. I will update this post when I’ve had a chance to check it out in more depth.

Have you found a good Kubernetes podcast? Am I being wildly unfair?

Share

December 12, 2019

Audiobooks – October 2019

The Story of the British Isles in 100 Places by Neil Oliver

Covers what you’d expect with a good attempt not just to hit the “history 101” places. Author has an accent that takes a while to get used to. 7/10

Death’s End – Cixin Liu

3rd in Trilogy wrapping things mostly up. Just a few characters so easy to keep track of them. If you liked the previous books you’ll like this one. 7/10

Building the Cycling City: The Dutch Blueprint for Urban Vitality by Melissa & Chris Bruntlett

Talking about Dutch Cycling culture. Compares 5 different cities (some car orientated) and how they differ in their cycling journey. 7/10

Scrappy Little Nobody by Anna Kendrick

A general memoir by the actress. A bit disjointed & unsystematic and by no means a tell-all. A few good stories sprinkled in. 6/10

The $100 Startup by Chris Guillebeau

Lots of case studies of businesses built off relatively little capital (and usually staying small). Plenty of good advice although lists don’t translate well in audio. 7/10

Atomic Adventures: Secret Islands, Forgotten N-Rays, and Isotopic Murder-A Journey into the Wild World of Nuclear Science by James Magaffey

A bunch of really good stories from the Atomic age (not just the usual ones) including a view from inside of the Cold Fusion fiasco. 8/10

Share

December 10, 2019

Looking at the state of Blackbird firmware

Having been somewhat involved in OpenPOWER firmware, I have a bunch of experience and opinions on maintaining firmware trees for products, what working with upstream looks like and all that.

So, with my new Blackbird system I decided to take a bit of a look as to what the firmware situation was like.

There’s two main parts of firmware: BMC and Host. The BMC firmware runs purely on the ASPEED AST2500 and is based on OpenBMC while the host firmware is what runs on the POWER9 and is based off of OpenPOWER Firmware as assembled by op-build.

Initial impressions on the BMC is that there doesn’t seem to be any web based UI for it, which is kind of disappointing, as the Web UI being developed upstream has some nice qualities, and I’d say I even enjoyed using it when it was built into BMC firmware for systems we had when I was at IBM.

Looking at the git trees, the raptor-v1.00 tag is OpenBMC 2.7.0-dev-533-g386e5602e while current master is 2.8.0-dev-960-g10f7830bd. The spot where it split off was 2.7.0-dev-430-g7443ee80b, from April 2019 – so it’s not too old, but I’m also not convinced there should have been some security patches since then.

I’m not sure if any of the OpenBMC code is upstream, I haven’t looked.

Unfortunately, none of the host firmware is upstream.

On the host firmware side, v2.3-rc2-67-ga6a5f142 is the Raptor tag, and that compares with current master of v2.4-305-g54d8daf4, the place where Raptor forked was v2.3-rc2-9-g7b556015, again in April of 2019. Considering there was an upstream release in May of 2019 (v2.3), and again in July (v2.4), it could have easily have made it into an upstream release.

Unfortunately, there doesn’t seem to have been an upstream op-build release since v2.4 back in July (when I made it shortly before leaving IBM).

The skiboot component of host firmware has had an upstream release since I left (v6.5 in mid-August 2019), so the (rather trivial) platform support could have easily made it. I have a cleaned up and ready to upstream patch for it, I just need some DIMMs to actually test with before I send the patch.

As the current firmware situation stands, producing another build with updated upstream code is tricky due to the out-of-tree nature of the Blackbird patches, and a straight “git merge” is probably doable by some people, but not everybody.

On my TODO list is to get all the code into a state I can upstream it, assess vulnerability to CVE-2019-6260, and work out how I want to make it do Secure Boot (something that isn’t in upstream firmware yet, and currently would require a TPM, which I do not have).

Blackbird (singing in the dead of night..)

Way back when Raptor Computer Systems was doing pre-orders for the microATX Blackboard POWER9 system, I put in a pre-order. Since then, I’ve had a few life changes (such as moving to the US and starting to work for Amazon rather than IBM), but I’ve finally gone and done (most of) the setup for my own POWER9 system on (or under) my desk.

An 8 core POWER9 CPU, in bubble wrap and plastic packaging.

Everything came in a big brown box, all rather well packed. I had the board, CPU, heatsink assembly and the special tool to attach the heatsink to the board. Although unique to POWER9, the heatsink/fan assembly was one of the easier ones I’ve ever attached to a board.

The board itself looks pretty much as you’d expect – there’s a big spot for the CPU, a couple of PCI slots, a couple of DIMM slots and some SATA connectors.

The bits that are a bit unusual for a micro-ATX board are the big space reserved for FlexVer, the ASPEED BMC chip and the socketed flash. FlexVer is something I’m not ever going to use, and instead wish that there was an on-board m2 SSD slot instead, even if it was just PCIe. Having to sacrifice a PCIe slot just for a SSD is kind of a bummer.

The Blackbird POWER9 board
The POWER9 chip in socket

One annoying thing is my DIMMs are taking their sweet time in getting here, so I couldn’t actually populate the board with any memory.

Even without memory though, you can start powering it on and see that everything else works okay (i.e. it’s not completely boned). So, even without DIMMs, I could plug it in, and observe the Hostboot firmware complaining about insufficient hardware to IPL the box.

It Lives!

Yep, out the console (via ssh) you clearly see where things fail:

--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--

  3.03104|secure|SecureROM valid - enabling functionality
  6.67619|Booting from SBE side 0 on master proc=00050000
  6.85100|ISTEP  6. 5 - host_init_fsi
  7.23753|ISTEP  6. 6 - host_set_ipl_parms
  7.71759|ISTEP  6. 7 - host_discover_targets
 11.34738|HWAS|PRESENT> Proc[05]=8000000000000000
 11.34739|HWAS|PRESENT> Core[07]=1511540000000000
 11.69077|ISTEP  6. 8 - host_update_master_tpm
 11.73787|SECURE|Security Access Bit> 0x0000000000000000
 11.73787|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
 11.76276|ISTEP  6. 9 - host_gard
 11.96654|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
 11.96655|HWAS|FUNCTIONAL> Core[07]=1511540000000000
 12.07554|================================================
 12.07554|Error reported by hwas (0x0C00) PLID 0x90000007
 12.10289|  checkMinimumHardware found no functional dimm cards.
 12.10290|  ModuleId   0x03 MOD_CHECK_MIN_HW
 12.10291|  ReasonCode 0x0c06 RC_SYSAVAIL_NO_MEMORY_FUNC
 12.10292|  UserData1  HUID of node : 0x0002000000000000
 12.10293|  UserData2  number of present, non-functional dimms : 0x0000000000000000
 12.10294|------------------------------------------------
 12.10417|  Callout type             : Procedure Callout
 12.10417|  Procedure                : EPUB_PRC_FIND_DECONFIGURED_PART
 12.10418|  Priority                 : SRCI_PRIORITY_HIGH
 12.10419|------------------------------------------------
 12.10420|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.10421|================================================
 12.51718|================================================
 12.51719|Error reported by hwas (0x0C00) PLID 0x90000007
 12.51720|  Insufficient hardware to continue.
 12.51721|  ModuleId   0x03 MOD_CHECK_MIN_HW
 12.51722|  ReasonCode 0x0c04 RC_SYSAVAIL_INSUFFICIENT_HW
 12.54457|  UserData1   : 0x0000000000000000
 12.54458|  UserData2   : 0x0000000000000000
 12.54458|------------------------------------------------
 12.54459|  Callout type             : Procedure Callout
 12.54460|  Procedure                : EPUB_PRC_FIND_DECONFIGURED_PART
 12.54461|  Priority                 : SRCI_PRIORITY_HIGH
 12.54462|------------------------------------------------
 12.54462|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.54463|================================================
 12.73660|System shutting down with error status 0x90000007
 12.75545|================================================
 12.75546|Error reported by istep (0x1700) PLID 0x90000007
 12.77991|  IStep failed, see other log(s) with the same PLID for reason.
 12.77992|  ModuleId   0x01 MOD_REPORTING_ERROR
 12.77993|  ReasonCode 0x1703 RC_FAILURE
 12.77994|  UserData1  eid of first error : 0x9000000800000c04
 12.77995|  UserData2  Reason code of first error : 0x0000000100000609
 12.77996|------------------------------------------------
 12.77996|  host_gard
 12.77997|------------------------------------------------
 12.77998|  Callout type             : Procedure Callout
 12.77998|  Procedure                : EPUB_PRC_HB_CODE
 12.77999|  Priority                 : SRCI_PRIORITY_LOW
 12.78000|------------------------------------------------
 12.78001|  Hostboot Build ID: hostboot-3beba24/hbicore.bin
 12.78002|================================================

Looking forward to getting some DIMMs to show/share more.

December 09, 2019

systemd-nspawn and Private Networking

Currently there’s two things I want to do with my PC at the same time, one is watching streaming services like ABC iView (which won’t run from non-Australian IP addresses) and another is torrenting over a VPN. I had considered doing something ugly with iptables to try and get routing done on a per-UID basis but that seemed to difficult. At the time I wasn’t aware of the ip rule add uidrange [1] option. So setting up a private networking namespace with a systemd-nspawn container seemed like a good idea.

Chroot Setup

For the chroot (which I use as a slang term for a copy of a Linux installation in a subdirectory) I used a btrfs subvol that’s a snapshot of the root subvol. The idea is that when I upgrade the root system I can just recreate the chroot with a new snapshot.

To get this working I created files in the root subvol which are used for the container.

I created a script like the following named /usr/local/sbin/container-sshd to launch the container. It sets up the networking and executes sshd. The systemd-nspawn program is designed to launch init but that’s not required, I prefer to just launch sshd so there’s only one running process in a container that’s not being actively used.

#!/bin/bash

# restorecon commands only needed for SE Linux
/sbin/restorecon -R /dev
/bin/mount none -t tmpfs /run
/bin/mkdir -p /run/sshd
/sbin/restorecon -R /run /tmp
/sbin/ifconfig host0 10.3.0.2 netmask 255.255.0.0
/sbin/route add default gw 10.2.0.1
exec /usr/sbin/sshd -D -f /etc/ssh/sshd_torrent_config

How to Launch It

To setup the container I used a command like “/usr/bin/systemd-nspawn -D /subvols/torrent -M torrent –bind=/home -n /usr/local/sbin/container-sshd“.

First I had tried the --network-ipvlan option which creates a new IP address on the same MAC address. That gave me an interface iv-br0 on the container that I could use normally (br0 being the bridge used in my workstation as it’s primary network interface). The IP address I assigned to that was in the same subnet as br0, but for some reason that’s unknown to me (maybe an interaction between bridging and network namespaces) I couldn’t access it from the host, I could only access it from other hosts on the network. I then tried the --network-macvlan option (to create a new MAC address for virtual networking), but that had the same problem with accessing the IP address from the local host outside the container as well as problems with MAC redirection to the primary MAC of the host (again maybe an interaction with bridging).

Then I tried just the “-n” option which gave it a private network interface. That created an interface named ve-torrent on the host side and one named host0 in the container. Using ifconfig and route to configure the interface in the container before launching sshd is easy. I haven’t yet determined a good way of configuring the host side of the private network interface automatically.

I had to use a bind for /home because /home is a subvol and therefore doesn’t get included in the container by default.

How it Works

Now when it’s running I can just “ssh -X” to the container and then run graphical programs that use the VPN while at the same time running graphical programs on the main host that don’t use the VPN.

Things To Do

Find out why --network-ipvlan and --network-macvlan don’t work with communication from the same host.

Find out why --network-macvlan gives errors about MAC redirection when pinging.

Determine a good way of setting up the host side after the systemd-nspawn program has run.

Find out if there are better ways of solving this problem, this way works but might not be ideal. Comments welcome.

December 06, 2019

Audiobooks – September 2019

Off the Rails: A Train Trip Through Life by Beppe Severgnini

A collection of train journey articles (written over about 20 years). A good selection on interesting and amusing. 7/10

Exoplanets: Hidden Worlds and the Quest for Extraterrestrial Life by Donald Goldsmith

A history of the discovery of exoplanets, covering the different groups, techniques and rivalries. Good although I got the people mixed up sometimes. 7/10

Save the Cat! : The Last Book on Screenwriting You’ll Ever Need by Blake Snyder

A guide to screenwriting with a few stories and observations on movies thrown in. Good even if you are just reading it for fun. 7/10

Being Mortal: Medicine and What Matters in the End by Atul Gawande

A book about geriatric and end-of-life care and choices. Lots of points about how risking all for aggressive treatment is often a very bad idea. Thought-provoking. 9/10

Ancient Alexandria: The History and Legacy of Egypt’s Most Famous City by Charles River Editors

Just a two hour long overview of the history. Covered the basic stuff and maybe worth skimming before you hit something meatier. 6/10

Vulcan 607 by Rowland White

The story of the long-distant bombing raids during the Falkland’s war. Lots of details on the history of the Vulcan, the crews, background and the actual missions. 9/10

101 Secrets For Your Twenties by Paul Angone

I really can’t remember this book well. I think it was okay but serves me right for getting months behind on reviews. On list for completeness. ?/10

Share

December 02, 2019

LUV December 2019 Main Meeting: A review of Linux and Open Source in 2019

Dec 3 2019 19:00
Dec 3 2019 21:00
Dec 3 2019 19:00
Dec 3 2019 21:00
Location: 
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

NOTE: The library closes at 7pm so arrivals after that time will need to contact Andrew on (0421) 775 358 or any other attendee for admission.

Speaker:  Alexar Pendashteh

A review of Linux and Open Source in 2019

This is the last main meeting of LUV in 2019!
In this meeting we are going to have a look at what 2019 had for Linux and Open Source and have a peek into what's coming up.
This event will be mainly a social event, with group discussion followed by a dinner in a nearby resturant or cafe!

Many of us like to go for dinner nearby in Lygon St. after the meeting. Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

December 3, 2019 - 19:00

November 25, 2019

Fixing Turris Omnia WiFi Quality

I was recently hoping to replace an aging proprietary router (upgraded to a Gargoyle FOSS firmware). After rejecting a popular brand with a disturbing GPL violation habit, I settled on the Turris Omnia router, built on free software. Overall, I was pretty satisfied with the fact that it is free and comes with automatic updates, but I noticed a problem with the WiFi. Specifically, the 5 GHz access point was okay but the 2.4 GHz was awful.

False lead

I initially thought that the 2.4 GHz radio wasn't working, but then I realized that putting my phone next to the router would allow it to connect and exchange data at a slow-but-steady rate. If I moved the phone more than 3-4 meters away though, it would disconnect for lack of signal. To be frank, the wireless performance was much worse than my original router, even though the wired performance was, as expected, amazing:

I looked on the official support forums and found this intriguing thread about interference between USB3 and 2.4 GHz radios. This sounded a lot like what I was experiencing (working radio but terrible signal/interference) and so I decided to see if I could move the radios around inside the unit, as suggested by the poster.

After opening the case however, I noticed that radios were already laid out in the optimal way:

and that USB3 interference wasn't going to be the reason for my troubles.

Real problem

So I took a good look at the wiring and found that while the the larger radio (2.4 / 5 GHz dual-bander) was connected to all three antennas, the smaller radio (2.4 GHz only) was connected to only 2 of the 3 antennas:

To make it possible for antennas 1 and 3 to carry the signal from both radios, a duplexer got inserted between the radios and the antenna:

On one side is the 2.4 antenna port and on the other side is the 5 GHz port.

Looking at the wiring though, it became clear that my 2.4 GHz radio was connected to the 5 GHz ports of the two duplexers and the 5 GHz radio was connected to the 2.4 GHz ports of the duplexers. This makes sense considering that I had okay 5 GHz performance (with one of the three chains connected to the right filter) and abysimal 2.4 GHz performance (with none of the two chains connected to the right filter).

Solution

Swapping the antenna connectors around completely fixed the problem. With the 2.4 GHz radio connected to the 2.4 side of the duplexer and the dual-bander connected to the 5 GHz side, I was able to get the performance I would expect from such a high-quality router.

Interestingly enough, I found the solution to this problem the same weekend as I passed my advanced amateur radio license exam. I guess that was a good way to put the course material into practice!

November 18, 2019

4K Monitors

A couple of years ago a relative who uses a Linux workstation I support bought a 4K (4096*2160 resolution) monitor. That meant that I had to get 4K working, which was 2 years of pain for me and probably not enough benefit for them to justify it. Recently I had the opportunity to buy some 4K monitors at a low enough price that it didn’t make sense to refuse so I got to experience it myself.

The Need for 4K

I’m getting older and my vision is decreasing as expected. I recently got new glasses and got a pair of reading glasses as a reduced ability to change focus is common as you get older. Unfortunately I made a mistake when requesting the focus distance for the reading glasses and they work well for phones, tablets, and books but not for laptops and desktop computers. Now I have the option of either spending a moderate amount of money to buy a new pair of reading glasses or just dealing with the fact that laptop/desktop use isn’t going to be as good until the next time I need new glasses (sometime 2021).

I like having lots of terminal windows on my desktop. For common tasks I might need a few terminals open at a time and if I get interrupted in a task I like to leave the terminal windows for it open so I can easily go back to it. Having more 80*25 terminal windows on screen increases my productivity. My previous monitor was 2560*1440 which for years had allowed me to have a 4*4 array of non-overlapping terminal windows as well as another 8 or 9 overlapping ones if I needed more. 16 terminals allows me to ssh to lots of systems and edit lots of files in vi. Earlier this year I had found it difficult to read the font size that previously worked well for me so I had to use a larger font that meant that only 3*3 terminals would fit on my screen. Going from 16 non-overlapping windows and an optional 8 overlapping to 9 non-overlapping and an optional 6 overlapping is a significant difference. I could get a second monitor, and I won’t rule out doing so at some future time. But it’s not ideal.

When I got a 4K monitor working properly I found that I could go back to a smaller font that allowed 16 non overlapping windows. So I got a real benefit from a 4K monitor!

Video Hardware

Version 1.0 of HDMI released in 2002 only supports 1920*1080 (FullHD) resolution. Version 1.3 released in 2006 supported 2560*1440. Most of my collection of PCIe video cards have a maximum resolution of 1920*1080 in HDMI, so it seems that they only support HDMI 1.2 or earlier. When investigating this I wondered what version of PCIe they were using, the command “dmidecode |grep PCI” gives that information, seems that at least one PCIe video card supports PCIe 2 (released in 2007) but not HDMI 1.3 (released in 2006).

Many video cards in my collection support 2560*1440 with DVI but only 1920*1080 with HDMI. As 4K monitors don’t support DVI input that meant that when initially using a 4K monitor I was running in 1920*1080 instead of 2560*1440 with my old monitor.

I found that one of my old video cards supported 4K resolution, it has a NVidia GT630 chipset (here’s the page with specifications for that chipset [1]). It seems that because I have a video card with 2G of RAM I have the “Keplar” variant which supports 4K resolution. I got the video card in question because it uses PCIe*8 and I had a workstation that only had PCIe*8 slots and I didn’t feel like cutting a card down to size (which is apparently possible but not recommended), it is also fanless (quiet) which is handy if you don’t need a lot of GPU power.

A couple of months ago I checked the cheap video cards at my favourite computer store (MSY) and all the cheap ones didn’t support 4K resolution. Now it seems that all the video cards they sell could support 4K, by “could” I mean that a Google search of the chipset says that it’s possible but of course some surrounding chips could fail to support it.

The GT630 card is great for text, but the combination of it with a i5-2500 CPU (rating 6353 according to cpubenchmark.net [3]) doesn’t allow playing Netflix full-screen and on 1920*1080 videos scaled to full-screen sometimes gets mplayer messages about the CPU being too slow. I don’t know how much of this is due to the CPU and how much is due to the graphics hardware.

When trying the same system with an ATI Radeon R7 260X/360 graphics card (16* PCIe and draws enough power to need a separate connection to the PSU) the Netflix playback appears better but mplayer seems no better.

I guess I need a new PC to play 1920*1080 video scaled to full-screen on a 4K monitor. No idea what hardware will be needed to play actual 4K video. Comments offering suggestions in this regard will be appreciated.

Software Configuration

For GNOME apps (which you will probably run even if like me you use KDE for your desktop) you need to run commands like the following to scale menus etc:

gsettings set org.gnome.settings-daemon.plugins.xsettings overrides "[{'Gdk/WindowScalingFactor', <2>}]"
gsettings set org.gnome.desktop.interface scaling-factor 2

For KDE run the System Settings app, go to Display and Monitor, then go to Displays and Scale Display to scale things.

The Arch Linux Wiki page on HiDPI [2] is good for information on how to make apps work with high DPI (or regular screens for people with poor vision).

Conclusion

4K displays are still rather painful, both in hardware and software configuration. For serious computer use it’s worth the hassle, but it doesn’t seem to be good for general use yet. 2560*1440 is pretty good and works with much more hardware and requires hardly any software configuration.

November 17, 2019

Use swap on NVMe to run more dev KVM guests, for when you run out of RAM

I often spin up a bunch of VMs for different reasons when doing dev work and unfortunately, as awesome as my little mini-itx Ryzen 9 dev box is, it only has 32GB RAM. Kernel Samepage Merging (KSM) definitely helps, however when I have half a dozens or so VMs running and chewing up RAM, the Kernel’s Out Of Memory (OOM) killer will start executing them, like this.

[171242.719512] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d435\x2dtest\x2dvm\x2dcentos\x2d7\x2d00.scope,task=qemu-system-x86,pid=2785515,uid=107
[171242.719536] Out of memory: Killed process 2785515 (qemu-system-x86) total-vm:22450012kB, anon-rss:5177368kB, file-rss:0kB, shmem-rss:0kB
[171242.887700] oom_reaper: reaped process 2785515 (qemu-system-x86), now anon-rss:0kB, file-rss:68kB, shmem-rss:0kB

If I had more slots available (which I don’t) I could add more RAM, but that’s actually pretty expensive, plus I really like the little form factor. So, given it’s just dev work, a relatively cheap alternative is to buy an NVMe drive and add a swap file to it (or dedicate the whole drive). This is what I’ve done on my little dev box (actually I bought it with an NVMe drive so adding the swapfile came for free).

Of course the number of VMs you can run depends on the amount of RAM each VM actually needs for what you’re running on it. But whether I’m running 100 small VMs or 10 large ones, it doesn’t matter.

To demonstrate this, I spin up a bunch of CentOS 7 VMs at the same time and upgrade all packages. Without swap I could comfortably run half a dozen VMs, but more than that and they would start getting killed. With 100GB swap file I am able to get about 40 going!

Even with pages swapping in and out, I haven’t really noticed any performance decrease and there is negligible CPU time wasted waiting on disk I/O when using the machines normally.

The main advantage for me is that I can keep lots of VMs around (or spin up dozens) in order to test things, without having to juggle active VMs or hoping they won’t actually use their memory and have the kernel start killing my VMs. It’s not as seamless as extra RAM would be, but that’s expensive and I don’t have the slots for it anyway, so this seems like a good compromise.

November 16, 2019

DrupalSouth Diversity Scholarship Winner Announced

A few weeks ago we announced our diversity scholarship for DrupalSouth. Before announcing the winner I want to talk a bit about our experience doing this for the first time.

DrupalSouth is the largest Drupal event held in Oceania every year. It provides a great marketing opportunity for businesses wanting to promote their products and services to the Drupal community. Dave Hall Consulting planned to sponsor DrupalSouth to promote our new training business - Getting It Live training. By the time we got organised all of the (affordable) sponsorship opportunities had gone. After considering various opportunities around the event we felt the best way of investing a similar amount of money and giving something back to the community was through a diversity scholarship

The community provided positive feedback about the initiative. However despite the enthusiasm and working our networks to get a range of applicants, we only ended up with 7 applicants. They were all guys. One applicant was from Australia, the rest were from overseas. About half the applicants dropped out when contacted to confirm that they could cover their own travel and visa expenses.

We are likely to offer other scholarships in the future. We will start earlier and explore other channels for promoting the program.

The scholarship has been awarded to Yogesh Ingale, from Mumbai, India. Over the last 3 years Yogesh has been employed by Tata Consultancy Services’ digital operations team as a DevOps Engineer. During this time he has worked with Drupal, Cloud Computing, Python and Web Technologies. Yogesh is interested in automating processes. When he’s not working, Yogesh likes to travel, automate things and write blog posts. Disclaimer: I know Yogesh through my work with one of my clients. Some times the Drupal community feels pretty small.

Congratulations Yogesh! I am looking forward to seeing you in Hobart.

If you want to meet Yogesh before DrupalSouth, we still have some seats available for our 73780151419">2 day git training course that’s running on 25-26 November. If you won’t be in Hobart, contact us to discuss your training needs.

November 10, 2019

Database Tab Sweep

I miss a proper database related newsletter for busy people. There’s so much happening in the space, from tech, to licensing, and even usage. Anyway, quick tab sweep.

Paul Vallée (of Pythian fame) has been working on Tehama for sometime, and now he gets to do it full time as a PE firm, bought control of Pythian’s services business. Pythian has more than 350 employees, and 250 customers, and raised capital before. More at Ottawa’s Pythian spins out software platform Tehama.

Database leaks data on most of Ecuador’s citizens, including 6.7 million children – ElasticSearch.

Percona has launched Percona Distribution for PostgreSQL 11. This means they have servers for MySQL, MongoDB, and now PostgreSQL. Looks very much like a packaged server with tools from 3rd parties (source).

Severalnines has launched Backup Ninja, an agent-based SaaS service to backup popular databases in the cloud. Backup.Ninja (cool URL) supports MySQL (and variants), MongoDB, PostgreSQL and TimeScale. No pricing available, but it is free for 30 days.

Comparing Database Types: How Database Types Evolved to Meet Different Needs

New In PostgreSQL 12: Generated Columns – anyone doing a comparison with MariaDB Server or MySQL?

Migration Complete – Amazon’s Consumer Business Just Turned off its Final Oracle Database – a huge deal as they migrated 75 petabytes of internal data to DynamoDB, Aurora, RDS and Redshift. Amazon, powered by AWS, and a big win for open source (a lot of these services are built-on open source).

MongoDB and Alibaba Cloud Launch New Partnership – I see this as a win for the SSPL relicense. It is far too costly to maintain a drop-in compatible fork, in a single company (Hi Amazon DocumentDB!). Maybe if the PostgreSQL layer gets open sourced, there is a chance, but otherwise, all good news for Alibaba and MongoDB.

MySQL 8.0.18 brings hash join, EXPLAIN ANALYZE, and more interestingly, HashiCorp Vault support for MySQL Keyring. (Percona has an open source variant).

Some thoughts on Storytelling as an engineering teaching tool

Every week at work on Wednesday afternoons we have the SRE ops review, a relaxed two hour affair where SREs (& friends of, not all of whom are engineers) share interesting tidbits that have happened over the last week or so, this might be a great success, an outage, a weird case, or even a thorny unsolved problem. Usually these relate to a service the speaker is oncall for, or perhaps a dependency or customer service, but we also discuss major incidents both internal & external. Sometimes a recent issue will remind one of the old-guard (of which I am very much now a part) of a grand old story and we share those too.

Often the discussion continues well into the evening as we decant to one of the local pubs for dinner & beer, sometimes chatting away until closing time (probably quite regularly actually, but I'm normally long gone).

It was at one of these nights at the pub two months ago (sorry!), that we ended up chatting about storytelling as a teaching tool, and a colleague asked an excellent question, that at the time I didn't have a ready answer for, but I've been slowly pondering, and decided to focus on over an upcoming trip.

As I start to write the first draft of this post I've just settled in for cruise on my first international trip in over six months[1], popping over to Singapore for the Melbourne Cup weekend, and whilst I'd intended this to be a holiday, I'm so terrible at actually having a holiday[2] that I've ended up booking two sessions of storytelling time, where I present the history of Google's production networks (for those of you reading this who are current of former engineering Googlers, similar to Traffic 101). It's with this perspective of planning, and having run those sessions that I'm going to try and answer the question that I was asked.

Or at least, I'm going to split up the question I was asked and answer each part.

"What makes storytelling good"

On its own this is hard to answer, there are aspects that can help, such as good presentation skills (ideally keeping to spoken word, but simple graphs, diagrams & possibly photos can help), but a good story can be told in a dry technical monotone and still be a good story. That said, as with the rest of these items charisma helps.

"What makes storytelling interesting"

In short, a hook or connection to the audience, for a lot of my infrastructure related outage stories I have enough context with the audience to be able to tie the impact back in a way that resonates with a person. For larger disparate groups shared languages & context help ensure that I'm not just explaining to one person.

In these recent sessions one was with a group of people who work in our Singapore data centre, in that session I focused primarily on the history & evolution of our data centre fabrics, giving them context to understand why some of the (at face level) stranger design decisions have been made that way.

The second session was primarily people involved in the deployment side of our backbone networks, and so I focused more on the backbones, again linking with knowledge the group already had.

"What makes storytelling entertaining"

Entertaining storytelling is a matter of style, skills and charisma, and while many people can prepare (possibly with help) an entertaining talk, the ability to tell an entertaining story off the cuff is more of a skill, luckily for me, one I seem to do ok with. Two things that can work well are dropping in surprises, and where relevant some level of self-deprecation, however both need to be done very carefully.

Surprises can work very well when telling a story chronologically "I assumed X because Y, <five minutes of waffling>, so it turned out I hadn't proved Y like I thought, so it wasn't X, it was Z", they can help the audience to understand why a problem wasn't solved so easily, and explaining "traps for young players" as Dave Jones (of the EEVblog) likes to say can themselves be really helpful learning elements. Dropping surprises that weren't surprises to the story's protagonist generally only works if it's as a punchline of a joke, and even then it often doesn't.

Self-deprecation is an element that I've often used in the past, however more recently I've called others out on using it, and have been trying to reduce it myself, depending on the audience you might appear as a bumbling success or stupid, when the reality may be that nobody understood the situation properly, even if someone should have. In the ops review style of storytelling, it can also lead to a less experienced audience feeling much less confident in general than they should, which itself can harm productivity and careers.

If the audience already had relevant experience (presenting a classic SRE issue to other SREs for example, a network issue to network engineers, etc.) then audience interaction can work very well for engagement. "So the latency graph for database queries was going up and to the right, what would you look at?" This is also similar to one of the ways to run a "wheel of misfortune" outage simulation.

"What makes storytelling useful & informative at the same time"

In the same way as interest, to make storytelling useful & informative for the audience involves consideration for the audience, as a presenter if you know the audience, at least in broad strokes this helps. As I mentioned above, when I presented my talk to a group of datacenter-focused people I focused on the DC elements, connecting history to the current incarnations; when I presented to a group of more general networking folk a few days later, I focused more on the backbones and other elements they'd encountered.

Don't assume that a story will stick wholesale, just leaving a few keywords, or even just a vague memory with a few key words they can go digging for can make all the difference in the world. Repetition works too, sharing many interesting stories that share the same moral (for an example, one of the ops review classics is demonstrations about how lack of exponential backoff can make recovery from outages hard), hearing this over dozens of different stories over weeks (or months, or years...) it eventually seeps in as something to not even question having been demonstrated as such an obvious foundation of good systems.

When I'm speaking to an internal audience I'm happy if they simply remember that I (or my team) exist and might be worth reaching out to in future if they have questions.

Lastly, storytelling is a skill you need to practice, whether a keynote presentation in front of a few thousand people, or just telling tall takes to some mates at the pub practice helps, and eventually many of the elements I've mentioned above become almost automatic. As can probably be seen from this post I could do with some more practice on the written side.

1: As I write these words I'm aboard a Qantas A380 (QF1) flying towards Singapore, the book I'm currently reading, of all things about mechanical precision ("Exactly: How Precision Engineers Created the Modern World" or as it has been retitled for paperback "The Perfectionists"), has a chapter themed around QF32, the Qantas A380 that notoriously had to return to Singapore after an uncontained engine failure. Both the ATSB report on the incident and the captain Richard de Crespigny's book QF32 are worth reading. I remember I burned though QF32 one (very early) morning when I was stuck in GlobalSwitch Sydney waiting for approval to repatch a fibre, one of the few times I've actually dealt with the physical side of Google's production networks, and to date the only time the fact I live just a block from that facility has been used at all sensibly.

2: To date, I don't think I've ever actually had a holiday that wasn't organised by family, or attached to some conference, event or work travel I'm attending. This trip is probably the closest I've ever managed (roughly equal to my burnout trip to Hawaii in 2014), and even then I've ruined it by turning two of the three weekdays into work. I'm much better at taking breaks that simply involve not leaving home or popping back to stay with family in Melbourne.

November 04, 2019

Audiobooks – August 2019

Periodic Tales: The Curious Lives of the Elements by Hugh Aldersey-Williams

Various depths of coverage (usually by interest of the story) of the discovery, usage and literature/cultural impact around each of the elements. 8/10

Born to Run by Bruce Springsteen

Autobiography read by the author. Covers his whole career and personal life. Well written and lots of details and insight. Well read too. 9/10

The Admirals: Nimitz, Halsey, Leahy, and King – The Five-Star Admirals Who Won the War at Sea by Walter R. Borneman

A Biography of the 5 Admirals and the interactions of their careers before and during World War 2. 7/10

Because Internet: Understanding the New Rules of Language by Gretchen McCulloch

I really can’t remember this book (serves me right for delaying reviews). I think it was okay though. [67]/10

The 4% Universe: Dark Matter, Dark Energy, and the Race to Discover the Rest of Reality by Richard Panek

Pretty much what the subtitles says. Worked fairly well at keep the different people distinct and technical explanations made sense. 7/10

The Unopened casebook of Sherlock Holmes written by John Taylor with Simon Callow as Sherlock Holmes and Nicky Henson as Dr Watson

6 audioplay stories. Quality is okay although I detected a theme with the villains. 7/10

Best. Movie. Year. Ever: How 1999 Blew Up the Big Screen by Brian Raftery

A run though of the great (and a few not) movies that came out in 1999. Some backstories on many with industry and world news from the year. 8/10


Share

November 03, 2019

KMail Crashing and LIBGL

One problem I’ve had recently on two systems with NVideo video cards is KMail crashing (SEGV) while reading mail. Sometimes it goes for months without having problems, and then it gets into a state where reading a few messages (or sometimes reading one particular message) causes a crash. The crash happens somewhere in the Mesa library stack.

In an attempt to investigate this I tried running KMail via ssh (as that precludes a lot of the GL stuff), but that crashed in a different way (I filed an upstream bug report [1]).

I have discovered a workaround for this issue, I set the environment variable LIBGL_ALWAYS_SOFTWARE=1 and then things work. At this stage I can’t be sure exactly where the problems are. As it’s certain KMail operations that trigger it I think that’s evidence of problems originating in KMail, but the end result when it happens often includes a kernel error log so there’s probably a problem in the Nouveau driver. I spent quite a lot of time investigating this, including recompiling most of the library stack with debugging mode and didn’t get much of a positive result. Hopefully putting it out there will help the next person who has such issues.

Here is a list of environment variables that can be set to debug LIBGL issues (strangely I couldn’t find documentation on this when Googling it). If you are stuck with a problem related to LIBGL you can try setting each of these to “1” in turn and see if it makes a difference. That can either be for the purpose of debugging a problem or creating a workaround that allows you to run the programs you need to run. I don’t know why GL is required to read email.

LIBGL_DIAGNOSTIC
LIBGL_ALWAYS_INDIRECT
LIBGL_ALWAYS_SOFTWARE
LIBGL_DRI3_DISABLE
LIBGL_NO_DRAWARRAYS
LIBGL_DEBUG
LIBGL_DRIVERS_PATH
LIBGL_DRIVERS_DIR
LIBGL_SHOW_FPS

November 01, 2019

LUV November 2019 Main Meeting: nfq - an ad blocker that runs on the router

Nov 6 2019 19:00
Nov 6 2019 21:00
Nov 6 2019 19:00
Nov 6 2019 21:00
Location: 
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

NOTE: This month's meeting will be on WEDNESDAY night due to the Melbourne Cup public holiday.  The library closes at 7pm so arrivals after that time will need to contact Andrew on (0421) 775 358 or any other attendee for admission.

Speaker:  Duncan Roe, nfq - an ad blocker that runs on the router

Many of us like to go for dinner nearby after the meeting, typically at Brunetti's or Trotters Bistro in Lygon St.  Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

November 6, 2019 - 19:00

read more

LUV November 2019 Workshop: Replacing Windows 7 with Linux

Nov 16 2019 12:30
Nov 16 2019 16:30
Nov 16 2019 12:30
Nov 16 2019 16:30
Location: 
Infoxchange, 33 Elizabeth St. Richmond

Replacing Windows 7 with Linux

What to do with your Windows 7 PC when its EOL arrives in January next year?  Install Linux of course!  Wen Lin will lead this talk with an intro, then get everyone to join in for a Q&A - let's share all the great ideas (and personal experience) on how to install a variety of Linux Distros to replace one's obsolete Win7 - and breathe new life into one's PC.

The meeting will be held at Infoxchange, 33 Elizabeth St. Richmond 3121.  Late arrivals please call (0421) 775 358 for access to the venue.

LUV would like to acknowledge Infoxchange for the venue.

Linux Users of Victoria is a subcommittee of Linux Australia.

November 16, 2019 - 12:30

read more

October 29, 2019

Buying an Apple Watch for 7USD

For DrupalCon Amsterdam, Srijan ran a competition with the prize being an Apple Watch 5. It was a fun idea. Try to get a screenshot of an animated GIF slot machine showing 3 matching logos and tweet it.

Try your luck at @DrupalConEur Catch 3 in a row and win an #AppleWatchSeries5. To participate, get 3 of the same logos in a series, grab a screenshot and share it with us in the comment section below. See you in Amsterdam! #SrijanJackpot #ContestAlert #DrupalCon

I entered the competition.

I managed to score 3 of the no logo logos. That's gotta be worth something, right? #srijanJackpot

The competition had a flaw. The winner was selected based on likes.

After a week I realised that I wasn’t going to win. Others were able to garner more likes than I could. Then my hacker mindset kicked in.

I thought I’d find how much 100 likes would cost. A quick search revealed likes costs pennies a piece. At this point I decided that instead of buying an easy win, I’d buy a ridiculous number of likes. 500 likes only cost 7USD. Having a blog post about gaming the system was a good enough prize for me.

Receipt: 500 likes for 7USD

I was unsure how things would go. I was supposed to get my 500 likes across 10 days. For the first 12 hours I got nothing. I thought I’d lost my money on a scam. Then the trickle of likes started. Every hour I’d get a 2-3 likes, mostly from Eastern Europe. Every so often I’d get a retweet or a bonus like on a follow up comment. All up I got over 600 fake likes. Great value for money.

Today Sirjan awarded me the watch. I waited until after they’ve finished taking photos before coming clean. Pics or it didn’t happen and all that. They insisted that I still won the competition without the bought likes.

The prize being handed over

Think very carefully before launching a competition that involves social media engagement. There’s a whole fake engagement economy.

October 27, 2019

FreeDV between Argentina and the UK

Jose (LU5DKI) has been in daily contact with a group of UK Hams including Eric (GW8LJJ) Cess (GW3OAJ) Steve (G7HZI). They are using FreeDV 700D over a novel combination of HF radio channels and the Internet via SDRs.

Jose transmits from his station in Argentina to a KiwiSDR in Santiago, Chile, around 1500km away. The UK hams listen to this SDR over the Internet. To receive, Jose listens to a KiwiSDR in the UK. The combination of the Internet and HF radio gives them reliable communications at a time where long distance band conditions are poor.

Thanks Jose for the video. You can see the “barber pole” HF fading on the signal from the UK.

Several of the UK Hams are using SM1000s running the new v2 firmware that includes FreeDV 700D. Good to see that working well in the field.

FreeDV 1.4 includes 700C/700D improvements, and the new FreeDV 2020 mode. I hope to release FreeDV 1.4 later this year. However it’s already working quite well (just a few small issues to go), so if you would like to try a Windows development version of FreeDV 1.4, please contact me. For Linux users, it’s quite easy to compile from source.

October 22, 2019

DevOpsDays NZ 2019 – Day 2 – Session 3

Everett Toews – Is GitOps worthy of the [BuzzWord]Ops moniker?

  • Usual Git workflow
  • But it takes some action
  • Applying desired state from Git
  • Example: Infrastructure as code
    • DNS
    • Onboarding and offboarding
  • Git is now a SPOF
  • Change Management Dept is now a barrier
  • Integrate with ITSM
  • Benefits: Self-service, Compiience

Joel Wirāmu Pauling – Why Bare Metal still maters

  • Cloud Native Dev doesn’t exist as a closer system
  • IoT is all hardware
  • AI/ML is using special hardware
  • Networks is all hardware offloads
  • FPGAs and ASICS need more standard open way to access
  • You’ll always have weird stuffs on your network
  • Virtualization has abstracted away the real
  • We care able vendor lockin with cloud APIs and Aus electricity isn’t all that green

Steven Ensslen – Do you have a data quality problem?

  • What is data ops and why do we want it?
  • People think they have a data quality problem but they don’t actually measure it to see how bad.
  • Causes all sorts of problems.
  • 3 Easy steps to fix data quaility
  • 1 – Document data charactersistics and train people to know them
  • 2 – Monitor data as if it is infrastructure
    • Test data like it is code
  • 3 – Professionalize your support of data professionals
    • Bring in the spreadsheet experts
    • Support reporting and analytics people too

Mandi Buswell – What are Kubernetes Operators and Why do I care

  • Like an App Store on your kubernetes cluster
  • Like a like Kubernetes robot doing that hard work for you. Lifecycle management
  • Operators run as microservices on the kubernetes cluster
  • operatorhub.io
  • Work on any kubernetes cluster
  • You can even write your own

Laura Bell – Securing the systems of the future

  • Fear and Lothing
    • It is an old problem because “People are Jerks”
  • All organization try either Fight, Flight, Freeze
  • Trying to protect: Confidentiality, Integrity, Availbality
  • Protect, Detect, Respond
  • Monolith
    • A big wall around
    • Layered defense is better but not the final solution
    • Defensive software architecture is not just prevention
    • Castles had lots of layers of defenses. Some prevention, Some Detection, Some response
  • MIcroservices
    • Look at something in the middle of a star and erase it
    • Push malicious code into deployment pipelines
  • Avoid scar tissue, stuff put in just to avoid specific previous problems. Make you feel safe but without any real evidence.
  • Fearless security patterns and approaches
  • Technology is changing but the basics are still the same
  • Lots of techniques in computer security.
  • Prevention and Detection are interchangeable
  • Batman vs Meercat model
  • Be Aware and challenge your own bubble
  • Supply Chains are vulnerable: Integrations, dependencies, Data Sources
  • Determinate threat vs Dynamic Threat
    • Can’t predicts which steps in which order are going to get the result
    • Comprimise the data then the engine will return bad results
  • Plug for opensecurity.nz

Share

DevOpsDays NZ 2019 – Day 2 – Session 2

Jacob Ivester – Diagnose DevOps: The work behind the work

  • Unhappy DevOps Family
    • Unsupport Software
    • Releases outside of primetime
    • etc
  • Focus on Process as a common problem
    • Manage Change that Affects Multiple teams
    • Throughputs vs Outputs
  • Repeatability
  • Extensibility
  • Visability
  • Safety

Cameron Huysmans – Designing an Enterprise Secrets Management Service using HashiCorp Vault

  • Australian based Bank
  • Transition for last 30 years for a bank to a layered based security model (all the way down to the server in the datacentre)
  • In 2017 moved to the cloud and infrastructure in the cloud
  • What makes a bank – licensed to operate
    • Must demonstrate control of the process
    • Reports problems to regulator
    • Identifyable business Processes
    • All Humans
  • If you use a pipeline there are no humans in the process. These machine process needs to conform to the same control
    • Archetecture naturally resistent to change. Change requires a complex process
    • ITIL
    • 2FA required for everything
    • Secrets everywhere
  • Disruption
    • Dynamic Systems with constant updates
    • Immutable containers
    • Changes done via code
    • Live system changes
    • Code and automation drives things
    • Dynamic CMDB – High Levels of abstraction
    • But you still have a secrets problems
  • Secrets Management
    • Not just a place to store passwords
    • But also a Chain of Trust
  • If Pipelines make the change who owns it, who audits it?
  • Vault becomes a bit of audit by saying who used something (person or process)
  • Why another tool ?
  • Created a pattered on how thing will be deployed. Got Security to okay it. Build it in a pipeline
  • Vault placed in the highest security area
    • But less-secure areas needed to talk to it.
    • Lots of zones internally. Some in Cloud, DMZ
    • Some talk via API gateway to main vault
    • Had a Vault replica that had a copy of some secrets and could be used by those zones that were not allowed to to the secrets zone
  • Learnings
    • This is hard, especially in the cloud
    • If Pipelines are doing the change, that must be kept secure. Attribution, notification and real-time analytics
    • Declarative manifests of change (code, scripts, tools) require more strict access controls
    • Avoid direct point-to-point connections

Share

October 21, 2019

DevOpsDays NZ 2019 – Day 2 – Session 1

Cath Jones – The Myth of the Senior Engineer

  • They won’t be able to hit the ground running on Day 1
    • Assume they know everything about how things work at your organisation that is organisation or industry-specific
    • If you don’t account for this you will see problems, stress, high turnover
  • Example: Trail by Fire
    • You get shown the basic stuff and then given your first ticket
  • How do you take organisation knowledge and empower people?
  • Employee Socialisation
    • Helps mitigate problems and assumptions
    • Facilitates communication and networking
    • Allows people to begin contributing sooner
  • Pre-Arrival Stage
    • Let people know what is expected
    • Let existing people kno who is thating and our expectations for them
    • Example: Automatic (wordpress)
      • Asked people in the final stages to complete some (paid) work.
      • Candiatites get better understanding of the company
  • Preparing for Transition
    • Culture-shock
    • How are you like compared to where they came from?
    • The new role compared to their previous one?
    • Come from a place where they were an expert and had lots of domain-specific knowledge to being a newbie
  • The Encounter Stage
    • Mentoring, Communication, Technical onboarding
    • Example: Cohorts of new hires
    • Mentoring: Proven way to socialise Senior engineers. Can be Labour intensive but helps when documentation lacking
    • Share Mentor-ship responsibilities: eg Technical and Organisational mentor seperate
    • Communication: Expectations that company places, how privledged and how transparent?
    • Authenticity: Can people be themselves. Reduces stress
  • Technical onboarding: Needs to take time and do it properly. Allow new people to contribute back to it and make it better.
    • Pick out easy wins or low-hanging fruit so peopel can contribute sooner
    • Have Style Guides and good docs
  • MetaMorphosis
    • Senior Engineers are fully Contributing

Katie McLaughlin – Being kind to 3am you

Share

DevOpsDays NZ 2019 – Day 1 – Session 3

Gleidson Nascimento – Packaging OpenShift Origin Kubernetes Distribution (OKD)

  • Centos SIG
  • Based on latest upstream

Joshua King – Don’t Reinvent the Wheel, Just Realign It

  • Project: Let notifications work for powershell users
  • Then he found the UWP community toolkit
  • Which had notifications built-in
  • These days looks around first, asks for APIs rather than scraping
  • Look around for open-source tools and give back
  • Sometimes your implimentation might be fun or even better than the original

Srdan Dukic – Implicit trust agreement in Learning Organizations

  • Sysadmin shell -> ansible -> APIs -> automate everything
  • Programmers coded themselves out of a job
  • Followup instructions or achieve results?
  • A bit of both – tension between the two
  • Money today or Money tomorrow?
  • Employee – Expected to make things better
  • Employer – Support things getting better, not fire people when they automate themselves out of a job

Julie Gunderson – You Can’t Buy DevOps

  • Lots of companies talking about DevOps are trying to sell you a solution
  • What doesn’t makes you a devops company
    • Be in the Cloud
    • Have a DevOps team
    • Get rid of the Ops Team
    • A checklist you can tick off
    • Easy
  • Westrum 3 Cultures Model
  • We want the generative model
  • Keeping information flowing between teams is prerequisite for high performance teams
  • Psychological Safety to make decisions. Lets employees focus on problems and getting work done rather than politics
  • Practices
    • Configuration management
    • CICD Pipelines
    • Work in small batches
    • Test every commit and everything else (look at Chaos engineering)
  • Tools
    • Let the teams who are using the tools decide on what tools they will use
    • XebiaLabs Periodic table of DevOps tools
  • Getting there
    • Start with one team and a POC

Share

DevOpsDays NZ 2019 – Day 1 – Session 2

Allen Geer, Michael Harrod – Kiwi Ingenuity – Kiwi’s can Overcome Tough Problems In DevOps

  • Contrast – US vs NZ
    • In the US companies are bigger, lots more people, lots more money to throw at problems.
    • Contrast with Arial Topdressing pioneered in NZ using surplus WW2 aircraft
    • Since the problems are up to 100x bigger in the US the tools are designed for that scale. ROI might not not be there for smaller companies.
    • Dealing with Scale
      • Avoid “Shinny new thing” syndrom, plan for keeping things for at least 5 years.
      • Ramp up slowly with the tool, push it into other areas.
      • Avoid Single Person Silo.
      • Bring up some Kiwi Inginuity (Look at Open source, Use the Free Tiers or Cheap Tiers).
      • Out-Innovate the US companies rather than trying to out-scale
    • Infrastructure: Monetization of Toil
      • Spending time and money on stuff you can automate
      • Lots of manual creating of infrastructure, servers, firewalls.
      • Lack of incentive for providers who charge for changes to automate stuff
      • Other Providers will automate (especially overseas ones that will come into NZ)
      • People take risks (eg no DR) in order to save money.
      • Innovator’s Dilemma
    • Solutions
      • More vocal customers
      • Providers should provider a platform, lots more self-service. Ahnd-holding for the hard stuff not the day-to-day
      • Charge for outcomes not person-hours
      • Begin Small
      • It’s an experiment – Freedom to Fail
    • Inattentive Customer Service
      • Overseas companies have a lot more forums, helpdesks, quick responses.
      • “Kiwis reluctant to make a fuss” , Companies not used to people making a fuss
      • Apply “American Ingenuity” – Striving focus to increase customer satisfaction.
      • Build a healthy community (eg online forums) around your service.
      • Gather insights from customers
      • Bezos – “When a customer contacts us, we see this as a defect” . Focus on the source of problems
    • Evolving Kiwi Workforce
      • NZ has older and aging workforce. 2nd oldest in the OECD
      • Slightly Fewer peoples with degrees
      • 11% of workforce 65+ by 2038
    • Learning in the workplace
      • Leverage senior Knowledge
      • Telco – Older customers didn’t want to approach young workers in mall. Brought in retired engineers to work in stores.
      • Mentoring and reverse-mentoring. Mentor learns insights from mentoree too (eg about younger people’s habits)
    • Introducing people to DevOps
      • Kiwi DevOps models

Craig Box – Teaching Old Servers New Tricks: extending the service mesh outside the cluster

  • Service Mesh
    • Managing a service is hard
    • metrics, monitoring, logging, traceing
    • AAA encryption, certs
    • load balancing, routing, network policy
    • quota
    • Failure handling, fault inject
  • Microservices
    • Not just for hipsters
    • Works best at scale. Lots of devs
  • Now introduce a network in between everything. Lots of hard dtuff, distributed systems are hard
  • Leaky abstractions
    • Have to build stuff into microservice to deal with problems of the network
    • In multiple libraries and languages
    • Can we fix it?
  • Sidecar Pattern
    • The sidecar does all the hard stuff instead of making the microservice itself do it.
    • Talks TCP. Able to work with all languages
  • Proxies as sidecars
    • SPOF
    • Sidecar is attached to each MS
  • Flexability and Power
    • Single place where we can do everything
    • Traffic going in: TLS termination, metrics, quota
    • Traffic out of workloads: Authentication, TLS connections
  • Istio
    • Open platform
    • Not always microservices
    • Uniform observability
    • Operational Agility
    • Policy driven Security
  • How istio works
    • Proxies + control plane
    • Pilot in control plane pushes config to proxies, keeps track of them, looks up stuff in k8s cluster
    • Mixer – policy check and telemetry
    • Citidel – cert authority to proxies
    • Control plane has to run on k8s
    • Proxies run using envoy
    • Zipkin built in
    • All done automatically for kubernetes environments ( admission controller adds sidecar )
  • Adding a VM to a service mesh
    • Enable the mesh expansion, connect the networks
    • Add the gateway IP to the VM
    • Get a cert and copy to the VM
    • Install proxy and node agent
    • Traffic from cluster -> VM .
      • Add the service to DNS in the cluster,
      • Create a ServiceEntry on the cluster
    • Traffic from VM -> Cluster service
      • Add Service and IP to /etc/host on the VM
  • Sample Application – Hipster Shop
    • productcatalogservice is outside of kubernetes
    • headless service in kubernetes
    • manually created service entry in k8s
    • Experimental istio commands to simplify process to single command

Share

October 20, 2019

DevOpsDays NZ 2019 – Day 1 – Session 1

Brooke Treadgold – Back to Basics

  • Transformation Lead ANZ Bank
  • Not originally from a Tech background
  • Tech has a lot of buzzwords and acronyms that make it an exclusive club. Improvements relay on people from other parts of the business that aren’t in that club
  • These people have to care about it and understand it.
  • Had to use terms that everybody in the business understood and related to.
  • Case for change – What top orgs do:
    • 208 times more frequent deployments
    • 2604 times faster to recover from incidents
    • 7 times lower change failure rate
  • What you need
    • High Priority -> Access to people to do the work
    • Needed tangible goal (weekly releases) to get people to focus (and pay)
  • Making change a reality
    • Risk Management
      • You can just stop doing the reports
      • You need to gain their trust in order to get influence
      • Have to take them along the way with the changes
    • Empathy
    • Influence
  • History at ANZ
    • First pipeline replace just one document
      • Explained to change managment team how the pipeline could replace the traditional plan
    • Rethink of Change Plan and Outcome Reports
      • Other teams needed these for confidence in the change
      • Found out what people actually cared about, found better ways to provide that information (confidence) it an automated way
    • Security Assessment
      • Traditionally required a big document filled in and signed off
      • Found that this was only required for “Significant” changes
      • Got a definition of what significant means so didn’t need to do this.
    • High Risk Change Records
      • Lots of paperwork for High Risk changes
      • Decided that these are not high risk changes so lots less work
      • Templated them so a lot easier to do

Charles Korn – Dockerised local build and testing environments made easy

  • Go Script – Single script that a consistence place in all you repos that does the basic function. install, help, run, deploy
  • batect – tool he wrote
    • dockerized dev environment plus a Go Script
  • Dev environment
    • Build env: code to an artifact
    • Testing Environments. Fake stuff, lots of different levels
  • Build Environment
    • Container with the build tools. Mount our code directory into this
    • Isolation brings consistency and repeatability. No more “works on my machine”
    • Clean container every single time we run a build
    • CI agents just need docker since teams will provide the container
    • Ease of Onboarding. Just get git and docker installed
    • Ease of change. Environment and tasks defined in yaml and versioned like everything else. New version downloaded. Kept in sync with actual code
  • Test Environments
    • You can run local tests
    • Consistently runs test on CI
    • Have to launch multiple containers for more complex tests, using built in docker definitions and health checks and networking
  • Path to Production
    • If deploying docker then can use same image
    • But works with stuff that isn’t deployed as docker too
  • What about docker compose?
    • Better performance
    • Model – tasks are a first class citizen – Doesn’t feel like you are fighting the too.
    • Better UI and developer experience. Updates managed automatically
    • Cleans up better after each run
    • It just works. Works with proxies better. Works with file permissions better.
  • How to get started?
    • start small, work incrementally
    • Start with the build enviroment
    • With the Test env work though one piece at a time.
    • Reuse components
    • Take advantage for other people’s images. Lots of mocks for cloud services.
    • Docker has library of health check scripts
    • Bunch of sample scripts for batect
  • github.com/charleskorn/batect

Share

October 14, 2019

Network Maintenance

To my intense amazement, it seems that NBN Co have finally done sufficient capacity expansion on our local fixed wireless tower to actually resolve the evening congestion issues we’ve been having for the past couple of years. Where previously we’d been getting 22-23Mbps during the day and more like 2-3Mbps (or worse) during the evenings, we’re now back to 22-23Mbps all the time, and the status lights on the NTD remain a pleasing green, rather than alternating between green and amber. This is how things were way back at the start, six years ago.

We received an email from iiNet in early July advising us of the pending improvements. It said:

Your NBN™ Wireless service offers maximum internet speeds of 25Mbps downland and 5Mbps upload.

NBN Co have identified that your service is connected to a Wireless cell that is currently experiencing congestion, with estimated typical evening speeds of 3~6 Mbps. This congestion means that activities like browsing, streaming or gaming might have been and could continue to be slower than promised, especially when multiple people or devices are using the internet at the same time.

NBN Co estimates that capacity upgrades to improve the speed congestion will be completed by Dec-19.

At the time we were given the option of moving to a lower speed plan with a $10 refund because we weren’t getting the advertised speed, or to wait it out on our current plan. We chose the latter, because if we’d downgraded, that would have reduced our speed during the day, when everything was otherwise fine.

We did not receive any notification from iiNet of exactly when works would commence, nor was I ever able to find any indication of planned maintenance on iiNet’s status page. Instead, I’ve come to rely on notifications from my neighbour, who’s with activ8me. He receives helpful emails like this:

This is a courtesy email from Activ8me, Letting you know NBN will be performing Fixed Wireless Network capacity work in your area that might affect your connectivity to the internet. This activity is critical to the maintenance and optimisation of the network. The approximate dates of this maintenance/upgrade work will be:

Impacted location: Neika, TAS & Downstream Sites & Upstream Sites
NBN estimates interruption 1 (Listed Below) will occur between:
Start: 24/09/19 7:00AM End: 24/09/19 8:00PM
NBN estimates interruption 2 (Listed Below) will occur between:
Start: 25/09/19 7:00AM End: 25/09/19 8:00PM
NBN estimates interruption 3 (Listed Below) will occur between:
Start: 01/10/19 7:00AM End: 01/10/19 8:00PM
NBN estimates interruption 4 (Listed Below) will occur between:
Start: 02/10/19 7:00AM End: 02/10/19 8:00PM
NBN estimates interruption 5 (Listed Below) will occur between:
Start: 03/10/19 7:00AM End: 03/10/19 8:00PM
NBN estimates interruption 6 (Listed Below) will occur between:
Start: 04/10/19 7:00AM End: 04/10/19 8:00PM
NBN estimates interruption 7 (Listed Below) will occur between:
Start: 05/10/19 7:00AM End: 05/10/19 8:00PM
NBN estimates interruption 8 (Listed Below) will occur between:
Start: 06/10/19 7:00AM End: 06/10/19 8:00PM

Change start
24/09/2019 07:00 Australian Eastern Standard Time

Change end
06/10/2019 20:00 Australian Eastern Daylight Time

This is expected to improve your service with us however, occasional loss of internet connectivity may be experienced during the maintenance/upgrade work.
Please note that the upgrades are performed by NBN Co and Activ8me has no control over them.
Thank you for your understanding in this matter, and your patience for if it does affect your service. We appreciate it.

The astute observer will note that this is pretty close to two weeks of scheduled maintenance. Sure enough, my neighbour and I (and presumably everyone else in the area) enjoyed major outages almost every weekday during that period, which is not ideal when you work from home. But, like I said at the start, they did finally get the job done.

Interestingly, according to activ8me, there is yet more NBN maintenance scheduled from 21 October 07:00 ’til 27 October 21:00, then again from 28 October 07:00 ’til 3 November 21:00 (i.e. another two whole weeks). The only scheduled upgrade I could find listed on iiNet’s status page is CM-177373, starting “in 13 days” with a duration of 6 hours, so possibly not the same thing.

Based on the above, I am convinced that there is some problem with iiNet’s status page not correctly reporting NBN incidents, but of course I have no idea whether this is NBN Co not telling iiNet, iiNet not listening to NBN Co, or if it’s just that the status web page is busted.

LUV October 2019 Workshop: Ubuntu 19.10 Eoan Ermine

Oct 19 2019 12:30
Oct 19 2019 16:30
Oct 19 2019 12:30
Oct 19 2019 16:30
Location: 
Infoxchange, 33 Elizabeth St. Richmond

Ubuntu 19.10 Eoan Ermine

The latest version of Ubuntu Linux has been released!  Come along to learn what's new and try it out, or get help upgrading.  This version adds Raspberry Pi 4 support and experimental support for ZFS as a root filesystem.

The meeting will be held at Infoxchange, 33 Elizabeth St. Richmond 3121.  Late arrivals please call (0421) 775 358 for access to the venue.

LUV would like to acknowledge Infoxchange for the venue.

Linux Users of Victoria is a subcommittee of Linux Australia.

October 19, 2019 - 12:30

read more

October 09, 2019

AWS Welcomes Stewart

A little over a month ago now, I started a new role at Amazon Web Services (AWS) as a Principal Engineer with Amazon Linux. Everyone has been wonderfully welcoming and helpful. I’m excited about the future here, the team, and our mission.

Thanks to all my IBM colleagues over the past five and a half and a bit years too, I really enjoyed working with you on OpenPOWER and hope it continues to gain traction. I have my Blackbird now and am eagerly waiting for a spare 20 minutes to assemble it.

October 04, 2019

Linux Security Summit North America 2019: Videos and Slides

LSS-NA for 2019 was held in August in San Diego.  Slides are available at the Schedule, and videos of the talks may now be found in this playlist.

LWN covered the following presentations:

The new 3-day format (as previously discussed) worked well, and we’re expecting to continue this next year for LSS-NA.

Details on the 2020 event will be announced soon!

Announcements may be found on the event twitter account @LinuxSecSummit, on the linux-security-module mailing list, and via this very blog.

Announcing the DrupalSouth Diversity Scholarship

Over the years I have benefited greatly from the generosity of the Drupal Community. In 2011 people sponsored me to write lines of code to get me to DrupalCon Chicago.

Today Dave Hall Consulting is a very successful small business. We have contributed code, time and content to Drupal. It is time for us to give back in more concrete terms.

We want to help someone from an under represented group take their career to the next level. This year we will provide a Diversity Scholarship for one person to attend DrupalSouth, our 2 day Gettin’ Git training course and 5 nights at the conference hotel. This will allow this person to attend the premier Drupal event in the region while also learning everything there is to know about git.

To apply for the scholarship, fill out the form by 23:59 AEST 19 October 2019 to be considered. (Extended from 12 October)

The winner has been announced.

October 03, 2019

Installing LineageOS 16 on a Samsung SM-T710 (gts28wifi)

  1. Check the prerequisites
  2. Backup any files you want to keep
  3. Download LineageOS ROM and optional GAPPS package
  4. Copy LineageOS image & additional packages to the SM-T710
  5. Boot into recovery mode
  6. Wipe the existing installation.
  7. Format the device
  8. Install LineageOS ROM and other optional ROMs.

0 - Check the Prerequisites

  • The device already has the latest TWRP installed.
  • Android debugging is enabled on the device
  • ADB is installed on your workstation.
  • You have a suitably configured SD card as a back up handy.

I use this android.nix to ensure my NixOS environment has the prerequisites install and configured for it's side of the process.

1 - Backup any Files You Want to Keep

I like to use adb to pull the files from the device. There are also other methods available too.

$ adb pull /sdcard/MyFolder ./Downloads/MyDevice/

Usage of adb is documented at Android Debug Bridge

2 - Download LineageOS ROM and optional GAPPS package

I downloaded lineage-16.0-20191001-UNOFFICIAL-gts28wifi.zip from gts28wifi.

I also downloaded Open GApps ARM, nano to enable Google Apps.

I could have also downloaded and installed LineageOS addonsu and addonsu-remove but opted not to at this point.

3 - Copy LineageOS image & additional packages to the SM-T710

I use adb to copy the files files across:

$ adb push ./lineage-16.0-20191001-UNOFFICIAL-gts28wifi.zip /sdcard/
./lineage-16.0-20191001-UNOFFICIAL-gts28wifi.zip: 1 file pushed. 12.1 MB/s (408677035 bytes in 32.263s)
$ adb push ./open_gapps-arm-9.0-nano-20190405.zip /sdcard/
./open_gapps-arm-9.0-nano-20190405.zip: 1 file pushed. 11.1 MB/s (185790181 bytes in 15.948s)

I also copy both to the SD card at this point as the SM-T710 is an awful device to work with and in many random cases will not work with ADB. When this happens, I fall back to the SD card.

4 - Boot into recovery mode

I power the device off, then power it back into recovery mode by holding down [home]+[volume up]+[power].

5 - Wipe the existing installation

Press Wipe then Advanced Wipe.

Select:

  • Dalvik / Art Cache
  • System
  • Data
  • Cache

Swipe Swipe to Wipe at the bottom of the screen.

Press Back to return to the Advanced Wipe screen.

Press the triangular "back" button once to return to the Wipe screen.

6 - Format the device

Press Format Data.

Type yes and press blue check mark at the bottom-right corner to commence the format process.

Press Back to return to the Advanced Wipe screen.

Press the triangular "back" button twice to return to the main screen.

7 - Install LineageOS ROM and other optional ROMs

Press Install, select the images you wish to install and swipe make it go.

Reboot when it's completed and you should be off and running wtth a brand new LineageOS 16 on this tablet.

October 02, 2019

Percona Live Europe Amsterdam Day 1 notes

Percona Live Europe Amsterdam Day 1 was a bunch of fun, especially since I didn’t have to give a talk or anything since my tutorial was over on Day 0.

At lunch, I realised that there are a lot more fringe events happening around Percona Live… and if you’ve seen how people do “tech weeks”, maybe this is what the event ends up being – a show, plus plenty of focused satellite events. FOSDEM in the open source world totally gets this, and best of all, also lists fringe events (see example from 2019).

So, Thursday evening gets a few fringe events, a relatively short train ride away:

Anyway, what was Day 1 like? Keynotes started the morning, and I did make a Twitter thread. It is clear that there is a lot of talk amongst companies that make open source software, and companies in the ecosystem that somehow also derive great value from it. Some look at this as the great cloud vendors vs open source software vendors debate, but this isn’t necessarily always the case – we’ve seen this as Percona’s model too. And we’ve seen cloud companies contribute back (again, just like Percona). Guess this is a topic for a different post, because there are always two sides to this situation…

It is also clear that people want permissive open source licenses over anything source available. If you’re a CxO looking for software, it would be almost irresponsible to be using critical software that is just source available with a proprietary license. After all, what happens when the company decides to ask for more money? (Companies change ownership too, you know).

It is probably clear the best strategies are the “multi” (or hybrid) strategies. Multiple databases, multiple clouds, and going all in on open source to avoid vendor lock-in. Of course, don’t forget that open source software also can have “vendor lock-in” – always look at the health metrics of a project, vs. a product. We’re lucky in the MySQL ecosystem that we have not just the excellent work of Oracle, but also MariaDB Corporation / MariaDB Foundation and also Percona.

MySQL 8.0 adoption is taking off, with about 26% of the users on it. Those on MySQL 5.6 still seem to be on it, and there has been a decrease in 5.7 use to grow that 8.0 pie. It isn’t clear how these stats are generated (since there is no “phone home” functionality in MySQL; also the MariaDB Server variant doesn’t get as much usage as one would like), but maybe it is via download numbers?

Anyone paying any attention to MySQL 8 will know that they have switched to a “continuous delivery model”, also known as, you get new features in every point release. So the latest 8.0.18 gets EXPLAIN ANALYZE, and while we can’t try it yet (not released, and the documentation isn’t updated), I expect it will be fairly soon. I am eager to try this, because MariaDB Server has had ANALYZE since 10.1 (GA – Oct 2015). And it wasn’t long ago that MySQL received CHECK constraints support (8.0.16). Also the CLONE plugin in 8.0.17 warrants some checking/usage!

Besides all the hallway chats and meetings I did manage to get into a few sessions… Rakuten Intelligence talked about their usage of ProxySQL, and one thing was interesting with regard to their future plans slide – they do consider group replication but they wonder what would replace their custom HA software? But most importantly they wonder if it is stable and which companies have successfully deployed it, because they don’t want to be the first. Question from the floor about Galera Cluster came up, and they said they had one app that required XA support – looks like something to consider once Galera 4 is fully baked!

The PXC–8 talk was also chock full of information, delivered excellently, and something to try soon (it wasn’t quite available yesterday, but today I see a release announcement: Experimental Binary of Percona XtraDB Cluster 8.0).

I enjoyed the OpenCorporates use case at the end too. From the fact that for them, being on-premise would be cheaper than the cloud, how they use ProxySQL, Galera Cluster branch Percona XtraDB Cluster (PXC), and ZFS. ZFS is not the most common filesystem for MySQL deployments, so it was interesting to see what could be achieved.

Then there was the Booking.com party and boy, did they outdo themselves. We had a menu, multi-course meal with wine pairings, and a lot of good conversation. A night wouldn’t be complete without some Salmiakkikossu, and Monty sent some over for us to enjoy.

Food at the Hilton has been great too (something I would never really want to say, considering I’m not a fan of the hotel chain) – even the coffee breaks are well catered for. I think maybe this has been the best Percona Live in terms of catering, and I’ve been to a lot of them (maybe all…). I have to give much kudos to Bronwyn and Lorraine at Percona for the impeccable organisation. The WiFi works a charm as well. On towards Day 2!

September 30, 2019

ProxySQL Technology Day Ghent 2019

Just delivered a tutorial on MariaDB Server 10.4. Decided to take a closer look at the schedule for Percona Live Europe Amsterdam 2019 and one thing is clear: feels like there should also be a ProxySQL tutorial, largely because at mine, I noticed like 20% of the folk saying they use it.

Seems like there are 2 talks about it though, one about real world usage on Oct 1, and one about firewall usage with AWS, given by Marco Tusa on Oct 2.

Which led me to the ProxySQL Technology Day 2019 in Ghent, Belgium. October 3 2019. 2 hour train ride away from Amsterdam Schipol (the airport stop). It is easy to grab a ticket at Schipol Plaza, first class is about €20 more per way than second class, and a good spot to stay could be the Ibis Budget Dampoort (or the Marriott Ghent). Credit card payments accepted naturally, and I’m sure you can also do this online. Didn’t take me longer than five minutes to get all this settled.

So, the ProxySQL Technology Day is free, seems extremely focused and frankly is refreshing because you just learn about one thing! I feel like the MySQL world misses out on this tremendously as we lost the users conference… Interesting to see if this happens more in our ecosystem!

September 28, 2019

Using pipefail with shell module in Ansible

If you’re using the shell module with Ansible and piping the output to another command, it might be a good idea to set pipefail. This way, if the first command fails, the whole task will fail.

For example, let’s say we’re running this silly task to look for /tmp directory and then trim the string “tmp” from the result.

ansible all -i "localhost," -m shell -a \
'ls -ld /tmp | tr -d tmp'

This will return something like this, with a successful return code.

localhost | CHANGED | rc=0 >>
drwxrwxrw. 26 roo roo 640 Se 28 19:08 /

But, let’s say the directory doesn’t exist, what would the result be?

ansible all -i "localhost," -m shell -a \
'ls -ld /tmpnothere | tr -d tmp'

Still success because of the pipe to trim was successful, even though we can see the ls command failed.

localhost | CHANGED | rc=0 >>
ls: cannot access ‘/tmpnothere’: No such file or directory

This time, let’s set pipefail first.

ansible all -i "localhost," -m shell -a \
'set -o pipefail && ls -ld /tmpnothere | tr -d tmp'

This time it fails, as expected.

localhost | FAILED | rc=2 >>
ls: cannot access ‘/tmpnothere’: No such file or directorynon-zero return code

If /bin/sh on the remote node does not point to bash then you’ll need to pass in an argument specifying bash as the executable to use for the shell task.

  - name: Silly task
   shell: set -o pipefail && ls -ld /tmp | tr -d tmp
   args:
     executable: /usr/bin/bash

Ansible lint will pick these things up for you, so why not run it across your code 😉

September 27, 2019

If I Understood You, Would I Have This Look on My Face?

Share

This book discusses science and technical communication from the perspective of someone who comes from professional theatre and acting. Alan explains how his accidental discovery of the application of theatre sports to communication created an opportunity to teach technical communicators how to be more effective. Essentially, the argument is that empathy is essential to communication — you need to be able to understand where your audience is starting and and where they’re likely to get stuck before you can take them on the journey.

Unsurprisingly given the topic of the book, this is a well written and engaging read. The book is nicely structured and uses regular anecdotes (some of them humorous) to get its message across.

A detailed and fun read.

If I Understood You, Would I Have This Look on My Face? Book Cover If I Understood You, Would I Have This Look on My Face?
Alan Alda
Self-Help
Random House
June 6, 2017
240

NEW YORK TIMES BESTSELLER • Award-winning actor Alan Alda tells the fascinating story of his quest to learn how to communicate better, and to teach others to do the same. With his trademark humor and candor, he explores how to develop empathy as the key factor. “Invaluable.”—Deborah Tannen, #1 New York Times bestselling author of You’re the Only One I Can Tell and You Just Don’t Understand Alan Alda has been on a decades-long journey to discover new ways to help people communicate and relate to one another more effectively. If I Understood You, Would I Have This Look on My Face? is the warm, witty, and informative chronicle of how Alda found inspiration in everything from cutting-edge science to classic acting methods. His search began when he was host of PBS’s Scientific American Frontiers, where he interviewed thousands of scientists and developed a knack for helping them communicate complex ideas in ways a wide audience could understand—and Alda wondered if those techniques held a clue to better communication for the rest of us. In his wry and wise voice, Alda reflects on moments of miscommunication in his own life, when an absence of understanding resulted in problems both big and small. He guides us through his discoveries, showing how communication can be improved through learning to relate to the other person: listening with our eyes, looking for clues in another’s face, using the power of a compelling story, avoiding jargon, and reading another person so well that you become “in sync” with them, and know what they are thinking and feeling—especially when you’re talking about the hard stuff. Drawing on improvisation training, theater, and storytelling techniques from a life of acting, and with insights from recent scientific studies, Alda describes ways we can build empathy, nurture our innate mind-reading abilities, and improve the way we relate and talk with others. Exploring empathy-boosting games and exercises, If I Understood You is a funny, thought-provoking guide that can be used by all of us, in every aspect of our lives—with our friends, lovers, and families, with our doctors, in business settings, and beyond. “Alda uses his trademark humor and a well-honed ability to get to the point, to help us all learn how to leverage the better communicator inside each of us.”—Forbes “Alda, with his laudable curiosity, has learned something you and I can use right now.”—Charlie Rose

Share

September 26, 2019

Talking with WP&UP

At WordCamp Europe this year, I had the opportunity to chat with the folks at WP&UP, who are doing wonderful work providing mental health support in the WordPress community.

Listen to the podcast, and check out the services that WP&UP provide!

September 21, 2019

Restricting third-party iframe widgets using the sandbox attribute, referrer policy and feature policy

Adding third-party embedded widgets on a website is a common but potentially dangerous practice. Thankfully, the web platform offers a few controls that can help mitigate the risks. While this post uses the example of an embedded SurveyMonkey survey, the principles can be used for all kinds of other widgets.

Note that this is by no means an endorsement of SurveyMonkey's proprietary service. If you are looking for a survey product, you should consider a free and open source alternative like LimeSurvey.

SurveyMonkey's snippet

In order to embed a survey on your website, the SurveyMonkey interface will tell you to install the following website collector script:

<script>(function(t,e,s,n){var
o,a,c;t.SMCX=t.SMCX||[],e.getElementById(n)||(o=e.getElementsByTagName(s),a=o[o.length-1],c=e.createElement(s),c.type="text/javascript",c.async=!0,c.id=n,c.src=["https:"===location.protocol?"https://":"http://","widget.surveymonkey.com/collect/website/js/tRaiETqnLgj758hTBazgd9NxKf_2BhnTfDFrN34n_2BjT1Kk0sqrObugJL8ZXdb_2BaREa.js"].join(""),a.parentNode.insertBefore(c,a))})(window,document,"script","smcx-sdk");</script><a
style="font: 12px Helvetica, sans-serif; color: #999; text-decoration:
none;" href=https://www.surveymonkey.com> Create your own user feedback
survey </a>

which can be rewritten in a more understandable form as:

(
function (s) {
  var scripts, last_script, new_script;
  window.SMCX = window.SMCX || [],
  document.getElementById("smcx-sdk") ||
    (
      scripts = document.getElementsByTagName("script"),
      last_script = scripts[scripts.length - 1],
      new_script = document.createElement("script"),
      new_script.type = "text/javascript",
      new_script.async = true,
      new_script.id = "smcx-sdk",
      new_script.src =
        [
          "https:" === location.protocol ? "https://" : "http://",
          "widget.surveymonkey.com/collect/website/js/tRaiETqnLgj758hTBazgd9NxKf_2BhnTfDFrN34n_2BjT1Kk0sqrObugJL8ZXdb_2BaREa.js"
        ].join(""),
      last_script.parentNode.insertBefore(new_script, last_script)
    )
  }
)();

The fact that this adds a third-party script dependency to your website is problematic because it means that a security vulnerability in their infrastructure could lead to a complete compromise of your site, thanks to third-party scripts having full control over your website. Security issues aside though, this could also enable this third-party to violate your users' privacy expectations and extract any information displayed on your site for marketing purposes.

However, if you embed the snippet on a test page and inspect it with the developer tools, you will find that it actually creates an iframe:

<iframe
    width="500"
    height="500"
    frameborder="0"
    allowtransparency="true"
    src="https://www.surveymonkey.com/r/D3KDY6R?embedded=1"
></iframe>

and you can use that directly on your site without having to load their script.

Mixed content anti-pattern

As an aside, the script snippet they propose makes use of a common front-end anti-pattern:

"https:"===location.protocol?"https://":"http://"

This is presumably meant to avoid inserting an HTTP script element into an HTTPS page, since that would be considered mixed content and get blocked by browsers, however this is entirely unnecessary. One should only ever use the HTTPS version of such scripts anyways since an HTTP page never prohibits embedding HTTPS content.

In other words, the above code snippet can be simplified to:

"https://"

Restricting iframes

Thanks to defenses which have been added to the web platform recently, there are a few things that can be done to constrain iframes.

Firstly, you can choose to hide your full page URL from SurveyMonkey using the referrer policy:

referrerpolicy="strict-origin"

This mean seem harmless, but page URLs sometimes include sensitive information in the URL path or query string, for example, search terms that a user might have typed. The strict-origin policy will limit the referrer to your site's hostname, port and protocol.

Secondly, you can prevent the iframe from being able to access anything about its embedding page or to trigger popups and unwanted downloads using the sandbox attribute:

sandbox="allow-scripts allow-forms"

Ideally, the contents of this attribute would be empty so that all restrictions would be active, but SurveyMonkey is a JavaScript application and it of course needs to submit a form since that's the purpose of the widget.

Finally, a new experimental capability is making its way into browsers: feature policy. In the context of untrusted iframes, it enables developers to explicitly disable certain powerful features:

allow="accelerometer 'none';
       ambient-light-sensor 'none';
       camera 'none';
       display-capture 'none';
       document-domain 'none';
       fullscreen 'none';
       geolocation 'none';
       gyroscope 'none';
       magnetometer 'none';
       microphone 'none';
       midi 'none';
       payment 'none';
       usb 'none';
       vibrate 'none';
       vr 'none';
       webauthn 'none'"

Putting it all together, we end up with the following HTML snippet:

<iframe
    width="500"
    height="500"
    frameborder="0"
    allowtransparency="true"
    allow="accelerometer 'none'; ambient-light-sensor 'none';
           camera 'none'; display-capture 'none';
           document-domain 'none'; fullscreen 'none';
           geolocation 'none'; gyroscope 'none'; magnetometer 'none';
           microphone 'none'; midi 'none'; payment 'none'; usb 'none';
           vibrate 'none'; vr 'none'; webauthn 'none'"
    sandbox="allow-scripts allow-forms"
    referrerpolicy="strict-origin"
    src="https://www.surveymonkey.com/r/D3KDY6R?embedded=1"
></iframe>

Content Security Policy

Another advantage of using the iframe directly is that instead of loosening your site's Content Security Policy by adding all of the following:

  • script-src https://www.surveymonkey.com
  • img-src https://www.surveymonkey.com
  • frame-src https://www.surveymonkey.com

you can limit the extra directives to just the frame controls:

  • frame-src https://www.surveymonkey.com

CSP Embedded Enforcement would be another nice mechanism to make use of, but looking at SurveyMonkey's CSP policy:

Content-Security-Policy:
  default-src https: data: blob: 'unsafe-eval' 'unsafe-inline'
      wss://*.hotjar.com 'self';
  img-src https: http: data: blob: 'self';
  script-src https: 'unsafe-eval' 'unsafe-inline' http://www.google-analytics.com http://ajax.googleapis.com
      http://bat.bing.com http://static.hotjar.com http://www.googleadservices.com
      'self';
  style-src https: 'unsafe-inline' http://secure.surveymonkey.com 'self';
  report-uri https://csp.surveymonkey.com/report?e=true&c=prod&a=responseweb

it allows the injection of arbitrary Flash files, inline scripts, evals and any other scripts hosted on an HTTPS URL, which means that it doesn't really provide any meaningful security benefits.

Embedded enforcement is thefore not a usable security control in this particular example until SurveyMonkey gets a stricter CSP policy.

September 20, 2019

Running a non-root container on Fedora with podman and systemd (Home Assistant example)

Similar to my post about running Home Assistant on Fedora in Docker, this is about using podman instead and integrating the container as a service with systemd. One of the major advantages to me is the removal of Docker daemon and integration with the rest of the system including management of dependencies like regular services.

This assumes you’ve just installed Fedora server and have a local user with sudo privileges. Let’s also install some SELinux tools.

sudo dnf install -y /usr/sbin/semanage

Create non-root user

Let’s create a specific user to run the Home Assistant service.

We could create a regular user (and remove password expiry settings), but as this is a service let’s create a system account even though it’s a bit more tricky.

sudo useradd -r -m -d /var/lib/hass hass

As this is a system account, we’ll need to manually specify sub user and group ids that the account is allowed to use inside the container. We work out what range is available by looking at /etc/subuid and /etc/subgid files on the host, ideally UID and GID should be the same.

NEW_SUBUID=$(($(tail -1 /etc/subuid |awk -F ":" '{print $2}')+65536))
NEW_SUBGID=$(($(tail -1 /etc/subgid |awk -F ":" '{print $2}')+65536))
sudo usermod \
--add-subuids  ${NEW_SUBUID}-$((${NEW_SUBUID}+65535)) \
--add-subgids  ${NEW_SUBGID}-$((${NEW_SUBGID}+65535)) \
hass

Inside the hass user’s home directory, create a config directory to store configuration files and ssl to store SSL certificates. These will be mapped into the container as /config and /ssl respectively. We will also set the appropriate SELinux context so that the directories can be accessed in the container.

sudo -H -u hass bash -c "mkdir ~/{config,ssl}"
sudo semanage fcontext -a -t user_home_dir_t "/var/lib/hass(/.+)?"
sudo semanage fcontext -a -t svirt_sandbox_file_t \
"/var/lib/hass/((config)|(ssl))(/.+)?"
sudo restorecon -Frv /var/lib/hass

Pull the container image

Now that we have the basic home directly in place, we can switch to the hass user with sudo.

sudo su - hass

As the hass user, let’s use podman to download and run the official Home Assistant image in a container.

First, pull the container which is stored under the non-root user’s ~/.local/share/containers/ directory. Note the latest tag on the end of the image name specifies the version to run. While this is not necessary if you’re getting the latest (as it’s the default), if you want a specific release simply replace latest with the version you want (see their Docker hub page for available releases). Specifying latest means we’ll get the latest release of the container at the time.

podman pull \
docker.io/homeassistant/home-assistant:latest

Manually start the container

Now we can spin up a container using that image. Note that we’re passing in the config and ssl (as read only) directories we created earlier and using host networking to open the required ports on the host.

podman run -dt \
--name=hass \
-v /var/lib/hass/config:/config \
-v /var/lib/hass/ssl:/ssl:ro \
-v /etc/localtime:/etc/localtime:ro \
--net=host \
docker.io/homeassistant/home-assistant:latest

Similar to Docker, you can look at the status of the container and manage it with podman, including getting the logs if you require them.

podman ps -a
podman logs hass
podman restart hass

To get a temporary shell on the container, execute bash.

podman exec -it hass /bin/bash

Inside the container, take a look at the passed-in /config directory (do anything else you want) and then exit when you’re done.

ls -l /config
id
echo "I am root in the container"
exit

Once the container is up and running, the Home Assistant port should be listening on the host.

ss -ltn |grep 8123

Manually destroy the container

Next we’ll create a service to manage this, so for now you can stop and delete this container (this does not delete the image we downloaded). Do this as the hass user still, then exit to return to your regular user.

podman stop hass
podman rm hass
podman ps -a
exit

Configuring the firewall

Home Assistant runs on port 8123, so we will need to open this port on the firewall (back as your regular user).

sudo firewall-cmd --get-active-zones
sudo firewall-cmd --zone=FedoraServer --add-port=8123/tcp

You can test this by using your web browser to connect to the IP address of your machine on port 8123 from another machine on your network.

If that works, make the firewall change permanent.

sudo firewall-cmd --runtime-to-permanent

Create service for the container

Now that we have the container that works, let’s create a systemd service to manage it. This will auto start the container on boot and allow us to manage it as a regular service, including any dependencies. This service stops, removes and starts a new container every time.

Note the Exec lines which will delete and restart the container from the image. As per the manual command above, to run a specific version replace latest with an available tagged release.

cat &amp;lt;&amp;lt; EOF | sudo tee /etc/systemd/system/hass.service
[Unit]
Description=Home Assistant in Container
After=network.target

[Service]
User=hass
Group=hass
Type=simple
TimeoutStartSec=5m
ExecStartPre=-/usr/bin/podman rm -f "hass"
ExecStart=podman run --name=hass -v /var/lib/hass/ssl:/ssl:ro -v /var/lib/hass/config:/config -v /etc/localtime:/etc/localtime:ro --net=host docker.io/homeassistant/home-assistant:latest
ExecReload=-/usr/bin/podman stop "hass"
ExecReload=-/usr/bin/podman rm "hass"
ExecStop=-/usr/bin/podman stop "hass"
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target
EOF

Reload the systemd daemon is required to pick up the new file.

sudo systemctl daemon-reload

Manage the container with systemd

Let’s see if we can restart the container and check its status. Because it is now managed by systemd, we can check the log with journalctl.

sudo systemctl restart hass
sudo systemctl status hass
sudo journalctl -u hass

Once you’re happy, we can enable the service.

sudo systemctl enable hass

Now is probably a good time to reboot your machine and make sure that the service comes up fine on boot.

Configuring Home Assistant

After rebooting, you should be able to browse to the Home Assistant port on your machine.

Now that you have Home Assistant running, modify the configuration as you please by editing the configuration file under the hass user home directory.

If you make a change, you can simply restart the service.

Updating the container

To update the container, switch to the hass user again and pull a newer version of the container. We can see the newer version of the image with podman and if you want to you can inspect the image for more details.

podman pull docker.io/homeassistant/home-assistant:latest
podman images -a
podman inspect docker.io/homeassistant/home-assistant:latest

Now you can restart the container as your regular user.

sudo systemctl restart hass
sudo journalctl -uf hass.service

Conclusion

Anyway, that’s an example of how you could do it with something like Home Assistant. It can be modified accordingly for any other container you might want to run.

September 18, 2019

Sadly leaving the NSW Government

This week was sadly my last week with the NSW Government, Department of Customer Service, formerly the Department of Finance, Services and Innovation. I am sad to be leaving such an exciting place at such an exciting time, but after 12 months of commuting from Canberra to Sydney. The hardest part of working in the NSW Government has been, by far, the commute. I have been leaving my little family every week for 3, 4 or 5 days, and although we have explored possibilities to move, my family and I have to continue living in Canberra for the time being. It has got to the point where my almost 4 year old has asked me to choose her over work, a heart breaking scenario as many will understand. 

I wanted to publicly thank everyone I worked with, particularly my amazing teams who have put their heart, soul and minds to the task of making exceptional public services in an exceptional public sector. I am really proud of the two Branches I had the privilege and delight to lead, and I know whatever comes next, that those 160 or so individuals will continue to do great things wherever they go. 

I remain delighted and amazed at the unique opportunity in NSW Government to lead the way for truly innovative, holistic and user centred approaches to government. The commitment and leadership from William Murphy, Glenn King, Greg Wells, Damon Rees, Emma Hogan, Tim Reardon, Annette O’Callaghan, Michael Coutts-Trotter (and many others across the NSW Government senior executive) genuinely to my mind, has created the best conditions anywhere in Australia (and likely the world!) to make great and positive change in the public service.

I want to take a moment to also directly thank Martin Hoffman, Glenn, Greg, William, Amanda Ianna and all those who have supported me in the roles, as well as everyone from my two Branches over that 12 months for their support, belief and commitment. It has been a genuine privilege and delight to be a part of this exceptional department, and to see the incredible work across our Branches.

I have only been in the NSW Government for 12 months, and in that time was the ED for Digital Government Policy and Innovation for 9 months, and then ED Data, Insights and Transformation for a further 3 months.

In just 9 months, the Digital Government Policy and Innovation team achieved a lot in the NSW Government digital space, including:

  • Australia’s first Policy Lab (bringing agile test driven and user centred design methods into a traditional policy team),
  • the Digital Government Policy Landscape (mapping all digital gov policies for agencies) including IoT & a roadmap for an AI Ethics Framework and AI Strategy,
  • the NSW Government Digital Design Standard and a strong community of practice to contribute and collaborate
  • evolution of the Digital NSW Accelerator (DNA) to include delivery capabilities,
  • the School Online Enrolment system,
  • an operational and cross government Life Journeys Program (and subsequent life journey based navigators),
  • a world leading Rules as Code exemplars and early exploration of developing human and machine readable legislation from scratch(Better Rules),
  • establishment of a digital talent pool for NSW Gov,
  • great improvements to data.nsw and whole of government data policy and the Information Management Framework,
  • capability uplift across the NSW public sector including the Data Champions network and digital champions,
  • a prototype whole of government CX Pipeline,
  • the Innovation NSW team were recognised as one of Apolitical’s 100+ teams teaching government the skills of the future with a range of Innovation NSW projects including several Pitch to Pilot events, Future Economy breakfast series,
  • and the improvements to engagement/support we provided across whole of government.

For the last 3 months I was lucky to lead the newly formed and very exciting Data, Insights and Transformation Branch, which included the Data Analytics Centre, the Behavioural Insights Unit, and a new Transformation function to explore how we could design a modern public service fit for the 21st century. In only 3 months we

  • established a strong team culture, developed a clear cohesive work program, strategic objectives and service offerings,
  • chaired the ethics board for behavioural insights projects, which was a great experience, and
  • were seeing new interest, leads and engagement from agencies who wanted to engage with the Data Analytics Centre, Behavioural Insights Unit or our new Transformation function.

It was wonderful to work with such a fantastic group of people and I learned a lot, including from the incredible leadership team and my boss, William Murphy, who shared the following kind words about my leaving:

As a passionate advocate for digital and transformative approaches to deliver great public services, Pia has also been working steadily to deliver on whole-of-government approaches such as Government as a Platform, service analytics and our newly formed Transformation agenda to reimagine government.

Her unique and effective blend of systems thinking, technical creativity and vision will ensure the next stage in her career will be just as rewarding as her time with Customer Service has been.

Pia has made the difficult decision to leave Customer Service to spend more time with her Canberra-based family.

The great work Pia and her teams have done over the last twelve months has without a doubt set up the NSW digital and customer transformation agenda for success.

I want to thank her for the commitment and drive she has shown in her work with the NSW Government, and wish her well with her future endeavours. I’m confident her focus on building exceptional teams, her vision for NSW digital transformation and the relationships she has built across the sector will continue.

For my part, I’m not sure what will come next, but I’m going to have a holiday first to rest, and probably spend October simply writing down all my big ideas and doing some work on rules as code before I look for the next adventure.

Deploying TT-RSS on NixOS

NixOS Gears by Craige McWhirter

Deploying a vanilla Tiny Tiny RSS server on NixOS via NixOps is fairly straight forward.

My preferred method is to craft a tt-rss.nix file describes the configuration of the TT-RSS server.

tt-rss.nix:

{ config, pkgs, lib, ... }:

{

  services.tt-rss = {
    enable = true;                                # Enable TT-RSS
    database = {                                  # Configure the database
      type = "pgsql";                             # Database type
      passwordFile = "/run/keys/tt-rss-dbpass";   # Where to find the password
    };
    email = {
      fromAddress = "news@mydomain";              # Address for outgoing email
      fromName = "News at mydomain";              # Display name for outgoing email
    };
    selfUrlPath = "https://news.mydomain/";       # Root web URL
    virtualHost = "news.mydomain";                # Setup a virtualhost
  };

  services.postgresql = {
    enable = true;                # Ensure postgresql is enabled
    authentication = ''
      local tt_rss all ident map=tt_rss-users
    '';
    identMap =                    # Map the tt-rss user to postgresql
      ''
        tt_rss-users tt_rss tt_rss
      '';
  };

  services.nginx = {
    enable = true;                                          # Enable Nginx
    recommendedGzipSettings = true;
    recommendedOptimisation = true;
    recommendedProxySettings = true;
    recommendedTlsSettings = true;
    virtualHosts."news.mydomain" = {                        # TT-RSS hostname
      enableACME = true;                                    # Use ACME certs
      forceSSL = true;                                      # Force SSL
    };
  };

  security.acme.certs = {
      "news.mydomain".email = "email@mydomain";
  };

}

This line from the above file should stand out:

              passwordFile = "/run/keys/tt-rss-dbpass";   # Where to find the password

The passwordFile option requires that you use a secrets file with NixOps.

Where does that file come from? It's pulled from a secrets.nix file (example) that for this example, could look like this:

secrets.nix:

{ config, pkgs, ... }:

{
  deployment.keys = {
    # Database key for TT-RSS
    tt-rss-dbpass = {
      text        = "vaetohH{u9Veegh3caechish";   # Password, generated using pwgen -yB 24
      user        = "tt_rss";                     # User to own the key file
      group       = "wheel";                      # Group to own the key file
      permissions = "0640";                       # Key file permissions
    };

  };
}

The file's path /run/keys/tt-rss-dbpass is determined by the elements. So deployment.keys determines the initial path of /run/keys and the next element tt-rss-dbpass is a descriptive name provided by the stanza's author to describe the key's use and also provide the final file name.

Now that we have described the TT-RSS service in tt-rss_for_NixOps.nix and the required credentials in secrets.nix we need to pull it all together for deployment. We achieve that in this case by importing both these files into our existing host definition:

myhost.nix:

    {
      myhost =
        { config, pkgs, lib, ... }:

        {

          imports =
            [
              ./secrets.nix                               # Import our secrets
              ./servers/tt-rss_for_NixOps.nix              # Import TT-RSS description
            ];

          deployment.targetHost = "192.168.132.123";   # Target's IP address

          networking.hostName = "myhost";              # Target's hostname.
        };
    }

To deploy TT-RSS to your NixOps managed host, you merely run the deploy command for your already configured host and deployment, which would look like this:

    $ nixops deploy -d MyDeployment --include myhost

You should now have a running TT-RSS server and be able to login with the default admin user (admin: password).

In my nixos-examples repo I have a servers directory with some example files and a README with information and instructions. You can use two of the files to generate a TT-RSS VM to take a quick poke around. There is also an example of how you can deploy TT-RSS in production using NixOps, as per this post.

If you wish to dig a little deeper, I have my production deployment over at mio-ops.

September 17, 2019

LUV October 2019 Main Meeting and AGM

Oct 1 2019 19:00
Oct 1 2019 21:00
Oct 1 2019 19:00
Oct 1 2019 21:00
Location: 
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

NOTE: The library closes at 7pm so arrivals after that time will need to contact Andrew on (0421) 775 358 or any other attendee for admission.

Speakers:

Many of us like to go for dinner nearby after the meeting, typically at Brunetti's or Trotters Bistro in Lygon St.  Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

October 1, 2019 - 19:00

read more

Software Freedom Day 2019

Sep 21 2019 13:00
Sep 21 2019 18:00
Sep 21 2019 13:00
Sep 21 2019 18:00
Location: 
Electron Workshop, 31 Arden Street North Melbourne 3051

It's time once again to get excited about all the benefits that Free and Open Source Software have given us over the past year and get together to talk about how Freedom and Openness can improve our human rights, our privacy, our security and our communities. It's Software Freedom Day!

Linux Users of Victoria is a subcommittee of Linux Australia.

September 21, 2019 - 13:00

read more

September 16, 2019

FreeDV 2020 over the QO-100 Satellite

Gerhard (OE3GBB), Steve (K5OKC) and I have been working on FreeDV 2020 over the Es’hail 2/QO-100 satellite. This satellite is in geosynchronous orbit and has a linear transponder. It’s designed for SSB so has a narrow bandwidth which rules out most digital voice modes – except FreeDV. For example FreeDV 2020 can send 8 kHz wide speech over just 1600 Hz of RF bandwidth, A linear amplifier also means the OFDM waveforms used by FreeDV will pass OK, as long as your transmit system is linear.

Modem Mods

Gerhard’s initial experiments showed that FreeDV 1600 worked well, but FreeDV 2020 was breaking up and losing sync. We guessed that this was due to significant phase noise on the channel, from the many up and down conversion steps and the transponder itself. Fortunately the SNR was quite high.

Steve and I modified the OFDM modem used for FreeDV 2020 to handle this. This modem had been designed for coherent demodulation on very low SNR HF fading channels. The phase tracking was designed for HF channels with a bandwidth of a few Hz. As a first step we added a high bandwidth option, then moved to differential demodulation. This allows us to handle faster phase shifts (i.e. more phase noise), at the expense of reduced low SNR performance. This is an acceptable trade off for this channel as we have plenty of SNR.

Gerhard also had some set-up problems getting everything to run on one machine – FreeDV 2020 likes a powerful, modern CPU due to the LPCNet codec.

I now have the rather complex Windows build process for FreeDV fully scripted (thanks Richard and Danilo). This means I can develop on Linux, then run a Docker script that rebuilds everything for Windows, packages it in an installer, and pops it up on my web site. Remarkably, it then produces the same results as the Linux version. This take a a lot of pain out of my life, and makes it easy for others to test innovations rapidly.

Here is a sample of the decoded audio, from a QSO between Gerhard and Dani (EA4GPZ):

The quality is quite high, and through a nice set of speakers the wide, 8kHz audio bandwidth is very pleasant. However I can hear some frame rate modulation, and I’ve heard similar on some other 2020 samples over HF channels from the US test team. I’ll explore that at some stage.

Gerhard’s QO100 Station










Team Work

I am enjoying working with Gerhard and Steve on this project. We are roughly equi-distant around the globe but the time zone shift allows us to bounce new software versions around for testing on 12 hour cycles. Working as a team, we investigated the problem and greatly improved the performance of FreeDV 2020 over the QO-100 satellite. We worked very carefully, debugging tricky problems, collecting and comparing samples, and discussing our results via email.

We have applied new speech coding technology (the neural net/machine learning based LPCNet), modified and optimised a HF modem, and sent our signals through a new satellite transponder. This is real experimental radio!

Our next step is to look at improving modem acquisition, which is also likely to require tuning for this channel.

Reading Further

Es’hail 2/QO-100 satellite
WebSDR for QO-100 satellite
FreeDV 2020 First On Air Tests
Steve Ports an OFDM modem from Octave to C

September 15, 2019

Prometheus 2.12, query logging, and startup failures on macos

Share

Prometheus v2.12 added active query logging. The basic idea is that there is a mmaped JSON file that contains all of the queries currently running. If prometheus was to crash, that file would therefore be a list of the queries running at the time of the crash. Overall, not a bad idea.

Some friends had recently added prometheus to their development environments. This is wired up to grafana dashboards for their microservices, and prometheus is configured to store 14 days worth of time series data via a persistent volume from the developer desktops. We did this because it is valuable for the developers to be able to see the history of metrics before and after their changes.

Now we have a developer using macos as their primary development platform, and since prometheus 2.12 it hasn’t worked. Specifically this developer is using parallels to provide the docker virtual machine on his mac. You can summarise the startup for prometheus in the dev environment like this:

$ docker run ...stuff...
...snip...
level=error ts=2019-09-15T02:20:23.520Z caller=query_logger.go:94 component=activeQueryTracker msg="Failed to mmap" file=/prometheus-data/data/queries.active Attemptedsize=20001 err="invalid argument"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7fff9917af38, 0x15, 0x14, 0x2a6b7c0, 0xc00003c7e0, 0x2a6b7c0)
	/app/promql/query_logger.go:112 +0x4d2
main.main()
	/app/cmd/prometheus/main.go:361 +0x52bd

And here’s the underlying problem — because of the way the persistent data is mapped into this container (via parallels sharing in this case), the mmap of the active queries file fails and prometheus fails to start.

In other words, since prometheus 2.12 your prometheus data files have to be stored on a filesystem which supports mmap. Additionally, there is no flag to just disable the active query logger.

So how do we work around this? Well, here’s a horrible workaround — in the data directory that is volume mapped into the container, create a symlink that is to a path that is mmapable inside the docker container, even if that path doesn’t exist outside the container. For example, given that we store the prometheus time series at $CONFIG/prometheus-data:

$ ln -s /tmp/queries.active "$CONFIG/prometheus-data/queries.active"

Note that /tmp/queries.active does not exist on the developer’s mac. Prometheus now starts and its puppies and kittens the whole way down.

Share

September 14, 2019

VDSL versus HF Radio

I’m putting up a 40M dipole. When I Tx on 40m (50W peak) my Internet drops out. Sometimes it comes back, other times the modem loses sync. The dipole has a balun, and is nicely tuned.

I tried some ferrites with several turns on the modem VDSL and power cables which improved the situation a little. But I still get a momentary drop out of Internet on PTT, and if I try hard enough I can still lose sync on the modem.

Now I have NBN (Australian National Broadband Network) with a VDSL link over traditional copper phone lines to a “node” several hundred metres away. Turns out VDSL uses bandwidth up to 30 MHz … so I guess I’m getting right into it’s pass band. Old school ADSL only used a few MHz. The phone line used for this service is 50 years old and has significant differential to common mode conversion. It’s not much of a transmission line. But probably a pretty good antenna!

I build a little jig with a transformer to couple the differential signal to my spectrum analyser and take a look:

Lotsa turns on the primary, one turn on the secondary, some core I found in the junk box. I adjusted the coupling capacitors in line with both arms of the primary so that the modem didn’t lose sync when I plugged it in (about 5pF). Also in this photo is the series LC circuit, but disconnected (open) at this stage.

Sure enough, I could see Rx energy from the node to my modem at around 7MHz, and other energy out to 12MHz. In the 7MHz region, I could see the Rx signal from the “node” at -60dBm. When I Tx SSB on 7.18 MHz my SSB signal was -30dBm. No wonder the modem is choking.

After some experimentation, I came up with a 7MHz LC series resonant circuit connected across the phone line. When the modem does it’s training thing, it sees a short circuit around 7MHz and ignores that region as no good. So when I transmit in that region, there is no modem signal to interfere with.

I started with a 800nH/600pF filter. Xc and Xl is a rather low 37 ohms reactance at resonance, and just a bit higher than that above resonance (e.g. at 8-12 MHz), attenuating a lot of the HF energy. So it was basically a LPF, killing anything above 6 MHz. This stopped the drop outs, but my Internet downstream bandwidth dropped from 55 to 24 Mbps.

Final Design

After some fiddling with a spreadsheet I came up with a 5uH/100pF series LC notch filter. This is simply a 5uH inductor in series with a 100pF capacitor, connected across the VDSL phone line.

This has a few hundred ohms of Xc above resonance, which results in just a few dB attenuation at 8-12 MHz. This obtained 38 Mbps downstream. Upstream was the same (24 Mbps) as with no filter. Good enough.

The inductor is 9 turns on a F37-61 core. Make sure you use a material suitable for high Q inductors. I initially used the wrong core material and couldn’t get a decent notch. I recommend you check the tuning – it should have a deep notch at around your operating frequency.

Here is a sweep of my filter:

I put 560 ohm resistors in series with the tracking generator output and spectrum analyser input to approximate the line impedance using this jig:

Here is a plot of the system in action:

The yellow plot is the original, unfiltered VDSL signal. At the same time I’m transmitting SSB. You can see my SSB signal on 7.18 MHz (yellow peak above the “1”).

Purple is with the series LC notch filter installed. You can see the notch left of the “1” at 7MHz. The “node” has worked out 7-8MHz is a dud band so isn’t sending any information. So nothing to interfere with when I PTT SSB. I’m not sending a SSB signal in this plot.

Note also the 8-12 MHz purple (filtered) is just a few dB lower than the yellow (unfiltered). So the notch filter isn’t wiping out the HF signals.

These plots show a mixture of Tx (-10dBm) from my modem, and Rx (underlying gentle downwards slope) – the signal from the “node”. I assume it’s full duplex, we just can’t see the Rx signal most of the time. I am sampling the combined signal next to the modem, so Tx dominates. You can see the Rx signal better when the modem is training.

For some reason my modem doesn’t Tx in the 6-8MHz band. Probably a good thing for RFI.

Results

Without the filter I get immediate interruptions pings and loss of modem sync after 20 seconds. With the filter I’ve hammered it for the last few weeks with SSB and FreeDV signals but no interruptions in pings or the received audio and waterfall from a local KiwiSDR.

There is a hit on my downstream bandwidth, but it’s not significant for me. Much nicer to be able to transmit on 40m and not have the Internet break!

Here is the finished filter, installed near the modem in some heat shrink:

I’d be interested to see if this idea will work at other sites. Due to the random nature of the phone lines no two VDSL installations are the same.

The design is simply a 5uH inductor in series with a 100pF capacitor, connected across the VDSL phone lines. You need a high Q inductor, I used 9 turns on a F37-61 core. If possible, carefully check the tuning of the notch filter.

I’ve also seen suggestions of using a quarter wave stub (about 10m of phone cable) to get the same effect. This is a neat idea, as you could just buy a 10m phone extension lead, and plug it in parallel with your VDSL line. However once again – carefully check the tuning of the stub – phone cable is messy, uncalibrated stuff!

This was an interesting little project, with a satisfying result. I quite like learning about RF, and (re) learnt about the trade-offs around reactance at resonance, transmission lines, and inductor core material.

Thanks for help and useful comments from AREG members on their mailing list. Several other AREG members are also suffering from the same problem, so I imagine it’s wide spread in other countries that use VDSL.

September 13, 2019

On the airwaves

As of this year, I’m now an amateur radio operator! Callsign VK2FAAS, foundation licence. It’s something I’ve always had an interest in doing. As a kid, I had some toy 27 MHz radios with barely 20 metres of range. Then, I got a job working as a sysadmin at a wireless ISP where we built long-distance wireless networks. And, while at LCA2013 I attended a ham radio BoF (“birds of a feather”) session, where some operators made a DX (long distance) contact by fashioning an antenna out of some wire tied to a tree.

September 11, 2019

Deploying and Configuring Vim on NixOS

NixOS Gears by Craige McWhirter

I had a need to deploy vim and my particular preferred configuration both system-wide and across multiple systems (via NixOps).

I started by creating a file named vim.nix that would be imported into either /etc/nixos/configuration.nix or an appropriate NixOps Nix file. This example is a stub that shows a number of common configuration items:

vim.nix:

with import <nixpkgs> {};

vim_configurable.customize {
  name = "vim";   # Specifies the vim binary name.
  # Below you can specify what usually goes into `~/.vimrc`
  vimrcConfig.customRC = ''
    " Preferred global default settings:
    set number                    " Enable line numbers by default
    set background=dark           " Set the default background to dark or light
    set smartindent               " Automatically insert extra level of indentation
    set tabstop=4                 " Default tabstop
    set shiftwidth=4              " Default indent spacing
    set expandtab                 " Expand [TABS] to spaces
    syntax enable                 " Enable syntax highlighting
    colorscheme solarized         " Set the default colour scheme
    set t_Co=256                  " use 265 colors in vim
    set spell spelllang=en_au     " Default spell checking language
    hi clear SpellBad             " Clear any unwanted default settings
    hi SpellBad cterm=underline   " Set the spell checking highlight style
    hi SpellBad ctermbg=NONE      " Set the spell checking highlight background
    match ErrorMsg '\s\+$'        "

    let g:airline_powerline_fonts = 1   " Use powerline fonts
    let g:airline_theme='solarized'     " Set the airline theme

    set laststatus=2   " Set up the status line so it's coloured and always on

    " Add more settings below
  '';
  # store your plugins in Vim packages
  vimrcConfig.packages.myVimPackage = with pkgs.vimPlugins; {
    start = [               # Plugins loaded on launch
      airline               # Lean & mean status/tabline for vim that's light as air
      solarized             # Solarized colours for Vim
      vim-airline-themes    # Collection of themes for airlin
      vim-nix               # Support for writing Nix expressions in vim
    ];
    # manually loadable by calling `:packadd $plugin-name`
    # opt = [ phpCompletion elm-vim ];
    # To automatically load a plugin when opening a filetype, add vimrc lines like:
    # autocmd FileType php :packadd phpCompletion
  };
}

I then needed to import this file into my system packages stanza:

  environment = {
    systemPackages = with pkgs; [
      someOtherPackages   # Normal package listing
      (
        import ./vim.nix
      )
    ];
  };

This will then install and configure Vim as you've defined it.

If you'd like to give this build a run in a non-production space, I've written vim_vm.nix with which you can build a VM, ssh into afterwards and test the Vim configuration:

$ nix-build '<nixpkgs/nixos>' -A vm --arg configuration ./vim_vm.nix
...
$ export QEMU_OPTS="-m 4192"
$ export QEMU_NET_OPTS="hostfwd=tcp::18080-:80,hostfwd=tcp::10022-:22"
$ ./result/bin/run-vim-vm-vm

Then, from a another terminal:

$ ssh nixos@localhost -p 10022

And you should be in a freshly baked NixOS VM with your Vim config ready to be used.

There's an always current example of my production Vim configuration in my mio-ops repo.

September 10, 2019

Deploying Gitea on NixOS

NixOS Gitea by Craige McWhirter

I've been using GitLab for years but recently opted to switch to Gitea, primarily because of timing and I was looking for something more lightweight, not because of any particular problems with GitLab.

To deploy Gitea via NixOps I chose to craft a Nix file (example) that would be included in a host definition. The linked and below definition provides a deployment of Gitea, using Postgres, Nginx, ACME certificates and ReStructured Text rendering with syntax highlighting.

version-management/gitea_for_NixOps.nix:

    { config, pkgs, lib, ... }:

    {

      services.gitea = {
        enable = true;                               # Enable Gitea
        appName = "MyDomain: Gitea Service";         # Give the site a name
        database = {
          type = "postgres";                         # Database type
          passwordFile = "/run/keys/gitea-dbpass";   # Where to find the password
        };
        domain = "source.mydomain.tld";              # Domain name
        rootUrl = "https://source.mydomaain.tld/";   # Root web URL
        httpPort = 3001;                             # Provided unique port
        extraConfig = let
          docutils =
            pkgs.python37.withPackages (ps: with ps; [
              docutils                               # Provides rendering of ReStructured Text files
              pygments                               # Provides syntax highlighting
          ]);
        in ''
          [mailer]
          ENABLED = true
          FROM = "gitea@mydomain.tld"
          [service]
          REGISTER_EMAIL_CONFIRM = true
          [markup.restructuredtext]
          ENABLED = true
          FILE_EXTENSIONS = .rst
          RENDER_COMMAND = ${docutils}/bin/rst2html.py
          IS_INPUT_FILE = false
        '';
      };

      services.postgresql = {
        enable = true;                # Ensure postgresql is enabled
        authentication = ''
          local gitea all ident map=gitea-users
        '';
        identMap =                    # Map the gitea user to postgresql
          ''
            gitea-users gitea gitea
          '';
      };

      services.nginx = {
        enable = true;                                          # Enable Nginx
        recommendedGzipSettings = true;
        recommendedOptimisation = true;
        recommendedProxySettings = true;
        recommendedTlsSettings = true;
        virtualHosts."source.MyDomain.tld" = {                  # Gitea hostname
          enableACME = true;                                    # Use ACME certs
          forceSSL = true;                                      # Force SSL
          locations."/".proxyPass = "http://localhost:3001/";   # Proxy Gitea
        };
      };

      security.acme.certs = {
          "source.mydomain".email = "anEmail@mydomain.tld";
      };

    }

This line from the above file should stand out:

              passwordFile = "/run/keys/gitea-dbpass";   # Where to find the password

Where does that file come from? It's pulled from a secrets.nix file (example) that for this example, could look like this:

secrets.nix:

    { config, pkgs, ... }:

    {
      deployment.keys = {
        # An example set of keys to be used for the Gitea service's DB authentication
        gitea-dbpass = {
          text        = "uNgiakei+x>i7shuiwaeth3z";   # Password, generated using pwgen -yB 24
          user        = "gitea";                      # User to own the key file
          group       = "wheel";                      # Group to own the key file
          permissions = "0640";                       # Key file permissions
        };
      };
    }

The file's path /run/keys/gitea-dbpass is determined by the elements. So deployment.keys determines the initial path of /run/keys and the next element gitea-dbpass is a descriptive name provided by the stanza's author to describe the key's use and also provide the final file name.

Now that we have described the Gitea service in gitea_for_NixOps.nix and the required credentials in secrets.nix we need to pull it all together for deployment. We achieve that in this case by importing both these files into our existing host definition:

myhost.nix:

    {
      myhost =
        { config, pkgs, lib, ... }:

        {

          imports =
            [
              ./secrets.nix                               # Import our secrets
              ./version-management/gitea_got_NixOps.nix   # Import Gitea
            ];

          deployment.targetHost = "192.168.132.123";   # Target's IP address

          networking.hostName = "myhost";              # Target's hostname.
        };
    }

To deploy Gitea to your NixOps managed host, you merely run the deploy command for your already configured host and deployment, which would look like this:

    $ nixops deploy -d MyDeployment --include myhost

You should now have a running Gitea server and be able to create an initial admin user.

In my nixos-examples repo I have a version-management directory with some example files and a README with information and instructions. You can use two of the files to generate a Gitea VM to take a quick poke around. There is also an example of how you can deploy Gitea in production using NixOps, as per this post.

If you wish to dig a little deeper, I have my production deployment over at mio-ops.

September 09, 2019

Monitoring OpenWrt with collectd, InfluxDB and Grafana

In my previous blog post I showed how to set up InfluxDB and Grafana (and Prometheus). This is how I configured my OpenWrt devices to provide monitoring and graphing of my network.

OpenWrt includes support for collectd (and even graphing inside Luci web interface) so we can leverage this and send our data across the network to the monitoring host.

OpenWrt stats in Grafana

Install and configure packages on OpenWrt

Log into your OpenWrt devices and install the required packages.

opkg update
opkg install luci-app-statistics collectd collectd-mod-cpu \
collectd-mod-interface collectd-mod-iwinfo \
collectd-mod-load collectd-mod-memory collectd-mod-network collectd-mod-uptime
/etc/init.d/luci_statistics enable
/etc/init.d/collectd enable

Next, log into your device’s OpenWrt web interface and you should see a new Statistics menu at the top. Hover over this and click on Setup so that we can configure collectd.

Add the Hostname field and enter in the device’s hostname (or some name you want).

Click on General plugins and make sure that Processor, System Load, Memory and Uptime are all enabled. Hit Save & Apply.

Under Network plugins, ensure Interfaces is enabled and select the interfaces you want to monitor (lan, wan, wifi, etc).

Still under Network plugins, also ensure Wireless is enabled but don’t select any interfaces (it will work it out). Hit Save & Apply (I don’t bother with the Ping plugin).

Click on Output plugins and ensure Network is enabled so that we can stream metrics to InfluxDB. All you need to do is add an entry under server interfaces that points to the IP address of your monitor server (which is running InfluxDB with the collectd listener enabled). Hit Save & Apply.

Finally, you can leave RRDTool plugin as it is, or disable it if you want to (it will stop showing graphs in Luci if you do, but we’re using Grafana anyway and you’ll have less load on your router). If you do enable, it make sure it is writing data to tmpfs to avoid wearing our your flash (this is the default configuration).

That’s your OpenWrt configuration done!

Loading a dashboard in Grafana

Still in your web browser, log into Grafana on your monitor node (port 3000 by default).

Import a new dashboard.

We will use an existing dashboard by contributor vooon341, so simply type in the number 3484 and hit Load.

This will download the dashboard from Grafana and prompt for settings. Enter whatever Name you like, select InfluxDB as your data source (configured in the previous blog post), then hit Import.

Grafana will now go and query InfluxDB and present your dashboard with all of your OpenWrt devices.

OpenWrt also supports a LUA Prometheus node exporter, so if you wanted to add those as well, you could. However, I think collectd does a reasonable job.

September 08, 2019

Setting up a monitoring host with Prometheus, InfluxDB and Grafana

Prometheus and InfluxDB are powerful time series database monitoring solutions, both of which are natively supported with graphing tool, Grafana.

Setting up these simple but powerful open source tools gives you a great base for monitoring and visualising your systems. We can use agents like node-exporter to publish metrics on remote hosts which Prometheus will scrape, and other tools like collectd which can send metrics to InfluxDB’s collectd listener (as per my post about OpenWRT).

Prometheus’ node exporter metrics in Grafana

I’m using CentOS 7 on a virtual machine, but this should be similar to other systems.

Install Prometheus

Prometheus is the trickiest to install, as there is no Yum repo available. You can either download the pre-compiled binary or run it in a container, I’ll do the latter.

Install Docker and pull the image (I’ll use Quay instead of Dockerhub).

sudo yum install docker
sudo systemctl start docker
sudo systemctl enable docker
sudo docker pull quay.io/prometheus/prometheus

Let’s create a directory for Prometheus configuration files which we will pass into the container.

sudo mkdir /etc/prometheus.d

Let’s create the core configuration file. This file will set the scraping interval (under global) for Prometheus to pull data from client endpoints and is also where we configure those endpoints (under scrape_configs). As we will enable node-exporter on the monitoring node itself later, let’s add it as a localhost target.

cat << EOF | sudo tee /etc/prometheus.d/prometheus.yml
global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
    - targets:
      - localhost:9100
EOF

Now we can start a persistent container. We’ll pass in the config directory we created earlier but also a dedicated volume so that the database is persistent across updates. We use host networking so that Prometheus can talk to localhost to monitor itself (not required if you want to configure Prometheus to talk to the host’s external IP instead of localhost).

Pass in the path to any custom CA Certificate as a volume (example below) for any end points you require. If you want to run this behind a reverse proxy, then set web.external-url to the hostname and port (leave it off if you don’t).

Note that enabling the admin-api and lifecycle will be allow anyone on your network to perform those functions, so you may want to only allow that if your network is trusted. Else you should probably put those behind an SSL enabled, password protected webserver (out of scope for this post).

Note also that some volumes have either :z or :Z option appended to them, this is to set the SELinux context for the container (shared vs exclusive, respectively).

sudo docker run \
--detach \
--interactive \
--ttty \
--network host \
--name prometheus \
--restart always \
--publish 9090:9090 \
--volume prometheus:/prometheus \
--volume /etc/prometheus.d:/etc/prometheus.d:Z \
--volume /path/to/ca-bundle.crt:/etc/ssl/certs/ca-certificates.crt:z \
quay.io/prometheus/prometheus \
--config.file=/etc/prometheus.d/prometheus.yml \
--web.external-url=http://$(hostname -f):9090 \
--web.enable-lifecycle \
--web.enable-admin-api

Check that the container is running properly, it should say that it is ready to receive web requests in the log. You should also be able to browse to the endpoint on port 9090 (you can run queries here, but we’ll use Grafana).

sudo docker ps
sudo docker logs prometheus

Updating Prometheus config

Updating and reloading the config is easy, just edit /etc/prometheus.d/prometheus.yml and restart the container. This is useful when adding new nodes to scrape metrics from.

sudo docker restart prometheus

You can also send a message to Prometheus to reload (if you enabled this by web.enable-lifecycle option).

curl -s -XPOST localhost:9090/-/reload

In the container log (as above) you should see that it has reloaded the config.

Installing Prometheus node exporter

You’ll notice in the Prometheus configuration above we have a job called node and a target for localhost:9100. This is a simple way to start monitoring the monitor node itself! Installing the node exporter in a container is not recommended, so we’ll use the Copr repo and install with Yum.

sudo curl -Lo /etc/yum.repos.d/_copr_ibotty-prometheus-exporters.repo \
https://copr.fedorainfracloud.org/coprs/ibotty/prometheus-exporters/repo/epel-7/ibotty-prometheus-exporters-epel-7.repo

sudo yum install node_exporter
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

It should be listening on port 9100 and Prometheus should start getting metrics from http://localhost:9100/metrics automatically (we’ll see them later with Grafana).

Install InfluxDB

Influxdata provides a yum repository so installation is easy!

cat << \EOF | sudo tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name=InfluxDB
baseurl=https://repos.influxdata.com/centos/$releasever/$basearch/stable
enabled=1
gpgcheck=1
gpgkey=https://repos.influxdata.com/influxdb.key
EOF
sudo yum install influxdb

The defaults are fine, other than enabling collectd support so that other clients can send metrics to InfluxDB. I’ll show you how to use this in another blog post soon.

sudo sed-i 's/^\[\[collectd\]\]/#\[\[collectd\]\]/' /etc/influxdb/influxdb.conf
cat << EOF | sudo tee -a /etc/influxdb/influxdb.conf
[[collectd]]
  enabled = true
  bind-address = ":25826"
  database = "collectd"
  retention-policy = ""
   typesdb = "/usr/local/share/collectd"
   security-level = "none"
EOF

This should open a number of ports, including InfluxDB itself on TCP port 8086 and collectd receiver on UDP port 25826.

sudo ss -ltunp |egrep "8086|25826"

Create InfluxDB collectd database

Finally, we need to connect to InfluxDB and create the collectd database. Just run the influx command.

influx

And at the prompt, create the database and exit.

CREATE DATABASE collectd
exit

Install Grafana

Grafana has a Yum repository so it’s also pretty trivial to install.

cat << EOF | sudo tee /etc/yum.repos.d/grafana.repo
[grafana]
name=Grafana
baseurl=https://packages.grafana.com/oss/rpm
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF
sudo yum install grafana

Grafana pretty much works out of the box and can be configured via the web interface, so simply start and enable it. The server listens on port 3000 and the default username is admin with password admin.

sudo systemctl start grafana
sudo systemctl enable grafana
sudo ss -ltnp |grep 3000

Now you’re ready to log into Grafana!

Configuring Grafana

Browse to the IP of your monitoring host on port 3000 and log into Grafana.

Now we can add our two data sources. First, Prometheus, poing to localhost on port 9090

..and then InfluxDB, pointing to localhost on port 8086 and to the collectd database.

Adding a Grafana dashboard

Make sure they tested OK and we’re well on our way. Next we just need to create some dashboards, so let’s get a dashboard to show node exporter and we’ll hopefully at least see the monitoring host itself.

Go to Dashboards and hit import.

Type the number 1860 in the dashboard field and hit load.

This should automatically download and load the dash, all you need to do is select your Prometheus data source from the Prometheus drop down and hit Import!

Next you should see the dashboard with metrics from your monitor node.

So there you go, you’re on your way to monitoring all the things! For anything that supports collectd, you can forward metrics to UDP port 25826 on your monitor node. More on that later…

September 05, 2019

Why Computers Lie Badly At Alarming Speed and the unum Promise

The translation of arithmetic to physical hardware with using the IEEE standard employed numerical representation is fraught with difficulty. As is well known by any who have used even a pocket calculator, computer processors are imprecise with dangerous rounding errors, which vary on different systems. Further, the standard representation method, IEEE 754 "Standard for Floating-Point Arithmetic" (1985, revised 2008), is extremely inefficient from an engineering perspective with increasing physical cost when additional precision is sought.

The basic issue is the limitations in converting decimal or floating point notation into binary form. The IEEE standard suggests that when a calculation overflows the value +inf should be used instead, and when a number is too small the standard says to use 0 instead. Inserting infinity to represent "a very big number" or 0 to represent a "very small number" will certainly cause computational issues. Floating point operations have additional issues when employed in parallel, breaking the logic of associative properties. The equation (a + b) + (c + d) in parallel will not equal the equation ((a + b) + c) + d when run in serial.

These issues have been known in computer science for some decades (Goldberg, 1991). In recent years an attempt has been made to reconstruct the physical implementation of arithmetic to physical hardware by providing a superset to IEEE's 754 standard and IEEE 1788, Standard for Interval
Arithmetic. This number format, the Unum (Gustafson, 2015), consists of a bit string of variable length with six sub-fields: a sign bit, exponent, fraction, uncertainty bit, exponent size, and fraction size. The uncertainty bit, or ubit, specifies whether or not there are additional bits after fraction, instead of rounding, in other words a precise interval. This means that numbers that are close to
zero or infinity are treated as such and are never represented as zero or infinity. To date, Unums have not been translated into hardware as they require more logic than floating-point numbers, but software logic has been provided.

Why Computers Lie Badly At Alarming Speed and the unum Promise
Challenges in High Performance Computing Conference
2-6 Sept, 2019 Mathematical Sciences Institute, Australian National University
http://levlafayette.com/files/2019ChallHPC-unums.pdf

September 04, 2019

Install newer git from software collections and enable globally

Work on Linux almost always means git for me, but the version provided by CentOS and RHEL is too old. Software collections is a convenient way to get a newer version and enable it for everyone by default.

First, enable software collections (different for RHEL and CentOS).

# CentOS
sudo yum install centos-release-scl
# RHEL
sudo yum-config-manager --enable rhel-server-rhscl-7-rpms

Install the newer version of git you want (e.g. git 2.18).

sudo yum install rh-git218

Enable it for everyone for any new sessions.

cat << EOF | sudo tee /etc/profile.d/git-scl.sh
source scl_source enable rh-git218
EOF

Test with a new session.

git --version
$SHELL
git --version

September 02, 2019

Replacing a NixOS Service with an Upstream Version

NixOS Hydra Gears by Craige McWhirter

It's fairly well documented how to replace a NixOS service in the stable channel with one from the unstable channel.

What if you need to build from an upstream branch that's not in either of stable or unstable channels? This is how I go about it, including building a VM in which to test the result.

I specifically wanted to test the new hydra-notify service, so to test that, I need to replace the existing Hydra module in nixpkgs with the one from upstream source. Start by checking out the hydra source:

$ git clone https://github.com/NixOS/hydra.git

We Can configure Nix to replace the nixpkgs version of Hydra with a build from hydra/master.

You can see a completed example in hydra_notify.nix but the key points are that we need to disable Hydra in the standard Nix packages:

  disabledModules = [ "services/continuous-integration/hydra/default.nix" ];

as well as import the module definition from the Hydra source we downloaded:

  imports =
    [
      "/path/to/source/hydra/hydra-module.nix"
    ];

and we need to switch services.hydra to services.hydra-dev in two locations:

  networking.firewall.allowedTCPPorts = [ config.services.hydra-dev.port 80 443 ];

  services.hydra-dev = {
    ...
  };

With these three changes, we have swapped out the Hydra in nixpkgs for one to be built from the upstream source in hydra_notify.nix.

Next we need to build a configuration for our VM that uses the replaced Hydra module declared in hydra_notify.nix. This is hydra_vm.nix, which is a simple NixOS configuration, which importantly includes our replaced Hydra module:

  imports =
    [
      ./hydra_notify.nix
    ];

to give this a run yourself, checkout nixos-examples and change to the services/hydra_upstream directory:

$ git clone https://code.mcwhirter.io/craige/nixos-examples.git
$ cd  nixos-examples/services/hydra_upstream

After updating the path to Hydra's source, We can then build the VM with:

$ nix-build '<nixpkgs/nixos>' -A vm --arg configuration ./hydra_vm.nix

Before launching the VM, I like to make sure that it is provided with enough RAM and both hydra's web UI and SSH are available by exporting the below Qemu options:

$ export QEMU_OPTS="-m 4192"
$ export QEMU_NET_OPTS="hostfwd=tcp::10443-:443,hostfwd=tcp::10022-:22"

So now we're ready to launch the VM:

./result/bin/run-hydra-notications-vm

Once it has booted, you should be able to ssh nixos@localhost -p 10022 and hit the Hydra web UI at localhost:10443.

Once you've logged into the VM you can run systemctl status hydra-notify to check that you're running upstream Hydra.

August 29, 2019

NixOS Appears to be Always Building From Source

NixOS Gears by Craige McWhirter

One of the things that NixOS and Hydra make easy is running your own custom cache of packages. A number of projects and companies make use of this.

A NixOS or Nix user can then make use of these caches by adding them to nix.conf for Nix users or /etc/nixos/configuration.nix for NixOS users.

What most people will want, is for their devices to have access to both caches.

If you add the new cache "incorrectly", you may suddenly find your device building almost everything from source, as I did.

The default /etc/nix/nix.conf for NixOS users has these default lines:

substituters = https://cache.nixos.org
...
trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY=

Many projects running custom caches will advise NixOS users to add a stanza like this to /etc/nixos/configuration.nix:

{
  nix = {
    binaryCaches = [
      "https://cache.my-project.org/"
    ];
    binaryCachePublicKeys = [
      "cache.my-project.org:j/Kb+r+tGeM+4YZH+ECfTr+b4OFViKHaciuIOHw1/DP="
    ];
  };
}

If you add this stanza to your NixOS configuration, you will end up with a nix.conf that looks like this:

...
substituters = https://cache.my-project.org/
...
trusted-public-keys = cache.my-project.org:j/Kb+r+tGeM+4YZH+ECfTr+b4OFViKHaciuIOHw1/DP=
...

Which will result in your systems only pulling cached packages from that cache and building everything else that's missing.

If you want to take advantage of what a custom cache is providing but not lose the advantages of the primary NixOS cache, your stanza in configuration.nix needs to looks like this:

{
  nix = {
    binaryCaches = [
      "https://cache.nixos.org"
      "https://cache.my-project.org/"
    ];
    binaryCachePublicKeys = [
      "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
      "cache.my-project.org:j/Kb+r+tGeM+4YZH+ECfTr+b4OFViKHaciuIOHw1/DP="
    ];
  };
}

You will now get the benefit of both caches and your nix.conf will now look like:

...
substituters = https://cache.nixos.org https://cache.my-project.org/
...
trusted-public-keys = cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY= cache.my-project.org:j/Kb+r+tGeM+4YZH+ECfTr+b4OFViKHaciuIOHw1/DP=
...

The order does not matter, I just feel more comfortable putting the cache I consider "primary" first. The order is determined by NixOS, using the cache-info file from each Hydra cache:

$ curl https://cache.nixos.org/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 40

If you were experiencing excessive building from source and your intention was to draw from two caches, this should resolve it for you.

August 27, 2019

International HPC Certification Forum and AU-NZ Contributions

High Performance Computing (HPC) is the most effective method to process increasingly large and complex datasets, making them increasingly critical for research organisations. Researchers wanting to use HPC resources often start with low levels of skills in using those systems. Despite this situation, educational programmes coming out of well-informed user needs analysis and/or a widely acknowledged set of required skills, capabilities and knowledge are rare. As a result, the training of researchers typically left to individual HPC sites, such as the Pawsey Supercomputing Centre, National Compute Infrastructure Australia, and the University of Melbourne.

With different sites providing their own training with varied content and delivery there is a lack of consistency in skills and knowledge among HPC users, despite the fact that there is a recognised high level of homogeneity in HPC skills (e.g., UNIX-like environments, cluster architecture, job submission principles, parallel programming techniques).

One group trying to address this challenge on an international level is the International HPC Certification Forum ("the Forum"). The Forum was established by a global collection of individuals committed to identifying competency areas, skills and measurable outcomes per identified HPC user roles. The Forum plans to provide examination and certification of users in fine-grained competencies. The Forum has purposefully not taken ownership for training content, separating the definition of skills and certificates from education content and delivery, but allows for the option of delivery agents to be recognized as including examinable content. Australia has been involved from the start of this effort toward a global curriculum with two members of the governing Board.

For Australia and New Zealand HPC educators and trainers in the HPC environment there is a desire for a collaborative development of course content. This is a rational allocation of scarce temporal and financial resources. This has generated ongoing interest in establishing collaboration among HPC educators to develop a programme suitable for Forum Certification. With a lead from the Pawsey Supercomputing Centre, the University of Melbourne, Adelaide University, and NCI, national coordination of HPC educators in Australia and New Zealand are developing a repository of knowledge for content, delivery, and assessment, with the objective of increasing regional research output.

This is a presentation initially given at the ARDC eResearch Skilled Workforce Summit, 29-30 July, 2019, in Sydney then in an expanded version at the HPC-AI Advisory Council, August 29, 2019, in Perth, and finally in a reduced version for a ARDC Tech-Talk, 6 Sept, 2019.

http://levlafayette.com/files/2019ARDCsummit.pdf

August 20, 2019

Installing Your First Hydra

NixOS Hydra Gears by Craige McWhirter

Hydra is a Nix-based continuous build system. My method for configuring a server to be a Hydra build server, is to create a hydra.nix file like this:

# NixOps configuration for machines running Hydra

{ config, pkgs, lib, ... }:

{

  services.postfix = {
    enable = true;
    setSendmail = true;
  };

  services.postgresql = {
    enable = true;
    package = pkgs.postgresql;
    identMap =
      ''
        hydra-users hydra hydra
        hydra-users hydra-queue-runner hydra
        hydra-users hydra-www hydra
        hydra-users root postgres
        hydra-users postgres postgres
      '';
  };

  networking.firewall.allowedTCPPorts = [ config.services.hydra.port ];

  services.hydra = {
    enable = true;
    useSubstitutes = true;
    hydraURL = "https://my.website.org";
    notificationSender = "my.website.org";
    buildMachinesFiles = [];
    extraConfig = ''
      store_uri = file:///var/lib/hydra/cache?secret-key=/etc/nix/my.website.org/secret
      binary_cache_secret_key_file = /etc/nix/my.website.org/secret
      binary_cache_dir = /var/lib/hydra/cache
    '';
  };

  services.nginx = {
    enable = true;
    recommendedProxySettings = true;
    virtualHosts."my.website.org" = {
      forceSSL = true;
      enableACME = true;
      locations."/".proxyPass = "http://localhost:3000";
    };
  };

  security.acme.certs = {
      "my.website.org".email = "my.email@my.website.org";
  };

  systemd.services.hydra-manual-setup = {
    description = "Create Admin User for Hydra";
    serviceConfig.Type = "oneshot";
    serviceConfig.RemainAfterExit = true;
    wantedBy = [ "multi-user.target" ];
    requires = [ "hydra-init.service" ];
    after = [ "hydra-init.service" ];
    environment = builtins.removeAttrs (config.systemd.services.hydra-init.environment) ["PATH"];
    script = ''
      if [ ! -e ~hydra/.setup-is-complete ]; then
        # create signing keys
        /run/current-system/sw/bin/install -d -m 551 /etc/nix/my.website.org
        /run/current-system/sw/bin/nix-store --generate-binary-cache-key my.website.org /etc/nix/my.website.org/secret /etc/nix/my.website.org/public
        /run/current-system/sw/bin/chown -R hydra:hydra /etc/nix/my.website.org
        /run/current-system/sw/bin/chmod 440 /etc/nix/my.website.org/secret
        /run/current-system/sw/bin/chmod 444 /etc/nix/my.website.org/public
        # create cache
        /run/current-system/sw/bin/install -d -m 755 /var/lib/hydra/cache
        /run/current-system/sw/bin/chown -R hydra-queue-runner:hydra /var/lib/hydra/cache
        # done
        touch ~hydra/.setup-is-complete
      fi
    '';
  };
  nix.trustedUsers = ["hydra" "hydra-evaluator" "hydra-queue-runner"];
  nix.buildMachines = [
    {
      hostName = "localhost";
      systems = [ "x86_64-linux" "i686-linux" ];
      maxJobs = 6;
      # for building VirtualBox VMs as build artifacts, you might need other
      # features depending on what you are doing
      supportedFeatures = [ ];
    }
  ];
}

From there it can be imported in your configuration.nix or NixOps files like this:

{ config, pkgs, ... }:

{

  imports =
    [
      ./hydra.nix
    ];

...
}

To deploy hydra, you will then need to either run nixos-rebuild switch on the server or use nixops deploy -d my.network.

The result of this deployment, via NixOps can be seen at hydra.mcwhirter.io.

August 18, 2019

LUV September 2019 Main Meeting

Sep 3 2019 19:00
Sep 3 2019 21:00
Sep 3 2019 19:00
Sep 3 2019 21:00
Location: 
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

NOTE: The library closes at 7pm so arrivals after that time will need to contact Andrew on (0421) 775 358 or any other attendee for admission.

Speakers:

  • To be announced

 

Many of us like to go for dinner nearby after the meeting, typically at Brunetti's or Trotters Bistro in Lygon St.  Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

September 3, 2019 - 19:00

August 16, 2019

Passwordless restricted guest account on Ubuntu

Here's how I created a restricted but not ephemeral guest account on an Ubuntu 18.04 desktop computer that can be used without a password.

Create a user that can login without a password

First of all, I created a new user with a random password (using pwgen -s 64):

adduser guest

Then following these instructions, I created a new group and added the user to it:

addgroup nopasswdlogin
adduser guest nopasswdlogin

In order to let that user login using GDM without a password, I added the following to the top of /etc/pam.d/gdm-password:

auth    sufficient      pam_succeed_if.so user ingroup nopasswdlogin

Note that this user is unable to ssh into this machine since it's not part of the sshuser group I have setup in my sshd configuration.

Privacy settings

In order to reduce the amount of digital traces left between guest sessions, I logged into the account using a GNOME session and then opened gnome-control-center. I set the following in the privacy section:

Then I replaced Firefox with Brave in the sidebar, set it as the default browser in gnome-control-center:

and configured it to clear everything on exit:

Create a password-less system keyring

In order to suppress prompts to unlock gnome-keyring, I opened seahorse and deleted the default keyring.

Then I started Brave, which prompted me to create a new keyring so that it can save the contents of its password manager securely. I set an empty password on that new keyring, since I'm not going to be using it.

I also made sure to disable saving of passwords, payment methods and addresses in the browser too.

Restrict user account further

Finally, taking an idea from this similar solution, I prevented the user from making any system-wide changes by putting the following in /etc/polkit-1/localauthority/50-local.d/10-guest-policy.pkla:

[guest-policy]
Identity=unix-user:guest
Action=*
ResultAny=no
ResultInactive=no
ResultActive=no

If you know of any other restrictions that could be added, please leave a comment!

August 15, 2019

Setting Up Wireless Networking with NixOS

NixOS Gears by Craige McWhirter

The current NixOS Manual is a little sparse on details for different options to configure wireless networking. The version in master is a little better but still ambiguous. I've made a pull request to resolve this but in the interim, this documents how to configure a number of wireless scenarios with NixOS.

If you're going to use NetworkManager, this is not for you. This is for those of us who want reproducible configurations.

To enable a wireless connection with no spaces or special characters in the name that uses a pre-shared key, you first need to generate the raw PSK:

$ wpa_passphrase exampleSSID abcd1234
network={
        ssid="exampleSSID"
        #psk="abcd1234"
        psk=46c25aa68ccb90945621c1f1adbe93683f884f5f31c6e2d524eb6b446642762d
}

Now you can add the following stanza to your configuration.nix to enable wireless networking and this specific wireless connection:

networking.wireless = {
  enable = true;
  userControlled.enable = true;
  networks = {
    exampleSSID = {
      pskRaw = "46c25aa68ccb90945621c1f1adbe93683f884f5f31c6e2d524eb6b446642762d";
    };
  };
};

If you had another WiFi connection that had spaces and/or special characters in the name, you would configure it like this:

networking.wireless = {
  enable = true;
  userControlled.enable = true;
  networks = {
    "example's SSID" = {
      pskRaw = "46c25aa68ccb90945621c1f1adbe93683f884f5f31c6e2d524eb6b446642762d";
    };
  };
};

If you need to connect to a hidden network, you would do it like this:

networking.wireless = {
  enable = true;
  userControlled.enable = true;
  networks = {
    myHiddenSSID = {
      hidden = true;
      pskRaw = "46c25aa68ccb90945621c1f1adbe93683f884f5f31c6e2d524eb6b446642762d";
    };
  };
};

The final scenario that I have, is connecting to open SSIDs that have some kind of secondary method (like a login in web page) for authentication of connections:

networking.wireless = {
  enable = true;
  userControlled.enable = true;
  networks = {
    FreeWiFi = {};
  };
};

This is all fairly straight forward but was non-trivial to find the answers too.

August 14, 2019

An animated GIF resume

Share

The graphic designer at work and I were talking, and I challenged him to come up with a resume as an animated GIF. This is where he landed…

I think its quite clever. Need a graphic designer or video team? Consider onefishsea.

Share

August 11, 2019

Audiobooks – July 2019

The Return of the King by J.R.R Tolkien. Narrated by Rob Inglis. Excellent although I should probably listen slower next time. 10/10

Why Superman Doesn’t Take Over the World: What Superheroes Can Tell Us About Economics by J. Brian O’Roark

A good idea for a theme but author didn’t quite nail it. Further let down in audiobook format when the narrator talked to invisible diagrams. 6/10

A Fabulous Creation: How the LP Saved Our Lives by David Hepworth

Covers the years 1967 (Sgt Peppers) to 1982 (Thriller) when the LP dominated music. Lots of information all delivered in the authors great style. 8/10

The Front Runner by Matt Bai

Nominally a story about the downfall of Democratic presidential front-runner Gray Hart in 1987. Much of the book is devoted to how norms of political coverage changed at that moment due to changes in technology & culture. 8/10

A race like no other: 26.2 Miles Through the Streets of New York by Liz Robbins

Covering the 2007 New York marathon it follows the race with several top & amateur racers. Lots of digressions into the history of the race and the runners. Worked well 8/10

1983: Reagan, Andropov, and a World on the Brink by Taylor Downing

An account of how escalations in the cold war in 1983 nearly lead to Nuclear War, with the Americans largely being unaware of the danger. Superb 9/10


The High cost of Free Parking (2011 edition) by Donald Shoup.

One of the must-read books in the field although not a revelation for today’s readers. Found it a little repetitive (23 hours) and talking to diagrams and equations doesn’t work in audiobook format. 6/10



Share

August 08, 2019

The wonderful world of machine learning automated lego sorting

Share

Inspired by Alastair D’Silva‘s cunning plans for world domination, I’ve been googling around for automated lego sorting systems recently. This seems like a nice tractable machine learning problem with some robotics thrown in for fun.

Some cool projects if you’re that way inclined:

This sounds like a great way to misspend some evenings to me…

Share

August 03, 2019

LUV August 2019 Workshop: Drupal

Aug 17 2019 12:30
Aug 17 2019 16:30
Aug 17 2019 12:30
Aug 17 2019 16:30
Location: 
Infoxchange, 33 Elizabeth St. Richmond

Alexar Pendashtash will run a workshop about Drupal, a web application framework he has extensively used throughout the past ten years for a variety of applications.

You will not need any prior knowledge of programming or web development to benefit from the talk.

Alexar is a technologist and a social entrepreneur.

The meeting will be held at Infoxchange, 33 Elizabeth St. Richmond 3121.  Late arrivals please call (0421) 775 358 for access to the venue.

LUV would like to acknowledge Infoxchange for the venue.

Linux Users of Victoria is a subcommittee of Linux Australia.

August 17, 2019 - 12:30

read more

LUV August 2019 Main Meeting: Open Source Hardware and Software for Assistive Technologies

Aug 6 2019 19:00
Aug 6 2019 21:00
Aug 6 2019 19:00
Aug 6 2019 21:00
Location: 
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

NOTE: The library closes at 7pm so arrivals after that time will need to contact Andrew on (0421) 775 358 or any other attendee for admission.

Speakers:

  • Jonathan Oxer: Open Source Hardware and Software for Assistive Technologies

 

Open Source Hardware and Software for Assistive Technologies

Learn how Open Source hardware and software can be used in a wide variety of assistive technologies, including wheelchair control, environmental systems, robotics, communication, prosthetics, games, and telepresence.

Many of us like to go for dinner nearby after the meeting, typically at Brunetti's or Trotters Bistro in Lygon St.  Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

August 6, 2019 - 19:00

read more

What is the Spotify model for Agile?

Share

The other day someone said to me  that “they use the Spotify development model”, and I said “you who the what now?”. It was a super productive conversation that I am quite proud of.

So… in order to look like less of a n00b in the next conversation, what is the “Spotify development model”? Well, it turns out the Spotify came up with a series of tweaks to the standard Agile process in order to scale their engineering teams. If you google for “spotify development model” or “spotify agile” you’ll get lots and lots of third party blog posts about what Spotify did (I guess a bit like this one), but its surprisingly hard to find primary sources. The best I’ve found so far is this Quora answer from a former VP of Engineering at Spotify, although some of the resources he links to no longer exist.

Here’s a quick summary though:

Squads: the basic unit of development team. Squads sit together, and have all of the tools and skills to release a feature to production. Squads self-organize and can choose in what way they work. For example some might use Scrum, while others might choose Kanban. Squads have a long term mission, but its usually a shared belief — “we will improve the search interface” for example. Squads don’t have a designated leader, but they do have a Product Owner.

An aside, squad interdependencies: Squads are meant to feel like independent small startups, so the dependancies between Squads are closely managed. You can’t completely eliminate dependancies if you’re going to build big things, but you can actively track those dependancies and make sure that they’re consciously managed.

Product Owners: each Squad has a Product Owner, who is responsible for prioritising the work of the Squad. The Product Owner also keeps in touch with other Product Owners in associated areas to ensure that a coherent overall product is being built.

Tribes: a Tribe is a set of Squads which work in related areas. So for example it might be all of the Squads which work on the mobile client, or all of the Squads dealing with backend infrastructure. Tribes are usually co-located (the same building), and are kept to less than 100 members to ensure that everyone knows each other. Tribes hold regular gatherings where they show off what they are working on, what they have delivered, and what others can learn from them.

Chapters: if you’re the only operations person in your Squad, then that can be an isolating experience and stops you from learning from other operations people’s experiences. To solve this, people with similar skills are also grouped into an overlay called a Chapter. Chapters meet regularly like Tribes, but Chapters are also where you get your people management from — the Chapter Lead is your people manager and does all the usual HR stuff to ensure your career needs are being met. At Spotify Chapters are a sub-unit of a Tribe, you don’t have Chapters which cross Tribe boundaries.

Guilds: Guilds are similar to Chapters, but aren’t skills based. They’re people with common interests, and can cross Tribe boundaries. An example of a Guild might be all your web developers, or all your Prometheus aficionados.

Share

July 29, 2019

Quick hack: extracting the contents of a Docker image to disk

Share

For various reasons, I wanted to inspect the contents of a Docker image without starting a container. Docker makes it easy to get an image as a tar file, like this:

docker save -o foo.tar image

But if you extract that tar file you’ll find a configuration file and manifest as JSON files, and then a series of tar files, one per image layer. You use the manifest to determine in what order you extract the tar files to build the container filesystem.

That’s fiddly and annoying. So I wrote this quick python hack to extract an image tarball into a directory on disk that I could inspect:

#!/usr/bin/python3

# Call me like this:
#  docker-image-extract tarfile.tar extracted

import tarfile
import json
import os
import sys

image_path = sys.argv[1]
extracted_path = sys.argv[2]

image = tarfile.open(image_path)
manifest = json.loads(image.extractfile('manifest.json').read())

for layer in manifest[0]['Layers']:
    print('Found layer: %s' % layer)
    layer_tar = tarfile.open(fileobj=image.extractfile(layer))

    for tarinfo in layer_tar:
        print('  ... %s' % tarinfo.name)
        if tarinfo.isdev():
            print('  --> skip device files')
            continue

        dest = os.path.join(extracted_path, tarinfo.name)
        if not tarinfo.isdir() and os.path.exists(dest):
            print('  --> remove old version of file')
            os.unlink(dest)

        layer_tar.extract(tarinfo, path=extracted_path)

Hopefully that’s useful to someone else (or future me).

Share

July 23, 2019

Codec 2 700C Equaliser Part 2

This post extends the work of Part 1. I’ve developed a new Equaliser (EQ) algorithm, and ported it to C.

EQ in front of VQ

After several days of experimentation I moved the EQ in front of the VQ. The algorithm used in Part 1 included the VQ in the loop (top), however I found this had problems removing large amounts of low frequency energy. I’m not sure why – with the VQ in the loop the operation is complex.

The new algorithm (bottom) is rather simple; we compare the average spectrum of the input speech to an ideal “mask”. The difference between the two are the equaliser weights. However, like many DSP algorithms, it took several days of careful experimentation, trial and error, listening tests, and a many backwards step to get the results I wanted. I coded and tested a bunch of candidate algorithms in vq_700c_eq.m

The input spectrum is averaged using a IIR filter, so it takes about a second to adapt.

This plot shows the EQ weights ramping up over a few frames (one frame is 40ms). It’s effectively a map of the excess spectral energy – the stuff we want to remove before the VQ:

This sample has a lot of excess LF energy. The HF spike is due to the shaping of the VQ – it expects all samples to have low energy in the last bin. This plot shows average values over an entire sample:

Dark blue is the input speech (target), and cyan the input speech after equalisation. Green is the “ideal” mask we make the EQ shoot for.

Objective Results

The following table is similar to the table in Part 1. It presents results for stage 1 (vq1) and stage 2 (vq2) of the two stage vector quantisation process. eq1 is the algorithm from Part 1, which I used as my starting point. eq2 is the latest algorithm for real time implementation, which does a comparable job. What really matters are the last three columns – the output error after quantisation for the current 700C VQ (vq2), Part 1 eq (vq2_eq1), and latest algorithm (vq_eq2).

Results Table

I placed the results in a separate text file as I had trouble getting them to display neatly in WordPress.

The first two samples are “contrived”: I took “cq_freedv_8k” and using Audacity added 12dB gain between 0 and 500Hz to get cq_freedv_8k_lfboost. I set the high frequency cut off at 3500 Hz to get cq__freedv_8k_hfcut. This was really useful – in listening tests lfboost broke the EQ design from Part 1 – it was ineffective at cleaning up the sample! Note the obj measure for this sample (vq2_eq1 column) is still OK – one reason why we back up objective results with listening tests.

Subjective Results

Sample Codec 2 700C Codec 2 700C + EQ
cq_ref Listen Listen
kristoff Listen Listen
cq_freedv_8k Listen Listen
cq_freedv_8k_lfboost Listen Listen

I listen to these samples with a small set of loudspeakers, that have some low frequency response. The last sample is an extreme case. That cq_freedv_8k sample already has quite a lot of low frequency, and we have added 12dB extra. The equalised sample is much improved, but not quite as good as the original “clean” cq_freedv_8k sample. The quality improves over the first few seconds as the EQ adapts.

Discussion

An EQ has been developed, tested over a range of samples, and ported to C. Tests have been written to ensure the C port matches the Octave version, and it’s ready to try in the real word on x86 and embedded (stm32) platforms.

I’m quite pleased with this work. I’ve resolved a long standing personal mystery – why some samples/speakers code well and others don’t. I’ve made Codec 2 700C more robust to a range of different microphone inputs. This will mean an incremental improvement in the average speech quality of on-air modes like FreeDV 700C and 700D. I’ve also learned a lot about VQ, which can be applied to new codec designs.

Lessons learned:

  1. This work has reinforced how important accurate representation of the low frequencies is for speech.
  2. Using a contrived sample is a good technique, it gives you a known error to test against.
  3. This was a good exercise in learning the strengths and weaknesses of VQ.
  4. VQs are fussy about what you throw at them.
  5. Variance is a really useful objective measure.
  6. Use a decent training database (from Part 1 where I trained a new VQ).

The contrived cq_freedv_8k examples have a very big variance (column 1 of the results table) before quantisation – this suggests “more information to quantise”. However they started as the same sample, and just had their frequency response tweaked in a way that did not affect the perceptual information they contain at all. This suggests equalisation might be a neat way to lower the bit rate. We may be able to find a transformation in the spectrum that sounds more or less the same to the human ear, but is easier (requires less bits) to encode.

Make me wonder – how many bits of other codecs are spent coding non-essential information like the arbitrary frequency responses thrown at them?

You can play around with EQ yourself

$ ./c2enc 700C ../../raw/cq_ref.raw - --eq --var | ./c2dec 700C - - | aplay -f S16_LE

Links

Codec 2 700C Equaliser
Codec 2 700C
Codec 2 Pull Request for this work

Mastermind in JavaScript

Share

I’ve been learning JavaScript for the last few days, and I figured I’d implement Jacqui’s favourite board game as a learning exercise. Jacqui loves a simple colour guessing game called Mastermind. In the game someone picks four coloured pins and then the player has to progressively guess what those colours are.

In my JavaScript version the computer picks four colours, and you need to work out what they are. Click on the white squares to cycle through colours and then hit the “guess” button when you’re ready to see how many you got right. The gray boxes in the top row will progressively reveal their colours as you guess them.

The code is here, and the game can be played here.

Share

July 22, 2019

Generating an sha256 Hash for Nix Packages

NixOS Gears by Craige McWhirter

Let's say that you're working to replicate the PureOS environment for the Librem 5 phone so that you can run NixOS on it instead and need to package calls. Perhaps you just want to use Nix to package something else that isn't packaged yet.

When you start digging into Nix packaging, you'll start to see stanzas like this one:

src = fetchFromGitLab {
  domain = "source.puri.sm";
  owner = "Librem5";
  repo = pname;
  rev = "v${version}";
  sha256 = "1702hbdqhfpgw0c4vj2ag08vgl83byiryrbngbq11b9azmj3jhzs";
};

It's fairly self explanatory and merely a breakdown of a URL into it's component parts so that they can be reused elsewhere in the packaging system. It was the generation of the sha256 hash that stumped me the most.

I'd not been able to guess how it was generated. I was not able to find clear instructions in the otherwise pretty thorough Nix documentation.

Putting clues together from a variety of other blog posts, this is how I eventually came to generate the correct sha256 hash for Nix packages:

Using the above hash for libhandy, I was able to test the method I'd come up with, using nix-prefetch-url to download the tagged version and provide an sha256 hash which I could compare to one in the existing libhandy default.nix file:

$ nix-prefetch-url --unpack https://source.puri.sm/Librem5/libhandy/-/archive/v0.0.10/libhandy-v0.0.10.tar.gz
unpacking...
[0.3 MiB DL]
path is
'/nix/store/58i61w34hx06gcdaf1x0gwi081qk54km-libhandy-v0.0.10.tar.gz'
1702hbdqhfpgw0c4vj2ag08vgl83byiryrbngbq11b9azmj3jhzs

Low and behold, I have matching sha256 hashes. As I'm wanting to create a package for "calls", I now safely do the same against it's repository on the way to crafting a Nix file for that:

$ nix-prefetch-url --unpack https://source.puri.sm/Librem5/calls/-/archive/v0.0.1/calls-v0.0.1.tar.gz
unpacking...
[0.1 MiB DL]
path is '/nix/store/3c7aifgmf90d7s60ph5lla2qp4kzarb8-calls-v0.0.1.tar.gz'
0qjgajrq3kbml3zrwwzl23jbj6y62ccjakp667jq57jbs8af77pq

That sha256 hash is what I'll drop into my nix file for "calls":

src = fetchFromGitLab {
  domain = "source.puri.sm";
  owner = "Librem5";
  repo = pname;
  rev = "v${version}";
  sha256 =
  0qjgajrq3kbml3zrwwzl23jbj6y62ccjakp667jq57jbs8af77pq";
};

Now we have an sha256 hash that can be used by Nix to verify source downloads before building.

July 20, 2019

How Niantic is Killing Ingress

For the past several years, I've been an active player of Ingress, a game where two competing factions play a sort of "capture-the-flag" of public locations of note using an augmentation of Google maps. The game, the precursor to Pokémon Go, and Harry Potter: Wizards Unite, has had its fair share of issues over the past six years. But on July 19, 2019, a death knell was sounded by the very company that produces the game; they forced players (albeit temporarily) to adopt the new interface, Ingress Prime, which is passionately hated by the overwhelming majority of the game's players, and for good reason. The interface is a radical change to the old version, has distracting effects, issues with visual accessibility, and is cumbersome to use. These issues have been raised by the player community for months now, but have largely fallen on deaf ears. Why is this? Why would a game company be so inattentive to the player base?

It is perhaps not so well known, but Niantic started off as a Google project, working on the Field Trip alogorithm, which would push information to users on what the algorithm thinks you might be interested in, and with integration into Google Glass. There's a fascinating unlisted video on Youtube, with an astounding 22 million views, where you basically witness in all of two and a half minutes of how a person is turned on a thoughtless robot, the ideal consumer. Of course, such an algorithm can't make such decisions randomly, it has to know where a person goes, what their habits are and so forth. Trying to find out this information by surveys and the like would be onerous to the extreme; but Google Location Services can provide that data, and players will willingly give up such privacy for the entertainment of an Augmented Reality game, whether it is Ingress, Pokémon Go, or Harry Potter: Wizards Unite.

OK, six years on one might think "big deal", as various form of software spying is so ubiquitous only a few seem to care anymore. But in 2013 it was still an item of some public debate. Certainly, Ingress et. al., also made money from select advertising and merchandise, but that's hardly enough to cover costs. The real business, however, is Niantic Real World Platform, which uses the Augmented Reality front to have a commercial back-end, with Point of Interest data collection, social analytics, CRM; the talk from Phil Kreslin at the Augmented World Expo 2018 provides a lot of the details. The games are the, often enjoyable, entry point behind the data collection.

How does this relate to game development? In a nutshell, if a game isn't generating interest, it isn't collecting data, and if it isn't collecting data it's not helping Niantic's core business. Ingress was the first of its kind - it was popular for a couple of years, and still has a smaller, but dedicated, fan-base who for many have it as part of their daily lives. There were some issues, such as the fact it was in beta for a long period of time, which significant changes in play, leading it to be nicknamed "Calvinball". To enhance the data collection activities, one of the moves was to allow players to submit favourite hotels, cafes, etc as portals. It quickly got out of hand of course, and Niantic had to pull back on what constituted a Point of Interest, but a lot of damage had been done. Today there are swathes of empty or barely deployed portals as a result.

Shortly afterward Pokémon Go came on to the scene, and Niantic must have thought they struck gold. With a fancier interface, simple gameplay it was enormously popular, and recently claimed to 150 million active players, even setting five Guinness World Records. By comparison, Ingress probably has less than 10% of that figure; one would have to dig into the Niantic data, and they're not exactly making this information publically available. The point being, of course, is that Pokémon Go was huge and they're hoping to replicate the same success with Harry Potter: Wizards Unite.

It seems to me, however, that Niantic has completely misread the reasons for their success with Pokémon Go. They think that the reason for Pokémon Go's success relative to Ingress is due to the user-interface, and whilst that has a degree of truth the reality is that Pokémon fans are deep fans. By way of analogy, after its first season Star Trek was going to be dropped due to low ratings. However those who were viewers were deep viewers, and their campaign to keep it on led to its lasting success. Pokémon fans are a bit like this; it's the world's largest media franchise, for reasons I do not even begin to understand (my interest in Pokémon is very close to zero). To summarise, the reason that Pokémon Go succeeded is because it provided an Ingress-like AR game to an extremely popular setting. The reason that Ingress Prime fails is because it ignores the existing player community and believes that a resource-intensive Pokémon-style interface will improve the game's popularity. They have their market analysis absolutely upside-down. Which will make for an interesting case study in the future, I suppose.

With this in mind, can Ingress be saved? In theory yes, if the Niantic concentrates on gameplay, rather than massive changes to the UI. An interesting game generates new players even with an old UI (consider the continuing and lasting popularity of tabletop roleplaying games). The following is a list of suggestions, which do not involve any radical changes to the core principles of the game:

  • Tweak the level and badge gains so tha they fit some sort of mathematical progression. I am at a complete loss to understand why this wasn't done in the first instance, by programmers, for goodness sake. Progression on the AP-level table looks like it was designed by a drunk.
  • Allow players to increase levels about 16. The cap is absolutely unnecessary and any sort of geometric progression will ensure that the game is still accessible to new players.
  • Likewise allow devices to increase beyond level 8. Again the cap is unnecessary, and leads to a situation where existing players who have peaked play out of habit rather than for achievement.
  • Restrict portals that are higher than level 8 to genuine points of notable public interest. My favourite Indian restaurant, as charming as it is, is not a wonder of the world like the Eiffel Tower or the Pyramid of Giza.
  • Keep the familiar interface of Scanner Redacted beyond the end-of-September end-date and open it up to new players; at the moment only Agents who made accounts using the original 1.0 Scanner can access Scanner Redacted.

However, I have next to zero confidence that Niantic will introduce any of these changes, even though they would improve Niantic's core business of data collection. Which will mean after many years of game-play, Ingress will simply die in an (overwhelmingly graphic UI) whimper. Maybe then we'll be lucky and Niantic will release the original source-code and the community will be able to build something from the cadaver. But I doubt it.

July 18, 2019

SM1000 V2 Firmware

After a year of development we have released version 2 of the SM1000 firmware. Major new features include:

  • FreeDV 700D (thanks Don W7DMR)
  • A Morse menu system (thanks Stuart VK4MSL)
  • The SM1000 Manual which tells you how to upgrade to the new firmware

I’d also like to thanks Steve (K5OKC) for his fine work on the OFDM modem, Danilo (DB4PLE), and Richard (KF5OIM) for their work on the build system, Github/Travis integration and testing; and Walter (K5WH) and George (AC6RB) for helping us sort out the Windows flashing procedure.

The software engineering side of Codec 2 has come a long way in 2019 – we now have about 50 automated tests and a very nice build system for the x86 and embedded stm32 ports. About half the tests run on a stm32 Discovery card and ensure the stm32 port keeps running as we push the x86 side of the project forward.

The SM1000 hardware was developed by myself and Rick Barnich KA8BMA a few years ago. It is being manufactured, tested and shipped by our good friend Edwin at Dragino in Shenzhen, China.

The FreeDV 700D port includes some sophisticated signal processing:

  • The Codec 2 700C speech codec.
  • The OFDM modem developed by Steve and I that has been optimised for HF radio.
  • State of the art LDPC forward error correction software designed by Bill, VK5DSP.

Largely through Don’s effort that is all running on a tiny 168MHz micro-controller in just 192k of RAM. With identical results to the x86 version.

I’ve really enjoyed working with the team on this project, in particular through GitHub. I like GitHub more that Git – which I still get into trouble with occasionally.

The stm32 side of the Codec 2 project is much improved with a shiny new build system, up to date documentation, and we have the automated tests to ensure we keep it that way. Thanks guys!

Here’s my SM1000 on the bench during development. I’ve hooked up a serial port to print debug messages, and am using laptops to feed it with FreeDV signals:

Links
SM1000 page and store
SM1000 Manual
Porting a LDPC Decoder to a STM32 Microcontroller Don tells us how he squeezed sophisticated LDPC error correction into a microcontroller.
Pull Request where the integration work was done for the V2 release.
2019 Codec2 and FreeDV Update
FreeDV 700D – a description of this new mode.
SM1000 Development – scan the archives around 2014/2015 for many posts on SM1000 hardware development.

July 16, 2019

OFDM Frequency Acquisition Revisited

I’ve had some anecdotal reports (from the UK and VK5) and a sample suggesting 700D sync is very slow (around 10s) on some fading HF channels, with problems even at high SNRs. This was puzzling to me, as I carefully developed the 700D modem against models of fast fading (1 Hz Doppler bandwidth) HF channels.

So I took a look at the samples, and revisited the OFDM modem acquisition simulations. I ran a bunch of tests using the core timing and freq offset estimators, for example:

octave:12> ofdm_dev; acquisition_test(Ntests=100, EbNodB=10, foff_hz=0, hf_en=2,verbose=1);

I added a slow fading channel option (hf=2). Turns out slow fading (e.g. 0.1Hz Doppler) leads to relatively “stationary” notches in the spectrum that confuse the estimator. The net result is many seconds with a poor frequency offset and hence no sync. Curiously, it’s less of a problem on fast fading channels. The channel conditions change so fast that we rapidly get to a channel state that the estimator likes and sync proceeds.

Here is a spectrogram of a slow fading channel, note the deep notch that appears around 10 seconds. It’s chomping out a big chunk of the signal and slowly moves across the signal for a few seconds. Our challenge is to estimate the frequency offset with half the signal missing!

Beneath that is a plot of frequency offset estimate for the same channel. It’s meant to be about zero, but during the slow, frequency selective fade it gets stuck at around -3Hz for several seconds.


Once I could reproduce the problem I worked out a few improvements, and added some ctests to make sure we trap any similar problems in future. The steps I took are described in the notes of this Codec 2 Pull Request. The major changes are:

  1. A new frequency offset estimator.
  2. This still has some residual frequency offset, so I added a two-speed phase estimator to rapidly track out any residual fine frequency errors soon after sync.
  3. Increased the frequency offset estimation range to +/-60Hz, to make tuning easier for FreeDV 700D and FreeDV 2020.

The following two plots measure the probability of getting a valid frequency offset estimate (on each frame) for various Eb/No (previous algorithm, and new improved algorithm):


Steve (K5OKC) and I worked on optimising the acquisition functions, to ensure they would run in real time on the stm32 for the upcoming 700D port. I’m happy with the result: improved sync, wider frequency range, some automated tests to trap similar issues in future, and running in real time even on the stm32. Nice that we can incrementally improve this modem. It’s available now if building codec2 and freedv-gui from source.

Acquisition is hard. Much harder than many other parts of modem design. For my next modem (or FreeDV waveform) I’ll get acquisition right first, then design the rest if the waveform (e.g. number of FEC/payload bits, carriers) around that.

Running the Acquisition Simulations

Writing some notes to myself so I remember how to so this next time (and there’s always a next time with acquisition!).

Generate a bunch of curves measuring acquisition performance using simulated HF and AWGN channels:

octave:12> ofdm_dev; acquistion_curves

Test complete sync (core estimators integrated with state machine) at a certain time offset:

octave:12> ofdm_ldpc_rx("~/Desktop/700D/vk2tpm_004.wav", "700D", 1, "", 3, 5)

Test complete sync at a range of time offsets to gather stats, e.g mean and variance of sync time. I used real, off air signals:

octave:12> ofdm_time_sync("~/Desktop/700D/vk2tpm_004.wav", 30)
octave:13> ofdm_time_sync("~/Desktop/mike/websdr_recording_2019-04-21T09_11_56Z_3625.0kHz.wav", 30)

Links

FreeDV 700D Part 4 – Acquisition My last lap around acquisition.

July 14, 2019

Installing Debian buster on a GnuBee PC 2

Here is how I installed Debian 10 / buster on my GnuBee Personal Cloud 2, a free hardware device designed as a network file server / NAS.

Flashing the LibreCMC firmware with Debian support

Before we can install Debian, we need a firmware that includes all of the necessary tools.

On another machine, do the following:

  1. Download the latest librecmc-ramips-mt7621-gb-pc1-squashfs-sysupgrade_*.bin.
  2. Mount a vfat-formatted USB stick.
  3. Copy the file onto it and rename it to gnubee.bin.
  4. Unmount the USB stick

Then plug a network cable between your laptop and the black network port and plug the USB stick into the GnuBee before rebooting the GnuBee via ssh:

ssh 192.68.10.0
reboot

If you have a USB serial cable, you can use it to monitor the flashing process:

screen /dev/ttyUSB0 57600

otherwise keep an eye on the LEDs and wait until they are fully done flashing.

Getting ssh access to LibreCMC

Once the firmware has been updated, turn off the GnuBee manually using the power switch and turn it back on.

Now enable SSH access via the built-in LibreCMC firmware:

  1. Plug a network cable between your laptop and the black network port.
  2. Open web-based admin panel at http://192.168.10.0.
  3. Go to System | Administration.
  4. Set a root password.
  5. Disable ssh password auth and root password logins.
  6. Paste in your RSA ssh public key.
  7. Click Save & Apply.
  8. Go to Network | Firewall.
  9. Select "accept" for WAN Input.
  10. Click Save & Apply.

Finaly, go to Network | Interfaces and note the ipv4 address of the WAN port since that will be needed in the next step.

Installing Debian

The first step is to install Debian jessie on the GnuBee.

Connect the blue network port into your router/switch and ssh into the GnuBee using the IP address you noted earlier:

ssh root@192.168.1.xxx

and the root password you set in the previous section.

Then use fdisk /dev/sda to create the following partition layout on the first drive:

Device       Start       End   Sectors   Size Type
/dev/sda1     2048   8390655   8388608     4G Linux swap
/dev/sda2  8390656 234441614 226050959 107.8G Linux filesystem

Note that I used an 120GB solid-state drive as the system drive in order to minimize noise levels.

Then format the swap partition:

mkswap /dev/sda1

and download the latest version of the jessie installer:

wget --no-check-certificate https://raw.githubusercontent.com/gnubee-git/GnuBee_Docs/master/GB-PCx/scripts/jessie_3.10.14/debian-jessie-install

(Yes, the --no-check-certificate is really unfortunate. Please leave a comment if you find a way to work around it.)

The stock installer fails to bring up the correct networking configuration on my network and so I have modified the install script by changing the eth0.1 blurb to:

auto eth0.1
iface eth0.1 inet static
    address 192.168.10.1
    netmask 255.255.255.0

Then you should be able to run the installer succesfully:

sh ./debian-jessie-install

and reboot:

reboot

Restore ssh access in Debian jessie

Once the GnuBee has finished booting, login using the serial console:

  • username: root
  • password: GnuBee

and change the root password using passwd.

Look for the IPv4 address of eth0.2 in the output of the ip addr command and then ssh into the GnuBee from your desktop computer:

ssh root@192.168.1.xxx  # type password set above
mkdir .ssh
vim .ssh/authorized_keys  # paste your ed25519 ssh pubkey

Finish the jessie installation

With this in place, you should be able to ssh into the GnuBee using your public key:

ssh root@192.168.1.172

and then finish the jessie installation:

wget --no-check-certificate https://raw.githubusercontent.com/gnubee-git/gnubee-git.github.io/master/debian/debian-modules-install
bash ./debian-modules-install
reboot

After rebooting, I made a few tweaks to make the system more pleasant to use:

update-alternatives --config editor  # choose vim.basic
dpkg-reconfigure locales  # enable the locale that your desktop is using

Upgrade to stretch and then buster

To upgrade to stretch, put this in /etc/apt/sources.list:

deb http://httpredir.debian.org/debian stretch main
deb http://httpredir.debian.org/debian stretch-updates main
deb http://security.debian.org/ stretch/updates main

Then upgrade the packages:

apt update
apt full-upgrade
apt autoremove
reboot

To upgrade to buster, put this in /etc/apt/sources.list:

deb http://httpredir.debian.org/debian buster main
deb http://httpredir.debian.org/debian buster-updates main
deb http://security.debian.org/debian-security buster/updates main

and upgrade the packages:

apt update
apt full-upgrade
apt autoremove
reboot

Next steps

At this point, my GnuBee is running the latest version of Debian stable, however there are two remaining issues to fix:

  1. openssh-server doesn't work and I am forced to access the GnuBee via the serial interface.

  2. The firmware is running an outdated version of the Linux kernel though this is being worked on by community members.

I hope to resolve these issues soon, and will update this blog post once I do, but you are more than welcome to leave a comment if you know of a solution I may have overlooked.

July 11, 2019

Gurobi Installation and Tests on a HPC system

Gurobi is an optimisation solver, which describes itself as follows, thus explaining it's increasing popularity:

The Gurobi Optimizer is a state-of-the-art solver for mathematical programming. The solvers in the Gurobi Optimizer were designed from the ground up to exploit modern architectures and multi-core processors, using the most advanced implementations of the latest algorithms.

The following outlines the installation procedure on a Linux cluster, various licensing condundrums, and a sample job using Slurm.

Installation

In our case we acquired a floating academic license. A form will need to filled out, scanned, and sent back to Gurobi. This is all sub-optimal and like any every partially proprietary software it's a damaged good. But certainly it's not as bad as it could be, small mercies.

Having received a licene file and having downloaded the software one can install. The following is an EasyBuild script (Gurobi-8.1.1.eb), but it's basically a tarball which contains several binaries, docs, examples etc.


name = 'Gurobi'
version = '8.1.1'
easyblock = 'Tarball'
homepage = 'http://www.gurobi.com'
description = """The Gurobi Optimizer is a state-of-the-art solver for mathematical programming. The solvers in the Gurobi Optimizer were designed from the ground up to exploit modern architectures and multi-core processors, using the most advanced implementations of the latest algorithms."""
toolchain = {'name': 'dummy', 'version': 'dummy'}
# registration is required
# source_urls = ['http://www.gurobi.com/downloads/user/gurobi-optimizer']
sources = ['%(namelower)s%(version)s_linux64.tar.gz']
moduleclass = 'math'

Licensing

As is often the case with proprietary software, the greatest pain for sysadmins will be dealing with the license. Even in those cases where this is quicker than the software install itself, at least with the software you know that the work is necessary. With licenses, it's unnecesssary work, and I keep a sharp eye on the number of expected seconds I have left in my life. For anyone else reading this, hopefully I've saved a few for you.

Like most sensible HPC systems the management node is not directly accessible to the outside world. A license file will need to be created from the grbprobe command and relevant material added into a gurobi.lic file. The following is an example:


# DO NOT EDIT THIS FILE except as noted
#
# License ID XXXXXX
TYPE=TOKEN
VERSION=8
TOKENSERVER=spartan-build.hpc.unimelb.edu.au
HOSTNAME=spartan-build.hpc.unimelb.edu.au
HOSTID=XXXXXXXX
SOCKETS=2
EXPIRATION=2020-04-18
USELIMIT=4096
DISTRIBUTED=100
SPECIAL=2
KEY=XXXXXXXX
CKEY=XXXXXXXX
# Uncomment and edit the following lines as desired:
PORT=XXXXX
# # PASSWORD=YourPrivatePassword

Gurobi strongly prefers that the license is installed in a /opt/gurobi directory according to their documentation, but on an HPC system it is doubtful that this is mounted across compute nodes. Thus a path will have to be exported with the appropriate variable when run. In the meantime, the token server can be started:


(vSpartan) [root@spartan-build gurobi]# module load Gurobi
(vSpartan) [root@spartan-build gurobi]# grb_ts
..

Smoke Test

With the token server running, it should be possible to run a Gurobi task. Note however that in an HPC environment where the management node is private and running the token server, and the login node is public, you may encounter a subnet error.


[lev@spartan-login1 ~]$ module load Gurobi
[lev@spartan-login1 ~]$ export GRB_LICENSE_FILE=/usr/local/easybuild/software/Gurobi/gurobi.lic
[lev@spartan-login1 ~]$ gurobi_cl
ERROR 10009: Server must be on the same subnet

Thus in these situations it is best to run on a compute node after launching an interactive job.


[lev@spartan-login1 ~]$ sinteractive
..
[lev@spartan-rc110 ~] module load Gurobi
[lev@spartan-rc110 ~] export GRB_LICENSE_FILE=/usr/local/easybuild/software/Gurobi/gurobi.lic
[lev@spartan-rc110 ~]$ gurobi_cl
Usage: gurobi_cl [--command]* [param=value]* filename
Type 'gurobi_cl --help' for more information.

Slurm Job Script

A number of example Gurobi jobs are provided in the appliction, and the misc07.mps file is a good example for a speedtest. Note that this task is pleasingly parallel and run much faster as a multicore job compared to a single-core job. Here is a sample Slurm script, gurobi.slurm. For the test case, it is worth testing the sample job with a single CPU vs eight or more.


#!/bin/bash
#SBATCH -p cloud
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
module load Gurobi/7.0.1
export GRB_LICENSE_FILE=/usr/local/easybuild/software/Gurobi/gurobi.lic
time gurobi_cl misc07.mps

The following are some sample results:

cpus-per-task=1

real 0m9.255s
user 0m8.220s
sys 0m1.009s

cpus-per-task=8

real 0m2.558s
user 0m14.593s
sys 0m3.778s

Restart License File

One other issue that's too easy to overlook is if the management node is restarted for any reason (e.g., a planned outage for system upgrades), the Gurobi license server will have to be restarted. One method to do this is to write a short script and it to the list of services that are required on boot. It may be necessary to source a profile in order to invoke the modules system in a non-interactive shell e.g.,


#!/bin/bash
. /etc/profile.d/z00_lmod.sh
. /etc/profile.d/z01_spartan.sh
module load Gurobi/8.1.1
grb_ts

July 08, 2019

Red Hat, Red Heart.

Red Hat, Red Heart. kattekrab Mon, 08/07/2019 - 22:10

July 06, 2019

SIP Encryption on VoIP.ms

My VoIP provider recently added support for TLS/SRTP-based call encryption. Here's what I did to enable this feature on my Asterisk server.

First of all, I changed the registration line in /etc/asterisk/sip.conf to use the "tls" scheme:

[general]
register => tls://mydid:mypassword@servername.voip.ms

then I enabled incoming TCP connections:

tcpenable=yes

and TLS:

tlsenable=yes
tlscapath=/etc/ssl/certs/

Finally, I changed my provider entry in the same file to:

[voipms]
type=friend
host=servername.voip.ms
secret=mypassword
username=mydid
context=from-voipms
allow=ulaw
allow=g729
insecure=port,invite
transport=tls
encryption=yes

(Note the last two lines.)

The dialplan didn't change and so I still have the following in /etc/asterisk/extensions.conf:

[pstn-voipms]
exten => _1NXXNXXXXXX,1,Set(CALLERID(all)=Francois Marier <5551234567>)
exten => _1NXXNXXXXXX,n,Dial(SIP/voipms/${EXTEN})
exten => _1NXXNXXXXXX,n,Hangup()
exten => _NXXNXXXXXX,1,Set(CALLERID(all)=Francois Marier <5551234567>)
exten => _NXXNXXXXXX,n,Dial(SIP/voipms/1${EXTEN})
exten => _NXXNXXXXXX,n,Hangup()
exten => _011X.,1,Set(CALLERID(all)=Francois Marier <5551234567>)
exten => _011X.,n,Authenticate(1234) ; require password for international calls
exten => _011X.,n,Dial(SIP/voipms/${EXTEN})
exten => _011X.,n,Hangup(16)

Server certificate

The only thing I still need to fix is to make this error message go away in my logs:

asterisk[8691]: ERROR[8691]: tcptls.c:966 in __ssl_setup: TLS/SSL error loading cert file. <asterisk.pem>

It appears to be related to the fact that I didn't set tlscertfile in /etc/asterisk/sip.conf and that it's using its default value of asterisk.pem, a non-existent file.

Since my Asterisk server is only acting as a TLS client, and not a TLS server, there's probably no harm in not having a certificate. That said, it looks pretty easy to use a Let's Encrypt cert with Asterisk.

July 03, 2019

Audiobooks – June 2019

Robot Visions by Isaac Asimov

A collection of short Robot stores and very short essays. Lots of classic stories although the essays are mostly forgettable. 7/10

Foreigner by Robert J. Sawyer

An alien counterpart of Sigmund Freud psychoanalyzes her race’s equivalent of Galileo. 3rd in the trilogy. I like it enough. 7/10

In Your Defence: Stories of Life and Law by Sarah Langford

An English Barrister describes 11 cases she has worked on. The lives and cases are mostly tragic but the writing is very compelling. 8/10

The Unthinkable: Who Survives When Disaster Strikes and Why by Amanda Ripley

A wide tour of the various ways people react in disasters for ignoring to freezing. Lots of interesting stories, some investigations into the psychology and some practical advice. 8/10

The Fellowship of the Ring by J.R.R Tolkien. Narrated by Rob Inglis.

The first time I’ve ever listened to this version. Excellent in every way. 10/10

Podcasting: The Ultimate Guide to Record, Produce, and Launch Your Podcast and Build Raving Fans by Martin C. Glover

A quick (40 minutes) intro to podcasting, some do’s and don’ts for perspective podcasters. Worth a listen if you are new to the topic and considering. 6/10

Nothing is real: The Beatles Were Underrated And Other Sweeping Statements About Pop by David Hepworth

A collection of essays, many about the Beatles but covering lots of other Pop-Music topics. A lot of good ones in there. 7/10

Safely to Earth: The Men and Women Who Brought the Astronauts Home by Jack Clemons

A memoir of a engineer who worked on the Shuttle and Apollo programs about his time there and what he worked on including the shuttle software. 7/10


The Two Towers by J.R.R Tolkien. Narrated by Rob Inglis.

10/10

Share

July 01, 2019

LUV July 2019 Workshop: Debian 10 "Buster"

Jul 20 2019 12:30
Jul 20 2019 16:30
Jul 20 2019 12:30
Jul 20 2019 16:30
Location: 
Infoxchange, 33 Elizabeth St. Richmond

 

Celebrating the release of Debian 10 "Buster":

The meeting will be held at Infoxchange, 33 Elizabeth St. Richmond 3121.  Late arrivals please call (0421) 775 358 for access to the venue.

LUV would like to acknowledge Infoxchange for the venue.

Linux Users of Victoria is a subcommittee of Linux Australia.

July 20, 2019 - 12:30

read more

LUV July 2019 Main Meeting: Automated Linux Firewall Failover

Jul 2 2019 19:00
Jul 2 2019 21:00
Jul 2 2019 19:00
Jul 2 2019 21:00
Location: 
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

PLEASE NOTE LATER START TIME

7:00 PM to 9:00 PM Tuesday, July 2, 2019
Kathleen Syme Library, 251 Faraday Street Carlton VIC 3053

Speakers:

  • Adrian Close: Automated Linux Firewall Failover

 

Automated Linux Firewall Failover

Many of us like to go for dinner nearby after the meeting, typically at Brunetti's or Trotters Bistro in Lygon St.  Please let us know if you'd like to join us!

Linux Users of Victoria is a subcommittee of Linux Australia.

July 2, 2019 - 19:00

read more

June 29, 2019

Long-term Device Use

It seems to me that Android phones have recently passed the stage where hardware advances are well ahead of software bloat. This is the point that desktop PCs passed about 15 years ago and laptops passed about 8 years ago. For just over 15 years I’ve been avoiding buying desktop PCs, the hardware that organisations I work for throw out is good enough that I don’t need to. For the last 8 years I’ve been avoiding buying new laptops, instead buying refurbished or second hand ones which are more than adequate for my needs. Now it seems that Android phones have reached the same stage of development.

3 years ago I purchased my last phone, a Nexus 6P [1]. Then 18 months ago I got a Huawei Mate 9 as a warranty replacement [2] (I had swapped phones with my wife so the phone I was using which broke was less than a year old). The Nexus 6P had been working quite well for me until it stopped booting, but I was happy to have something a little newer and faster to replace it at no extra cost.

Prior to the Nexus 6P I had a Samsung Galaxy Note 3 for 1 year 9 months which was a personal record for owning a phone and not wanting to replace it. I was quite happy with the Note 3 until the day I fell on top of it and cracked the screen (it would have been ok if I had just dropped it). While the Note 3 still has my personal record for continuous phone use, the Nexus 6P/Huawei Mate 9 have the record for going without paying for a new phone.

A few days ago when browsing the Kogan web site I saw a refurbished Mate 10 Pro on sale for about $380. That’s not much money (I usually have spent $500+ on each phone) and while the Mate 9 is still going strong the Mate 10 is a little faster and has more RAM. The extra RAM is important to me as I have problems with Android killing apps when I don’t want it to. Also the IP67 protection will be a handy feature. So that phone should be delivered to me soon.

Some phones are getting ridiculously expensive nowadays (who wants to walk around with a $1000+ Pixel?) but it seems that the slightly lower end models are more than adequate and the older versions are still good.

Cost Summary

If I can buy a refurbished or old model phone every 2 years for under $400 that will make using a phone cost about $0.50 per day. The Nexus 6P cost me $704 in June 2016 which means that for the past 3 years my phone cost was about $0.62 per day.

It seems that laptops tend to last me about 4 years [3], and I don’t need high-end models (I even used one from a rubbish pile for a while). The last laptops I bought cost me $289 for a Thinkpad X1 Carbon [4] and $306 for the Thinkpad T420 [5]. That makes laptops about $0.20 per day.

In May 2014 I bought a Samsung Galaxy Note 10.1 2014 edition tablet for $579. That is still working very well for me today, apart from only having 32G of internal storage space and an OS update preventing Android apps from writing to the micro SD card (so I have to use USB to copy TV shows on to it) there’s nothing more than I need from a tablet. Strangely I even get good battery life out of it, I can use it for a couple of hours without the battery running out. Battery life isn’t nearly as good as when it was new, but it’s still OK for my needs. As Samsung stopped providing security updates I can’t use the tablet as a SSH client, but now that my primary laptop is a small and light model that’s less of an issue. Currently that tablet has cost me just over $0.30 per day and it’s still working well.

Currently it seems that my hardware expense for the forseeable future is likely to be about $1 per day. 20 cents for laptop, 30 cents for tablet, and 50 cents for phone. The overall expense is about $1.66 per month as I’m on a $20 per month pre-paid plan with Aldi Mobile.

Saving Money

A laptop is very important to me, the amounts of money that I’m spending don’t reflect that. But it seems that I don’t have any option for spending more on a laptop (the Thinkpad X1 Carbon I have now is just great and there’s no real option for getting more utility by spending more). I also don’t have any option to spend less on a tablet, 5 years is a great lifetime for a device that is practically impossible to repair (repair will cost a significant portion of the replacement cost).

I hope that the Mate 10 can last at least 2 years which will make it a new record for low cost of ownership of a phone for me. If app vendors can refrain from making their bloated software take 50% more RAM in the next 2 years that should be achievable.

The surprising thing I learned while writing this post is that my mobile phone expense is the largest of all my expenses related to mobile computing. Given that I want to get good reception in remote areas (needs to be Telstra or another company that uses their network) and that I need at least 3GB of data transfer per month it doesn’t seem that I have any options for reducing that cost.

June 26, 2019

Installing NixOS on a Headless Raspberry Pi 3

NixOS Raspberry Pi Gears by Craige McWhirter

This represents the first step in being able to build ready-to-run NixOS images for headless Raspberry Pi 3 devices. Aarch64 images for NixOS need to be built natively on aarch64 hardware so the first Pi 3, the subject of this post, will need a keyboard and mouse attached for two commands.

A fair chunk of this post is collated from NixOS on ARM and NixOS on ARM/Raspberry Pi into a coherent, flowing process with additional steps related to the goal of this being a headless Raspberry Pi 3.

Head to Hydra job nixos:release-19.03:nixos.sd_image.aarch64-linux and download the latest successful build. ie:

 $ wget https://hydra.nixos.org/build/95346103/download/1/nixos-sd-image-19.03.172980.d5a3e5f476b-aarch64-linux.img

You will then need to write this to your SD Card:

# dd if=nixos-sd-image-19.03.172980.d5a3e5f476b-aarch64-linux.img of=/dev/sdX status=progress

Make sure you replace "/dev/sdX" with the correct location of your SD card.

Once the SD card has been written, attach the keyboard and screen, insert the SD card into the Pi and boot it up.

When the boot process has been completed, you will be thrown to a root prompt where you need to set a password for root and start the ssh service:

[root@pi-tri:~]#
[root@pi-tri:~]# passwd
New password:
Retype new password:
passwd: password updated successfully

[root@pi-tri:~]# systemctl start sshd

You can now complete the rest of this process from the comfort of whereever you normally work.

After successfully ssh-ing in and examining your disk layout with lsblk, the first step is to remove the undersized, FAT32 /boot partition:

# fdisk -l /dev/mmcblk0
Disk /dev/mmcblk0: 7.4 GiB, 7948206080 bytes, 15523840 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2178694e

Device         Boot  Start      End  Sectors  Size Id Type
/dev/mmcblk0p1 *     16384   262143   245760  120M  b W95 FAT32
/dev/mmcblk0p2      262144 15522439 15260296  7.3G 83 Linux


# echo -e 'a\n1\na\n2\nw' | fdisk /dev/mmcblk0

Welcome to fdisk (util-linux 2.32.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): Partition number (1,2, default 2):
The bootable flag on partition 1 is disabled now.

Command (m for help): Partition number (1,2, default 2):
The bootable flag on partition 2 is enabled now.

Command (m for help): The partition table has been altered.
Syncing disks.

# fdisk -l /dev/mmcblk0
Disk /dev/mmcblk0: 7.4 GiB, 7948206080 bytes, 15523840 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2178694e

Device         Boot  Start      End  Sectors  Size Id Type
/dev/mmcblk0p1       16384   262143   245760  120M  b W95 FAT32
/dev/mmcblk0p2 *    262144 15522439 15260296  7.3G 83 Linux

Next we need to configure NixOS to boot the basic system we need with ssh enabled, root and a single user and disks configured correctly. I have this example file which at the time of writing looked like this:

# This is an example of a basic NixOS configuration file for a Raspberry Pi 3.
# It's best used as your first configuration.nix file and provides ssh, root
# and user accounts as well as Pi 3 specific tweaks.

{ config, pkgs, lib, ... }:

{
  # NixOS wants to enable GRUB by default
  boot.loader.grub.enable = false;
  # Enables the generation of /boot/extlinux/extlinux.conf
  boot.loader.generic-extlinux-compatible.enable = true;

  # For a Raspberry Pi 2 or 3):
  boot.kernelPackages = pkgs.linuxPackages_latest;

  # !!! Needed for the virtual console to work on the RPi 3, as the default of 16M doesn't seem to be enough.
  # If X.org behaves weirdly (I only saw the cursor) then try increasing this to 256M.
  boot.kernelParams = ["cma=32M"];

  # File systems configuration for using the installer's partition layout
  fileSystems = {
    "/" = {
      device = "/dev/disk/by-label/NIXOS_SD";
      fsType = "ext4";
    };
  };

  # !!! Adding a swap file is optional, but strongly recommended!
  swapDevices = [ { device = "/swapfile"; size = 1024; } ];

  hardware.enableRedistributableFirmware = true; # Enable support for Pi firmware blobs

  networking.hostName = "nixosPi";     # Define your hostname.
  networking.wireless.enable = false;  # Toggles wireless support via wpa_supplicant.

  # Select internationalisation properties.
  i18n = {
    consoleFont = "Lat2-Terminus16";
    consoleKeyMap = "us";
    defaultLocale = "en_AU.UTF-8";
  };

  time.timeZone = "Australia/Brisbane"; # Set your preferred timezone:

  # List services that you want to enable:
  services.openssh.enable = true;  # Enable the OpenSSH daemon.

  # Configure users for your Pi:
   users.mutableUsers = false;     # Remove any users not defined in here

  users.users.root = {
    hashedPassword = "$6$eeqJLxwQzMP4l$GTUALgbCfaqR8ut9kQOOG8uXOuqhtIsIUSP.4ncVaIs5PNlxdvAvV.krfutHafrxNN7KzaM7uksr6bXP5X0Sx1";
    openssh.authorizedKeys.keys = [
      "ssh-ed25519 Voohu4vei4dayohm3eeHeecheifahxeetauR4geigh9eTheey3eedae4ais7pei4ruv4 me@myhost"
    ];
  };

  # Groups to add
  users.groups.myusername.gid = 1000;

  # Define a user account.
  users.users.myusername = {
    isNormalUser = true;
    uid = 1000;
    group = "myusername";
    extraGroups = ["wheel" ];
    hashedPassword = "$6$l2I7i6YqMpeviVy$u84FSHGvZlDCfR8qfrgaP.n7/hkfGpuiSaOY3ziamwXXHkccrOr8Md4V5G2M1KcMJQmX5qP7KOryGAxAtc5T60";
    openssh.authorizedKeys.keys = [
      "ssh-ed25519 Voohu4vei4dayohm3eeHeecheifahxeetauR4geigh9eTheey3eedae4ais7pei4ruv4 me@myhost"
    ];
  };

  # This value determines the NixOS release with which your system is to be
  # compatible, in order to avoid breaking some software such as database
  # servers. You should change this only after NixOS release notes say you
  # should.
  system.stateVersion = "19.03"; # Did you read the comment?
  system.autoUpgrade.enable = true;
  system.autoUpgrade.channel = https://nixos.org/channels/nixos-19.03;
}

Once this is copied into place, you only need to rebuild NixOS using it by running:

# nixos-rebuild switch

Now you should have headless Pi 3 which you can use to build CD card images for other Pi 3's that are fully configured and ready to run.

June 25, 2019

Linux Security Summit North America 2019: Schedule Published

The schedule for the 2019 Linux Security Summit North America (LSS-NA) is published.

This year, there are some changes to the format of LSS-NA. The summit runs for three days instead of two, which allows us to relax the schedule somewhat while also adding new session types.  In addition to refereed talks, short topics, BoF sessions, and subsystem updates, there are now also tutorials (one each day), unconference sessions, and lightning talks.

The tutorial sessions are:

These tutorials will be 90 minutes in length, and they’ll run in parallel with unconference sessions on the first two days (when the space is available at the venue).

The refereed presentations and short topics cover a range of Linux security topics including platform boot security, integrity, container security, kernel self protection, fuzzing, and eBPF+LSM.

Some of the talks I’m personally excited about include:

The schedule last year was pretty crammed, so with the addition of the third day we’ve been able to avoid starting early, and we’ve also added five minute transitions between talks. We’re hoping to maximize collaboration via the more relaxed schedule and the addition of more types of sessions (unconference, tutorials, lightning talks).  This is not a conference for simply consuming talks, but to also participate and to get things done (or started).

Thank you to all who submitted proposals.  As usual, we had many more submissions than can be accommodated in the available time.

Also thanks to the program committee, who spent considerable time reviewing and discussing proposals, and working out the details of the schedule. The committee for 2019 is:

  • James Morris (Microsoft)
  • Serge Hallyn (Cisco)
  • Paul Moore (Cisco)
  • Stephen Smalley (NSA)
  • Elena Reshetova (Intel)
  • John Johnansen (Canonical)
  • Kees Cook (Google)
  • Casey Schaufler (Intel)
  • Mimi Zohar (IBM)
  • David A. Wheeler (Institute for Defense Analyses)

And of course many thanks to the event folk at Linux Foundation, who handle all of the logistics of the event.

LSS-NA will be held in San Diego, CA on August 19-21. To register, click here. Or you can register for the co-located Open Source Summit and add LSS-NA.

 

June 23, 2019

X-Axis is now ready!

The thread plate is now mounted to the base with thread lock in select locations. The top can still come off easily so I can drill holes to mount the gantry to the alloy tongue that comes out the bottom middle (there is one on the other side too).


Without the 75mm by 50mm by 1/4 inch 6061 alloy angle brackets you could flex the steel in the middle. Now, well... it is not so easy for a human to apply enough force to do it. The thread plate is only supported by 4 colonnades at the left and right side. The middle is unsupported to allow the gantry to travel 950mm along. I think the next build will be more a vertical mill style than sliding gantry to avoid these rigidity challenges.


June 20, 2019

Booting a NixOS aarch64 Image in Qemu

NixOS Gears by Craige McWhirter

To boot a NixOS aarch64 image in qemu, in this example, a Raspberry Pi3 (B), you can use the following command:

 qemu-system-aarch64 -M raspi3 -drive format=raw,file=NIXOS.IMG \
 -kernel ./u-boot-rpi3.bin -serial stdio -d in_asm -m 1024

You will need to replace NIXOS.IMG with the name of the image file you downloaded ie: nixos-sd-image-18.09.2568.1e9e709953e-aarch64-linux.img

You will also need to mount the image file and copy out u-boot-rpi3.bin for the -kernel option.

A nerd snipe, in which I reverse engineer the Aussie Broadband usage API

Share

I was curious about the newly available FTTN NBN service in my area, so I signed up to see what’s what. Of course, I need a usage API so that I can graph my usage in prometheus and grafana as everyone does these days. So I asked Aussie. The response I got was that I was welcome to reverse engineer the REST API that the customer portal uses.

So I did.

I give you my super simple implementation of an Aussie Broadband usage client in Python. Patches of course are welcome.

I’ve now released the library on pypi under the rather innovative name of “aussiebb”, so installing it is as simple as:

$ pip install aussiebb

Share

June 18, 2019

TEN THOUSAND DISKS

In OpenPOWER land we have a project called op-test-framework which (for all its strengths and weaknesses) allows us to test firmware on a variety of different hardware platforms and even emulators like Qemu.

Qemu is a fantasic tool allowing us to relatively quickly test against an emulated POWER model, and of course is a critical part of KVM virtual machines running natively on POWER hardware. However the default POWER model in Qemu is based on the "pseries" machine type, which models something closer to a virtual machine or a PowerVM partition rather than a "bare metal" machine.

Luckily we have Cédric Le Goater who is developing and maintaining a Qemu "powernv" machine type which more accurately models running directly on an OpenPOWER machine. It's an unwritten rule that if you're using Qemu in op-test, you've compiled this version of Qemu!

Teething Problems

Because the "powernv" type does more accurately model the physical system some extra care needs to be taken when setting it up. In particular at one point we noticed that the pretend CDROM and disk drive we attached to the model were.. not being attached. This commit took care of that; the problem was that the PCI topology defined by the layout required us to be more exact about where PCI devices were to be added. By default only three spare PCI "slots" are available but as the commit says, "This can be expanded by adding bridges"...

More Slots!

Never one to stop at a just-enough solution, I wondered how easy it would be to add an extra PCI bridge or two to give the Qemu model more available slots for PCI devices. It turns out, easy enough once you know the correct invocation. For example, adding a PCI bridge in the first slot of the first default PHB is:

-device pcie-pci-bridge,id=pcie.3,bus=pcie.0,addr=0x0

And inserting a device in that bridge just requires us to specify the bus and slot:

-device virtio-blk-pci,drive=cdrom01,id=virtio02,bus=pcie.4,addr=3

Great! Each bridge provides 31 slots, so now we have plenty of room for extra devices.

Why Stop There?

We have three free slots, and we don't have a strict requirement on where devices are plugged in, so lets just plug a bridge into each of those slots while we're here:

-device pcie-pci-bridge,id=pcie.3,bus=pcie.0,addr=0x0 \
-device pcie-pci-bridge,id=pcie.4,bus=pcie.1,addr=0x0 \
-device pcie-pci-bridge,id=pcie.5,bus=pcie.2,addr=0x0

What happens if we insert a new PCI bridge into another PCI bridge? Aside from stressing out our PCI developers, a bunch of extra slots! And then we could plug bridges into those bridges and then..


Thus was born "OpTestQemu: Add PCI bridges to support more devices." and the testcase "Petitboot10000Disks". The changes to the Qemu model setup fill up each PCI bridge as long as we have devices to add, but reserve the first slot to add another bridge if we run out of room... and so on..

Officially this is to support adding interesting disk topologies to test Pettiboot use cases, stress test device handling, and so on, but while we're here... what happens with 10,000 temporary disks?

======================================================================
ERROR: testListDisks (testcases.Petitboot10000Disks.ConfigEditorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sam/git/op-test-framework/testcases/Petitboot10000Disks.py", line 27, in setUp
    self.system.goto_state(OpSystemState.PETITBOOT_SHELL)
  File "/home/sam/git/op-test-framework/common/OpTestSystem.py", line 366, in goto_state
    self.state = self.stateHandlers[self.state](state)
  File "/home/sam/git/op-test-framework/common/OpTestSystem.py", line 695, in run_IPLing
    raise my_exception
UnknownStateTransition: Something happened system state="2" and we transitioned to UNKNOWN state.  Review the following for more details
Message="OpTestSystem in run_IPLing and the Exception=
"filedescriptor out of range in select()"
 caused the system to go to UNKNOWN_BAD and the system will be stopping."

Yeah that's probably to be expected without some more massaging. What about a more modest 512?

I: Resetting PHBs and training links...
[   55.293343496,5] PCI: Probing slots...
[   56.364337089,3] PHB#0000:02:01.0 pci_find_ecap hit a loop !
[   56.364973775,3] PHB#0000:02:01.0 pci_find_ecap hit a loop !
[   57.127964432,3] PHB#0000:03:01.0 pci_find_ecap hit a loop !
[   57.128545637,3] PHB#0000:03:01.0 pci_find_ecap hit a loop !
[   57.395489618,3] PHB#0000:04:01.0 pci_find_ecap hit a loop !
[   57.396048285,3] PHB#0000:04:01.0 pci_find_ecap hit a loop !
[   58.145944205,3] PHB#0000:05:01.0 pci_find_ecap hit a loop !
[   58.146465795,3] PHB#0000:05:01.0 pci_find_ecap hit a loop !
[   58.404954853,3] PHB#0000:06:01.0 pci_find_ecap hit a loop !
[   58.405485438,3] PHB#0000:06:01.0 pci_find_ecap hit a loop !
[   60.178957315,3] PHB#0001:02:01.0 pci_find_ecap hit a loop !
[   60.179524173,3] PHB#0001:02:01.0 pci_find_ecap hit a loop !
[   60.198502097,3] PHB#0001:02:02.0 pci_find_ecap hit a loop !
[   60.198982582,3] PHB#0001:02:02.0 pci_find_ecap hit a loop !
[   60.435096197,3] PHB#0001:03:01.0 pci_find_ecap hit a loop !
[   60.435634380,3] PHB#0001:03:01.0 pci_find_ecap hit a loop !
[   61.171512439,3] PHB#0001:04:01.0 pci_find_ecap hit a loop !
[   61.172029071,3] PHB#0001:04:01.0 pci_find_ecap hit a loop !
[   61.425416049,3] PHB#0001:05:01.0 pci_find_ecap hit a loop !
[   61.425934524,3] PHB#0001:05:01.0 pci_find_ecap hit a loop !
[   62.172664549,3] PHB#0001:06:01.0 pci_find_ecap hit a loop !
[   62.173186458,3] PHB#0001:06:01.0 pci_find_ecap hit a loop !
[   63.434516732,3] PHB#0002:02:01.0 pci_find_ecap hit a loop !
[   63.435062124,3] PHB#0002:02:01.0 pci_find_ecap hit a loop !
[   64.177567772,3] PHB#0002:03:01.0 pci_find_ecap hit a loop !
[   64.178099773,3] PHB#0002:03:01.0 pci_find_ecap hit a loop !
[   64.431763989,3] PHB#0002:04:01.0 pci_find_ecap hit a loop !
[   64.432285000,3] PHB#0002:04:01.0 pci_find_ecap hit a loop !
[   65.180506790,3] PHB#0002:05:01.0 pci_find_ecap hit a loop !
[   65.181049905,3] PHB#0002:05:01.0 pci_find_ecap hit a loop !
[   65.432105600,3] PHB#0002:06:01.0 pci_find_ecap hit a loop !
[   65.432654326,3] PHB#0002:06:01.0 pci_find_ecap hit a loop !

(That isn't good)

[   66.177240655,5] PCI Summary:
[   66.177906083,5] PHB#0000:00:00.0 [ROOT] 1014 03dc R:00 C:060400 B:01..07 
[   66.178760724,5] PHB#0000:01:00.0 [ETOX] 1b36 000e R:00 C:060400 B:02..07 
[   66.179501494,5] PHB#0000:02:01.0 [ETOX] 1b36 000e R:00 C:060400 B:03..07 
[   66.180227773,5] PHB#0000:03:01.0 [ETOX] 1b36 000e R:00 C:060400 B:04..07 
[   66.180953149,5] PHB#0000:04:01.0 [ETOX] 1b36 000e R:00 C:060400 B:05..07 
[   66.181673576,5] PHB#0000:05:01.0 [ETOX] 1b36 000e R:00 C:060400 B:06..07 
[   66.182395253,5] PHB#0000:06:01.0 [ETOX] 1b36 000e R:00 C:060400 B:07..07 
[   66.183207399,5] PHB#0000:07:02.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 
[   66.183969138,5] PHB#0000:07:03.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 

(a lot more of this)

[   67.055196945,5] PHB#0002:02:1e.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 
[   67.055926264,5] PHB#0002:02:1f.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 
[   67.094591773,5] INIT: Waiting for kernel...
[   67.095105901,5] INIT: 64-bit LE kernel discovered
[   68.095749915,5] INIT: Starting kernel at 0x20010000, fdt at 0x3075d270 168365 bytes

zImage starting: loaded at 0x0000000020010000 (sp: 0x0000000020d30ee8)
Allocating 0x1dc5098 bytes for kernel...
Decompressing (0x0000000000000000 <- 0x000000002001f000:0x0000000020d2e578)...
Done! Decompressed 0x1c22900 bytes

Linux/PowerPC load: 
Finalizing device tree... flat tree at 0x20d320a0
[   10.120562] watchdog: CPU 0 self-detected hard LOCKUP @ pnv_pci_cfg_write+0x88/0xa4
[   10.120746] watchdog: CPU 0 TB:50402010473, last heartbeat TB:45261673150 (10039ms ago)
[   10.120808] Modules linked in:
[   10.120906] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.5-openpower1 #2
[   10.120956] NIP:  c000000000058544 LR: c00000000004d458 CTR: 0000000030052768
[   10.121006] REGS: c0000000fff5bd70 TRAP: 0900   Not tainted  (5.0.5-openpower1)
[   10.121030] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 48002482  XER: 20000000
[   10.121215] CFAR: c00000000004d454 IRQMASK: 1 
[   10.121260] GPR00: 00000000300051ec c0000000fd7c3130 c000000001bcaf00 0000000000000000 
[   10.121368] GPR04: 0000000048002482 c000000000058544 9000000002009033 0000000031c40060 
[   10.121476] GPR08: 0000000000000000 0000000031c40060 c00000000004d46c 9000000002001003 
[   10.121584] GPR12: 0000000031c40000 c000000001dd0000 c00000000000f560 0000000000000000 
[   10.121692] GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000000 
[   10.121800] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[   10.121908] GPR24: 0000000000000005 0000000000000000 0000000000000000 0000000000000104 
[   10.122016] GPR28: 0000000000000002 0000000000000004 0000000000000086 c0000000fd9fba00 
[   10.122150] NIP [c000000000058544] pnv_pci_cfg_write+0x88/0xa4
[   10.122187] LR [c00000000004d458] opal_return+0x14/0x48
[   10.122204] Call Trace:
[   10.122251] [c0000000fd7c3130] [c000000000058544] pnv_pci_cfg_write+0x88/0xa4 (unreliable)
[   10.122332] [c0000000fd7c3150] [c0000000000585d0] pnv_pci_write_config+0x70/0x9c
[   10.122398] [c0000000fd7c31a0] [c000000000234fec] pci_bus_write_config_word+0x74/0x98
[   10.122458] [c0000000fd7c31f0] [c00000000023764c] __pci_read_base+0x88/0x3a4
[   10.122518] [c0000000fd7c32c0] [c000000000237a18] pci_read_bases+0xb0/0xc8
[   10.122605] [c0000000fd7c3300] [c0000000002384bc] pci_setup_device+0x4f8/0x5b0
[   10.122670] [c0000000fd7c33a0] [c000000000238d9c] pci_scan_single_device+0x9c/0xd4
[   10.122729] [c0000000fd7c33f0] [c000000000238e2c] pci_scan_slot+0x58/0xf4
[   10.122796] [c0000000fd7c3430] [c000000000239eb8] pci_scan_child_bus_extend+0x40/0x2a8
[   10.122861] [c0000000fd7c34a0] [c000000000239e34] pci_scan_bridge_extend+0x4d4/0x504
[   10.122928] [c0000000fd7c3580] [c00000000023a0f8] pci_scan_child_bus_extend+0x280/0x2a8
[   10.122993] [c0000000fd7c35f0] [c000000000239e34] pci_scan_bridge_extend+0x4d4/0x504
[   10.123059] [c0000000fd7c36d0] [c00000000023a0f8] pci_scan_child_bus_extend+0x280/0x2a8
[   10.123124] [c0000000fd7c3740] [c000000000239e34] pci_scan_bridge_extend+0x4d4/0x504
[   10.123191] [c0000000fd7c3820] [c00000000023a0f8] pci_scan_child_bus_extend+0x280/0x2a8
[   10.123256] [c0000000fd7c3890] [c000000000239b5c] pci_scan_bridge_extend+0x1fc/0x504
[   10.123322] [c0000000fd7c3970] [c00000000023a064] pci_scan_child_bus_extend+0x1ec/0x2a8
[   10.123388] [c0000000fd7c39e0] [c000000000239b5c] pci_scan_bridge_extend+0x1fc/0x504
[   10.123454] [c0000000fd7c3ac0] [c00000000023a064] pci_scan_child_bus_extend+0x1ec/0x2a8
[   10.123516] [c0000000fd7c3b30] [c000000000030dcc] pcibios_scan_phb+0x134/0x1f4
[   10.123574] [c0000000fd7c3bd0] [c00000000100a800] pcibios_init+0x9c/0xbc
[   10.123635] [c0000000fd7c3c50] [c00000000000f398] do_one_initcall+0x80/0x15c
[   10.123698] [c0000000fd7c3d10] [c000000001000e94] kernel_init_freeable+0x248/0x24c
[   10.123756] [c0000000fd7c3db0] [c00000000000f574] kernel_init+0x1c/0x150
[   10.123820] [c0000000fd7c3e20] [c00000000000b72c] ret_from_kernel_thread+0x5c/0x70
[   10.123854] Instruction dump:
[   10.123885] 7d054378 4bff56f5 60000000 38600000 38210020 e8010010 7c0803a6 4e800020 
[   10.124022] e86a0018 54c6043e 7d054378 4bff5731 <60000000> 4bffffd8 e86a0018 7d054378 
[   10.124180] Kernel panic - not syncing: Hard LOCKUP
[   10.124232] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.5-openpower1 #2
[   10.124251] Call Trace:

I wonder if I can submit that bug without someone throwing something at my desk.

The X Axis is growing...

The new cnc X axis will be around a meter in length. This presents some issues with material selection as steel that is 1100mm long by 350mm wide and 5mm thick will flex when only supported by the black columns at each end. I have some brackets to sure that up so the fixture plate will not be pushed away or vibrate under cutting load.




The linear rails are longer than the ballscrew to allow the gantry to travel the full length of the ballscrew. In this case a 1 meter ballscrew allows about 950mm of tip to tip travel and thus 850mm of cutter travel. The gantry is 100mm wide, shown as just the mounting plate in the picture above.

The black columns to hold the fixture plate are 38mm square and 60mm high solid steel. They come in at about 500grams a pop. The steel plate is about 15kg. I was originally going to use 38mm solid square steel stock as the shims under the linear rails but they came in at over 8kg each and the build was starting to get heavy.

The columns are m6 tapped both ends to hold the fixture plate up above the assembly. I will likely laminate some 1.2mm alloy to the base of the fixture plate to mitigate chips falling through the screw fixture holes into the rails and ballscrew.

I have to work out the final order of the 1/4 inch 6061 brackets that sure up the 5mm thick fixture plate yet. Without edge brackets you can flex the steel when it is only supported at the ends. Yes, I can see why vertical mills are made.

I made the plate that will have the gantry attached on the cnc but had to refixture things as the cnc can not cut something that long in any of the current axis.



It is interesting how much harder 6061 is compared to some of the more economic alloys when machining things. You can see the cnc machine facing more resistance especially on 6mm and larger holes.  It will be interesting to see if the cnc can handle drilling steel at some stage.

June 15, 2019

OpenSUSE 15 LXC setup on Ubuntu Bionic 18.04

Similarly to what I wrote for Fedora, here is how I was able to create an OpenSUSE 15 LXC container on an Ubuntu 18.04 (bionic) laptop.

Setting up LXC on Ubuntu

First of all, install lxc:

apt install lxc
echo "veth" >> /etc/modules
modprobe veth

turn on bridged networking by putting the following in /etc/sysctl.d/local.conf:

net.ipv4.ip_forward=1

and applying it using:

sysctl -p /etc/sysctl.d/local.conf

Then allow the right traffic in your firewall (/etc/network/iptables.up.rules in my case):

# LXC containers
-A FORWARD -d 10.0.3.0/24 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -s 10.0.3.0/24 -j ACCEPT
-A INPUT -d 224.0.0.251 -s 10.0.3.1 -j ACCEPT
-A INPUT -d 239.255.255.250 -s 10.0.3.1 -j ACCEPT
-A INPUT -d 10.0.3.255 -s 10.0.3.1 -j ACCEPT
-A INPUT -d 10.0.3.1 -s 10.0.3.0/24 -j ACCEPT

and apply these changes:

iptables-apply

before restarting the lxc networking:

systemctl restart lxc-net.service

Creating the container

Once that's in place, you can finally create the OpenSUSE 15 container:

lxc-create -n opensuse15 -t download -- -d opensuse -r 15 -a amd64

To see a list of all distros available with the download template:

lxc-create -n foo --template=download -- --list

Logging in as root

Start up the container and get a login console:

lxc-start -n opensuse15 -F

In another terminal, set a password for the root user:

lxc-attach -n opensuse15 passwd

You can now use this password to log into the console you started earlier.

Logging in as an unprivileged user via ssh

As root, install a few packages:

zypper install vim openssh sudo man
systemctl start sshd
systemctl enable sshd

and then create an unprivileged user:

useradd francois
passwd francois
cd /home
mkdir francois
chown francois:100 francois/

and give that user sudo access:

visudo  # uncomment "wheel" line
groupadd wheel
usermod -aG wheel francois

Now login as that user from the console and add an ssh public key:

mkdir .ssh
chmod 700 .ssh
echo "<your public key>" > .ssh/authorized_keys
chmod 644 .ssh/authorized_keys

You can now login via ssh. The IP address to use can be seen in the output of:

lxc-ls --fancy

June 13, 2019

Intersections and connections

Intersections and connections kattekrab Thu, 13/06/2019 - 07:07

Raspberry Pi HAT identity EEPROMs, a simple guide

Share

I’ve been working on a RFID scanner than can best be described as an overly large Raspberry Pi HAT recently. One of the things I am grappling with as I get closer to production boards is that I need to be able to identify what version of the HAT is currently installed — the software can then tweak its behaviour based on the hardware present.

I had toyed with using some spare GPIO lines and “hard coded” links on the HAT to identify board versions to the Raspberry Pi, but it turns out others have been here before and there’s a much better way. The Raspberry Pi folks have defined something called the “Hardware On Top” (HAT) specification which defines an i2c EEPROM which can be used to identify a HAT to the Raspberry Pi.

There are a couple of good resources I’ve found that help you do this thing — sparkfun have a tutorial which covers it, and there is an interesting forum post. However, I couldn’t find a simple tutorial for HAT designers that just covered exactly what they need to know and nothing else. There were also some gaps in those documents compared with my experiences, and I knew I’d need to look this stuff up again in the future. So I wrote this page.

Initial setup

First off, let’s talk about the hardware. I used an 24LC256P DIL i2c EEPROM — these are $2 on ebay, or $6 from Jaycar. The pins need to be wired like this:

24LC256P Pin Raspberry Pi Pin Notes
1 (AO) GND (pins 6, 9, 14, 20, 25, 30, 34, 39) All address pins tied to ground will place the EEPROM at address 50. This is the required address in the specification
2 (A1) GND
3 (A2) GND
4 VSS GND
5 SDA 27

You should also add a 3.9K pullup resistor from EEPROM pin 5 to 3.3V.

You must use this pin for the Raspberry Pi to detect the EEPROM on startup!
6 SCL 28

You should also add a 3.9K pullup resistor from EEPROM pin 6 to 3.3V.

You must use this pin for the Raspberry Pi to detect the EEPROM on startup!
7 WP Not connected Write protect. I don’t need this.
8 VCC 3.3V (pins 1 or 17) The EEPROM is capable of being run at 5 volts, but must be run at 3.3 volts to work as a HAT identification EEPROM.

The specification requires that the data pin be on pin 27, the clock pin be on pin 28, and that the EEPROM be at address 50 on the i2c bus as described in the table above. There is also some mention of pullup resistors in both the data sheet and the HAT specification, but not in a lot of detail. The best I could find was a circuit diagram for a different EEPROM with the pullup resistors shown.

My test EEPROM wired up on a little breadboard looks like this:

My prototype i2c EEPROM circuit

And has a circuit diagram like this:

An ID EEPROM circuit

Next enable i2c on your raspberry pi. You also need to hand edit /boot/config.txt and then reboot. The relevant line of my config.txt look like this:

dtparam=i2c_vc=on

After reboot you should have an entry at /dev/i2c-0.

GOTCHA: you can’t probe the i2c bus that the HAT standard uses, and I couldn’t get flashing the EEPROM to work on that bus either.

Now time for our first gotcha — the version detection i2c bus is only enabled during boot and then turned off. An i2cdetect on bus zero wont show the device post boot for this reason. This caused an initial panic attack because I thought my EEPROM was dead, but that was just my twitchy nature showing through.

You can verify your EEPROM works by enabling bus one. To do this, add these lines to /boot/config.txt:

dtparam=i2c_arm=on
dtparam=i2c_vc=on

After a reboot you should have /dev/i2c-0 and /dev/i2c-1. You also need to move the EEPROM to bus 1 in order for it to be detected:

24LC256P Pin Raspberry Pi Pin Notes
5 SDA 3
6 SCL 5

You’ll need to move the EEPROM back before you can use it for HAT detection.

Programming the EEPROM

You program the EEPROM with a set of tools provided by the raspberry pi folks. Check those out and compile them, they’re not packaged for raspbian that I can find:

pi@raspberrypi:~ $ git clone https://github.com/raspberrypi/hats
Cloning into 'hats'...
remote: Enumerating objects: 464, done.
remote: Total 464 (delta 0), reused 0 (delta 0), pack-reused 464
Receiving objects: 100% (464/464), 271.80 KiB | 119.00 KiB/s, done.
Resolving deltas: 100% (261/261), done.
pi@raspberrypi:~ $ cd hats/eepromutils/
pi@raspberrypi:~/hats/eepromutils $ ls
eepdump.c    eepmake.c            eeptypes.h  README.txt
eepflash.sh  eeprom_settings.txt  Makefile
pi@raspberrypi:~/hats/eepromutils $ make
cc eepmake.c -o eepmake -Wno-format
cc eepdump.c -o eepdump -Wno-format

The file named eeprom_settings.txt is a sample of the settings for your HAT. Fiddle with that until it makes you happy, and then compile it:

$ eepmake eeprom_settings.txt eeprom_settings.eep
Opening file eeprom_settings.txt for read
UUID=b9e3b4e9-e04f-4759-81aa-8334277204eb
Done reading
Writing out...
Done.

And then we can flash our EEPROM, remembering that I’ve only managed to get flashing to work while the EEPROM is on bus 1 (pins 2 and 5):

$ sudo sh eepflash.sh -w -f=eeprom_settings.eep -t=24c256 -d=1
This will attempt to talk to an eeprom at i2c address 0xNOT_SET on bus 1. Make sure there is an eeprom at this address.
This script comes with ABSOLUTELY no warranty. Continue only if you know what you are doing.
Do you wish to continue? (yes/no): yes
Writing...
0+1 records in
0+1 records out
107 bytes copied, 0.595252 s, 0.2 kB/s
Closing EEPROM Device.
Done.

Now move the EEPROM back to bus 0 (pins 27 and 28) and reboot. You should end up with entries in the device tree for the HAT. I get:

$ cd /proc/device-tree/hat/
$ for item in *
> do
>   echo "$item: "`cat $item`
>   echo
> done
name: hat

product: GangScan

product_id: 0x0001

product_ver: 0x0008

uuid: b9e3b4e9-e04f-4759-81aa-8334277204eb

vendor: madebymikal.com

Now I can have my code detect if the HAT is present, and if so what version. Comments welcome!

Share

June 11, 2019

Codec 2 700C Equaliser

During the recent FreeDV QSO party, I was reminded of a problem with the codec used for FreeDV 700D. Some speakers are muffled and hard to understand, while others code quite nicely and are easy to listen too. I’m guessing the issue is around the Vector Quantiser (VQ) used to encode the speech spectrum. As I’ve been working on Vector Quantisation (VQ) recently for the LPCNet project, I decided to have a fresh look at this problem with Codec 2.

I’ve been talking to a few people about the Codec 2 700C VQ and the idea of an equaliser. Thanks Stefan, Thomas, and Jean Marc for you thoughts.

Vector Quantiser

The FreeDV 700C and FreeDV 700D modes both use Codec 2 700C. Sorry about the confusing nomenclature – FreeDV and Codec 2 aren’t always in lock step.

Vector Quantisers are trained on speech from databases that represent a variety of speakers. These databases tend to have standardised frequency responses across all samples. I used one of these databases to train the VQ for Codec 2 700C. However when used in the real world (e.g. with FreeDV), the codec gets connected to many different microphones and sound cards with varying frequency responses.

For example, the VQ training database might be high pass filtered at 150 Hz, and start falling off at 3600 Hz. A gamer headset used for FreeDV might have low frequency energy down to 20 Hz, and and have a gentle high pass slope such that energy at 3 kHz is 6dB louder than energy at 1000 Hz. Another user of FreeDV might have a completely different frequency response on their system.

The VQ naively tries to match the input spectrum. If you have input speech that is shaped differently to the VQ training data, the VQ tends to expend a lot of bits matching spectral shaping rather than concentrating on important features of speech that make it intelligible. The result is synthesised speech that has artefacts, is harder to understand and muffled.

In contrast, commercial radios have it easier, they can control the microphone and input analog signal frequency response to neatly match the codec.

Equaliser Algorithm

I wrote an Octave script (vq_700c_eq.m) to look into the problem and try a few different algorithms. It allows to me analyse speech frame by frame and in batch mode, so I can listen to the results.

The equaliser is set to the average quantiser error for each of the K=20 bands in each vector, a similar algorithm to [1]. For these initial tests, I calculated the equaliser values as the mean quantiser error over the entire sample. This is cheating a bit to get an “early result” and test the general idea. A real world EQ would need to adapt to input speech.

Here is a 3D mesh plot of the spectrum of the cq_ref sample evolving over time. For each 40ms frame, we have a K=20 element vector of samples we need to quantise. Note the high levels towards the low frequency end in this sample.

Averaging the first stage VQ error over the sample, we get the following equaliser values:

Note the large values at the start and end of the spectrum. The eq_hi curve is the mean error of just the high energy frames (ignoring silence frames).

Here is a snapshot of a single frame, showing the target vector, the first stage vector quantiser’s best effort, and the error.

Here is the same frame after the equaliser has been applied to the target vector:

In this case the vector quantiser has selected a different vector – it’s not “fighting” the static frequency response so much and can focus on the more perceptually important parts of the speech spectrum. It also means the 2nd stage can address perceptually important features of the vector rather than the static frequency response.

For this frame the variance (mean square error) was halved using the equaliser.

Results

The following table presents the results in terms of the variance (mean square error) in dB^2. The first column is the variance of the input data, samples with a wider spectral range will tend to be higher. The idea of the quantiser is to reduce the variance (quantiser error) as much as possible. It’s a two stage quantiser (9 bits or 512 entries) per stage.

The two right hand columns show the results (variance after 2nd stage) without and with the equaliser, using the Codec 2 700C VQ (which I have labelled train_120 in the source code). On some samples it has quite an effect (cq_ref, cq_freedv_8k), less so on others. After the equaliser, most of the samples are in the 8dB^2 range.

--------------------------------------------------------
Sample        Initial  stg1    stg1_eq     stg2  stg2_eq
--------------------------------------------------------
hts1a         120.40   17.40   16.07       9.34    8.66
hts2a         149.13   18.67   16.85       9.71    8.81
cq_ref        170.07   34.08   20.33      20.07   11.33
ve9qrp_10s     66.09   23.15   14.97      13.14    8.10
vk5qi         134.39   21.52   14.65      12.03    8.05
c01_01_8k     126.75   18.84   14.11      10.19    7.51
ma01_01        91.22   23.96   16.26      14.05    8.91
cq_freedv_8k  118.80   29.60   16.41      17.46    8.97

In the next table, we use a different vector quantiser (all_speech), derived from a different training database. This used much more training data than train_120 above. In this case, the VQ (in general) does a better job, and the equaliser has a smaller effect. Notable exceptions are the hts1a/hts2a samples, which are poorer. They seem messed up no matter what I do. The c01_01_8k/ma0_01 samples are from within the all-speech database, so predictably do quite well.

---------------------------------------------------------
Sample        Initial  stg1    stg1_eq    stg2    stg2_eq
---------------------------------------------------------
hts1a         120.40   20.75   16.63      11.36    9.05
hts2a         149.13   24.31   16.54      12.48    7.90
cq_ref        170.07   22.41   17.54      12.11    9.26
ve9qrp_10s     66.09   15.29   13.88       8.12    7.29
vk5qi         134.39   14.66   13.25       7.95    7.17
c01_01_8k     126.75   10.64   10.17       5.19    5.07
ma01_01        91.22   14.09   13.20       7.05    6.68
cq_freedv_8k  118.80   17.25   13.23       9.13    7.04

Listening to the samples:

  1. The equaliser doesn’t mess up anything. This is actually quite important. We are modifying the speech spectrum so it’s important that the equaliser doesn’t make any samples sound worse if they don’t need equalisation.
  2. In line with the tables above, the equaliser improves samples like cq_ref, cq_freedv_8k and vk5qi on Codec 700C. In particular a bass artefact that I sometimes hear can be removed, and (I hope) intelligibility improved.
  3. The second (all_speech) VQ improves the quality of most samples. Also in line with the variance table for this VQ, the equaliser makes a smaller improvement.
  4. hts1a and hts2a are indeed poorer with the all_speech VQ. You can’t win them all.

Here are some samples showing the various VQ and EQ options. For listening, I used a small set of loudspeakers with some bass response, as the artefacts I am interested in often affect the bass end.

Codec 2 700C + train_120 VQ Listen
Codec 2 700C + train_120 VQ + EQ Listen
Codec 2 700C + all_speech VQ Listen
Codec 2 700C + all_speech VQ + EQ Listen

Using my loudspeakers, I can hear the annoying bass artefact being removed by the EQ (first and second samples). In the next two samples (all_speech VQ), the effect of the EQ is less pronounced, as the VQ itself does a better job.

Conclusions and Discussion

The Codec 2 700C VQ can be improved, either by training a new VQ, or adding an equaliser that adjusts the input target vector. A new VQ would break compatibility with Codec 2 700C, so we would need to release a new mode, and push that through to a new FreeDV mode. The equaliser alone could be added to the current Codec 2 700C implementation, without breaking compatibility.

Variance (mean squared error) is a pretty good objective measure of the quantiser performance, and aligned with the listening tests results. Minimising variance is heading in the right direction. This is important, as listening tests are difficult and subjective in nature.

This work might be useful for LPCNet or other NN based codecs, which have a similar set of parameters that require vector quantisation. In particular if we want to use NN techniques at lower bit rates.

There is remaining mystery over the hts1a/hts2a samples. They must have a spectrum that the equaliser can’t adjust effectively and the VQ doesn’t address well. This suggests other equaliser algorithms might to a better job.

Another possibility is training the VQ to handle a wider variety of inputs by including static spectral shaping in the training data. This could be achieved by filtering the spectrally flat training database, and appending the shaped data to the VQ. However this would increase the variance the VQ has to deal with, and possibly lead to more bits for a given VQ performance.

Reading Further

Codec 2 700C Equaliser Part 2

[1] I. H. J. Nel and W. Coetzer, “An Adaptive Homomorphic Vocoder at 550 bits/second,” /IEEE South African Symposium on Communications and Signal Processing/, Johannesburg, South Africa, 1990, pp. 131-136.

Codec 2 700C