It is not a big deal for me to admit that I am a huge nerd when it comes to managing baremetal infrastructure in the datacenter. I feel like a dying breed in the times where everyone just wants to swipe their credit card and go to the public clouds. Nevertheless, there are still workloads out there that are too expensive to run in public cloud or too sensitive to be exposed in the “public” space. Proper automation and the management of the resources in the datacenter goes a long way in gaining competitive advantage. After all, if everyone is in “Public” cloud, there is really no way for anyone to leap over by constructing a model that would deliver services or software for lets say half the cost (at least from the infrastructure perspective) and twice as fast.
Disclaimer: I am not authorized to speak on behalf of Red Hat, are sponsored by Red Hat or are representing Red Hat and its views. The views and opinions expressed on this website are entirely my own.
One of the projects that I have been a big fan of since the original release is networking-ansible for OpenStack Ironic. There are some interesting problems that I have been able to help solve with the net-ansible including:
– proper multitenancy model for baremetal
– mix of BM and VM OpenShift IPI deployment
– ability for rapidly move expensive servers (with GPUs) between multiple projects/tenants
– allow for a agile benchmarking of the hardware prior to it’s release and shifting it between different QA teams
– transforming a datacenter to act as a render farm at night and pool production workstations during a day
– ability to move nodes quickly between network leaf
To be totally transparent I have hit some challenges with this net-ansible model as well:
– networking ops team not want to give over the control of the network switches (job security)
– security policies preventing software to tap into physical network infrastructure
– lack of drivers and support for all different type of networking equipment in the datacenter
There is really nothing I can do to solve the first 2, but I decided to take a stab at solving the last point and show you the process of writing drivers for any* switch. And by any* , I really mean as long as there is overall ansible support for it. Check out the list here -> https://docs.ansible.com/ansible/latest/modules/list_of_network_modules.html. If your switch is on this list, the chances are you will be able to write a driver in no time. It has taken me about 2 hours to write and test the driver for my Lenovo enos driver that my employer happens to own in one of the datacenters. I can’t take all the credit myself though. I would probably not be successful if not for help from my colleague Dan Radez who I consider a “father” of net-ansible project.
I. Writing a net-ansible switch driver:
1. The starting point is to clone the repositories of the existing solution. There is a standalone project called network-runner that is being somewhat independent from OpenStack and is used to generate drivers for the newer versions of the OpenStack
https://github.com/ansible-network/network-runner
2. In my case I also wanted to backport my driver to OSP13 (queens) which is still in use in my lab. For that I have cloned the queens branch inside the following OpenStack repo:
https://opendev.org/x/networking-ansible/src/branch/stable/queens
3. Next you need to write an actual driver inside proper directory
etc/ansible/roles/openstack-ml2/providers/ ← OpenStack repo
etc/ansible/roles/network-runner/providers/ ← Network-runner repo
4. I have used Dan’s original Junos commit as a reference. It is simple and effective:
https://github.com/ansible-network/network-runner/commit/4c37f8b51cede3754c4b05b453bef449c8a655f7
5. I ended up writing the following for the Lenovo enos switch:
https://github.com/ansible-network/network-runner/pull/49/files
6. To backport the driver to Queens and Rocky release (OSP13/14), I had to change a few variables. The results can be seen here:
https://review.opendev.org/#/c/741219/
II. Testing your driver
1. There is at least 2 things you will need to have in order to test your driver
– The switch hooked up to Ironic BM nodes
– Working OpenStack environment with net-ansible enabled
2. I have added following configuration to my openstack templates:
resource_registry:
OS::TripleO::Services::NeutronCorePlugin: OS::TripleO::Services::NeutronCorePluginML2Ansible
parameter_defaults:
NeutronMechanismDrivers: openvswitch,ansible
NeutronTypeDrivers: local,vlan,flat,vxlan
ML2HostConfigs:
sb_chassis2_sw1:
ansible_host: ‘10.9.57.86’
ansible_network_os: ‘enos’
ansible_ssh_pass: ‘XXXXX’
ansible_user: ‘XXXXX’
manage_vlans: ‘false’
mac: ‘a4:8c:db:60:75:00’
sb_chassis2_sw2:
ansible_host: ‘10.9.57.87’
ansible_network_os: ‘enos’
ansible_ssh_pass: ‘XXXXX’
ansible_user: ‘XXXXXX’
manage_vlans: ‘false’
mac: ‘a4:8c:db:60:92:00’
IronicDefaultNetworkInterface: neutron
I hope most of these are self explanatory. I like using a mac field for the switches in order to take advantage of “zero touch” provisioning, where the Ironic Introspection automatically map each node to a proper switch based on lldp data. Ansible_network_os needs to match the new driver name that has been written. Rest of the parameters are standard.
3. Finally, after my overcloud has updated with the inclusion of net-ansible I have manually added my just written driver into neutron server container on the controllers
[root@hextupleo-controller-0 ~]# docker exec -u root -ti neutron_api /bin/bash
()[root@hextupleo-controller-0 /]# cd /usr/share/ansible/roles/openstack-ml2/providers/
()[root@hextupleo-controller-0 providers]# ls /usr/share/ansible/roles/openstack-ml2/providers/enos/
conf_trunk_port.yaml create_network.yaml delete_network.yaml delete_port.yaml update_port.yaml
In OSP15/16 (Stein/Train) and above you would have to replace word docker with podman
4. You are now ready to add/discover a new BM node in your ironic and deploy an operating system to it to see if the driver is working. I like watching the following ad the nodes are getting cleaned and/or deployed:
My switch configuration getting changed
tail -f /var/log/containers/neutron/server.log (for errors)
5. I didn’t get it right on the first try and you probably won’t either .. just keep iterating the changes until you are satisfied with the final result. Make sure to also deploy Baremetal nodes with trunk ports since this is part of the functionality.
Also, keep in mind that that no container restart is required since this is just an ansible playbook being executed inside the neutron container
III. Sharing is caring – submit your driver back upstream
1. Congratulations on making this work for yourself. Now it’s time to share what you’ve done with the world. Since you have originally cloned from 2 repositories, it would be wise to send your changes back to the same places.
2. For the OpenStack (opendev) repos I have simply followed the instructions here:
https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
3. In order to submit the driver to the standalone project (network-runner), the process is slightly different. You first need to fork the project to your own github. Then add and commit changes. Finally create a pull request back to the master branch.
Disclaimer – once you submit your driver upstream, it doesn’t make the driver supportable by your OpenStack vendor.
You can now enjoy your baremetal infrastructure being managed in even more dynamic way.