Thursday, December 16, 2021

wither network boot?

I thought about pre-OS networking recently after giving a talk to other engineers about the value of EQ + IQ recently.  For the latter, what I called "Technical IQ" in http://vzimmer.blogspot.com/2013/03/a-technical-career-path.html, I use the T-shaped model https://careeredge.bentley.edu/blog/2015/04/21/how-gaining-t-shaped-skills-will-give-you-an-edge/. For this model networking was in the depth or vertical bar of my T since evolving the EFI network stack from a simple legacy PXE & UNDI thunk wrapper in 1999 to the native mode implementation in the early 2000's, certifying a clean-room IPV6 stack, and further into scalable use cases with wireless and HTTP boot a half decade ago. In the late days of 2021 perhaps it's not as prominent in the vertical bar but it's an important capability nevertheless.

As part of this survey, this blog is an elaboration upon some arbitrary points on the timeline of UEFI firmware network boot history, in the spirit of older postings like http://vzimmer.blogspot.com/2013/02/anniversary-daynext-arch-ps-and-some.html. This latter posting mentioned the 1.2 work https://github.com/vincentjzimmer/Documents/blob/master/EFIS001_100_2004.pdf, including 
client authentication w/ EAP https://www.semanticscholar.org/paper/Cloud-Net-Booting-Beyond-BIOS-Using-the-Unified-Zimmer/8f158dd172ca406095d051cc9bb5a7f5cc09435b and the folly of trying to create bespoke authentication methods. 


This same work was described in the 2009 https://github.com/vincentjzimmer/Documents/blob/master/SAM6560.pdf  paper https://dblp.uni-trier.de/rec/conf/csreaSAM/Zimmer09.html?view=bibtex in 2009, the same year as Berkeley Cloud paper that it cited https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf. At the time is was becoming increasingly obvious that a HTTP-based boot was necessary. We in fact left provision for the same in https://tools.ietf.org/rfc/rfc5970.txt

Page 97 of https://www.intel.com/content/dam/www/public/us/en/documents/research/2011-vol15-iss-1-intel-technology-journal.pdf notes the full move to standard network authentication protocols.

Another place I captured some history of pre-OS UEFI networking was in https://github.com/vincentjzimmer/Documents/raw/master/A-Quick-History-of-UEFI-Networking.pdf. This was just a quick background that I ultimately integrated into https://github.com/vincentjzimmer/Documents/blob/master/UEFI-Recovery-Options-002-1.pdf in 2016. This folds in advances such as UEFI wireless support and HTTP boot that I led through the UEFI Network (UNST) & Security (USST) subteams https://uefi.org/sites/default/files/resources/UEFI_Plugfest_VZimmer_Fall_2016.pdf, respectively, in the early portion of that decade.  Commercial usage of HTTP boot was described by HP enterprise, too

Although https://www.uefi.org/sites/default/files/resources/UEFI_Plugfest_VZimmer_Fall_2016.pdf updates  after wireless UEFI boot, the suggested next paradigm of pre-OS network bootstrap of  'NVME over Fabric UEFI boot' has been evolved in the NVME standards group versus my UNST in UEFI  https://linuxplumbersconf.org/event/7/contributions/737/attachments/531/944/LPC_2020_-_NVMe-oF_Boot_from_Ethernet_-_Final.pdf. I discovered that working the standard closer to the subject matter experts is best, just as I created the EFI TCG Protocols http://people.eecs.berkeley.edu/~kubitron/courses/cs194-24-S14/hand-outs/SF09_EFIS001_UEFI_PI_TCG_White_Paper.pdf in the Trusted Computing Group versus the UEFI Forum.  

At other times it is satisfying to see evolution of requirements that support important use cases. There has been a spate of activity in enabling https://csrc.nist.gov/publications/detail/sp/800-193/final over the last several years, including the low level infrastructure code like https://github.com/vincentjzimmer/Documents/blob/master/A_Tour_Beyond_BIOS_Capsule_Update_and_Recovery_in_EDKII.pdf. The idea therein entails having the platform having some degree of fault tolerance on the system board firmware and device firmware.


Beyond 193, though, there are scenarios where the operating system can become corrupted, such as infamous https://www.tomshardware.com/news/malware-shamoon-virus-security-disttrack,16988.html. There are some early thoughts about options here https://www.it-scc.org/uploads/4/7/2/3/47232717/requirements_for_recoverable_systems.pdf that happen to nicely match some of the OS recovery infrastructure defined in https://github.com/vincentjzimmer/Documents/blob/master/UEFI-Recovery-Options-002-1.pdf, especially figure 7 of the latter mapped to figure 2 of the former
which not surprisingly bear resemblance to 



During that effort above on OS recovery it was noted that although we defined UEFI network support in the UEFI standard and some infrastructure for the same in https://github.com/tianocore/edk2/blob/master/MdePkg/Include/Protocol/Supplicant.h, great support for multi device and protocol was available in https://ipxe.org. In fact, ipxe has historically been a favorite of datacenter operators using firmware based networking, including the ipxe scripting https://ipxe.org/scripting. In fact after talking with some hyperscalers in the mid 2000's, I tried to create a 'safe shell' subset to use .nsh in lieu of ipxe scripting for deployment w/ UEFI network stack to provide coequal, and in the limit script-compatible behavior between ipxe and UEFI. This didn't work out for various technical and community reasons.

This usage of ipxe in the pre-OS speaks to the challenge of OS-absent networking. There has always been a complain that UEFI didn't solve one of the fundamental systems problems, namely needing to write drivers twice: once for the boot environment, and once for the OS. Vertically integrated folks like Intel UEFI Apple Macs can have a OS kernel engineer and firmware engineer who is a domain expert do both, but in the horizontal industry where the OS providers and platform provider are different business entities, this is more difficult. And historically the OS folks eschewed using firmware at runtime, even for safe mode, which led to deprecating UNDI at runtime.

In the past some of the DEC Alpha platform manufacturers wrapped Windows NT NDIS drivers in the Arc firmware, but this was only the port level, datagram interface.  The higher level networking primitives had to be curated for each domain (pre-OS, OS runtime). Other challenges in pre-OS networking include security - the support of packets on the wire (or air with wireless) are a huge attack surface since the incoming packets are ostensibly attacker-controlled data. The use of hardware VPN's can ameliorate some of this concern for deployments in the traditional enterprise, but the world of borderless/zero trust architectures moots that argument. 

One potential approach to more robustness may include using language-based security, such as Rust and implementations such as smol https://github.com/smoltcp-rs/smoltcp? In the early 2000's I explored with Linux kernel EFI maintainer Matt Fleming on some options about encapsulating Linux in the pre-OS for the network stack and then returning to the UEFI environment to boot the downloaded image. There were a lot of intricacies about state of the platform in absence of exit boot services and other blockers that put that exploration on the shelf.  At the time there were already a lot of usages of embedding Linux in the flash as the boot environment, but it was a one-way gate from UEFI.  UEFI->Linux.  Never a UEFI->Linux->UEFI->Final OS.

It is nice to see that latter use-case having been re-invigorated in the last few years with https://www.linuxboot.org/ in the datacenter. Instead of UEFI netboot->Linux or the chain loading of UEFI netboot->ipxe->Linux, you have the platform firmware, UEFI or coreboot or slimboot or....based, directly launch Linux and its integrated networking support from the system board flash. 

In addition to the interesting works Trammel Hudson has done in firmware security, I am happy to see his work in this space of using Linux for pre-OS networking https://trmm.net/LinuxBoot_34c3/. This includes a recent patch to run interleave UEFI and Linux execution, or the design point Fleming and I explored years ago, viz.,

The other historical irony I have to mention about LinuxBoot and firmware relates to a conversation I had on the showcase floor at the Ubuntu Developer Summit in Oakland, CA 2012. My colleague brought Mark Shuttleworth over to have a conversation, ostensibly about UEFI secure boot and some of the other recent features I had been working on in this space. Shuttleworth nodded his head in response to the overview and then left me with the single question: "Why don't you just use Linux as the firmware to boot the operating system?"  Of course supply chain, horizontal industry COTs, .... and the myriad of other social, technical and business reasons from my confirmation bias vantage of working on UEFI and EDKII didn't really empower me to have a quick answer :)

So we mention LinuxBoot above which obviates the need for the UEFI network stack in some use-cases, especially Cloud. That doesn't mean that UEFI networking isn't important. In fact enterprise and client devices still deploy the capability today, and it definitely has challenges, including performance and robustness. On the former, we noted some of the glass jaws on the polled UEFI driver model and its impact on perf in https://uefi.org/sites/default/files/resources/7_Maciej%20Vincent_INTEL_network%20stack%20performance.pdf and the associated implementation using LWIP and multiprocessing https://github.com/tianocore/edk2-staging/tree/MpNetworkStack. For security we can go beyond just the forklift upgrade of a Rust port and near-term isolate the network stack drivers in ring 3 or user mode https://github.com/jyao1/SecurityEx/tree/master/UserModePkg or put into a VM https://www.intel.com/content/dam/develop/external/us/en/documents/a-tour-beyond-bios-launching-vmm-in-efi-developer-kit-ii-0-819978.pdf. In fact after the 2007 paper laying the ground work for UEFI secure boot in SAM07 https://www.semanticscholar.org/paper/Platform-Trust-Beyond-BIOS-Using-the-Unified-Zimmer/0bd3bdeb6dcadf088137e13c00adc7e4390fa0de was 2008 https://www.semanticscholar.org/paper/System-Isolation-Beyond-BIOS-using-the-Unified-Zimmer/cf0261fe8d8dc078fb389dc04a56188695581949 including VMM and rings for firmware. 

Firmware device interrupts http://vzimmer.blogspot.com/2015/06/guids-revisions-interrupts.html, ring isolation, multiprocessing, etc all seem to argue for just diving into a full operating system, but a deft hand can still apply some of these design precepts into pre-OS firmware. Driving change is hard, though. I recall a quote from a colleague about another engineer 'he spent a lot of time working on a problem that engineers (at the company) didn't want solved.' In other words, a good business problem isn't enough. For various historical and / or cultural reasons the existing engineering community may not want to pursue a solution. A mental model I sometimes use for this engineering reluctance is WWI/WWII. Leaders who came up in some technology era understand the guardrails and technologies for success there - mapping to WWI they are great trench builders and trench warfare combatants. And the ultimate example of trench warfare skills was the Maginot line https://en.wikipedia.org/wiki/Maginot_Line.


Technology changes. Say going from mainframe to mini, mini to PC, PC to Cloud/post-PC, anything to mobile, ...... A phase and its successor may represent a different corpus of technology or business constraints. Let's call one of these transitions going from a WWI to WWII based technology environment. When WWII commenced, airplanes just flew over the Maginot line. Just like your real competition is yourself professionally, not others, the way to leverage this insight is to avoid trying to leverage your trench-digging skills when the world war precepts have changed, and more importantly continually evolve skills and practices. As for the sentiment in the quotation above, a technologists should take a constructive approach that includes providing data on the environment change and technical options that can inform the organization of potentially evolve the bench from WWI to WWII class readiness.

OK. Enough on this topic of the past. If there are any more postings in '22 they will catch up to more recent events and commentary in the firmware space. 

No comments: