Recently we faced a strange problem on one of our x86 systems: after some successfull boots the system was unable to start the OS anymore but deleting the ‘Documents and Settings’ folder on the storage device (we included the hive based registry in the OS) the system behave normally.
After some investigation we found out that when the system was not booting correctly the USB OHCI driver was performing an endless loop waiting for the host controller to reset (thus preventing the device manager to load the other drivers – never
write an XXX_Init routine which can spin forever!).
The code looks like:
m_portBase->HcCommandStatus.HCR = 1;
while (m_portBase->HcCommandStatus.HCR == 1)
With the debugger we found out that all the memory area used to map the USB controller registers was 0xFF (!), thus the bit was never cleared. This looked very strange so we started looking at the system registry founding out that the BAR (Base Address Registry) of the USB host on the PCI bus was 0x80003000 while the device was actually mapped at 0xDA000 (both flat physical addresses).
An additional information: our BIOS implements legacy USB support which means that the BIOS maps the USB host at 0x80003000 when it assigns the resources to the PCI bus. If the legacy USB support is enabled the BIOS maps the USB host at 0xDA000 (under the 1MB limit), configures it and tries to find possibly connected manageable USB devices (i.e. storage and keyboard). If such devices are not detected the controller is remapped back at 0x80003000 otherwise it remains at 0xDA000 (this allow the BIOS to boot the OS from a USB disk redirecting INT13 on the USB – but this is another story…).
After accusing one collegue of mine (he wrote the BIOS after all!) we realized that you could trigger the problem this way:
- Connect the USB keyboard to the system, turn it on and boot a clean OS (no hives on the storage).
- Turn off the system, disconnect the keyboard and turn the system on: the OS does not boot anymore.
- Turn on the system (no USB keyboard) and boot a clean OS (no hives on the storage).
- Turn off the system, connect the keyboard and turn the system on: the OS does not boot anymore.
Looking at the PCIBUS code the problem became clear: on a cold boot the bus driver enumerates the PCI devices and build the ‘Instance’ keys based on the ‘Template’ keys. Under the ‘Instance’ key the bus driver saves relevant informations like the SysIntr value and the BAR’s (look here for more information).
The PCIBUS driver is quite smart so, in subsequent boots, it enumerates the PCI devices and analyze the ‘Instance’ keys (using the hive based registry they’re already set up) first – this allow the bus driver not to load a driver for a device which is not present anymore. If it founds that there is already an ‘Instance’ key for a specific device (with matching PCI ID’s and location in terms of bus/device/function) it says “Ha ha! Here it is, do not bother to recreate the ‘Instance’ key and all its values for this device, neither the BAR value”.
Do you see the problem? When you boot the first time with the USB keyboard connected the BIOS maps the USB host at 0xDA000 which is the address saved in the hive under the ‘Instance’ key. Some time later you will turn on the system without the keyboard, the BIOS will map the host at 0x80003000 but the USB host device driver will retrieve the wrong 0xDA000 BAR from the registry, thus accessing unused RAM instead of the controller registers.
I cloned the PCIBUS and modified it so that it ignores the ‘Instance’ key for PCI USB hosts, deletes it and recreates it at every boot based on the ‘Template’ key.