Sunday, July 12, 2015

Virtualizing Render Farm Recap

Should have done this earlier and started a work journal on virtualizing our render farm infrastructure. But to recap what I have been doing for the past 3 months regarding our renderfarm.

Renderfarm Specs

16X HP ProLiant WS460c Gen8 WS Blade 2 X Cpu@10 Cores@3.00GHz each with k4000 and 64GB RAM
23X workstations, mainly i7-4770, all with 32GB RAM
4X Dell PowerEdge R210 II E3-1230 V2 1 CPU@4Cores@3.30GHz with 32GB RAM
1X 30TB running Windows Server 2008 as a file share as main work directory
1X QNAP 120TB NAS for backup of number 4, public and user dir
1X QNAP 40TB NAS for portable data transfer

ESXI Install Notes

1. Install ESXI6.0 on all of our HP blades
2. Install ESXI on all of our older workstations that will also serve as render nodes
3. Injecting custom drivers into ESXI depot images using esxi customizer and building custom ESXI installer images
  • Adding ALL drivers into the depot image gives me a 100% success rate with various workstations
  • Install ESXI on older DELL blades
  • My custom ESXI work with DELL blades as well, the NIC drivers are included in the custom package

Setting up VSphere Center

  1. Install Vsphere Center on one of the older Dell blades
    1. After install, administrator cannot sign in, please see "Unlocking and resetting the VMware vCenter Single Sign-On administrator password (2034608)".
    2. Reset the administrator@vsphere.local password with the utiliy detailed above and we should be good to go.
    3. A lot of these issues could have been side stepped IF ONLY the renderfarm administrator is also the Domain Administrator. This would also allow a straightforward installation of Horizon View.
    4. On Horizon View, our effects artists could/should use alt machines to do simulations aside from their main workstation. Using remote desktop connection is a NO, since RDC is not well suited for 3d graphics. Horizon View would allow remote desktops with access to the blades k4000 graphic cards using directIO. But since I do not have Domain Administrator Rights, nor do I want to risk bringing down our domain by setting up my own domain, this is no go.
  2. We started with 4 datacenters since we need to divy up our resources for several departments, one in taipei, max/vray, lighting/rendering, and effects. 8 blades are reserved for Taipei, the rest for Taichung. 8 blades for lighting/compositing and workstations for 3dsmax/vray. 
  3. For renderfarm management we're running Virtual Vertex Muster 7.0.7
  4. Vray uses its own distributed render node utility so it is harder to monitor/manage usage, thus we're giving Vray users the workstation nodes.
  5. Each VM should be thin provisioned to save space on adds, if not one needs to migrate to another 
  6. data store and back.

Setting up Virtual Render Nodes Notes

for each physical host, we could either run one windows instance, and have muster run multiple render instances. This SHOULD be the most efficient setup, we ended up with two windows instance with 2 render instance. Render performance is curiously better this way. One minor issue is that our render manager server(muster) could not be virtualized since its tied to the mac address(its the 30TB file server, so can't fake the mac address either).

Issues in Production

  1. Vray render will most certainly eat up all the resources, so it is advisable to separate them from render management render nodes.
  2. VMs should be put in clusters instead of datacenters as clusters provides better resource management and monitoring that datacenters don't. For example, by grouping all physical hosts in a cluster, I can view performance data of all the aggregate machines instead of going through each host in a datacenter
  3. Memory overcommitment will crash the VM. Advisable to set it to 80%.
  4. High Availability and DRS in conjunction with iLO are interesting. Not sure if vMotion is required with HA. DRS and iLO to enable power savings is interesting. Could be useful to put our render farm to sleep overnight and have some automatic sleep/wake up functionality to save power. As such, each blade uses around 1400W idle, and workstations are 140W idle. There's definetly lots of saving to be done here.
  5. If HA and DRS are not up and running, there needs to be users with limited access to power on/off reset the machines. This is done through Vsphere web interface, adding roles and assinging users to these roles. Also, turn off those password restrictions as render farms aren't that security conscious.