Building Vision and Language Models with Implicit Supervision and Increased Efficiency