2015 update — Code samples updated to use latest Ruby (2.2.3).

Following my post on hooks, I thought it would be interesting sharing the details of object creation in Ruby. This post is a summary of the relevant pieces as found in MRI (Matz' Ruby Implementation) explaining that core process.

I will be using this piece of code as my starting point:

class Foo
  def initialize
    puts "hello!"
  end
end

Foo.new # Let's go!

Calling Foo.new

We are calling an instance method #new on Foo, which is a "class object", ie. an instance of the class Class. You may verify that calling Foo.class returns Class indeed. For we did not implement a Foo#new method, it may be provided by Foo's "superclass". We thus need to have a look at Class#new to get started.

To quickly review a method's implementation, you may install the pry and pry-doc gems and use the "show-method" or "show-source" commands. The "show-doc" command is useful as well. To find your way in the Ruby's source code, The Ruby Cross Reference is pretty useful. MRI is hosted on GitHub as well.

// [1] pry(main)> show-method Class#new

// From: object.c (C Method):
// Owner: Class
// Visibility: public
// Number of lines: 10

VALUE
rb_class_new_instance(int argc, const VALUE *argv, VALUE klass)
{
  VALUE obj;

  obj = rb_obj_alloc(klass);
  rb_obj_call_init(obj, argc, argv);

  return obj;
}

How does Pry know where to look in the C source? MRI maps Ruby method names to C function definitions, using rb_define_method. For instance, Class#new is bound to a C function named rb_class_new_instance, thanks to the instruction rb_define_method(rb_cClass, "new", rb_class_new_instance, -1). Calling #new in Ruby basically proxies to rb_class_new_instance in MRI.

If you are wondering about that -1 parameter, it is a "de-optimization" flag allowing any number of arguments to be passed to #new. Usually, when a method expects a specific number of arguments, Ruby has a way to optimize calls made to that method, so they introduce less memory overhead. But because new must accept any number of arguments, this optimization is pre-emptively cleared off.

As its name implies, rb_class_new_instance is responsible for creating a new instance out of a given class, named klass (in my example, it would be a reference to Foo). It first takes care of allocating memory for a new Object of type klass, hence creating a blank object, using rb_obj_alloc.

1st sub-call: rb_obj_alloc

/*
 *  call-seq:
 *     class.allocate()   ->   obj
 *
 *  Allocates space for a new object of <i>class</i>'s class and does not
 *  call initialize on the new instance. The returned object must be an
 *  instance of <i>class</i>.
 *
 *      klass = Class.new do
 *        def initialize(*args)
 *          @initialized = true
 *        end
 *
 *        def initialized?
 *          @initialized || false
 *        end
 *      end
 *
 *      klass.allocate.initialized? #=> false
 *
 */

 rb_obj_alloc(VALUE klass)
 {
   VALUE obj;
   rb_alloc_func_t allocator;

   if (RCLASS_SUPER(klass) == 0 && klass != rb_cBasicObject) {
     rb_raise(rb_eTypeError, "can't instantiate uninitialized class");
   }
   if (FL_TEST(klass, FL_SINGLETON)) {
     rb_raise(rb_eTypeError, "can't create instance of singleton class");
   }
   allocator = rb_get_alloc_func(klass);
   if (!allocator) {
     rb_raise(rb_eTypeError, "allocator undefined for %"PRIsVALUE,
     klass);
   }

   #if !defined(DTRACE_PROBES_DISABLED) || !DTRACE_PROBES_DISABLED
   if (RUBY_DTRACE_OBJECT_CREATE_ENABLED()) {
     const char * file = rb_sourcefile();
     RUBY_DTRACE_OBJECT_CREATE(rb_class2name(klass),
     file ? file : "",
     rb_sourceline());
   }
   #endif

   obj = (*allocator)(klass);

   if (rb_obj_class(obj) != rb_class_real(klass)) {
     rb_raise(rb_eTypeError, "wrong instance allocation");
   }
   return obj;
 }

source

Not much to comment here: sanity checks, DTrace handling, and eventually allocation of the object's footprint in memory. The final result is an empty Ruby object bound to klass, its blueprint.

2nd sub-call: rb_obj_call_init

The next job of rb_class_new_instance is to actually initialize the object which has just been allocated. It does so by calling rb_obj_call_init with a first argument of obj, the very object being created, forwarding any parameter(s) initially provided to #new.

rb_obj_call_init is implemented like this:

// rb_obj_call_init
void
rb_obj_call_init(VALUE obj, int argc, VALUE *argv)
{
  PASS_PASSED_BLOCK();
  rb_funcall2(obj, idInitialize, argc, argv);
}

source

This function first ensures any block attached to #new is passed down to the current execution thread in C. It then proceeds with calling "something" labeled idInitialize, with obj as the receiver. In MRI, idInitialize is kind of a "constant" evaluating to the string "initialize" (cf. template/id.c.tmpl and defs/id.def), and the actual function it will map to will depend on the receiver. Here, because the receiver is obj, an instance of the class Foo, Ruby ends up looking for an instance method Foo#initialize, which may or may not exist (Ruby has no clue at this stage).

We do have an #initialize instance method available on Foo though, for we defined it; back in the Ruby world, we have the pleasure to witness the following expected behaviour:

Foo.new

# hello!
# => #<Foo:0x007fc439e3a238>

Calling new without an explicit initialize defined

Let's now consider a slightly different example:

class Foo
  # No initialize method defined anymore!!!
end

Foo.new
# => #<Foo:0x007fc439e62fd0>

When creating a new Foo object, in the absence of a custom implementation for #initialize, notice how Ruby does not complain about it (especially, does not raise a undefined method error), and simply returns the new object. The secret beyond this handy behaviour lies in BasicObject, the mother-class for all objects in the system, and the way MRI handles "missing" methods.

A journey from rb_funcall2 to rb_call0

Here's the implementation of rb_funcall2, which is called upon creating an object:

#define rb_funcall2 rb_funcallv

source

It is basically an alias for rb_funcallv. The v suffix gives us the insight the function handles an argv, a pointer to an array of method arguments with a fixed length (MRI uses another internal method to handle calls made to methods accepting an arbitrary number of arguments). Long story short, no matter how the initial (Ruby) method call look, no matter how many arguments were passed, we will end up calling an internal function named rb_call0:

/*!
 * \internal
 * calls the specified method.
 *
 * This function is called by functions in rb_call* family.
 * \param recv   receiver of the method
 * \param mid    an ID that represents the name of the method
 * \param argc   the number of method arguments
 * \param argv   a pointer to an array of method arguments
 * \param scope
 * \param self   self in the caller. Qundef means no self is considered and
 *               protected methods cannot be called
 *
 * \note \a self is used in order to controlling access to protected methods.
 */
 static inline VALUE
 rb_call0(VALUE recv, ID mid, int argc, const VALUE *argv,
   call_type scope, VALUE self)
   {
     VALUE defined_class;
     rb_method_entry_t *me =
     rb_search_method_entry(recv, mid, &defined_class);
     rb_thread_t *th = GET_THREAD();
     int call_status = rb_method_call_status(th, me, scope, self);

     if (call_status != NOEX_OK) {
       return method_missing(recv, mid, argc, argv, call_status);
     }
     stack_check();
     return vm_call0(th, recv, mid, argc, argv, me, defined_class);
   }

source

I've highlighted the interesting part. If Ruby is unable to find a matching method (referenced as mid within rb_call0) in the receiving object (recv), it will return whatever method_missing is going to compute when provided with the current context (initial receiver & arguments). Let's then have a look at this new beast:

static inline VALUE
method_missing(VALUE obj, ID id, int argc, const VALUE *argv, int call_status)
{
  VALUE *nargv, result, work;
  rb_thread_t *th = GET_THREAD();
  const rb_block_t *blockptr = th->passed_block;

  th->method_missing_reason = call_status;
  th->passed_block = 0;

  if (id == idMethodMissing) {
    raise_method_missing(th, argc, argv, obj, call_status | NOEX_MISSING);
  }

  nargv = ALLOCV_N(VALUE, work, argc + 1);
  nargv[0] = ID2SYM(id);
  MEMCPY(nargv + 1, argv, VALUE, argc);

  if (rb_method_basic_definition_p(CLASS_OF(obj) , idMethodMissing)) {
    raise_method_missing(th, argc+1, nargv, obj, call_status | NOEX_MISSING);
  }
  th->passed_block = blockptr;
  result = rb_funcall2(obj, idMethodMissing, argc + 1, nargv);
  if (work) ALLOCV_END(work);
  return result;
}

source

Now things get a little bit more complicated but basically, what MRI will attempt starting from here, is to call a method with the same name on the "parent" of the previous receiver (as instructed by super and the internal ancestry chain). Notice the final call to rb_funcall2, which makes this process recursive. If no match is found once the whole ancestry chain has been searched for, a MethodMissing error will be raised.

Let's inspect the ancestry chain for a new instance of Foo:

f = Foo.new

# f's first ancestor is the class it belongs to, Foo.
# After that, MRI will consider Foo's ancestors:

Foo.ancestors
# => [Foo, Object, PP::ObjectMixin, Kernel, BasicObject]

At the end of the road, we find BasicObject, which actually knows about an initialize method:

rb_define_private_method(rb_cBasicObject, "initialize", rb_obj_dummy, 0);

source

That rb_obj_dummy function has a very straightforward implementation:

static VALUE
rb_obj_dummy(void)
{
    return Qnil;
}

source

It is a no-op method returning nil… which is the signature of a Ruby hook!