2015 update — Code samples updated to use latest Ruby (2.2.3).
Following my post on hooks, I thought it would be interesting to share the details of object creation in Ruby. This post is a summary of the relevant pieces as found in MRI (Matz' Ruby Implementation) explaining that core process.
I will be using this piece of code as my starting point:
class Foo
def initialize
puts "hello!"
end
end
Foo.new # Let's go!
class Foo
def initialize
puts "hello!"
end
end
Foo.new # Let's go!
Foo.new
We are calling an instance method #new
#new
on Foo
Foo
, which is a "class object", ie. an instance of the class Class
Class
. You may verify that calling Foo.class
Foo.class
returns Class
Class
indeed. For we did not implement a Foo#new
Foo#new
method, it may be provided by Foo
Foo
's "superclass". We thus need to have a look at Class#new
Class#new
to get started.
To quickly review a method's implementation, you may install the
pry
andpry-doc
gems and use the "show-method" or "show-source" commands. The "show-doc" command is useful as well. To find your way in the Ruby's source code, The Ruby Cross Reference is pretty useful. MRI is hosted on GitHub as well.
// [1] pry(main)> show-method Class#new
// From: object.c (C Method):
// Owner: Class
// Visibility: public
// Number of lines: 10
VALUE
rb_class_new_instance(int argc, const VALUE *argv, VALUE klass)
{
VALUE obj;
obj = rb_obj_alloc(klass);
rb_obj_call_init(obj, argc, argv);
return obj;
}
// [1] pry(main)> show-method Class#new
// From: object.c (C Method):
// Owner: Class
// Visibility: public
// Number of lines: 10
VALUE
rb_class_new_instance(int argc, const VALUE *argv, VALUE klass)
{
VALUE obj;
obj = rb_obj_alloc(klass);
rb_obj_call_init(obj, argc, argv);
return obj;
}
How does Pry know where to look in the C source? MRI maps Ruby method names to C function definitions, using rb_define_method
rb_define_method
.
For instance, Class#new
Class#new
is bound to a C function named rb_class_new_instance
rb_class_new_instance
, thanks to the instruction:
rb_define_method(rb_cClass, "new", rb_class_new_instance, -1)
rb_define_method(rb_cClass, "new", rb_class_new_instance, -1)
Calling #new
#new
in Ruby basically proxies to rb_class_new_instance
rb_class_new_instance
in MRI.
If you are wondering about that
-1
-1
parameter, it is a "de-optimization" flag allowing any number of arguments to be passed to#new
#new
. Usually, when a method expects a specific number of arguments, Ruby has a way to optimize calls made to that method, so they introduce less memory overhead. But because#new
#new
must accept any number of arguments, this optimization is pre-emptively cleared off.
As its name implies, rb_class_new_instance
rb_class_new_instance
is responsible for creating a new instance out of a given class, named klass
klass
(in my example, it would be a reference to Foo
Foo
). It first takes care of allocating memory for a new Object
Object
of type klass
klass
, hence creating a blank object, using rb_obj_alloc
rb_obj_alloc
.
rb_obj_alloc
/*
* call-seq:
* class.allocate() -> obj
*
* Allocates space for a new object of <i>class</i>'s class and does not
* call initialize on the new instance. The returned object must be an
* instance of <i>class</i>.
*
* klass = Class.new do
* def initialize(*args)
* @initialized = true
* end
*
* def initialized?
* @initialized || false
* end
* end
*
* klass.allocate.initialized? #=> false
*
*/
rb_obj_alloc(VALUE klass)
{
VALUE obj;
rb_alloc_func_t allocator;
if (RCLASS_SUPER(klass) == 0 && klass != rb_cBasicObject) {
rb_raise(rb_eTypeError, "can't instantiate uninitialized class");
}
if (FL_TEST(klass, FL_SINGLETON)) {
rb_raise(rb_eTypeError, "can't create instance of singleton class");
}
allocator = rb_get_alloc_func(klass);
if (!allocator) {
rb_raise(rb_eTypeError, "allocator undefined for %"PRIsVALUE,
klass);
}
#if !defined(DTRACE_PROBES_DISABLED) || !DTRACE_PROBES_DISABLED
if (RUBY_DTRACE_OBJECT_CREATE_ENABLED()) {
const char * file = rb_sourcefile();
RUBY_DTRACE_OBJECT_CREATE(rb_class2name(klass),
file ? file : "",
rb_sourceline());
}
#endif
obj = (*allocator)(klass);
if (rb_obj_class(obj) != rb_class_real(klass)) {
rb_raise(rb_eTypeError, "wrong instance allocation");
}
return obj;
}
/*
* call-seq:
* class.allocate() -> obj
*
* Allocates space for a new object of <i>class</i>'s class and does not
* call initialize on the new instance. The returned object must be an
* instance of <i>class</i>.
*
* klass = Class.new do
* def initialize(*args)
* @initialized = true
* end
*
* def initialized?
* @initialized || false
* end
* end
*
* klass.allocate.initialized? #=> false
*
*/
rb_obj_alloc(VALUE klass)
{
VALUE obj;
rb_alloc_func_t allocator;
if (RCLASS_SUPER(klass) == 0 && klass != rb_cBasicObject) {
rb_raise(rb_eTypeError, "can't instantiate uninitialized class");
}
if (FL_TEST(klass, FL_SINGLETON)) {
rb_raise(rb_eTypeError, "can't create instance of singleton class");
}
allocator = rb_get_alloc_func(klass);
if (!allocator) {
rb_raise(rb_eTypeError, "allocator undefined for %"PRIsVALUE,
klass);
}
#if !defined(DTRACE_PROBES_DISABLED) || !DTRACE_PROBES_DISABLED
if (RUBY_DTRACE_OBJECT_CREATE_ENABLED()) {
const char * file = rb_sourcefile();
RUBY_DTRACE_OBJECT_CREATE(rb_class2name(klass),
file ? file : "",
rb_sourceline());
}
#endif
obj = (*allocator)(klass);
if (rb_obj_class(obj) != rb_class_real(klass)) {
rb_raise(rb_eTypeError, "wrong instance allocation");
}
return obj;
}
Not much to comment on here: sanity checks, DTrace handling, and eventually allocation of the object's footprint in memory. The final result is an empty Ruby object bound to klass
klass
, its blueprint.
rb_obj_call_init
The next job of rb_class_new_instance
rb_class_new_instance
is to actually initialize the object which has just been allocated. It does so by calling rb_obj_call_init
rb_obj_call_init
with a first argument of obj
obj
, the very object being created, forwarding any parameter(s) initially provided to #new
#new
.
rb_obj_call_init
rb_obj_call_init
is implemented like this:
// rb_obj_call_init
void
rb_obj_call_init(VALUE obj, int argc, VALUE *argv)
{
PASS_PASSED_BLOCK();
rb_funcall2(obj, idInitialize, argc, argv);
}
// rb_obj_call_init
void
rb_obj_call_init(VALUE obj, int argc, VALUE *argv)
{
PASS_PASSED_BLOCK();
rb_funcall2(obj, idInitialize, argc, argv);
}
This function first ensures any block attached to #new
#new
is passed down to the current execution thread in C. It then proceeds with calling "something" labeled idInitialize
, with obj
obj
as the receiver. In MRI, idInitialize
is kind of a "constant" evaluating to the string "initialize"
(cf. template/id.c.tmpl and defs/id.def), and the actual function it will map to will depend on the receiver. Here, because the receiver is obj
, an instance of the class Foo
Foo
, Ruby ends up looking for an instance method Foo#initialize
Foo#initialize
, which may or may not exist (Ruby has no clue at this stage).
We do have an #initialize
#initialize
instance method available on Foo
Foo
though, for we defined it; back in the Ruby world, we have the pleasure to witness the following expected behaviour:
Foo.new
# hello!
# => #<Foo:0x007fc439e3a238>
Foo.new
# hello!
# => #<Foo:0x007fc439e3a238>
new
without an explicit initialize
definedLet's now consider a slightly different example:
class Foo
# No initialize method defined anymore!!!
end
Foo.new
# => #<Foo:0x007fc439e62fd0>
class Foo
# No initialize method defined anymore!!!
end
Foo.new
# => #<Foo:0x007fc439e62fd0>
When creating a new Foo
Foo
object, in the absence of a custom implementation for #initialize
#initialize
, notice how Ruby does not complain about it (especially, does not raise a undefined method
error), and simply returns the new object. The secret beyond this handy behaviour lies in BasicObject
BasicObject
, the mother-class for all objects in the system, and the way MRI handles "missing" methods.
rb_funcall2
to rb_call0
Here's the implementation of rb_funcall2
rb_funcall2
, which is called upon creating an object:
#define rb_funcall2 rb_funcallv
#define rb_funcall2 rb_funcallv
It is basically an alias for rb_funcallv
rb_funcallv
. The v
suffix gives us the insight the function handles an argv
argv
, a pointer to an array of method arguments with a fixed length (MRI uses another internal method to handle calls made to methods accepting an arbitrary number of arguments). Long story short, no matter how the initial (Ruby) method call look, no matter how many arguments were passed, we will end up calling an internal function named rb_call0
rb_call0
:
/*!
* \internal
* calls the specified method.
*
* This function is called by functions in rb_call* family.
* \param recv receiver of the method
* \param mid an ID that represents the name of the method
* \param argc the number of method arguments
* \param argv a pointer to an array of method arguments
* \param scope
* \param self self in the caller. Qundef means no self is considered and
* protected methods cannot be called
*
* \note \a self is used in order to controlling access to protected methods.
*/
static inline VALUE
rb_call0(VALUE recv, ID mid, int argc, const VALUE *argv,
call_type scope, VALUE self)
{
VALUE defined_class;
rb_method_entry_t *me =
rb_search_method_entry(recv, mid, &defined_class);
rb_thread_t *th = GET_THREAD();
int call_status = rb_method_call_status(th, me, scope, self);
if (call_status != NOEX_OK) {
return method_missing(recv, mid, argc, argv, call_status);
}
stack_check();
return vm_call0(th, recv, mid, argc, argv, me, defined_class);
}
/*!
* \internal
* calls the specified method.
*
* This function is called by functions in rb_call* family.
* \param recv receiver of the method
* \param mid an ID that represents the name of the method
* \param argc the number of method arguments
* \param argv a pointer to an array of method arguments
* \param scope
* \param self self in the caller. Qundef means no self is considered and
* protected methods cannot be called
*
* \note \a self is used in order to controlling access to protected methods.
*/
static inline VALUE
rb_call0(VALUE recv, ID mid, int argc, const VALUE *argv,
call_type scope, VALUE self)
{
VALUE defined_class;
rb_method_entry_t *me =
rb_search_method_entry(recv, mid, &defined_class);
rb_thread_t *th = GET_THREAD();
int call_status = rb_method_call_status(th, me, scope, self);
if (call_status != NOEX_OK) {
return method_missing(recv, mid, argc, argv, call_status);
}
stack_check();
return vm_call0(th, recv, mid, argc, argv, me, defined_class);
}
I've highlighted the interesting part. If Ruby is unable to find a matching method (referenced as mid
within rb_call0
) in the receiving object (recv
), it will return whatever method_missing
is going to compute when provided with the current context (initial receiver & arguments). Let's then have a look at this new beast:
static inline VALUE
method_missing(VALUE obj, ID id, int argc, const VALUE *argv, int call_status)
{
VALUE *nargv, result, work;
rb_thread_t *th = GET_THREAD();
const rb_block_t *blockptr = th->passed_block;
th->method_missing_reason = call_status;
th->passed_block = 0;
if (id == idMethodMissing) {
raise_method_missing(th, argc, argv, obj, call_status | NOEX_MISSING);
}
nargv = ALLOCV_N(VALUE, work, argc + 1);
nargv[0] = ID2SYM(id);
MEMCPY(nargv + 1, argv, VALUE, argc);
if (rb_method_basic_definition_p(CLASS_OF(obj) , idMethodMissing)) {
raise_method_missing(th, argc+1, nargv, obj, call_status | NOEX_MISSING);
}
th->passed_block = blockptr;
result = rb_funcall2(obj, idMethodMissing, argc + 1, nargv);
if (work) ALLOCV_END(work);
return result;
}
static inline VALUE
method_missing(VALUE obj, ID id, int argc, const VALUE *argv, int call_status)
{
VALUE *nargv, result, work;
rb_thread_t *th = GET_THREAD();
const rb_block_t *blockptr = th->passed_block;
th->method_missing_reason = call_status;
th->passed_block = 0;
if (id == idMethodMissing) {
raise_method_missing(th, argc, argv, obj, call_status | NOEX_MISSING);
}
nargv = ALLOCV_N(VALUE, work, argc + 1);
nargv[0] = ID2SYM(id);
MEMCPY(nargv + 1, argv, VALUE, argc);
if (rb_method_basic_definition_p(CLASS_OF(obj) , idMethodMissing)) {
raise_method_missing(th, argc+1, nargv, obj, call_status | NOEX_MISSING);
}
th->passed_block = blockptr;
result = rb_funcall2(obj, idMethodMissing, argc + 1, nargv);
if (work) ALLOCV_END(work);
return result;
}
Now things get a little bit more complicated but basically, what MRI will attempt starting from here, is to call a method with the same name on the "parent" of the previous receiver (as instructed by super
and the internal ancestry chain). Notice the final call to rb_funcall2
, which makes this process recursive. If no match is found once the whole ancestry chain has been searched for, a MethodMissing
error will be raised.
Let's inspect the ancestry chain for a new instance of Foo
:
f = Foo.new
# f's first ancestor is the class it belongs to, Foo.
# After that, MRI will consider Foo's ancestors:
Foo.ancestors
# => [Foo, Object, PP::ObjectMixin, Kernel, BasicObject]
f = Foo.new
# f's first ancestor is the class it belongs to, Foo.
# After that, MRI will consider Foo's ancestors:
Foo.ancestors
# => [Foo, Object, PP::ObjectMixin, Kernel, BasicObject]
At the end of the road, we find BasicObject
, which actually knows about an initialize
method:
rb_define_private_method(rb_cBasicObject, "initialize", rb_obj_dummy, 0);
rb_define_private_method(rb_cBasicObject, "initialize", rb_obj_dummy, 0);
That rb_obj_dummy
function has a very straightforward implementation:
static VALUE
rb_obj_dummy(void)
{
return Qnil;
}
static VALUE
rb_obj_dummy(void)
{
return Qnil;
}
It is a no-op method returning nil
… which is the signature of a Ruby hook!