Simple thoughts about recursion in Ruby

Beware: Ruby 1.9 code below.

Ruby is a popular procedural language with some functional aspects, most of them being underused. Recursion, for instance, is a standard mathematical pattern you either know and might use, or will simply ignore. I believe the former strategy to be slightly more efficient, hence I wanted to gather some thoughts about recursion in Ruby, and share them with newcomers.

Recursion is a powerful computation pattern mostly found in functional languages. Ruby is no functional language, but one may still make use of recursive algorithms: it actually especially makes sense to use them in conjunction with blocks!

Recursion and blocks

Let’s study a specific example. The Pathname core library is a great tool when it comes to manipulating paths and associated resources (files and directories) — and it’s portable. Yet, not unlike many other filesystem-related libraries of the Ruby Standard library (fileutils, etc.), it lacks recursive visitors. Simply put, it does not provide methods to visit a given directory and each of its sub-directories, recursively.

Such a feature comes in handy from time to time, so you may wonder how to perform this in a cool, rubyish way. Well, you could use File.findFile.find or even Pathname.findPathname.find which relies on the former, but it is certainly more fun to build a tailored version with just the Pathname helpers :)

A typical use case could be the following:

# Let's remove all files but HAML ones within an app/ directory.
Pathname.new("app").visit do |f|
  FileUtils.rm_f(f) unless f.to_s =~ /\.haml$/
  puts "!!! deleting \#{f}"
end

# Let's remove all files but HAML ones within an app/ directory.
Pathname.new("app").visit do |f|
  FileUtils.rm_f(f) unless f.to_s =~ /\.haml$/
  puts "!!! deleting \#{f}"
end

Here is my foolish implementation:

class Pathname
  def visit(options = {:all => false, :hidden => false})
    if self.directory?
      children.each do |entry|
        next if entry.basename.to_s[0] == "." && !options[:hidden]
        yield entry unless entry.directory? && !options[:all]
        entry.visit(:all => options[:all]) { |sub_entry| yield sub_entry } if entry.directory?
      end
    else
      yield self
    end
  end
end

class Pathname
  def visit(options = {:all => false, :hidden => false})
    if self.directory?
      children.each do |entry|
        next if entry.basename.to_s[0] == "." && !options[:hidden]
        yield entry unless entry.directory? && !options[:all]
        entry.visit(:all => options[:all]) { |sub_entry| yield sub_entry } if entry.directory?
      end
    else
      yield self
    end
  end
end

For seasoned rubyists, this code should be quite straightforward, but for the younger ones, it may look plain awkward.

The idea for this blog post came to me while I was coding on a similar feature (a recursive visitor), and realized it may actually be a good use-case for a combination of blocks and recursion. The code above is by not perfect; it merely illustrates a very common Ruby pattern: a method defined with some arguments, expecting to be passed a closure (a block) which will have some kind of dependency over the aforementioned arguments, so as to perform step-wise side effects. It leads to this typical, expressive ruby style you may be familiar with.

The interesting feature here is that visit is a recursive method. I guess it is worth explaining the inner logic of the process at play:

def visit(options = {:all => false, :hidden => false})
  # The :all option means both files *and directories* will be visited.
  # Use case: Pathname.new("mydir").visit(all: true) { |f| puts f }
  # The :hidden option allows you to include hidden directories and files.
 
  # Are we visiting a directory, at least?
  if self.directory?
    # Iterate over the sub-entries available at this level.
    children.each do |entry|
      # Should hidden files be ignored?
      next if entry.basename.to_s[0] == "." && !options[:hidden]
      # If the entry is not a directory, it should be yielded to the block.
      yield entry unless entry.directory? && !options[:all]
 
      # Here comes a sub directory! Let's visit it.
      # Remember, #visit takes a block.
      entry.visit(:all => options[:all]) { |sub_entry| yield sub_entry } if entry.directory?
      # This previous line does the recursion: "let's visit the
      # sub directory, known as 'entry', and yield any
      # of the resources we'll find within it to the caller";
      # that is, to our initial #visit call, which
      # in turns will yield the resource to the main block.
      # This process can go over and over again, as long as
      # there are some sub-entries to visit.
    end
  else
    # If what's visited is not a directory, there's nothing to
    # visit but the current resource, so just yield it!
    yield self
  end
end

def visit(options = {:all => false, :hidden => false})
  # The :all option means both files *and directories* will be visited.
  # Use case: Pathname.new("mydir").visit(all: true) { |f| puts f }
  # The :hidden option allows you to include hidden directories and files.
 
  # Are we visiting a directory, at least?
  if self.directory?
    # Iterate over the sub-entries available at this level.
    children.each do |entry|
      # Should hidden files be ignored?
      next if entry.basename.to_s[0] == "." && !options[:hidden]
      # If the entry is not a directory, it should be yielded to the block.
      yield entry unless entry.directory? && !options[:all]
 
      # Here comes a sub directory! Let's visit it.
      # Remember, #visit takes a block.
      entry.visit(:all => options[:all]) { |sub_entry| yield sub_entry } if entry.directory?
      # This previous line does the recursion: "let's visit the
      # sub directory, known as 'entry', and yield any
      # of the resources we'll find within it to the caller";
      # that is, to our initial #visit call, which
      # in turns will yield the resource to the main block.
      # This process can go over and over again, as long as
      # there are some sub-entries to visit.
    end
  else
    # If what's visited is not a directory, there's nothing to
    # visit but the current resource, so just yield it!
    yield self
  end
end

Note how we have several yieldyield statements, covering each edge case. For we want the user to be able to visit a file as well, we end up with two nested levels of testing/yielding, and some redundancy, but that is fine as a first, non-optimized version (KISS).

About performance

What is pretty cool here is that the recursion process is free of any (explicit) object instantiation, so it will be reasonably fast ( modulo the speed of the processing performed within the block). Of course, the handling of the optionsoptions hash does trigger an object instantiation each time the visitvisit message is sent to a receiver, but it is that of a light, empty hash object.

Ruby is often seen as being "inefficient", "slow", but as with every computational process… "it depends". Most of the time, one of the benefits of using a high-level language such as Ruby is actually to be able not to worry about frenzied object instantiations, and rely on the GC to perform.

Performing memoization may help. If the recursive pattern relies on well-defined ("pure") functions, that is functions returning the same result when provided with the same arguments, you may be able to avoid costly computations by relying on cache. Use the ||=||= operator wisely!

What more about recursion in Ruby? Take a look at this line:

entry.visit { |sub_entry| yield sub_entry } if entry.directory?

entry.visit { |sub_entry| yield sub_entry } if entry.directory?

The purpose of this line is to yield any file accessed within the nested sub-directories, back to the initial block. The visit method does nothing with this piece of data by itself, so instead of yielding from nested visit process to nested visit process until the main, initial one is reached, one could simply yield directly to that initial context. In practice, it meant doing some kind of tail call optimization (TCO), something you may be familiar with if you ever tried a truly functional language. The visitvisit code above could be refactored to use CTO: the Ruby implementation known as YARV (the “next” Ruby) has built-in support for TCO, but the feature is disabled by default for it currently messes the stack order (not so funny while debugging recursions!).

Options vs performance

While I insisted on avoiding object instantiations inside a recursive pattern, I mentioned the fact that the optionsoptions hash handling was not really time-consuming in this context. Here’s a little benchmark:

N = 10000
@path = Pathname.new(".") # some 20 files under a dozen of nested dir
                          # with one hidden dir
 
Benchmark.bmbm do |x|
  x.report("visit w/ options") do
    N.times { @path.visit { |f| } }
  end
 
  x.report("visit w/ options, changed") do
    N.times { @path.visit(all: true, hidden: true) { |f| } }
  end
 
  x.report("visit wo options") do
    # def visit_wo_options
    #   if self.directory?
    #     children.each do |entry|
    #       yield entry unless entry.directory?
    #       entry.visit { |sub_entry| yield sub_entry } if entry.directory?
    #     end
    #   else
    #     yield self
    #   end
    # end
    N.times { @path.visit_wo_options { |f| } }
  end
end
 
# =>
# Rehearsal -------------------------------------------------------------
# visit w/ options            4.950000   1.390000   6.340000 (  6.347853)
# visit w/ options, changed   5.340000   1.670000   7.010000 (  7.006673)
# visit wo options            4.890000   1.560000   6.450000 (  6.449474)
# --------------------------------------------------- total: 19.800000sec
 
#                                 user     system      total        real
# visit w/ options            5.010000   1.300000   6.310000 (  6.302758)
# visit w/ options, changed   5.200000   1.810000   7.010000 (  7.012919)
# visit wo options            4.770000   1.670000   6.440000 (  6.439384)

N = 10000
@path = Pathname.new(".") # some 20 files under a dozen of nested dir
                          # with one hidden dir
 
Benchmark.bmbm do |x|
  x.report("visit w/ options") do
    N.times { @path.visit { |f| } }
  end
 
  x.report("visit w/ options, changed") do
    N.times { @path.visit(all: true, hidden: true) { |f| } }
  end
 
  x.report("visit wo options") do
    # def visit_wo_options
    #   if self.directory?
    #     children.each do |entry|
    #       yield entry unless entry.directory?
    #       entry.visit { |sub_entry| yield sub_entry } if entry.directory?
    #     end
    #   else
    #     yield self
    #   end
    # end
    N.times { @path.visit_wo_options { |f| } }
  end
end
 
# =>
# Rehearsal -------------------------------------------------------------
# visit w/ options            4.950000   1.390000   6.340000 (  6.347853)
# visit w/ options, changed   5.340000   1.670000   7.010000 (  7.006673)
# visit wo options            4.890000   1.560000   6.450000 (  6.449474)
# --------------------------------------------------- total: 19.800000sec
 
#                                 user     system      total        real
# visit w/ options            5.010000   1.300000   6.310000 (  6.302758)
# visit w/ options, changed   5.200000   1.810000   7.010000 (  7.012919)
# visit wo options            4.770000   1.670000   6.440000 (  6.439384)

So, it seems adding some more options will not turn into a critical performance issue.

Do we really want adding extra options though? This question is not related to recursion or blocks, but it is part of writing solid code, so here are my two cents. Some options seems to make sense for they provide the user with some control over the behavior of the method. Certainly, many users will be glad to be able to include or exclude directories. What about excluding hidden resources? It may not be an as unequivocal feature as the directory filtering behavior, so it could probably be left for the block to handle.

What about an option so one can get the nesting depth passed to the block? We could then write stuff like:

Pathname.new(".").visit do |file, depth|
  puts "The file \#{file.basename} is \#{depth} level\#{"s" if depth > 1} under \#{self.basename.to_s}"
end

Pathname.new(".").visit do |file, depth|
  puts "The file \#{file.basename} is \#{depth} level\#{"s" if depth > 1} under \#{self.basename.to_s}"
end

But it would mean handling two ways of yielding back to block inside visitvisit, which could quickly get messy. For all the metadata required to calculate the nesting depth is available through the yielded pathname, (it is a relative path from the visited resource), devising this kind of insight really belongs to the associated block, not to the visitvisit method. The following code seems definitely better to me:

# À la tree output
Pathname.new(".").visit(all: true) do |f|
  puts "\#{"  " * (f.dirname.to_s.split(/\//).length)} \#{f.basename}\#{"/" if f.directory?}"
end
# =>
# test.rb
# app/
#   README
#   views/
#     article/
#       test1.html.erb
#       test2.html.haml
#     yet/
#       again/
#         super.haml
# foobar.rb
# anti-pattern/
#   nop.html

# À la tree output
Pathname.new(".").visit(all: true) do |f|
  puts "\#{"  " * (f.dirname.to_s.split(/\//).length)} \#{f.basename}\#{"/" if f.directory?}"
end
# =>
# test.rb
# app/
#   README
#   views/
#     article/
#       test1.html.erb
#       test2.html.haml
#     yet/
#       again/
#         super.haml
# foobar.rb
# anti-pattern/
#   nop.html

Complete implementation

As a final example, if you really want to manipulate a tree, you could just build your own tree within the block (you may want to skip reading that):

class Pathname
  def to_tree(options = {:all => false, :hidden => false})
    require 'tree' # rubytree library: http://rubytree.rubyforge.org/rdoc/
 
    raise ArgumentError, "Won't handle '..' paths" if self.to_s =~ /\.\./
 
    # relative paths make it harder to handle structure indexes,
    # see below, so this is a simple workaround
    # TODO: root as complex relative path (relative/path/to/resource)
    #       with Pathname.relative_path_from
    if self.to_s =~ /^[^\/]/ # is a relative path
      relative_path = true
      # we're going to branch subdirs from a fake local root and
      # detached the app node in the end
      tree_root = Pathname.new("./" + self.basename.to_s)
    else
      relative_path = false
      tree_root = self
    end
 
    root_node = Tree::TreeNode.new(self.basename.to_s, self)
 
    tree_root.visit(options.merge(all: false)) do |entry|
      # Each entry is a tree node.
      # Since entry is merely a string, the tree structure must be
      # extracted from the pathname object.
      structure = entry.dirname.to_s.split(/\//) - [""]
      name = entry.basename.to_s
 
      # for the sake of demonstration, force all: false
      # and handle sub-directories implicitly
      unless root_node[structure.first]
        unless structure.first == tree_root.to_s # avoid self-branching
          root_node.root << Tree::TreeNode.new(structure.first,
                                               Pathname.new(structure.first))
        end
      end
 
      unless structure.length == 1
        (1..structure.length-1).each do |index|
          unless root_node.child_from_path(structure[0..index])
            parent_node = root_node.child_from_path(structure[0..index-1])
            parent_node << Tree::TreeNode.new(structure.at(index),
                                              Pathname.new(structure[1..index].join("/")))
          end
        end
      end
 
      if structure == [root_node.name]
        # root leaf
        parent_node = root_node.child_from_path(structure[1..-1])
      else
        # inner leaf
        parent_node = root_node.child_from_path(structure[0..-1])
      end
      parent_node << Tree::TreeNode.new(name, entry)
    end
 
    if relative_path
      return root_node.child_from_path(".", self.basename.to_s).removeFromParent!
    else
      return root_node
    end
  end
end
 
class Tree::TreeNode
  def child_from_path(*args)
    args.flatten!
    child = self
    args.each do |name|
      child = child.send(:[], name) rescue nil
    end
    return child
  end
end
 
[".", "tmp", "./app", ".hidden"].each do |root|
  Pathname.new(root).to_tree(hidden: true).printTree
end

class Pathname
  def to_tree(options = {:all => false, :hidden => false})
    require 'tree' # rubytree library: http://rubytree.rubyforge.org/rdoc/
 
    raise ArgumentError, "Won't handle '..' paths" if self.to_s =~ /\.\./
 
    # relative paths make it harder to handle structure indexes,
    # see below, so this is a simple workaround
    # TODO: root as complex relative path (relative/path/to/resource)
    #       with Pathname.relative_path_from
    if self.to_s =~ /^[^\/]/ # is a relative path
      relative_path = true
      # we're going to branch subdirs from a fake local root and
      # detached the app node in the end
      tree_root = Pathname.new("./" + self.basename.to_s)
    else
      relative_path = false
      tree_root = self
    end
 
    root_node = Tree::TreeNode.new(self.basename.to_s, self)
 
    tree_root.visit(options.merge(all: false)) do |entry|
      # Each entry is a tree node.
      # Since entry is merely a string, the tree structure must be
      # extracted from the pathname object.
      structure = entry.dirname.to_s.split(/\//) - [""]
      name = entry.basename.to_s
 
      # for the sake of demonstration, force all: false
      # and handle sub-directories implicitly
      unless root_node[structure.first]
        unless structure.first == tree_root.to_s # avoid self-branching
          root_node.root << Tree::TreeNode.new(structure.first,
                                               Pathname.new(structure.first))
        end
      end
 
      unless structure.length == 1
        (1..structure.length-1).each do |index|
          unless root_node.child_from_path(structure[0..index])
            parent_node = root_node.child_from_path(structure[0..index-1])
            parent_node << Tree::TreeNode.new(structure.at(index),
                                              Pathname.new(structure[1..index].join("/")))
          end
        end
      end
 
      if structure == [root_node.name]
        # root leaf
        parent_node = root_node.child_from_path(structure[1..-1])
      else
        # inner leaf
        parent_node = root_node.child_from_path(structure[0..-1])
      end
      parent_node << Tree::TreeNode.new(name, entry)
    end
 
    if relative_path
      return root_node.child_from_path(".", self.basename.to_s).removeFromParent!
    else
      return root_node
    end
  end
end
 
class Tree::TreeNode
  def child_from_path(*args)
    args.flatten!
    child = self
    args.each do |name|
      child = child.send(:[], name) rescue nil
    end
    return child
  end
end
 
[".", "tmp", "./app", ".hidden"].each do |root|
  Pathname.new(root).to_tree(hidden: true).printTree
end

Final review

It generates nice nested tree ascii-arts like this:

# * tmp
# |---+ deux
# |    +---+ test
#         +---> super.rb
# +---+ un
#     |---> test.rb
#     +---> test2.rb

It's then easy to build a tree-like CLI.

That is an example of how one can build tailored helpers with a simple, straightforward API, using blocks and recursion.

So, let’s review what we (possibly) learned:

recursive patterns are easy to implement in Ruby
blocks work well with recursion, allowing for custom end criteria
the problem with doing recursion in Ruby is not about recursion itself, but about efficiency (beware massive object instantiations)
memoization or tail-call optimization can help when using a well-defined recursive function

Happy recursions!