Navigating large codebases and contributing to Open Source

(I wrote the following as an email to a someone who asked me for advice. So it’s verbose and written with a personal tone.)

I. Setup your system and build the project code

Pick any good Integrated Development Environment (IDE). I like VS Code. You can work on any langauge in VSCode be it C++, JS or Python, and you won’t have to learn the interface of a new editor everytime you switch language. VIM is a good choice too. Then of course there’s emacs and more. I recommend you pick a more generic — widely used editor, instead of something like say Dev C++ that’s restricted to a language.

Under the hood, VS Code has something called “Intellisense” which creates a sort of index of the whole project when you load the project (So, when you right click on a variable/function you can click on “Go to definition” to jump to where a variable is defined/declared etc.). This indexed code is super useful for navigating a large codebase: a good analogy I read long back is to think of this ‘index’ as a torch to navigate through a dark maze. That’s is basically the difference between an IDE or just using a more basic Editor.

Also figure out how to use the terminal effectively — modify your terminal’s configuration file to your liking. In older days, people would use things like ctags, cscope, gdb to achive the same via the Terminal; IDEs were buggy and slow. Command line tools like tags and cscope help you jump to definitions/functions etc. just like IDEs. It’s interesting to learn these too, in particular if you’re working on C/C++.

Then, get a local copy of the code you want to checkout and compile it. Usually there’s a “how to build/run the project” section in the README file.

You can’t do anything if the project doesn’t build on your system! Look at the build script (makefile, requirements.txt, webpack etc., whatever is used on the project) to see if you can find some hints on how the code is actually “booted”. See the list of other libraries the project depends on.

Always, read the output of the build process: you’ll find some helpful hints on what command were executed and how the build went.

II. Understand the business logic for the code base

Get a high level understanding of the project. Coding usually doesn’t just involve punching keys. You have to understand what the project is trying to achieve, otherwise you would remain confused.

You’ll have to read the project’s documentation or some literature on the industry. For example, if you’re working on an HTTP networking library, you’ll need to know the HTTP specification fairly well. If you’re working on a library management system, you’ll have to understand how books are categorised in libraries, ISBN format etc.

If it’s an open source projects, you might even find some talks by the project developers online or some project slides. So search on the web. See if you can find some architecture diagram viz. flowcharts, sequence diagrams, UML class diagrams, Component (Block) diagrams. If you can talk to other developers on the project, that would be useful too.

III. Set an initial aim

Decide what you want to do on the code base i.e. do you just want to just learn the code flow, fix some issue or enhance some feature etc. Try to aim high. Setting up audacious goals generally has positive outcomes.

This essentially is a Bottom-Up approach. You isolate a module/component in the codebase, and work on it.

Make small changes and see how it affects the code flow. You can always revert back to the original version.

The Top-Down approach, would be something like grasping the whole source code before editing anythint. While that theoretically sounds like a good idea, it could take too much time and you’ll tire youself out before modifying even a line of code. Learning the whole codebase is overwhelming for everyone.

New developers at Facebook are encouraged to commit their first code on the first day of joining! This bottom-up approach is a much better start, then trying to understand the whole FB architecture.)

IV. Make notes as you study the code

Understand the different classes or abstractions, and try to estimate what each module/class/abstraction is doing. Look at the Folders/Files and try to make sense of what they are doing by their names.

I usually like to make a lot of rough notes on paper – noting down the classes how they interact, the inheritance structure.

Draw rough class diagrams on the fly. Also draw a call graph, as in which functions calls the next function etc. Diagrams could be specially useful if you’re a visual person. At times, I have even taken printouts of some code at times and annotated it.

Big codebases, usually have a lot of asynchronous calls with multiple processes and threads. You might have to understand how the process/threads pass data between themselves.

V. Look for an entry point into the code

Usually, projects keep the core code in “src”. While the built binaries usually go into “bin”. There are also other folders like lib, docs, test etc.

Your IDE will allow you to setup Breakpoints on various files to debug the code. So setup a few breakpoints on functions that you might think will be ‘hit’ when the code is run. When a breakpoint is hit, the code execution stops and you can see a call hierarchy, i.e. a trace of all the functions that were called till the breakpoint.

VI. Look for the data models

There are two broad types of coding projects — Data driven and Functionality driven.

If your code base is to deal with large amounts of structured data then reading the DB Tables/Data structure would be super useful. Here’s a quote that puts it so well:

“Show me your [code] and conceal your [data structures], and I shall continue to be mystified. Show me your [data structures], and I won’t usually need your [code]; it’ll be obvious.” – Fred Brooks

VII. Logs, Debug msgs, Test folder

When you run/build the code, it shows a lot of info on the output terminal or in Log files.

It might seem daunting initially, but read the output carefully and try to follow what the messages are trying to tell you. You’ll understand a lot about the code flow from here! Eventually, you’ll develop a knack for how to read logfiles and debug messages and this is a super helpful skill.

If you just can figure out, look into the tests folder, there might be some earsier/helpful code you find their on how to interact with the main source code.

VIII. Language basics are important

Big projects could be using very advanced language features, complex design patterns, clever hacks etc. So a good understand of the language syntax itself, Design patterns, OS fundamentals is super useful.

For example, if you’re working on some mobile app, then understanding the MVP design pattern is almost necessary before even getting started. If the code uses frameworks, you’ll need to understand the framework itself. For example, to build a web application on Django, you’ll need to have gone through the basics of the django framework at the very least.

You might need to understand how the language interacts with the OS (Android, iOS, Linux, MacOS, embedded systems et al). For e.g. how does File I/O happen or how are threads spawned.

IX. Be patient

It takes time, so be patient. Navigating code base isn’t easy for even experienced developers, especially when many people have contributed to the code.

Additional

You’ll also need to understand:

How code versioning systems work. Most probably git is being used.
How bugs are reported and handled on the project.
Coding Standards — how variables/functions are to be named or how many spaces are used for indent etc. Be sure to follow the guidelines setup by the project’s maintainer.

More Tools

Tools like grep, find are super helpful.
tree
tree | less(you can do brew install tree to get tree on Macs)
For C/C++ callgrind + kcachegrind are interesting to see call graphs.