Tap, tap, tap! Your software engineering team is typing away, day after day, week after week. Are you going to have an asset that is marketable quality, competitively updatable, and responsive to changing customer needs? Or are you going to have an archaic behemoth with undesirable, unfixable bad behavior while competing products leave you in the dust? It depends on the quality of your software.
“But it works!” you say. “And we don’t have time for fancy details! My team is already behind schedule!” But software quality best practices exist to save you time and money. There’s a terribly thin line between working software and software that’s so hopelessly broken it must be thrown out and rewritten from scratch. Development best practices are lifelines to keep you on the “working software” side of the division. Here are the basic practices your engineers should be using.
There’s two types: descriptions inside the code files (the coder uses special characters to tell the computer to ignore those lines), and separate documents with diagrams and descriptions of how your system works. Both are equally important. Think of documentation as a fireplan of your large factory complex: when part of your building is on fire, you want the firefighters (who you’ve never met before) to be able to look at a map to figure out how and where to fight the fire, and also how to save important assets, etc. This isn’t “one and done”: documentation needs real-time updating (as code changes). Emphasis on documentation is a cultural value to be incorporated into your team.
Project folder structure
Unless your projects are very short scripts (“fetch files X, Y, and Z”, “change these file names to that”, etc), the code should be organized into multiple files. The files should be organized into folders, according to convention. This assists in speed of finding the relevant code and reduces bugs and errors.
You have contracts to hire your employees. You have contracts to rent your office. You have contracts to sell to clients. Guess what, you have contracts to build/use your code. Make sure you aren’t breaching them. All of the following are licensed unless you built it yourself from scratch: your dataset; your images; your model; your software libraries and dependencies; your IDE (software to help you write software); your third-party cloud ML; all cloud services; your webhost. Even your usage of other people’s websites! (Here’s looking at you, “I’ll scrape it myself” reader...)
When you are writing a document, you can revert to the last saved version. This is important, in case you screw up something, right? Well, software is written like this, except it uses a special save function that saves EVERYTHING. And prior versions are organized systematically so it’s super easy to choose which version you want, identify edits and changes to code. This is important when it’s Friday night and your live system just crashed and you don’t know why. You can easily identify the 2 lines of code that Fred just added, or you can revert the system to the last working version before the crash: version control enables both options.
But in order for version control to save your hiney, your team must be in the habit of using it properly! Each member should be committing their changes every few hours, and different versions (“branches”) hold different versions of the code while they are being worked on. When a feature is finished and is apparently bug-free, it is merged with the main, live version that your clients see.
Model and data version control
Standard version control (such as git) is only for code. But your data and your ML models need to be versioned too, so if something unexpectedly breaks or fails to work, you can always revert to a working version.
One way to find out if your software has a bug is to sell it to a client and wait for the client to complain. Alternatively, we use unit testing and integration testing. Every time you write a new chunk of code, you write one or more unit tests which, as specifically as possible, test just the new code you wrote. If you wrote a new code method to read in a piece of data and create a code object (collection of numbers and strings structuring that data), you might write a unit test that contains a fake piece of data, runs the method to read it and create and object, then compares expected fields in the object with actual fields. If they don’t match, your test fails. Tests run automatically. When your test fails, your system will loudly complain until you fix it. Integration tests are like unit tests, except they run most/all of your code at once. Integration tests take longer and use more resources, so you want to catch as many problems as possible with unit tests.
Beginning coders quickly learn the fastest way to debug their code: they insert “print statements” to print pieces of data to the console. (Sometimes every other line is a print statement!) When something doesn’t print, or prints wrong, you know the bug occurs before that point in the code.
But print statements aren’t the professional approach. Each major programming language uses one or more logging systems to systematically record important information, complete with timedates, code locations, custom messages, etc, all saved in a special file for easy reference. This organization reduces errors and is critical for quickly diagnosing real-time crashes and disasters. It can also be useful for analytics.
There’s always going to be bugs in your software. Some are huge, some are small, some are side effects of disagreement in architecture design, some are careless mistakes, some disappear and reappear over months and are hard to track down. To manage all these different kinds of bugs, use bug tracking software. There’s different options available, such as systems on GitHub and Jira, but they all systematically organize your bug reports along with comments and notes and code versions for diagnosing and fixing the bug, and statuses such as fixed/unfixable/being-fixed, etc. And you can search your history of bug reports, so when that annoying bug reappears every 6-12 months, you can find the previous bug report for more context in trying to fix the problem.
Libraries, dependencies, and compartmentalized environment
When we write code, we don’t just use one pre-defined set of code words and write our new code in a vacuum. We reference libraries, which are other pieces of already-written code, and dependencies, which are third-party libraries we automatically download from the cloud. We also use certain operating system options such as certain local software that we want to run our code because our code might break with different operating system options; we use compartmentalized environments to control what local software and operating systems options our code has access to.
There’s more, of course, but that’s enough to get you started in the right direction!