How I built This -

1. API Clients

A small illustration of the broad range of clients that use the API

There’s a never ending list of weird, wonderful - but always interesting uses that thousands of developers have used the Cat API for. From being baked into the Kubernetes tests, Tinder for cats, Discord bots, weekend hacks with Raspberry Pi’s, Games, IoT kiosks, office dashboards, and frequently teaching classroom full of students, young and old to code.

The API needs to accept requests from them all reliably under any load, with a predictable response, and yet be nimble enough to innovate upon.

Their requests hit the API’s endpoint - and from there onto the ….

2. Load Balanacer

Shows load balancer automatically triggering new servers to start as the load increases

This allows the API to scale based on the load - or how many requests it receives in a time. It spreads the requests across all the API Servers, and if the load gets too high will trigger the automatic creation of another Server - thus scaling horizontally. If any Servers become unavailable it will reroute traffic to the others.

3. API Servers

Showing function of API server, resolving requests into responses

These run the main application that’s written in NodeJS. They receive API requests via the Load Balancer, processes them according to the business logic, communicate with backend services & data storage, then send a response back to the client.

The business logic might be:

- ‘search for a random image of cats wearing hats’ - find a random image from the Data Store with category_id=4

-‘save an image as a favourite with a custom sub_id’ - validate the image exists, then save the data to the Write DB.

Any tasks that will take a long, or unknown amount of time are turned into Jobs and added to the Queue so they don’t hold up other requests.

I’ve opted for AWS Elasticbeanstalk instead of GKE (Kubernetes) - GKE would be cheaper and more powerful, however Elasticbeanstalk is quicker to get going and maintain for a tightly scoped project such as this.

4. Data Storage

Shows all the components that make up the data store

**Object Storage **- Stores files like uploaded Images, and Logs from the servers.

Job Queue - Temporary queue of jobs to be picked up by backend workers e.g. ‘analyse an uploaded image’, ‘roll up analytics from log files’, ‘webhook some data’, create & email a report’. Some tasks like ‘send a welcome email’ skip to the front of the queue.

Data Cache - A Redis In-memory (RAM) data storage, provides faster read & write access than a Database (HDD). Data is only kept here for a short time, or until a change is made to it e.g. if a favourite is deleted then it would be ‘invalidated’ (removed) from the Cache. Saves the response sent to the user, rather than the raw data from the DB, so the same business logic doesn’t need doing again.

**Read DB(s) **- These are replicas of the Write BD. If the data is not found in the Cache then it is read from one of the Read Database. The replicas provide redundancy if the Write (Master) database becomes unavailable, and a way to read data without holding up the saving of new data.

Write DB - This master database saves any new data (Images, Votes, Favourites etc), and quickly sync’s the new data across to all the Read (replica) databases. Data is typically saved in batches to prevent it becoming a bottleneck during heavy write traffic.

The main application uses MySQL as the database - it’s safe and ‘boring’ which is perfectly fine for storing structured relational data - data that is related to each other e.g. Votes/Favourites to Images.

Some of the job workers use NoSQL databases as they communicate with external services which might return data of differing sizes, formats & types and need storing as documents.

5. Serverless Job Workers

Serverless - someone else's servers up in the cloud, on demand, running a Function

Serverless (Lambda) functions are perfect for scheduled, or ad-hoc short tasks because they can scale up instantly to meet demand, automatically retry failed jobs (e.g. communicating with an external email service), and don’t hang around racking up bills like a Server would.

I use them to:

  • Send images to image analysis to check Uploaded images for inappropriate content, categorisation, and if they actually contain a kitty.
  • Deleting any images that fail analysis
  • Resizing images into different sizes available a query parameters (thumb, small, med, full e.g. size=full)
  • Create ‘rollups’ from the Log files with Athena for analytics
  • Send welcome emails to new signups
  • Analysing the user Votes for unpopular images to take out of rotation.
  • Sending me emails of potential issues before the Cloudwatch alarms kick in.

6. CDN

When an image has been approved for distribution it gets added into the S3 Object Storage bucket that is served to users.

This bucket acts as the source location for all images, but as it is in one location users further away would take longer to load it that users nearby. To get over this I use AWS Cloudfront & Cloudflare as the Content Distribution Network (CDN) - this is a network of storage servers around the world that hold a copy of the Image, and will serve it to any users nearby, instead of from the source location.

Cloudlfare CDN locations around the world

There are of course additional costs, and I’ll breakdown these fully in an upcoming article so people can see a comparison.

7. External Services

There are some things that I don’t use the AWS or GCP stacks for:

SendGrid: Sending emails to users when they signup containing their API Key - it’s secure and reliable.

Slack: Sending me real time telemetric updates to me like ‘XXX number of signups today’, ‘Image XXX is popular today’, etc

Rollbar: Error logging & alert service, along with stack trace for debugging.

Trello: Ticket & task management. The trello board is public so anyone can see, and if anyone posts a bug on the forum i create a ticket in trello for them to see updates.

8. Image Analysis

Image showing results from AWS Rekognition for a Cat picture

I had to build a basic image analysis engine years ago when the Cat API first launched in 2012 as there wasn’t any on the market - It was far from perfect. The AWS & GCP versions of today are a vast improvement, although neither do .gif files.

AWS Rekognition has different services (each chargeable):

  • Labels - a list of category style objects it has found in the image, along with a 0-100 confidence score e.g. Mammal - Confidence 80.123, Cat - Confidence 80.67. As you can see from the image above, these are of mixed usefulness.
  • Moderation Labels - a list of anything that would mark the image as ‘Unsafe Content’ like nudity or suggestive content. As a use-case for the API is in classrooms anything here would cause the image to be rejected.
  • Text - any machine readable text in the image. This generally causes an image to be rejected to be on the safe side.
  • Faces - any human faces in the image. This generally cause the image to be rejected too, the API’s about Cat images after all, not humans.

GCP Vision is different, and in some ways superior due to Google’s vast index of search results, and i’ll go into more detail about it, how i use the results to validate images, and how .gifs are moderated in a full article.

All this data is available via the API when requesting Image via ‘/images/{image_id}/analysis’

9. API Event Logging & Analysis

As CTO at both my own, and other companies i’ve spent millions of dollars over the years on data pipelines, warehouses, and 3rd party vendors, and specialist contractors. The price has thankfully come down by orders of magnitude over the years to do the same thing with mature stacks.

The closer to real-time the data needs to be seen the more expensive it is, and in the vast majority of cases hourly data is fine. Being pragmatic about it in this use-case, there is no need to use BigQuery or Redshift - the respective GCP & AWS platforms to for storing per request logs. If the output is well defined and simple then ‘timeboxed’ totals - or ‘rollups’ can be used instead, and the raw logs backed up. This has the added benefit of not storing more data than needed like raw user-agents, it’s all too tempting in most companies to simply store everything “just in-case it’s needed later” - it’s simply not acceptable in today’s world.

By accepting a delay in serving analytics, I can also do away with a data firehose via Kinisis & real-time ETL, and instead rotate the logs from the servers themselves into S3 and use AWS Athena to run Queries against them. Creating an AWS Kinesis/Glue/Redshift stack to process the same amount of data would eventually cost thousands vs Athena’s pennies.

10. BI Tools

My aim is to provide as much value to as many people as possible. To know if i’m successful in that, and where to do better visualisation and reporting tools are essential.

For this I picked Google Datastudio, it's free, and enables me to simple connect a Read DB, and start gathering insights.

11. Alerts & Issue Management

Things break. The key to to know about it quickly (ideally before hand), and track progress made to diagnose & fix it. Rollbar is excellent as an external service.

Cloudwatch, along with some Lambda functions lets me know if there’s any spikes in traffic that aren’t being handled well, if any Job workers are having recurring errors. These are added to the internal Trello board, along with any related reports or log files.

If any bugs crop up that i should make the public aware of, the Trello ticket is mirrored across to the Public Roadmap board