I wanted to provide a round-up of some changes to the vmst.io infrastructure since it went live last month.
No More Cloudflare
Our domain had always been registered and DNS provided through Cloudflare. For a brief period I was testing their WAF (web application firewall) service with the site but this led to more problems than perceived benefits. The sentiment within the Fediverse is generally negative towards Cloudflare, although many other instances use them.
When attacks were launched against various ActivityPub instances by a bad actor being protected by Cloudflare, I decided that it was time to stop using their services. The domain is now registered through a different service but is in the process of being transferred to Gandi. DNS services are provided through DNSimple. I intentionally broke up these two components. DNSSEC is currently not enabled for the domain, but will be as soon as the transfer work is completed.
I may look for an alternative to provide a level of DDoS/WAF protection for the site, as we grow. For the time being your secure connections will terminate directly to the Digital Ocean managed load balancers and CDN.
We launched with free Let's Encrypt digital certificates for the site and CDN. Let's Encrypt is designed to be a fully automated certificate authority. I love Let's Encrypt and everything they stand for. Unfortunately due to the way our web servers, CDN, and load balancers are configured, automation was easier said than done.
While I could have continued to manually generate the certificate and apply it to the various components every 90 days, I decided for the sake of not being responsible for that to purchase a certificate through Sectigo for this use. Not only does this extend the renewal responsibility to a year, the generation and application is simpler for me on the backend.
Additionally, docs.vmst.io has been moved from the Digital Ocean static site generator to Netlify. The major reason was to allow the use of customer provided certificates. Digital Ocean only uses certificates issued by ... Cloudflare.
Our status.vmst.io page will continue to serve a Let's Encrypt certificate, as there is no mechanism to provide a customer certificate on that service through UptimeRobot.
No one wants to think about backups.
Backing up the instance on Masto.host was provided by the service.
Prior to recently, as the focus was just getting things established, I was doing backups only of the database on a manual and infrequent basis. I liked nothing about this.
I'm currently trialing a service called SimpleBackups that integrates with Digital Ocean to connect to the Redis and Postgres databases, and CDN/Object Store, as well as use native connections GitHub, to automate and perform regular backups of the infrastructure on a daily basis.
Once I have a handle on size, load, and timing, we'll take more frequent backups. Backups are done to locations outside of Digital Ocean, so in the event of a disaster that impacts the Toronto or NYC datacenters where our data lives, or if Digital Ocean decided to go evil and delete our account, we'll be able to recover the data from a combination of AWS and Backblaze.
The configurations for all of the Docker and Mastodon components necessary to reconstitute the site (even on a totally different provider) are all stored in a private GitHub repository, also backed up (in multiple locations) to allow quick recovery of any critical component.
Reduced Frontend Count
We originally launched with three frontend servers. After spending the last month tweaking Sidekiq, Puma, database pools, and other various Mastodon related services, I decided to vertically scale the front end systems but reduce the count to two. This is actually cheaper than running the three smaller ones.
Should we experience a need to scale back up to three, it will be trivial to do so as the front end servers are actually an image on Digital Ocean that can be deployed within a few minutes. I originally wanted three to allow flexibility during maintenance operations, if a server was down for updates and we experienced a load spike or other event. Because of the ease of image deployment and the centralized configuration I have put in place, I can temporarily deploy an additional front end system while another is out of service with just a few clicks.
Additionally, after making adjustments post-Cloudflare, the load balancer should serve up connections via the HTTP2 protocol. While mostly transparent to users, this has the effect of drastically reducing load on the web-frontend.
I’ve had a few thoughts rattling around about registration numbers on vmst.io, vs instances that are generally wide open. The instance officially went live on October 6, but didn’t let anyone else join until a few weeks after. We are listed on joinmastodon.org so we definitely see more random user signups than instances that just rely on word-of-toot.
Because we ask folks to "apply" for their account here rather than just signup without moderation, we get maybe 1/10th the number of registrations you’d see in somewhere else. The bulk of folks coming from Twitter en-mass were scared, impatient, unwilling to wait. I base this on my observations of registration activity during the mass migration of folks coming over after the Twitter layoffs.
Some sites that had open registration with less than a 100 users at the start of November ended that week with over 30,000 accounts. We had probably 3,000 applications during that two day period, before the decision was made to close registrations temporarily.
All of this is to say, we have just over 1600 members right now. We could have had a lot more if we wanted.
Of those who apply, we have a method of deciding who to let in:
- We look at the username and display name and reject anything that’s obviously distasteful given the type of community we are seeking to build (xxxlol69, etc)
But specifically we ask folks to give a reason why they want to join:
- If people put nothing there, it’s rejected.
- If they put “idk” it’s rejected.
- If they just tell me that “Elon sucks” it’s rejected.
- If it just looks kinda sus... it’s rejected.We also decided that if the application isn’t in English it will also be rejected.
We are clear in our site description that we are English speaking. This isn't done out of some desire to limit interaction of folks in other languages. We do this only because of our current inability to moderate non-English posts.
We try never to approve a user until they’ve confirmed their email account. We’ll manually trigger at least one reminder email for confirmation but if after a few days the account remains unconfirmed, we remove it from the queue.
That basic level of filtering probably means about 1/5 of the people who apply are accepted. That isn't some target/goal, that's just the rough estimate based on the facts above.
As I was writing this, there were 9 people in the queue.
I approved 2.
There has been a somewhat noticeable uptick in the amount of junk registrations. On Tuesday I rejected probably 20 in a row where they were obviously just spamming our registrations page. We’ve periodically closed registrations when we needed to, and had an extended period as we migrated from Masto.host to running on our own infrastructure.
With that exception, we’re not keeping things small because of major infrastructure considerations at this point. We want things to be performant for the folks who are here, but we can scale a lot higher if we choose to.
We do this mostly because we want to scale the community here in a responsible way.
This post is a rollup and expansion of a set of Toots around the new vmst·io infrastructure that went live on the morning of Wednesday, November 23, 2022.
When I launched vmst·io at the start of October, it was intended to just be a personal instance. mastodon.technology had just announced it's pending closure, and I wanted to see what it was like to own my own corner of the Fediverse.
I signed up for a $6 plan with masto.host and migrated my account. Everything was great, except it was kinda boring. Being on a single-user instance means your local timeline is just you, and your federated timeline is just the same people you follow.
So I invited a few friends to join me, and upped the hosting to the $9 plan. Then Elon officially bought Twitter and suddenly a few more friends wanted to join me, so I went to $39. Then Elon purged Twitter staff and suddenly I needed about $79 worth of hosting.
Even before I went to the $39 plan, I started wondering if I could run this myself. So I started digging into documentation, testing various providers, and building an architecture. That is what we moved into on November 23. Now that things have settled, want to take a peek behind the firewall?
Horizontal or Vertical
When we talk about scaling any platform, there are generally two directions you can go. Horizontal or vertical.
Vertical scaling is generally easy, if your app needs more memory than the host has, add more. If it needs more CPU, and it's multi-threaded, just add more. Horizontal scaling is sometimes a little more tricky. This means adding more instances of your application. Even though we're a small instance in comparison to places like hachyderm.io or infosec.exchange, my goal was to build us from the start to be able to go both directions.
Almost all of our new infrastructure lives in the Toronto and New York data centers of Digital Ocean. Email notifications are handled by Sendgrid. DNS resolution comes through Cloudflare.
All public traffic is encrypted and what isn’t happens on private networks. We are using managed load balancing and database services. The various self-managed services run on Debian based Docker hosts.
Why Digital Ocean?
While the company I work for is a major partner of the major public cloud players, I like to support the littler/independent folks.
I’ve hosted many things in Linode and Digital Ocean over the years, and in comparing the two it’s really a toss up in price and features. The managed database offering is what finally pushed me to Digital Ocean. The uncertainty with Linode being acquired by Akamai recently, also weighed in.
One thing I wanted to do was put the backend databases (PostgreSQL and Redis) into managed instances, because I'm not a DBA or an expert in either platform. Linode offered only managed PostgreSQL, and to get no-downtime upgrades you had to purchase their high availability option which was a minimum 3x increase in price for a similarly spec'd platform.
Digital Ocean also had support for pgBouncer built in. More on that later.
I’m in the central US, so response times to any DC offering in North America are usually pretty good. I had our moderators and some friends in other countries (Europe) test and they came back with Toronto as the lowest on average. Also, I figured putting the servers in Canada would make sure they were polite. 🇨🇦
The object store is in New York because it was the closest geographical DC where DO offered the service. The speed of light is only so fast.
I tried Mailjet first and found it finicky. I tried Sendgrid and it worked the first time and every time since. What I discovered later was that by default Sendgrid includes tracking links/images in messages. I have zero interest in knowing if you’re opening your notifications and I personally run blockers to disable all this junk in my own email.
So while it’s relatively benign, and is related with the service offering, it’s not consistent with our privacy policies so they have been disabled going forward.
Cloudflare is our domain registrar, and also DNS provider.
What does vmstio-exec do?
vmstio-exec is essentially the Master-Mastodon node, holding the configuration that is presented to the worker nodes. Also, unlike on the workers, Mastodon is not in a container so I can do things like have direct database access and access to utilities like `tootclt` without impacting front end traffic.
Mastodon requires one Sidekiq node to run the ""scheduler"" queue, so that's where it sits. It also has more CPU and memory allocated so it can process other Sidekiq jobs while the worker nodes focus on web traffic.
The NFS share is used to make sure that all of the worker nodes always have the same/latest copy of the configuration and certificates.
What do the vmstio-workers do?
These are the frontend nodes for the site. User requests (either direct or from federated instances) flow through Cloudflare to our Digital Ocean managed load balancer. This load balancer can currently handle up 10,000 concurrent connections, and easily scale beyond that with a few clicks.
The load balancer monitors the health of every worker node and if they're reporting that they're available, then nginx will handle accept user connections.
The workers run a complete deployment of Mastodon in Docker containers. Docker allows each startup of the application components to be a clean boot from the image provided by Mastodon. (I hope to move the nginx component to a Docker container before long.)
Each worker has threads dedicated to handling the frontend web traffic (Mastodon Web & Mastodon Streaming), as well as processing some of the backend load (Sidekiq).
There are usually three worker nodes running. This allows at least one to be down for maintenance, without impacting user traffic. They are regularly reimaged via Digital Ocean tools, although not automatically.
What is Stunnel for?
Mastodon (specifically the Sidekiq component) cannot currently speak native TLS to Redis, meaning all of the traffic is over plaintext. While this isn't a deal breaker as the communication is happening over a private network, it's not ideal. Additionally, I wanted to use Digital Ocean's managed Redis offering instead of being responsible for this component myself. Digital Ocean does not permit you to disable TLS.
Stunnel creates a secure tunnel for the connection from Mastodon/Sidekiq to Redis, sidestepping Mastodon's lack of TLS unawareness. Mastodon actually thinks it's talking to a Redis instance running on localhost.
What does the Object Store do?
This is where all of the image, videos and other media that gets uploaded is stored. It also caches the media of federated instances that you interact with. There is a CDN (Content Delivery Network) that is managed by Digital Ocean that brings these large files closer to your location when you access them. That ability is further enhanced by Cloudflare.
What does Elastic Search provide?
When you search for content within the instance, Elastic Search is used to scan the content of your posts, and other posts you interact with so that you don't have to go hunting for it later. Without Elastic Search running you'd only be able to search by hashtags. Not all Mastodon instances have this available.
What is a pgBouncer?
pgBouncer manages the connection by the various worker/exec nodes, their various Sidekiq and Puma (web) threads, to the PostgreSQL database. This provides more flexibility in scaling up and managing connection counts. It effectively acts like a reverse load balancer for the PostgreSQL database.
Are you done?
Never. As we find betters ways to secure, scale, or provide resiliency -- we'll iterate. Even since launching last week, we've changed a few things around like using Stunnel for connection to managed Redis databases, and added Telegraf and InfluxDB for better telemetry of the infrastructure.
Honestly just making sure my site still works 🤭
Considering the news of the day, maybe I'll just blog more.