Web dev in DC http://ross.karchner.com
796 stories
·
16 followers

Linux Journal is Back

2 Comments and 3 Shares
Linux Journal

As of today, Linux Journal is back, and operating under the ownership of Slashdot Media.

As Linux enthusiasts and long-time fans of Linux Journal, we were disappointed to hear about Linux Journal closing its doors last year. It took some time, but fortunately we were able to get a deal done that allows us to keep Linux Journal alive now and indefinitely. It's important that amazing resources like Linux Journal never disappear.

We will begin publishing digital content again as soon as we can. If you're a former Linux Journal contributor or a Linux enthusiast that would like to get involved, please contact us and let us know the capacity in which you'd like to contribute. We're looking for people to cover Linux news, create Linux guides, and moderate the community and comments. We'd also appreciate any other ideas or feedback you might have. Right now, we don't have any immediate plans to resurrect the subscription/issue model, and will be publishing exclusively on LinuxJournal.com free of charge. Our immediate goal is to familiarize ourself with the Linux Journal website and ensure it doesn't ever get shut down again.

Many of you are probably already aware of Slashdot Media, but for those who aren't, we own and operate Slashdot and SourceForge: two iconic open source software and technology websites that have been around for decades. We didn't always own SourceForge, but we acquired it in 2016, and immediately began improving, and have since come a long way in restoring and growing one of the most important resources in open source. We'd like to do the same here. We're ecstatic to be able to take the helm at Linux Journal, and ensure that this legendary Linux resource and community not only stays alive forever, but continues to grow and improve.

Reach out if you'd like to get involved!

Read the whole story
rosskarchner
3 days ago
reply
DC-ish
Share this story
Delete
2 public comments
fxer
3 days ago
reply
They definitely cleaned up the dumpster fire of SourceForce, but that particular horse had already bolted and much better offerings were available. With Linux Journal they can probably still save the patient.
Bend, Oregon
JayM
3 days ago
reply
Woot!
Atlanta, GA
lousyd
3 days ago
Indeed!

On the use of a life

1 Share
In a recent discussion on Hacker News, a commenter posted the following question:
Okay, so, what do we think about TarSnap? Dude was obviously a genius, and spent his time on backups instead of solving millennium problems. I say that with the greatest respect. Is this entrepreneurship thing a trap?
I considered replying in the thread, but I think it deserves an in-depth answer — and one which will be seen by more people than would notice a reply in the middle of a 100+ comment thread.

Read the whole story
rosskarchner
3 days ago
reply
DC-ish
Share this story
Delete

“But it works”

1 Share

TL;DR – this is not nearly good enough in most cases and it’s only small fraction of what you are paid for.

I want this post to be the canonical place to refer people to who say “but it works” because people who explain why this is not OK are tired of repeating the same arguments, me included.

You are paid for …

Following is not an exhaustive list but it should give you some perspective which is opposite from the narrow-minded “but it works”.

Typically, your Software Engineering $Job pays you for:

  1. Of course the thing must work. But also..
  2. It should continue working
    1. Gives deprecation warnings? Probably not good.
    2. Only runs on Node.js v10 LTS which is end of life in less than a year (as of writing)? Think again.
    3. Got away with invalid XML? Can you be sure that the next version of parser won’t be stricter?
  3. It should be maintainable (aka you and other people should find it easy to operate and modify, now and years later)
    1. Code quality
    2. Tests (if you don’t have tests, even your basic claim that something “works” is under suspicion)
    3. Documentation
      1. How to use your sh*t?
      2. How to set up the development environment?
      3. Decisions
      4. Non-obvious code parts
  4. It should be production ready, not abstract “works” or even worse “works on my machine”
    1. Logs
    2. Metrics
    3. Tested in dev/qa/whatever-you-call it environment
    4. Reproducible – tomorrow they make a new environment, “qa42”, in a different AWS account in a different region. Could somebody else deploy your sh*t there without talking to you?

If you claim that you are “done” because “it works”, congratulations, you have a (probably) working prototype. That’s typically small part of a project.


Related term – “Tactical Tornado”, look this up.



Read the whole story
rosskarchner
13 days ago
reply
DC-ish
Share this story
Delete

The smell of boiling frog

2 Shares

I just got this email today:

Which tells me, from a sample of one (after another, after another) that Zoom is to video conferencing in 2020 what Microsoft Windows was to personal computing in 1999. Back then one business after another said they would only work with Windows and what was left of DOS: Microsoft’s two operating systems for PCs.

What saved the personal computing world from being absorbed into Microsoft was the Internet—and the Web, running on the Internet. The Internet, based on a profoundly generative protocol, supported all kinds of hardware and software at an infinitude of end points. And the Web, based on an equally generative protocol, manifested on browsers that ran on Mac and Linux computers, as well as Windows ones.

But video conferencing is different. Yes, all the popular video conferencing systems run in apps that work on multiple operating systems, and on the two main mobile device OSes as well. And yes, they are substitutable. You don’t have to use Zoom (unless, in cases like mine, where talking to my doctors requires it). There’s still Skype, Webex, Microsoft Teams, Google Hangouts and the rest.

But all of them have a critical dependency through their codecs. Those are the ways they code and decode audio and video. While there are some open source codecs, all the systems I just named use proprietary (patent-based) codecs. The big winner among those is H.264, aka AVC-1, which Wikipedia says “is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019.” Also,

H.264 is perhaps best known as being the most commonly used video encoding format on Blu-ray Discs. It is also widely used by streaming Internet sources, such as videos from NetflixHuluPrime VideoVimeoYouTube, and the iTunes Store, Web software such as the Adobe Flash Player and Microsoft Silverlight, and also various HDTV broadcasts over terrestrial (ATSCISDB-TDVB-T or DVB-T2), cable (DVB-C), and satellite (DVB-S and DVB-S2) systems.

H.264 is protected by patents owned by various parties. A license covering most (but not all) patents essential to H.264 is administered by a patent pool administered by MPEG LA.[9]

The commercial use of patented H.264 technologies requires the payment of royalties to MPEG LA and other patent owners. MPEG LA has allowed the free use of H.264 technologies for streaming Internet video that is free to end users, and Cisco Systems pays royalties to MPEG LA on behalf of the users of binaries for its open source H.264 encoder.

This is generative, clearly, but not as generative as the Internet and the Web, which are both end-to-end by design. .

More importantly, AVC-1 in effect slides the Internet and the Web into the orbit of companies that have taken over what used to be telephony and television, which are now mooshed together. In the Columbia Doctors example, Zoom the new PBX. The new classroom is every teacher and kid on her or his own rectangle, “zooming” with each other through the new telephony. The new TV is Netflix, Disney, Comcast, Spectrum, Apple, Amazon and many others, all competing for wedges our Internet access and entertainment budgets.

In this new ecosystem, you are less the producer than you were, or would have been, in the early days of the Net and the Web. You are the end user, the consumer, the audience, the customer. Not the producer, the performer. Sure, you can audition for those roles, and play them on YouTube and TikTok, but those are somebody else’s walled gardens. You operate within them at their grace. You are not truly free.

And maybe none of us ever were, in those early days of the Net and the Web. But it sure seemed that way. And it does seem that we have lost something.

Or maybe just that we are slowly losing it, in the manner of boiling frogs.

Do we have to? I mean, it’s still early.

The digital world is how old? Decades, at most.

And how long will it last? At the very least, more than that. Centuries or millennia, probably.

So there’s hope.

[Later…] For some of that, dig OBS—Open Broadcaster Software’s OBS StudioFree and open source software for video recording and live streaming. HT: Joel Grossman (@jgro).

Also, though unrelated, why is Columbia Doctors’ Telehealth leaking patient data to advertisers? See here.

Read the whole story
rosskarchner
14 days ago
reply
DC-ish
tingham
14 days ago
Hard no.
Share this story
Delete

Inside a CODE RED: Network Edition

1 Share

I wanted to follow up to Jeremy’s post about our recent outages with a deeper, more personal look behind the scenes. We call our major incident response efforts “CODE REDs” to signify that it is an all-hands-on-deck event and this definitely qualified. I want to go beyond the summary and help you see how an event like this unfolds over time. This post is meant for both people who want a deeper, technical understanding of the outage, as well as some insight into the human side of incident management at Basecamp.

The Prologue

The seeds of our issues this week started a few months ago. Two unrelated events started the ball rolling. The first event was a change in our networking providers. We have redundant metro links between our primary datacenter in Ashburn, VA and our other DC in Chicago, IL. Our prior vendor had been acquired and the new owner wanted us to change our service over to their standard offering. We used this opportunity to resurvey the market and decided to make a change. We ran the new provider alongside the other for several weeks. Then, we switched over entirely in late June.

The second event occurred around this same time when a security researcher notified us of a vulnerability. We quickly found a workaround for the issue by setting rules on our load balancers. These customizations felt sub-optimal and somewhat brittle. With some further digging, we discovered a new version of load balancer firmware that had specific support for eliminating the vulnerability and we decided to do a firmware upgrade. We first upgraded our Chicago site and ran the new version for a few weeks. After seeing no issues, we updated our Ashburn site one month ago. We validated the vulnerability was fixed and things looked good.

Incident #1

Our first incident began on Friday, August 28th at 11:59AM CDT. We received a flood of alerts from from PagerDuty, Nagios and Prometheus. The Ops team quickly convened on our coordination call line. Monitoring showed we lost our newer metro link for about 20-30 seconds. Slow BC3 response times continued despite the return of the network. We then noticed chats and pings were not working at all. Chat reconnections were overloading our network and slowing all of BC3. Since the problem was clearly related to chat, we restarted the Cable service. This didn’t resolve the connection issues. We then opted to turn chat off at the load balancer layer. Our goal was to make sure the rest of BC3 stabilized. The other services did settle as hoped. We restarted Cable again with no effect. Finally, as the noise died down, we noticed a stubborn alert for a single Redis DB instance.

Initially, we overlooked this warning because the DB was not down. We probed it from the command line and it still responded. We kept looking and finally discovered replication errors on a standby server and saw the replica was stuck in a resynchronization loop. The loop kept stealing resources and slowing the primary node. Redis wasn’t down but it was so that slow that it was only responding to monitoring checks. We restarted Redis on the replica and saw immediate improvement. BC3 soon returned to normal. Our issue was not a novel Redis problem but it was new to us. You can find much more detail here.

The Postmortem

The big question lingering afterward was “how can a 30 second loss of connectivity on a single redundant networking link take down BC3?” It was clear that the replication problem caused the pain. But, it seemed out of character that dropping one of two links would trigger this kind of Redis failure. As we went through logs following the incident, we were able to see that BOTH of our metro links had dropped for short periods. We reached out to our providers in search of an explanation. Early feedback pointed to some sub-optimal BGP configuration settings. But, this didn’t fully explain the loss of both circuits. We kept digging.

This seems as good a time as any for the confessional part of the story. Public postmortems can be challenging because not all of the explanations look great for people involved. Sometimes, human error contributes to service outages. In this case, my own errors in judgement and lack of focus came into play. You may recall we tripped across a known Redis issue with documented workaround. I created a todo for us to make those configuration changes to our Redis servers. The incident happened on a Friday when all but 2 Ops team members where off for the day. Mondays are always a busy, kick-off-the-week kind of day and I was also when I started my oncall rotation. I failed to make sure that config change was clearly assigned or finished with the sense of urgency it deserved. I’ve done this for long enough to know better. But, I missed it. As an Ops lead and active member of the team, every outage hurts. But this one is on me and it hurts even more so. 

Incident #2

At 9:39AM on Tuesday, 9/01, the unimaginable happened. Clearly, it isn’t unimaginable and a repeat now seems inevitable. But, this was not our mindset on Tuesday morning. Both metro links dropped for about 30 seconds and Friday began to repeat itself. We can’t know if the Redis config changes would have saved us because they had not been made (you can be sure they are done now!). We recognized the problem immediately and sprang into action. We restarted the Redis replica and the Cable service. It looked like things were returning to normal 5 minutes after the network flap. Unfortunately, our quick response during peak load on a Tuesday had unintended consequences. We saw a “thundering herd” of chat reconnects hit our Ashburn DC and the load balancers couldn’t handle the volume. Our primary load balancer locked up under the load and the secondary tried to take over. The failover didn’t register with the downstream hosts in the DC and we were down in our primary DC. This meant BC3, BC2, basecamp.com, Launchpad and  supporting services were all inaccessible. We attempted to turn off network connections into Ashburn but our chat ops server was impacted and we have to manually reconfigure the routers to disable anycast. The problem of peak traffic on Tuesday is much different than managing problems on a Friday.

We begin moving all of our services to our secondary DC in Chicago. We move BC3 completely. While preparing to move BC2 and Launchpad, we apply the manual router changes and the network in Ashburn settles. We decide to stop all service movement focus on stability for the rest of the day. That night after traffic dies down, we move all of our services back to their normal operating locations.

One new piece of the puzzle drops into place. The second round of network drops allowed our providers to watch in real time as events unfolded. We learn that both of our metro links share a physical path in Pennsylvania, which was affected by a fiber cut. A single fiber cut in the middle of Pennsylvania could still hit us unexpectedly. This was a surprise to us as it was to our providers. At least we could now make concrete plans to remove this new problem from our environment.

Incident #3

We rotate oncall shifts across the Ops team. As 2020 would have it, this was my week. After a late night of maintenances, I hoped for a slow Wednesday morning. At 6:55AM CDT on 9/2, PagerDuty informed me of a different plan. Things were returning to normal by the time I got setup. We could see our primary load balancer had crashed and failed over to the secondary unit. This caused about 2 minutes of downtime across most of our Basecamp services. Thankfully, the failover went smoothly. We immediately ship the core dump file to our load balancer vendor and start combing logs for signs of unusual traffic. This felt the same as Incident #2 but the metrics were all different. While there had been a rise in CPU on the load balancers, it was no where near the 100% utilization of the day before. We wondered about Cable traffic – mostly because of the recent issues. There was no sign of a network flap. We looked for evidence of a bad load balancer device or other network problem. Nothing stood out.

At 10:49AM, PagerDuty reared again. We suffered a second load balancer failover. Now we are back at peak traffic and the ARP synchronization on downstream devices fails. We are hard down for all of our Ashburn-based services. We decide to disable anycast for BC3 in Ashburn and run only from Chicago. This is again a manual change that is hampered by high load but it does stabilize the our services. We send the new core file off to our vendor and start parallel work streams to get us to some place of comfort.

These separate threads spawn immediately. I stay in the middle of coordinating between them while updating the rest of the company on status. Ideas come from all directions and we quickly prioritize efforts across the Ops team. We escalate crash analysis with our load balancer vendor. We consider moving everything to out of Ashburn. We expedite orders for upgraded load balancers. We prep our onsite remote hands team for action. We start spinning up virtual load balancers in AWS. We dig through logs and problem reports looking for any sign of a smoking gun. Nothing emerges … for hours.

Getting through the “waiting place” is hard. On the one hand, systems were pretty stable. On the other hand, we had been hit hard with outages for multiple days and our confidence was wrecked. There is a huge bias to want to “do something” in these moments. There was a strong pull to move out of Ashburn to Chicago. Yet, we have the same load balancers with the same firmware in Chicago. While Chicago has been stable, what if  it is only because it hasn’t seen the same load? We could put new load balancers in the cloud! We’ve never done that before and while we know what problem that might fix – what other problems might it create? We wanted to move the BC3 backend to Chicago – but this process guaranteed a few of minutes of customer disruption when everyone was on shaky ground. We call our load balancer vendor every hour asking for answers.  Our supplier tells us we won’t get new gear for a week. Everything feels like a growing list of bad options. Ultimately, we opt to prioritize customer stability. We prepare lots of contingencies and rules for when to invoke them. Mostly, we wait. It seemed like days.

By now, you know that our load balancer vendor confirms a bug in our firmware. There is workaround that we can apply through a standard maintenance process. This unleashes a wave conflicted feelings. I feel huge relief that we have a conclusive explanation that doesn’t require days of nursing our systems alongside massive frustration over a firmware bug that shows up twice in one day after weeks running smoothly. We set the emotions aside and plan out the remaining tasks. Our services remain stable during the day. That evening, we apply all our changes and move everything back to its normal operating mode. After some prodding, our supplier manages to air ships our new load balancers to Ashburn. Movement feels good. The waiting is the hardest part.

The Aftermath

TL;DR: Multiple problems can chain into several painful, embarrassing incidents in a matter of days. I use those words to truly express how this feels. These events are now understandable and explainable. Some aspects were arguably outside of our control. I still feel pain and embarrassment. But we move forward. As I write this, the workarounds appear to be working as expected. Our new load balancers are being racked in Ashburn. We proved our primary metro can go down without issues since the vendor had a maintenance on their problematic fiber just last night. We are prepping tools and processes for handling new operations. Hopefully, we are on a path to regain your trust.

We have learned a great deal and have much work ahead of us. A couple of things stand out. While we have planned redundancy into our deployments and improved our live testing over the past year, we haven’t done enough and have a false sense of security around that – particularly when running at peak loads. We are going to get much more confidence in our failover systems and start proving them in production at peak load. We have some known disruptive failover processes that we hope to never use and will not run during the middle of your day. But, shifting load across DCs or moving between redundant networking links should happen without issue. If that doesn’t work, I would rather know in a controlled environment with a full team at the ready. We also need to raise our sense of urgency for rapid follow up on outage issues. That doesn’t mean we just add them to our list. We need to clear room for post-incident action explicitly. I will clarify the priorities and and explicitly push out other work.

I could go on about our short comings. However, I want to take time to highlight what went right. First off, my colleagues at Basecamp are truly amazing. The entire company felt tremendous pressure from this series of events. But, no one cracked. Calmness is my strongest recollection from all of the long calls and discussions. There were plenty piercing questions and uncomfortable discussions, don’t get me wrong. The mood, however, remained a focused, respectful search for the best path forward. This is the benefit of working with exceptional people in an exceptional culture. Our redundancy setup did not prevent these outages. It did give us lots of room to maneuver. Multiple DCs, a cloud presence and networking options allowed us to use and explore lots of recovery options in a scenario we had not seen before. You might have noticed that HEY was not impacted this week. If you thought that is because it runs in the cloud, you are not entirely correct. Our outbound mail servers run in our DCs. So no mail actually sends from the cloud. Our redundant infrastructure isolated HEY from any of these Basecamp problems. We will keep adapting and working to improve our infrastructure. There are more gaps than I would like. But, we have a strong base.

If you’ve stuck around to the end, you are likely a longtime Basecamp customer or perhaps a fellow traveller in the operations realm. For our customers, I just want to say again how sorry I am that we were not able to provide the level of service you expect and deserve. I remain committed to making sure we get back to the standard we uphold. For fellow ops travelers, you should know that others struggle with the challenges of keeping complex systems stable and wrestling with feelings of failure and frustration. When I said there was no blaming going on during the incident, that isn’t entirely true. There was a pretty serious self-blame storm going on in my head. I don’t write this level of personal detail as an excuse or to ask for sympathy. Instead, I want people to understand that humans run Internet services. If you happen to be in that business, know that we have all been there. I have developed a lot of tools to help manage my own mental health while working through service disruptions. I could probably write an entire post on that topic. In the meantime, I want to make it clear that I am available to listen and help anyone in the business that struggles with this. We all get better by being open and transparent about how this works.

Read the whole story
rosskarchner
19 days ago
reply
DC-ish
Share this story
Delete

Orthographic media

1 Share
A view without perspective.
Read the whole story
rosskarchner
28 days ago
reply
DC-ish
Share this story
Delete
Next Page of Stories