Building a Scalable Document Pre-Processing Pipeline

This article was originally posted on the Amazon Web Services Architecture blog.

In a recent customer engagement, Quantiphi, Inc., a member of the Amazon Web Services Partner Network, built a solution capable of pre-processing tens of millions of PDF documents before sending them for inference by a machine learning (ML) model. While the customer's use case--and hence the ML model--was very specific to their needs, the pipeline that does the pre-processing of documents is reusable for a wide array of document processing workloads. This post will walk you through the pre-processing pipeline architecture.

Read more →

AWS VPC Traffic Mirroring Walkthrough

I was recently playing around with the Traffic Mirroring feature in AWS. As a network geek, this is right up my alley because as some colleagues and I used to say, "the wire never lies!". Being able to pick packets off the wire for detailed inspection has saved the day many a time. Until Traffic Mirroring came along, it wasn't possible to do that in an Amazon VPC. Below are my notes and considerations for using this feature.
Read more →

Replicating Elastic File System With AWS DataSync

I recently used AWS DataSync as part of a lab I was building. These are my notes for using DataSync to replicate an Amazon Elastic File System (EFS) share from one region to another.

AWS DataSync is a managed service that enables replication of data between AWS services and from on-prem to AWS. It automates the scheduling of transfer activities, validates copied data, and uses a purpose-built network protocol and multi-threaded architecture to achieve very high efficiency on the wire.

The use case I needed to tackle was replicating an Amazon EFS share in one region to an EFS share in a different region (a one-way replication). (DataSync can also connect to Amazon S3 and Amazon FSx for Windows File Server)

Read more →

Multicast Routing in AWS

Consider for a moment that you have an application running on a server that needs to push some data out to multiple consumers and that every consumer needs the same copy of the data at the same time. The canonical example is live video. Live audio and stock market data are also common examples. At the re:Invent conference in 2019, AWS announced support for multicast routing in AWS Virtual Private Cloud (VPC). This blog post will provide a walkthrough of configuring and verifying multicast routing in a VPC.

Read more →

AWS ABCs: Granting A Third-Party Access to Your Account

There can be times when you're working on the AWS Cloud where you need to grant limited access to your account to a third-party. For example:

  • A contractor or a specialist needs to perform some work on your behalf
  • You're having AWS Professional Services or a partner from the Amazon Partner Network do some work in your account
  • You're conducting a pilot with AWS and you want your friendly neighborhood Solutions Architect to review something

In each of these cases you likely want to grant the permissions the third-party needs but no more. In other words, no granting of AdministratorAccess policies because it's easy and just works. Instead, adherence to the principle of least privilege.

This post will describe two methods—IAM users and IAM roles—for proving limited access to third-parties.

Read more →

3 Tools for Getting VMs From Your Datacenter to the AWS Cloud

Here's a simple scenario: you have some Virtual Machines (VMs) in your on-premises environment, likely in VMware vSphere or Microsoft Hyper-V. You want to either fully migrate some or all of those VMs to the AWS Cloud or you want to copy a gold image to the AWS Cloud so you can launch compute instances from that image. Simple enough.

Now, how do you do it?

Can you just export an OVA of the VM, copy it up, and then boot it? Can you somehow import the VMDK files that hold the VM's virtual drive contents? Regardless the eventual method, how do you do it at scale for dozens or hundreds of VMs? And lastly, how do you orchestrate the process so that VMs belonging to an application stack are brought over together, as a unit?

Read more →

9 Things to Consider When Estimating Time

Often in my career I have to make an estimate about the so-called "level of effort" (LoE) to do a thing.

  • What's the LoE for me to do a demo for this customer?
  • What's the LoE for me to help respond to this RFP?
  • What's the LoE for me to participate in this conference?

The critical metric by which I usually have to measure the LoE is time. People, equipment, venue, materials, and location are rarely ever a limiting factor. Time is always the limiting factor because no matter the circumstance, you can't just go and get more of it. The other factors are often elastic and can be obtained.

And oh how I suck at estimating time.

As soon as the question comes up, "What's the LoE for...", I immediately start to think, ok, if I am doing the work, I can do this piece and that piece, I can read up on this thing and get it done with slightly more time invested, and then yada, yada, yada... it's done!

What I don't account for is the human element. The unexpected. The fact that we're all different and team members will go about their work in their own way. In other words, the soft, non-technical aspects of doing the thing.

Along these lines, here are 9 things that I would be wise to consider when making time estimates in the future.

Read more →

Five Functional Facts About AWS Service Control Policies

Following on the heels of my previous post, Five Functional Facts about AWS Identity and Access Management, I wanted to dive into a separate, yet related way of enforcing access policies in AWS: Service Control Policies (SCPs).

SCPs and IAM policies look very similar—both being JSON documents with the same sort of syntax—and it would be easy to mistake one for the other. However, they are used in different contexts and for different purposes. In this post, I'll explain the context where SCPs are used and why they are used (and even why you'd use SCPs and IAM policies together).

Read on, dear reader!

Read more →

Amazon CloudFront with WordPress as Infrastructure as Code

There are roughly a GAJILLION articles, blogs, and documents out there that explain how to setup Amazon CloudFront to work with WordPress.

Most of them are wrong in one or more ways.

Read more →

Five Functional Facts about AWS Identity and Access Management

This post is part of an open-ended series I'm writing where I take a specific protocol, app, or whatever-I-feel-like and focus on five functional aspects of that thing in order to expose some of how that thing really works.

The topic in this post is the AWS Identity and Access Management (IAM) service. The IAM service holds a unique position within AWS: it doesn't get the attention that the machine learning or AI services get, and doesn't come to mind when buzzwords like "serverless" or "containers" are brought up, yet it's used by-or should be used by-every single AWS customer (and if you're not using it, you're not following best practice, tsk, tsk) so it's worthwhile to take the time to really get to know this service.

Let's begin!

Read more →