Nathan Long - Big Nerd Ranch Tue, 19 Oct 2021 17:45:59 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.5 Don’t Get Tangled in Your Database Constraints https://bignerdranch.com/blog/dont-get-tangled-in-your-database-constraints/ https://bignerdranch.com/blog/dont-get-tangled-in-your-database-constraints/#respond Wed, 21 Jun 2017 10:00:53 +0000 https://nerdranchighq.wpengine.com/blog/dont-get-tangled-in-your-database-constraints/ Database constraints are a great tool for preventing bad data. Here's how to use them without causing frustration for yourself or your users.

The post Don’t Get Tangled in Your Database Constraints appeared first on Big Nerd Ranch.

]]>

In the last post on database constraints, we made a case that database constraints—unique indexes, foreign keys and more advanced tools like exclusion constraints—fill a necessary role that application-level validations can’t.

For example, when two users send simultaneous requests to claim the same username or rent the same property on a holiday weekend, application code alone can’t prevent conflicting data from being inserted. Only the database can guarantee consistent data, and constraints are the simplest way to have it do so.

But if they’re misused, constraints can cause a lot of frustration. For best results, keep these guidelines in mind:

1. Constrain For Certainty

  • If users should never be missing an email address, make the column NOT NULL to guarantee it.
  • If users should never have duplicate email addresses, add a unique index to guarantee it.
  • If product prices should never be negative, add a CHECK constraint to guarantee it.
  • If reservations should never overlap, add an EXCLUDE constraint to guarantee it.

These guarantees greatly reduce the number of cases your code has to handle and the number of bugs you could create. And in cases like uniqueness or overlap, they protect you from race conditions that your application code can’t.

2. Choose Ironclad Rules

Once in place, constraints are, by design, hard to bypass. If you set a column to NOT NULL and find you need to insert a NULL value, you’ll have to change the schema first.

So try to distinguish “ironclad” rules from context-dependent ones. “Two users can’t rent the same cabin simultaneously” is an ironclad rule, whereas “we need 3 hours between check out and check in for cleaning” may not apply in special cases, or may change based on staffing.

If you think a rule won’t always apply, it is best to keep it out of the database.

3. Consider the Consequences Of Being Wrong

What if you’re unsure whether to add a particular constraint? There are some tradeoffs to consider.

On one hand, it’s always possible to remove constraints, but it may be impossible to add them if you’ve allowed messy data in your system. That suggests you should err on the side of over-constraining your data.

On the other hand, while a missing constraint may force you to do cleanup work, an ill-considered one may prevent users from doing something reasonable, like omitting a “last name” because they actually don’t have one. That suggests you should err on the side of under-constraining (and under-validating) your data.

In the end, you have to decide what’s likely to cause the fewest, mildest problems for the specific data in your application.

Constraints and Your Code

OK, let’s say you have some use cases for constraints. Depending on your language and tools, this may present some challenges in your codebase.

Suppose we run a concert hall, and we have an application for managing events. Every event must have a unique name, and no two events may overlap dates. We’ll check both of these conditions with validations, and enforce them with database constraints.

Let’s compare the challenges when using Ruby’s Active Record and Elixir’s Ecto.

Active Record Challenge: Handle Constraint Errors Gracefully

Using ActiveRecord (v5.0), it’s straightforward to validate uniqueness: validates :name, uniqueness: true will do it. Checking that events don’t overlap requires a custom validation, but it’s not very hard (see our example code).

Constraint violations, on the other hand, cause exceptions, and having to rescue multiple exceptions in a method gets ugly fast.

For example, you wouldn’t want to do this:

def create
  @event = Event.new(event_params)
  if @event.save
    redirect_to events_path
  else
    render :new
  end
rescue ActiveRecord::RecordNotUnique => e
  if e.is_a?(ActiveRecord::RecordNotUnique) && e.cause.message.match('unique constraint "index_events_on_name"')
    event.errors.add(:name, "has been taken")
    render :new
  else
    # not something we expected
    raise
  end
rescue ActiveRecord::StatementInvalid => e
  if e.cause.is_a?(PG::ExclusionViolation) && e.message.match("no_overlaps")
    event.errors.add(:base, "cannot overlap existing events")
    render :new
  else
    # not something we expected
    raise
  end
end

That’s pretty ugly even for one controller action. Repeating all that in the update action would make it even worse.

The default path in Rails is “don’t worry about those exceptions”, and it’s not unreasonable. After all, if you’re validating uniqueness of name, the race-condition case where two users try get claim the same name nearly simultaneously should be rare. You could just return an HTTP 500 in that case and be done with it.

That’s especially true if there’s nothing the user could do to fix the error anyway, as Derek Prior has pointed out. For example, if your code encrypts users’ passwords before saving to the database, there’s no point validating the presence of encrypted_password or rescuing a NOT NULL exception if your code doesn’t set the field. You don’t need an error message for the user; you need an exception to alert you of buggy code.

But if you do decide to provide friendly user feedback for race-condition data conflicts, try to keep the controller code minimal and clear.

First, since a violated constraint can generate several different exceptions, we need a nice way to catch them all. Starr Horne blogged about an interesting technique: define a custom class for use in the rescue clause that knows which exceptions we care about.

For example, if we had an Event model, we could nest a ValidationRaceCondition class and override its === class method:

class Event::ValidationRaceCondition
  # returns true if this is something we should rescue
  def self.===(exception)
    return true if exception.is_a?(ActiveRecord::RecordNotUnique) && exception.cause.message.match('unique constraint "index_events_on_name"')
    return true if exception.cause.is_a?(PG::ExclusionViolation) && exception.message.match("no_overlaps")
    false
  end
end

There’s some ugly digging around in there, but at least it’s contained in one place.

We can then define Event#save_with_constraints, using Event::ValidationRaceCondition as a stand-in for “any of the errors we’d expect if one of our constraints were violated”:

# like a normal save, but also returns false if a constraint failed
def save_with_constraints
  save
rescue Event::ValidationRaceCondition
  # re-run validations to set a user-friendly error mesage for whatever the
  # validation missed the first time but the constraints caught
  valid?
  false
end

The rescue clause will catch only the constraint-related exceptions that Event::ValidationRaceCondition describes. At that point, we can re-run our validations and this time, they’ll see the conflicting data and set a helpful error message for the user.

With this all wrapped up in save_with_constraints, the controller code is as simple as usual:

def create
  @event = Event.new(event_params)
  if @event.save_with_constraints
    flash[:notice] = "Event created successfully"
    redirect_to events_path
  else
    flash[:error] = "There was a problem creating this event"
    render :new
  end
end

And there you have it! Our validations catch all input errors in one pass, our constraints ensure that we don’t allow last-second conflicts, and our users get friendly error messages if constraints are violated, all with minimal fuss. See my example Rails code if you want more details.

Ecto Challenge: Provide Good Feedback to Users

Elixir’s database library Ecto (v2.1) presents different challenges.

Unlike Active Record, Ecto makes constraints the easiest way to guard against conflicting data. For instance, it has built-in support for creating a unique index in its migrations.
Calling Ecto.Changeset.unique_constraint(changeset, :name) marks the changeset, signaling that if this constraint is violated, we want to parse the database error into a friendly user-facing message. Ecto has similar functions to work with check constraints and exclusion constraints. That’s great!

However, Ecto doesn’t provide a uniqueness validation; the documentation specifically says that Ecto validations “can be executed without a need to interact with the database”, which leaves all checks for conflicting data to be done exclusively by constraints.

This is unfortunate, because even if an INSERT would violate 3 constraints, PostgreSQL will only display an error message for the first one it notices. In the worst case, a user might have to submit a form once, fix the validation errors, submit again, fix the first constraint error and then submit two more times to fix the remaining constraint errors!

So despite the fact that it runs counter to the documentation, I recommend that for the best user experience, you layer your Ecto validations and constraints. That is, let your validations check for intrinsic errors like “email can’t be blank” and also for conflicting data errors like “username is taken”.

That’s because in the vast majority of cases, the conflicting data was inserted long before the current request, not milliseconds before. And catching conflicts with a validation lets you inform the user of all these conflicts at once.

To do that, you could use a custom Ecto validation like this:

def validate_no_conflicting_usernames(changeset = %Ecto.Changeset{changes: %{username: username}}) when not is_nil(username) do
  dups_query = from e in User, where: e.username == ^username
  # For updates, don't flag user record as a dup of itself
  id = get_field(changeset, :id)
  dups_query = if is_nil(id) do
    dups_query
  else
    from e in dups_query, where: e.id != ^id
  end

  exists_query = from q in dups_query, select: true, limit: 1
  case Repo.one(exists_query) do
    true -> add_error(
      changeset, :username, "has already been taken", [validation: :validate_no_conflicting_usernames]
    )
    nil  -> changeset
  end
end

# If changeset has no username or a nil username, it isn't a conflict
def validate_no_conflicting_usernames(changeset), do: changeset

With this validation in place, users will get faster feedback than with a uniqueness constraint alone. See my example Phoenix code if you want more details.

Keep It Friendly, Keep It Clean

Whatever tools you’re using, the rule suggested by Derek Prior is a good one: use constraints to prevent bad data and validations to provide user feedback. Let each do the job it is best at doing.

And as always, strive to keep your code DRY and clear. Feel free to compare the Rails and Phoenix example apps from this post to explore further.

The post Don’t Get Tangled in Your Database Constraints appeared first on Big Nerd Ranch.

]]>
https://bignerdranch.com/blog/dont-get-tangled-in-your-database-constraints/feed/ 0
Guaranteed Consistency: The Case for Database Constraints https://bignerdranch.com/blog/guaranteed-consistency-the-case-for-database-constraints/ https://bignerdranch.com/blog/guaranteed-consistency-the-case-for-database-constraints/#respond Wed, 22 Feb 2017 10:00:00 +0000 https://nerdranchighq.wpengine.com/blog/guaranteed-consistency-the-case-for-database-constraints/ Capturing valid data—whether usernames, purchases, or reservations—is a key part of writing good web applications. Here's how to make your database work for you.

The post Guaranteed Consistency: The Case for Database Constraints appeared first on Big Nerd Ranch.

]]>

If your data validation doesn’t involve your database, you’re asking for conflicts.

Many choices in programming are matters of opinion: which framework is better, which code is clearer, etc. But this one is different.

Application code alone cannot prevent conflicting data. To explain why, let’s start with a familiar example.

Here Come the Conflicts

Say you’re writing a web app where every user needs a unique username. How do you prevent duplicates?

If you’re using Active Record or a Rails-inspired ORM, you might write something like validates :username, uniqueness: true. But you probably know that’s not enough.

Your app probably has a way to process more than one request at a time, using multiple processes, threads, servers, or some combination of the three. If so, multiple simultaneous requests could claim the same username. You’ll have to add a unique index on the username column to guarantee there are no duplicates.

This is standard advice. But there’s an underlying principle that you may have missed: any validation that depends on what’s currently in the database is unreliable without a guarantee from the database itself.

These include:

  • Does any user have this username right now?
  • Does this blog post still exist for me to add a comment right now?
  • Does this user’s account have enough money to cover this purchase right now?
  • Does this rental property have an existing reservation whose date range overlaps the one I’m requesting right now?
  • Is there an employee with the job title “manager” assigned to this store right now?

In each of these cases, your application code can check what’s in the database before writing. But in each case, there’s a race condition: things may change between the time it checks and the time it writes. Other requests are being handled simultaneously, and your multiple application processes do not coordinate their work. Only the database is involved in every modification of data, so only it can prevent conflicts.

There are multiple ways you can enlist your database’s help here, but the simplest and most performant option is to use database constraints.

Why Transactions Aren’t Enough

Before I explain constraints, let’s look at some of the alternatives.

One alternative is to use database transactions. That’s what ActiveRecord is doing when I call user.save with a uniqueness validation in place, according to log/development.log:

 BEGIN
  SELECT 1 AS one
    FROM "users"
    WHERE "users"."username" = $1
    LIMIT $2  [["username", "doofenshmirtz"], ["LIMIT", 1]]
  -- application code checks the result of the select
  -- and decides to proceed
  INSERT INTO "users"
   ("username")
   VALUES ($1)
   RETURNING "id" [["username", "doofenshmirtz"]]
 COMMIT

A transaction guarantees that all its statements will be executed successfully, or else none of them will be executed at all. And in this case, if Active Record finds an existing user with this username, it aborts the transaction. So why isn’t it enough?

Well, it could be – depending on your database settings. Transactions are run with varying isolation levels, meaning “how much can concurrent transactions affect this one?” The good news is that if you use “serializable isolation” for your transactions, the database will guarantee that no other users are inserted between the SELECT and INSERT above. If two concurrent transactions try to insert a user with the username doofenshmirtz, the second one to complete will be rolled back, because the database knows that the second SELECT would be affected by the first INSERT.

The bad news is that you’re probably not using serializable isolation. And the worse news is that (at least in the case of PostgreSQL 9.6.1) using serializable isolation can produce false positives – rolling back your transaction when a different username was inserted. The PostgreSQL error message acknowledges this by saying “HINT: The transaction might succeed if retried.”

Writing application code to retry insertions sounds far too messy to me. And if concurrent requests have to wait in line to add a user, that’s a serious performance bottleneck – especially if the same request has return to the back of the “retry line” repeatedly.

Lock It Up

Another alternative is to use database locking. Essentially, you say “nobody else can use this data until I’ve finished.” There are various levels of locking – prohibiting reads or only writes, and locking a single row or an entire table.

While locking a row may not be a problem, the “no duplicate usernames” case would require locking the entire table for writes while your application code fetches and examines the results of the SELECT. Like serializable isolation, that creates a performance bottleneck. And if you ever use multiple locks, you have to be careful that you don’t create a deadlock.

Using Constraints

Constraints are a much more targeted tool. Constraints are rules you set in the database about what’s valid for a row, a table, or the relationships between them. The database will reject any data that violates those rules.

We’ve already seen unique indexes (which are a kind of uniqueness constraint). They reliably solve the “no duplicates” problem.

You may also know about foreign key constraints, which say “we can’t have a row in the comments table with post_id of 5 unless the posts table has a row with id of 5”. This guarantees that even if a post is deleted while a comment is being submitted, you won’t create an orphan comment. (It also forces you to decide whether to allow someone to delete a post that already has comments, and if so, what should happen to those comments. Without foreign keys, you’ll likely be cleaning up that mess later.)

But if you’re using PostgreSQL, there are other kinds of constraints available.

Check Constraints

For example, a check constraint could be used to say “an account balance may never be less than zero”.

ALTER TABLE
  accounts
ADD CONSTRAINT
  positive_balance CHECK (balance > 0);

This rule applies to each row in the accounts table. If you don’t see a need for this, you may have another race condition you haven’t thought about.

Suppose a business makes a purchase of $20. Does your code read the current $100 balance, subtract $20, and UPDATE the balance to $80? If so, there’s a race condition between the read and write. If two users make simultaneous $20 purchases, the second request to finish processing would set the balance to $80 when it should be $60.

You could prevent that by locking the row (a reasonable solution in this case), or you could ask your database to calculate the balance:

UPDATE
  accounts
SET
  balance = balance - 20
WHERE
  id = 1

With this approach, both purchases would subtract $20 correctly, and the CHECK constraint would ensure that if there’s not enough funds in the account, the purchase fails[1].

Exclusion Constraints

Exclusion constraints are less widely-known, but they could save your bacon.

Imagine a rental reservation system with properties and reservations. Each reservation has a property_id, a start_date and an end_date.

Clearly, you can’t allow conflicting reservations if you want happy customers. If Alice has rented a cabin from June 15-20, Bob must not be allowed to rent it from June 12-21 or any other overlapping period. But how do you guarantee this doesn’t happen?

You can check for overlaps in your application code, but if both rentals are submitted simultaneously, you might accidentally book them both. You could lock the table for writes while performing this check, but that’s not very performant. You could go ahead and book both, but have a cron job to check for messes and a customer service team to apologize…

The cleanest solution would be to set this rule in the database: “no two reservations for the same property_id can have overlapping date ranges.”

ALTER TABLE
  reservations
ADD CONSTRAINT
  no_overlapping_rentals
EXCLUDE USING
  gist (
    property_id WITH =,
    daterange("start_date", "end_date", '[]') WITH &&
  );

This says “if the property_id of an existing reservation is = to mine, and its inclusive ([]) date_range overlaps mine (&&), my reservation is invalid. (Constraints are implemented via indexes, and this one is a gist index.) With this in place, we’re guaranteed not to get conflicting reservations.

You can do even fancier things, like restricting this rule to reservations WHERE status = active, so that cancelled reservations don’t prevent new ones.

If you want more detail about exclusion constraints, see my write-up here. But before we wrap up, let’s talk about why you might not be using constraints already.

Logic in the Database?

One objection to techniques like these is that they put “business logic” in the database.

Way back in 2005, David Heinemeier Hansson, creator of Ruby on Rails, wrote a post called “Choose a Single Layer of Cleverness”, in which he dismissed “stored procedures and constraints” as destroying the “coherence” of a system.

I want a single layer of cleverness: My domain model. Object-orientation is all about encapsulating clever. Letting it sieve half ways through to the database is a terrible violation of those fine intentions.

I don’t know what systems DHH was reacting to, but I can imagine being frustrated to find that my code only told half the story about how the application behaves, or that the logic in SQL wasn’t deployed in lockstep with my application code.

I also don’t know to what extent DHH has changed his mind. Rails appears to have supported unique indexes since version 1.0, and finally added foreign key support in Rails 4.2 in December 2014.

But Rails, like the frameworks that sprang up after it in other languages, still doesn’t support database constraints nearly as well as it supports validations. This may be partly due to its early “database agnostic” approach; MySQL, for example, doesn’t even have CHECK or EXCLUDE constraints. But even if you use supported constraints (like unique indexes and foreign keys), violations will result in an ActiveRecord exception; it’s up to you to rescue it and provide meaningful user feedback.

And DHH’s view is still echoed in The Rails Guides:

Database constraints and/or stored procedures make the validation mechanisms database-dependent and can make testing and maintenance more difficult. However… database-level validations can safely handle some things (such as uniqueness in heavily-used tables) that can be difficult to implement otherwise…. [but] it’s the opinion of the Rails team that model-level validations are the most appropriate in most circumstances.

They have a point. Clearly, validations for things like “products must have a price” or “email addresses must contain an ‘@’” can be safely handled by application code. And even when validations are insufficient to guarantee correctness, your application might not get heavy enough traffic for you to notice.

As the authors of the paper “Feral Concurrency Control” wrote:

Empirically, even under worst-case workloads, these validations [for things like uniqueness] result in order-of-magnitude reductions in inconsistency. Under less pathological workloads, they may eliminate it. It is possible that, in fact, the degree of concurrency and data contention within Rails-backed applications simply does not lead to these concurrency races – that, in some sense, validations are “good enough” for many applications.

The reason validations help so much is this: if you’re trying to reserve a username that was already taken, it’s much more likely that it was claimed in a request before yours than in a request nearly simultaneous to yours. And if it was taken, say, yesterday, the validation will prevent your duplicate.

Don’t Leave It To Chance

Fundamentally, your multiple web application instances do not coordinate their actions, so they can’t prevent inconsistent data completely without leaning on the database via locks, constraints, or serializable transactions. And in my view, constraints are the most straightforward of these options: they don’t require retry loops like serializable transactions do, they can’t create deadlocks, and they have very little impact on performance.

Bottom line: I’m not happy with leaving data integrity to chance, especially when – as in the case of a doubly-booked property – an error would lead to very unhappy customers.

In my next post, I’ll give some thoughts on when not to use constraints. I’ll also suggest how to provide good a user and developer experience when you do use them.


[1]Example borrowed from Kevin Burke’s post, “Weird Tricks to Write Faster, More Correct Database Queries”</span>

The post Guaranteed Consistency: The Case for Database Constraints appeared first on Big Nerd Ranch.

]]>
https://bignerdranch.com/blog/guaranteed-consistency-the-case-for-database-constraints/feed/ 0
Elixir and IO Lists, Part 2: IO Lists in Phoenix https://bignerdranch.com/blog/elixir-and-io-lists-part-2-io-lists-in-phoenix/ https://bignerdranch.com/blog/elixir-and-io-lists-part-2-io-lists-in-phoenix/#respond Thu, 20 Oct 2016 10:00:53 +0000 https://nerdranchighq.wpengine.com/blog/elixir-and-io-lists-part-2-io-lists-in-phoenix/ Elixir's IO lists are one of the secrets to blazing response times in the Phoenix framework. Here's how.

The post Elixir and IO Lists, Part 2: IO Lists in Phoenix appeared first on Big Nerd Ranch.

]]>

This post was adapted from a talk called “String Theory”, which I co-presented with James Edward Gray II at Elixir & Phoenix Conf 2016. My posts on Elixir and Unicode were also part of that talk.


IO Lists and Phoenix

In my last post, I showed how Elixir’s IO lists enable us to build and write output with minimal work and memory usage.

This is nice enough for writing files, but it’s absolutely killer for web applications.
On any given web page, you probably have some elements that are dynamic, like the current user’s name, or a list of recent posts, or some items in a shopping cart.

But those dynamic bits of data are wrapped in chunks of markup that always look the same: maybe each product is wrapped in a <div class="product">, for example.
And there are probably big chunks of HTML in the headers, footers, and menus that look the same on every request.

Suppose you have a template called users/index.html.eex that looks like this:

Screenshot of an EEx template

These strings are going to be needed over and over again. The circled text never changes:

Screenshot of an EEx template with circles around all the parts never change

Most web frameworks concatenate the static markup and the dynamic data into one big response string.
That concatenation is a lot of work, and it creates a lot work for the garbage collector, too.

By contrast, Phoenix does the following:

  • At compile time, Phoenix runs through the templates directory and loads them all
  • It finds users/index.html.eex (among others)
  • Phoenix uses EEx to compile it, along with its layout and partials, into a function
  • That function will build and return IO lists, not strings (because Phoenix specifically asks EEx to do that)
  • The template rendering function is stored as something like UsersView.render("index.html", assigns)

The generated function will look something like this (don’t worry about the details):

defmodule MyApp.SomeView do
  defp(index.html(var!(assigns))) do
    _ = var!(assigns)
    {:safe, [(
      tmp1 = ["" | "<h1>Listing Users</h1>nn<ul>n  "]
      [tmp1 | case(for(user <- Phoenix.HTML.Engine.fetch_assign(var!(assigns), :users)) do
        {:safe, [(
          tmp1 = [(
            tmp1 = ["" | "n    <li> "]
            [tmp1 | case(user.first_name()) do
              {:safe, data} ->
                data
              bin when is_binary(bin) ->
                Plug.HTML.html_escape(bin)
              other ->
                Phoenix.HTML.Safe.to_iodata(other)
            end]
          ) | " ("]
          [tmp1 | case(user.id()) do
            {:safe, data} ->
              data
            bin when is_binary(bin) ->
              Plug.HTML.html_escape(bin)
            other ->
              Phoenix.HTML.Safe.to_iodata(other)
          end]
        ) | ")</li>n  "]}
      end) do
        {:safe, data} ->
          data
        bin when is_binary(bin) ->
          Plug.HTML.html_escape(bin)
        other ->
          Phoenix.HTML.Safe.to_iodata(other)
      end]
    ) | "n</ul>nnThat's all!n"]}
  end
  def(render("index.html", assigns)) do
    index.html(assigns)
  end
end

The thing to notice in that function is that it contains some string literals, like "<h1>Listing Users</h1>nn<ul>n " and "n <li> ".
These will be the same immutable strings, at the same locations in memory, every time they appear in the returned IO list, request after request.

When the function runs, it will return an IO list like this:

[[["" | "<h1>Listing Users</h1>nn<ul>n  "],
  [[[[["" | "n    <li>"] | "Jane"] | " ("] | "1"] | ")</li>n  "],
  [[[[["" | "n    <li>"] | "Joe"] | " ("] | "2"] | ")</li>n  "]] |
 "n</ul>nnThat's all!n"]

You may notice that this IO list looks odd.
That’s because it’s not a “proper” list.

We often build lists by prepending items, like this:

list = []           # => []
list = ["C" | list] # => ["C"]
list = ["B" | list] # => ["B", "C"]
list = ["A" | list] # => ["A", "B", "C"]

Every item in this list is itself a list, with a head pointing to a string and a tail pointing to another list.
The final list is an empty one (though it’s not shown).
That’s a lot of lists!

But it’s also possible to do this:

list = ["A" | "B"]  # => ["A" | "B"]

This is an “improper list” because the head points to “A”, and the tail does not point to a list, but to “B”.

Many functions that expect lists will break if given an improper one, so it’s not a good idea to use them, generally speaking.
But in this case, because the only thing we’ll be doing with these lists is wrapping them in other lists and ultimately writing them to a socket, making them improper removes the need to allocate quite so many lists.

OK, moving on.

The really interesting thing is what happens next: the IO list is handed to the web server process, which sends it to the user by calling writev on the network socket. The only place the full response is created is in the socket buffer.

Remember, the minimum requirement for sending a web response is to copy each byte of the response to the socket. With this view rendering strategy, that’s basically all Phoenix is doing.

Cache Flow

There’s one other great thing about this way of building responses.
Remember this template, where some of the strings will be needed over and over again?

Screenshot of a Phoenix template with circles around all the parts that are not interpolated

As we saw, the function that Phoenix compiles to render that template will use the same strings in memory every time, simply adding the dynamic parts as needed to the IO list.
This is, in effect, a kind of view caching.

You may have used web frameworks with slower, concatenation-based view rendering, which try to compensate by adding ways to cache views and partials.
But they make you solve one of the hard problems in computer science: cache invalidation.
You have to take into account the fact that content changes over time.

For example, a blog post may get updated in the database.
You might handle this with a simple expiration time: “cache this post partial for 1 hour.”
Or you might tie the view cache to the state of your database.
For example, “the HTML for a blog post and its comments can be cached, but we have to re-render it if the post changes, or if its comments change, or if the name of the post’s author changes, or if the name of any of the comments’ authors change.”
These kinds of rules may be needed for Rails’ Russian Doll Caching.

Also, you have to deal with the fact that parts of the page vary per user; for example, they might display the current user’s email address, or recommended products.
You might solve this by having a per-user cache key, as the Django docs show, or by caching some generic version and using JavaScript to add personalized content after the page loads, as was suggested years ago.
Or you might just forego caching for that part of the page.

In any case, you’re responsible for specifying which parts of your view can be cached and under what circumstances the cache is valid.
The more frequently-updated your content is, the less these caches help you, and the more personalized it is, the more space they occupy in your cache store.

Did I mention that you have to have a cache store?
You have to decide whether that should be in RAM, or the file system, or an external database, with the tradeoffs those choices imply.

I don’t mean to be too harsh.
Smart developers have put lots of hard work into making those view caches easy to use.
But there’s still a lot to think about.

By contrast, by compiling templates to functions, Phoenix automatically and universally applies this simple view caching strategy: the static parts of our template are always cached.
The dynamic parts are never cached.
The cache is invalidated if the template file changes.
The end.

Now, since the dynamic parts of a view (like the list of blog post titles from your database) are never cached in the view functions, they need to be given that data at render time.
If database queries are your performance bottleneck, you could cache their results; you might, for example, use a materialized view for that, and have a process responsible for periodically refreshing it.
But that would be a completely separate concern from your templates.
And Phoenix’s view rendering will almost certainly be too fast for you to care to cache its final product.

A Disclaimer

Before we wrap up, I want to add a small disclaimer about system calls.

As we saw in my last post, small strings (less than 64 bytes long) are generally combined by the BEAM in the writev arguments.
So if you trace a Phoenix app, you probably won’t see every little <li> handed to writev separately.
But it does use writev on the socket.

For example, here’s a trace I ran of a Phoenix app with some very large strings in its templates.

Screenshot of writev system calls generated by Cowboy for a Phoenix application

The two strings highlighted in blue were large ones that appeared multiple times in the template.
In writing the response to the socket, Cowboy referenced those same strings in memory each time.
The string highlighted in red is the opening <html> tag and a bunch of unchanging data after it.
You only see it once in this screenshot, but it was written from the same memory location in subsequent requests.

But regardless of how the BEAM writes Phoenix’s responses to the socket, we certainly benefit from the fast, memory-efficient way that they’re built in the first place.

So the next time you see Phoenix render your web page in less than a millisecond, take a moment to appreciate the lovely little IO lists that help make that possible.

The post Elixir and IO Lists, Part 2: IO Lists in Phoenix appeared first on Big Nerd Ranch.

]]>
https://bignerdranch.com/blog/elixir-and-io-lists-part-2-io-lists-in-phoenix/feed/ 0
Elixir and IO Lists, Part 1: Building Output Efficiently https://bignerdranch.com/blog/elixir-and-io-lists-part-1-building-output-efficiently/ https://bignerdranch.com/blog/elixir-and-io-lists-part-1-building-output-efficiently/#respond Tue, 11 Oct 2016 10:00:53 +0000 https://nerdranchighq.wpengine.com/blog/elixir-and-io-lists-part-1-building-output-efficiently/ Erlang's IO Lists enable Elixir to very efficiently write to files and sockets. But do you know how to use them?

The post Elixir and IO Lists, Part 1: Building Output Efficiently appeared first on Big Nerd Ranch.

]]>

This post was adapted from a talk called “String Theory”, which I co-presented with James Edward Gray II at Elixir & Phoenix Conf 2016. My posts on Elixir and Unicode were also part of that talk.


It’s been said that “the key to making programs fast is to make them do practically nothing.”

Imagine you’re going to write some data to a file, or send a web response to a browser.
What’s the minimum amount of work you can possibly do?

It’s this: copy each byte of data to the file or socket.

Now, if you’re going to get microsecond response times like Elixir’s Phoenix framework does, you can’t do much more than that.
And Phoenix doesn’t, thanks to a very interesting data structure called an “IO list”.

You can use IO lists to make your own code more efficient. To understand how, let’s first consider something that developers do all the time: string concatenation.

Strings and IO Lists

Here’s a concatenation example in Elixir:

name = "James"
IO.puts "Hi, " <> name  # => "Hi, James"

Interpolation is just a prettier way of writing the same thing.

name = "James"
IO.puts "Hi, #{name}"  # => "Hi, James"

In order to execute this code, the BEAM has to:

  • Allocate the string “James”
  • Allocate the string “Hi, “
  • Allocate a third string and copy the others into it, producing “Hi, James”

Copying the string data is work.
Not only that, but the more strings we allocate, the more memory we use, and the more work we create for the garbage collector.

We can do this more efficiently in Elixir by using an IO list.

An IO list just means “a list of things suitable for input/output operations”, like strings or codepoints.
Functions like IO.puts/1 and File.write/2 accept “IO data”, which can be either a simple string or an IO list.

name = "James"
IO.puts ["Hi, ", name]                            # => "Hi, James"
IO.puts ["Hi, ", 74, 97, 109, 101, 115]           # => "Hi, James"

IO lists can be nested, and they’re still handled by IO functions as if they were flat lists.

IO.puts ["Hi, ", [74, [97, [109, [101, [115]]]]]] # => "Hi, James"

At first glance, this may seem like no big deal; the output is the same.
But the ability to structure our output data like this provides some real performance benefits.

First, because we’re using lists, we can efficiently repeat items.

users = [%{name: "Amy"}, %{name: "Joe"}]

response = Enum.map(users, fn (user) ->
  ["<li>", user.name, "</li>"]
end)

IO.puts response

In this example, the "<li>" string is created only once; every time it appears in the list, it’s just a pointer to the original.
That means we can represent our final output with a lot fewer string allocations.
The more repetitive the content, the more allocations we save.

Also, the fact that we can nest these IO lists makes a huge difference in how quickly we can build them.
Normally, appending to a linked list is an O(N) operation: we have to walk through every item in the list, find the last one, and point its tail to a new item.
With immutable data, it’s even worse: we can’t modify the last item, so we have to copy it, which means we have to copy the previous item, and the previous one, all the way back to the start of the list.

However, with nesting, we can append to a list simply by wrapping it in a new list.

names = ["A", "B"]    # => ["A", "B"]
names = [names, "C"]  # => [["A", "B"], "C"]

This operation is O(1) and doesn’t require copying anything.

IO lists also have big implications for system calls.

System Calls

Most programs can’t work directly with things like files and sockets.
Instead, they have to use “system calls” to ask the operating system to do things on their behalf.
It’s up to the OS to enforce file permissions and handle the details of working with specific hardware.

Consider the following Elixir script:

# Here I'm calling an Erlang function. I'll explain why later.
{:ok, file} = :file.open("/tmp/tmp.txt", [:write, :raw])

foo = "foo"
bar = "bar"

output = [foo, bar, foo]
output = Enum.join(output)

# Another Erlang function call
:file.write(file, output)

It’s fairly simple: we open a file, create some strings, concatenate them, and write the result to the file.

When executing the last line, the BEAM makes a system call to write the file. Using Evan Miller’s dtrace script (linked from his excellent article), we can see what that looks like:

write:return Write data (9 bytes): 0x00000000146007e2

Here the BEAM uses the system call write and says “please write 9 bytes from memory address 0x00000000146007e2”. That 9-byte string contains 3 bytes for “foo”, 3 for “bar”, and 3 for “foo” again.

Now watch what happens if we comment out the line where we join the strings:

{:ok, file} = :file.open("/tmp/tmp.txt", [:write, :raw])

foo = "foo"
bar = "bar"

output = [foo, bar, foo]
# output = Enum.join(output)

:file.write(file, output)

This time, we’re passing an IO list to :file.write/2.
It’s a small change, but look at the system call:

writev:return Writev data 1/3: (3 bytes): 0x0000000014600430
writev:return Writev data 2/3: (3 bytes): 0x0000000014600120
writev:return Writev data 3/3: (3 bytes): 0x0000000014600430

What we’re seeing is actually one call to writev asking to write 3 chunks of data: “foo” from one address, “bar” from another, then “foo” from the same address as the first time.

This is very exciting.
Nowhere in our program is the final string, “foobarfoo”, produced.
The only place where the concatenation happens is in the file itself.

When we concatenated in our own code, we took two strings in BEAM memory, copied their contents into a third string, and asked the OS to write that new string to a file.

When we used the IO list, we skipped the work of concatenation, saved ourselves from allocating the full 9-byte string, and removed the need to garbage collect that final string.

All the BEAM had to do was to ask the OS to copy each byte of data to the file.

Caveats for writev

Now, I said that the BEAM doesn’t concatenate the strings in an IO list when performing I/O.
And if you run this in IEx and trace the system calls, you’ll see each item in the IO list as a separate argument to writev.

{:ok, file} = :file.open("/tmp/tmp.txt", [:write, :raw])
:file.write(file, some_iolist_of_your_own)

However, I made two important choices in that snippet to ensure that writev was used.

First, I specified that the file be opened in raw mode. This tells the BEAM not to create a separate process to handle the file, as it normally would.
You can read about this in the Erlang docs for file:open/2 and in the “Performance” section at the bottom of that page.

Second, I used two Erlang function calls, not the Elixir code you’d expect:

File.write("/tmp/tmp.txt", some_iodata, [:raw])

That Elixir function delegates to a similar Erlang function, file:write_file/3, which you could call like this:

:file.write_file("/tmp/tmp.txt", some_iodata, [:raw])

This function takes care of opening and closing the file handle for you.
In the current release of Erlang/OTP (19.1.2), :file:write_file/3 has a bug: it always combines the IO data into a single string, even if told to use raw mode.
I recently fixed that, so once the fix gets released, you’ll be able to just use the Elixir code shown above.
But for now, if you want to use writev when writing files from Elixir, you’ll have to use Erlang’s :file.open/3.

Another caveat about writev: you may see strings combined if they are less than 64 bytes long.
This has to do with the way the BEAM tracks strings in memory and copies data sent between processes.
If your IO list contains larger, “refc” strings (longer than 64 bytes), you will see them show up in the writev vector as their own entries.

Use IO lists for I/O

Caveats aside, here’s some simple advice: if you’re going to build up some output and then write it to a file, write it to a socket, or return it from a plug, don’t concatenate strings.
Use an IO list.

It will help you build your output quickly and with minimal memory usage.
And by doing this, you allow the BEAM to use writev to reduce copying.

As I mentioned, the Phoenix framework takes advantage of this technique.
In my next post, I’ll show you how.

The post Elixir and IO Lists, Part 1: Building Output Efficiently appeared first on Big Nerd Ranch.

]]>
https://bignerdranch.com/blog/elixir-and-io-lists-part-1-building-output-efficiently/feed/ 0
Elixir and Unicode, Part 2: Working with Unicode Strings https://bignerdranch.com/blog/elixir-and-unicode-part-2-working-with-unicode-strings/ https://bignerdranch.com/blog/elixir-and-unicode-part-2-working-with-unicode-strings/#respond Tue, 27 Sep 2016 10:00:53 +0000 https://nerdranchighq.wpengine.com/blog/elixir-and-unicode-part-2-working-with-unicode-strings/ Unicode strings are multi-layered things, which makes comparing them, traversing them and transforming them a bit more complicated than with ASCII. Learn how to wrangle them correctly in Elixir.

The post Elixir and Unicode, Part 2: Working with Unicode Strings appeared first on Big Nerd Ranch.

]]>

This post was adapted from a talk called “String Theory”, which I co-presented with James Edward Gray II at Elixir & Phoenix Conf 2016.


In my post on Unicode and UTF-8, I showed you the basis of Elixir’s great Unicode support: every string in Elixir is a series of codepoints, encoded in UTF-8.
I explained what Unicode is, and we walked through the encoding process and saw the exact bits it produces.

For this post, what’s important to know is that UTF-8 represents codepoints using three kinds of bytes. Larger codepoints get a leading byte followed by one, two or three continuation bytes, and the leading byte tells how many continuation bytes we should expect. Smaller codepoints just get a single solo byte.

Each kind of byte has a distinct pattern, and by using those patterns, Elixir can do a lot of things correctly that some other languages mess up, like reverse a string without breaking up its characters. Let’s see how.

Reversing a UTF-8 String

Suppose we wanted to reverse the string "a™". Elixir represents that string as a binary with four bytes: the "a" gets a solo byte, and the "™" gets three bytes (a leading byte and two continuation bytes).

For simplicity’s sake, we can picture "a™" like this:

SoloFirst of 3ContinuationContinuation

You wouldn’t want to reverse it like this, scrambling the multi-byte "™":

ContinuationContinuationFirst of 3Solo

Instead, you’d want to reverse it like this, keeping the bytes for "™" intact:

First of 3ContinuationContinuationSolo

Elixir does this correctly because, thanks to using UTF-8, it can tell which bytes should go together. That’s also what lets it correctly measure the length of a string, or get substrings by index: because it knows which bytes go together, it knows whether (for example) the first three bytes express one character or three.

Well, that’s part of it. There’s one more layer to consider.

Grapheme Clusters

Not only can we have multiple bytes in one codepoint; we can also have multiple codepoints in one “grapheme”. A grapheme is what most people would consider a single visible character, and in some cases, what looks like “a letter with an accent mark” may be composed of a “plain” letter followed by a “combining diacritical mark” – which says, “hey, put this mark on the previous letter”.
A series of codepoints that represent a single grapheme is called a “grapheme cluster.”

noel = "noeu0308l" # => "noël"
String.codepoints(noel) # => ["n", "o", "e", "̈", "l"]
String.graphemes(noel)  # => ["n", "o", "ë", "l"]

Notice that Elixir lets us ask for either the codepoints or the graphemes in that string.

Now, you might wonder: if I can add one mark to a letter, can I add two? How many can I add?

An ñ written with more and more diacritical marks on it

The answer is: you can add a boatload of them!

Picture of zalgo text, where meaningless diacritical marks make the text almost unreadable

You may have seen “zalgo text,” where some poor website’s text box is overflowed with horrible-looking characters, and they think they’ve been hacked. You may have seen questions on Stack Overflow asking how to keep people from putting this junk on your web site. And you may have seen snarky responses that break the Stack Overflow comment section.

A comment on Stack Overflow where the diacritical marks overflow the comment box

Because unfortunately, there’s no easy way to prevent people from putting this on your site. Zalgo text is not a bug. It’s a (misused) feature.

Remember, Unicode is trying to cover all of human language. It supports writing from left to right or right to left, and it supports special Javanese punctuation which “introduces a letter to a person of older age or higher rank”.

And it turns out that some languages need multiple combining marks per character – like the Ticuna language of Peru, which is a tonal language and uses the marks to indicate tones.

An example Ticuna word which requires three diacritical marks on a single grapheme

I mean, you could screen out all combining marks, but that would break a lot of Unicode text. You could screen out any characters with more than one combining mark, but you don’t want to outlaw the Ticuna in an attempt to control your page layout. Could you limit it to 5? Maybe, if you don’t mind alienating users from Tibet, who may use 8 or more combining marks.

A simpler solution would be to just declare “this text isn’t allowed to overflow its container.”

Shows that using the CSS property 'overflow: hidden' will prevent diacritical marks from visually overflowing their container

Anyway, the fact that multiple codepoints can be one grapheme—like an “e” codepoint followed by an “accent” codepoint—means that the string reversal I showed earlier wasn’t quite right. To properly reverse a string, you have to first group the bytes into codepoints, then group the codepoints into graphemes, and then reverse those.

"noël" |> String.codepoints |> Enum.reverse # => "l̈eon"
"noël" |> String.graphemes  |> Enum.reverse # => "lëon"

The latter is essentially what String.reverse does; the former is a common way that programming languages do it wrong.

By the way, emoji can also be created using multiple codepoints. One example of this is skin tone modifiers:

👍🏿 👍🏾 👍🏽 👍🏼 👍🏻

Each of those is a “thumbs up” codepoint (👍) , followed by a skin tone codepoint (like 🏿 ). This kind of thing is supported for various human codepoints.

Image from Unicode site, showing that a face plus a skin tone modifier = a face with that skin tone

There are plans to similarly interpret things like “police officer + female sign” as “female police officer,” which is nice.

Unicode also supports drawing a family by combining people emojis with the “zero width joiner” character, but in my opinion, this is insane. No font can support every possible configuration of a family—number, age and gender of adults and children, and skin tone of each family member—and give each combination its own character. If a group of individual emoji doesn’t cut it, it looks to me like a job for the <img> tag.

Anyway, to recap: we can have multiple UTF-8 bytes for one Unicode codepoint, and multiple codepoints can form one grapheme.

Gotchas

All this leads to a few little gotchas for programmers.

First, when checking the “length” of a string, you have to decide what you mean. Do you want the number of graphemes, codepoints, or bytes? String.length in Elixir will give you the number of graphemes, which is how long the string “looks”. If you want byte_size or a count of codepoints, you’ll have to be explicit about that.

And while byte_size is an O(1) operation, String.length has to walk through the whole string, grouping bytes into codepoints and codepoints into graphemes, so it’s O(N). (If you don’t understand these terms, see my article on Big O notation.)

Second, string equality is tricky. This is because there can be more than one way to build a given visible character; for example, the accented “e” in “noël” can be written as an “e” followed by a combining diacritical mark. This way of building it has to exist, because Unicode supports adding dots to any character. However, because “ë” is a common grapheme, Unicode also provides a “precomposed character” for it—a single codepoint that represents the whole grapheme.

A human can’t see any difference between “noël” and “noël”, but if you compare those two strings with ==, which checks for identical bytes, you will find that they’re not equal. They have different bytes, but the resulting graphemes are the same. In Unicode terms, they are “canonically equivalent.”

If you don’t care about the difference between “e with an accent” spelled as one codepoint or two, use String.equivalent?/2 to ignore it – that function will normalize the strings before comparing them.

{two_codepoints, one_codepoint} = {"eu0308", "u00EB"} # => {"ë", "ë"}
two_codepoints == one_codepoint # => false
String.equivalent?(two_codepoints, one_codepoint) # => true

Finally, changing case can be tricky, even though it’s essentially a (heh heh) case statement that’s run once per grapheme: “if you get ‘A’, downcase it as ‘a’, if ‘B’, as ‘b’…”.

Elixir mostly gets this right:

String.downcase("MAÑANA") == "mañana"

But human language is complicated. You think you have a simple thing like downcasing covered, then you learn that the Greek letter sigma (“Σ”) is supposed to be “ς” at the end of a word and “σ” otherwise.

Even Elixir doesn’t bother to support that; it downcases one grapheme at a time, so it can’t consider what comes after the sigma. If it really matters to you, you can write your own GreekAware.downcase/1 function.

But How Do I Type It?

So, suppose you’re browsing the Unicode code charts (as one does) and you come across something you want to put in an Elixir string (like “🂡””). How do you do it?

Well, you can just type or paste it directly into an Elixir source file. That’s not true for every programming language.

But if you want or need another way, you can use the codepoint value like this:

# hexadecimal codepoint value
"🂡" == "u{1F0A1}"
# decimal codepoint value
"🂡" == <<127_137::utf8>>

If you read my first article on Unicode, you can understand what the ::utf8 modifier does: it encodes the number as one or more UTF-8 bytes in the binary, just like we did there with ⏰.

Congratulations! You’re a 💪er 🤓 than ever!

The post Elixir and Unicode, Part 2: Working with Unicode Strings appeared first on Big Nerd Ranch.

]]>
https://bignerdranch.com/blog/elixir-and-unicode-part-2-working-with-unicode-strings/feed/ 0
Elixir and Unicode, Part 1: Unicode and UTF-8 Explained https://bignerdranch.com/blog/elixir-and-unicode-part-1-unicode-and-utf-8-explained/ https://bignerdranch.com/blog/elixir-and-unicode-part-1-unicode-and-utf-8-explained/#respond Sun, 25 Sep 2016 10:00:53 +0000 https://nerdranchighq.wpengine.com/blog/elixir-and-unicode-part-1-unicode-and-utf-8-explained/ You may know that Elixir has great Unicode support. But how? Let's see how Unicode and UTF-8 actually work and how Elixir is able to handle any string you give it.

The post Elixir and Unicode, Part 1: Unicode and UTF-8 Explained appeared first on Big Nerd Ranch.

]]>

This post was adapted from a talk called “String Theory”, which I co-presented with James Edward Gray II at Elixir & Phoenix Conf 2016.


You may have heard that Elixir has great Unicode support. This makes it a great language for distributed, concurrent, fault-tolerant apps that send poo emoji! 💩

Specifically, Elixir passes all the checks suggested in The String Type is Broken.

The article says that most languages fail at least some of its tests, and mentions C#, C++, Java, JavaScript and Perl as falling short (it doesn’t specify which versions). But here I’ll compare the languages I use most: Elixir (version 1.3.2), Ruby (version 2.4.0-preview1) and JavaScript (run in v8 version 4.6.85.31).

(By the way, the test descriptions use terms like “codepoints” and “normalized”—I’ll explain those later.)

1. Reverse of “noël” (e with accent is two codepoints) is “lëon”
Elixir String.reverse(“noël”) == “lëon”
Ruby “noël”.reverse == “l̈eon”
JS (No built-in string reversal)
2. First three chars of “noël” are “noë”
Elixir String.slice(“noël”, 0..2) == “noë”
Ruby “noël”[0..2] == “noe”
JS “noël”.substring(0,3) == “noe”
4. Length of “😸😾” is 2
Elixir String.length(“😸😾”) == 2
Ruby “😸😾”.length == 2
JS “😸😾”.length == 4
5. Substring after the first character of “😸😾” is “😾”
Elixir String.slice(“😸😾”, 1..-1) == “😾”
Ruby “😸😾”[1..-1] == “😾”
JS “😸😾”.substr(1) == “😾”
6. reverse of “😸😾” is “😾😸”
Elixir String.reverse(“😸😾”) == “😾😸”
Ruby “😸😾”.reverse == “😾😸”
JS (No built-in string reversal)
7. “baffle” (“baffle” with ligature – “ffl” as a single code point) upcased should be “BAFFLE”
Elixir String.upcase(“baffle”) == “BAFFLE”
Ruby “baffle”.upcase == “BAfflE”
JS “baffle”.toUpperCase() == “BAFFLE”
8. “noël” (this time the e with accent is one codepoint) should equal “noël” if normalized
Elixir String.equivalent?(“noël”, “noël”) == true
Ruby (“noël”.unicode_normalize == “noël”.unicode_normalize) == true
JS (“noël”.normalize() == “noël”.normalize()) == true

OK, but how does Elixir support Unicode so well? I’m glad you asked! (Ssssh, pretend you asked.) To find out, we need to explore the concepts behind Unicode.

What is Unicode?

Unicode is pretty awesome, but unfortunately, my first exposure to it was “broken characters on the web.”

I heart Unicode mug with heart rendered incorrectly as square
From Zazzle

To understand Unicode, let’s talk first about ASCII, which is what English-speaking Americans like me might think of as “plain old text.” Here’s what I get when I run man ascii on my machine:

Screenshot of the output from running 'man ascii'

ASCII is just a mapping from characters to numbers. It’s an agreement that capital A can be represented by the number 65, and so on. (Why 65? There are reasons for the numeric choices.) The number assigned to a character is called its “codepoint.”

To “encode” ASCII—to represent it in a way that can be stored or transmitted—is simple. You just convert the codepoint to base 2 and pad it with zeros up to a full 8-bit byte. Here’s how to do that in Elixir:

base_2 = fn (i) ->
  Integer.to_string(i, 2)
end

# A ? gives us the codepoint
?a == 97
?a |> base_2.() |> String.pad_leading(8, ["0"]) == "01100001"

Since there are only 128 ASCII characters; their actual data is never more than 7 bits long, hence the leading 0 when we encode ‘a’.

Character UTF-8 bytes
a 01100001

And that’s fine as far as it goes. But we want to be able to type more than just these characters.

We want to type accented letters.

á é í ó ú ü ñ ź

And Greek letters.

λ φ θ Ω

And this Han character that means “to castrate a fowl.”

𠜎

And sometimes we want to type more than just words. Sometimes we want to type pictures.

𠜎 = 🐓 + 🗡

We want to type emoji for laughing, and crying, and kissing, and being upside down, and having dollars in our mouths.

😆  😭  😘  🙃  🤑

Unicode lets us type them all. Unicode lets us type anything in all of human language. In theory.

In practice, Unicode is made by a standards body, so it’s a political process, and some people say that their language isn’t getting a fair shake. For example, in an article called I Can Text You A Pile of Poo, But I Can’t Write My Name, Aditya Mukerjee explains that Bengali, with about 200 million native speakers (more than Russian), can’t always be properly typed on a computer.

Similarly, through something called the “Han unification,” people who type Chinese, Japanese and Korean have been told, “hey listen, y’all are gonna have to share some of your characters to save space.”

A Han codepoint and the written characters it corresponds to in different languages
from The Sorry State of Japanese on the Internet

Or at least, that’s how some of them interpret it. There is an article on the Unicode site explaining the linguistic, historical and technical rationale, which also says:

This process was well-understood by the actual national standards participants from Japan, Korea, China, and other countries, who all along have been doing the major work involved in minimizing the amount of duplicate encoding of what all the committee members fully agree is the same character.

As someone who doesn’t write any of the languages in question, I can’t really weigh in. But if saving space was part of the rationale, it does seem odd that Unicode has seen fit to include playing cards

Playing Cards in Unicode

…and alchemical symbols

Alchemical symbols in Unicode

…and ancient Greek musical notation

Ancient Greek Musical Notation in Unicode

…oh, and Linear B, which nobody has used for anything for several thousand years.

But for the purposes of this article, what’s important is that Unicode can theoretically support anything we want to type in any language.

At its core, Unicode is like ASCII: a list of characters that people want to type into a computer. Every character gets a numeric codepoint, whether it’s capital A, lowercase lambda, or “man in business suit levitating.”

A = 65
λ = 923
🕴= 128,372 # <- best emoji ever

So Unicode says things like, “Allright, this character exists, we assigned it an official name and a codepoint, here are its lowercase or uppercase equivalents (if any), and here’s a picture of what it could look like. Font designers, it’s up to you to draw this in your font if you want to.”

Just like ASCII, Unicode strings (imagine “codepoint 121, codepoint 111…”) have to be encoded to ones and zeros before you can store or transmit them. But unlike ASCII, Unicode has more than a million possible codepoints, so they can’t possibly all fit in one byte. And unlike ASCII, there’s no One True Way to encode it.

What can we do? One idea would be to always use, say, 3 bytes per character. That would be nice for string traversal, because the 3rd codepoint in a string would always start at the 9th byte. But it would be inefficient when it comes to storage space and bandwidth.

Instead, the most common solution is an encoding called UTF-8.

UTF-8

UTF-8 gives you four templates to choose from: a one-byte template, a two-byte template, a three-byte template, and a four-byte template.

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Each of those templates has some headers which are always the same (shown here in red) and some slots where your codepoint data can go (shown here as “x”s).

The four-byte template gives us 21 bits for our data, which would let us represent 2,097,151 different values. There are only about 128,000 codepoints right now, so UTF-8 can easily encode any Unicode codepoint for the foreseeable future.

To use these templates, you first take the codepoint you want to encode and represent it as bits.

For example, ⏰ is codepoint 9200, as we can see by using ? in iex.

?⏰  # => 9200

Now let’s see that number in base 2:

base_2.(?⏰) == "10001111110000"

That’s 14 bits long—too many to fit into the UTF-8 2-byte template, but not too many for the 3-byte template. We insert them into it right to left, and pad with leading zeros.

Diagram showing how the raw 1s and 0s of a codepoint are inserted into a UTF-8 template's data slots

Is this what Elixir actually does? Let’s use the handy IEx.Helpers.i/1 function to inspect a string containing ⏰ in iex:

i "⏰"
....
Raw representation
  <<226, 143, 176>>

This shows us that the string is actually a binary containing three bytes. In Elixir, a “bitstring” is anything between << and >> markers, and it contains a contiguous series of bits in memory. If there happen to be 8 of those bits, or 16, or any other number divisible by 8, we call that bitstring a “binary” – a series of bytes. And if those bytes are valid UTF-8, we call that binary a “string”.

The three numbers shown here are decimal representations of the three bytes in this binary. What if we convert them to base 2?

[226, 143, 176] |> Enum.map(base_2)
# => ["11100010", "10001111", "10110000"]

Yep, that’s the UTF-8 we expected!

Three Kinds of Bytes

UTF-8 is cool because you can look at a byte and tell immediately what kind it is, based on what it starts with. There are “solo” bytes (as in, “this byte contains the whole codepoint”) which start with 0, “leading” bytes (the first of several in a codepoint) which start with 11 (and possibly some more 1s after that), and “continuation” bytes (additional bytes in a codepoint) which start with 10. The leading byte tells you how many continuation bytes to expect: if it starts with 110, you know there are two bytes in the codepoint; if 1110, there are three bytes in the codepoint, etc.

Starts With Kind
0 Solo
10 Continuation
110 or 1110 or 11110 First of N (count the 1s)

Here’s an example character for each of the UTF-8 templates.

Character UTF-8 bytes
a 01100001
ë 11000011 10101011
11100010 10000100 10100010
🍠 11110000 10011111 10001101 10100000

The letter ‘a’ is encoded with a solo byte—a single byte starting with 0. The “roasted sweet potato” symbol has a leading byte that starts with four 1s, which tells us that it’s four bytes long, then three continuation bytes that each start with 10.

Also, notice that the encoding for ‘a’ looks exactly like ASCII. In fact, any valid ASCII text can also be interpreted as UTF-8, which means that if you have some existing ASCII text, you can just declare “OK, this is UTF-8 now” and start adding Unicode characters.

The fact that each kind of byte looks different means you could start reading some UTF-8 in the middle of a file or a stream of data, and if you landed in the middle of a character, you’d know to skip ahead to the next leading or solo byte.

It also that means that you can reliably do things like reverse a string without breaking up its characters, measure the length of a string, or get substrings by index. In my next post, we’ll see how Elixir does those things. We’ll also learn why “noël” isn’t the same as “noël”, how to write comments that break a web site’s layout, and why even Elixir won’t properly downcase “ΦΒΣ”.

The post Elixir and Unicode, Part 1: Unicode and UTF-8 Explained appeared first on Big Nerd Ranch.

]]>
https://bignerdranch.com/blog/elixir-and-unicode-part-1-unicode-and-utf-8-explained/feed/ 0