- Developer Tools
Phil Whelan, February 4, 2014
In distributed systems, it makes your life much simpler if you stream all your logs in real-time to a central location. ActiveState's PaaS solution, Stackato, has had log streaming as a key feature, with Logyard, since 2012 and core Cloud Foundry now includes Loggregator.
Hootsuite, is not a customer of ActiveState or a user of Cloud Foundry, but I often chat with them about how they are scaling the architecture of Hootsuite as it grows. We are big proponents of streaming logs, whether it be application logs or system logs, so I sat down with Director of Technology at HootSuite, Beier Cai, to find out more about what Hootsuite is doing with log streaming at scale.
Streams Don't Stop
Hootsuite works heavily in streaming data, but not just logs. Their focus is on social networks, where there is a constant feed of new tweets, Facebook messages, LinkedIn posts and many more. These streams never stop. If anything, the volume of social network data is increasing every day. Every user's feed is different from the next. No two users share the same view, and the same users never view the same content twice. In short, it is a large and complex moving target.
When Hootsuite hits a bug in production, it is very hard to pull that sequence of events back to staging servers and reproduce it. With so many variables, the complexity and volume of data flowing through the system, and a vague description of the issue from the customer, where do you start? For their Enterprise customers, Beier tells me, the logic becomes very complex.
So, short of a time machine, how can you debug issues?
Hootsuite logs everything. This is their time machine.
When an error occurs in production or a user reports a problem, they can dive into the logs and see what their systems were doing at, and around, the time that the user reported the issue. They can drill down to see details specific to that user, their actions, and their content feeds.
One Log Level
What log level would you run your systems at if you were running a system on the scale of Hootsuite? Would it be "warn"? Maybe "info"? How about "debug"? At Hootsuite, they have done away with all other logging levels and made everything “debug”.
Beier does see the value in having different log levels, that you can toggle up and down, in other systems, but for Hootsuite it makes no sense. They would rarely be able to go back and repeat the exact same actions at a different log level. Having everything logged the first time around makes the most sense to them.
Beier also made a good point about logging levels in general. Who decides which level a particular log line should be at? It is hard to define and coordinate across a large system. Code-bases become increasingly large and complex and managing the correct level for each log line becauses impossible. Then, when you do hit an issue, you do not have the right logs because somebody set the wrong level for that particular scenario. This is another reason that Hootsuite chose to go with their "Log everything. One log level" mentality.
A Lot Of Logs
As you can probably imagine, there is a cost for logging everything. You generate a lot of log data!
Hootsuite generates about 200GB of logs per day. They keep 2 weeks worth of recent "hot" data, totaling about 2.8TB. After that it is archived to Amazon S3 and Amazon Glacier. Glacier is similar to S3, but cheaper at the cost of it taking longer to recover the logs from storage.
Streaming Via LogStash
LogStash provides "logging channels", which receive everything, filter and then output the data. In Beier's description, a LogStash channel consists of "input", "filter" and "output". It is very simple. There are many open-source filters, which you can just use or customize. One such filter they use, is a filter to convert Apache logs to JSON.
JSON Log Lines
Ultimately, all log lines throughout Hootsuite become JSON. For their applications and custom systems, it is easy. They can simply write out JSON logs at the application level. For system logs, and things like Apache logs, they rely to LogStash to rewrite them.
In Hootsuite’s setup, not everything that enters LogStash is JSON, but everything that leaves LogStash is.
Logs streams that come out of LogStash are persisted to two places, depending on what type of logs they are. These two places focus separately on debugging and BI (business intelligence). All BI data is sent to Hadoop. All debugging data is sent to ElasticSearch. All user events are sent to both. BI data is no use for debugging, so it is excluded from data sent Hadoop. Similarly debugging data, such as syslog, does not help with BI analytics.
Hootsuite uses "6 beefy servers" for their 2 weeks of hot data, on which they run an ElasticSearch cluster. This is their key tool for searching and digging into their debug log data.
They define a schema on this indexed data which exposes key fields on the web interface. This enables engineers to see what they can search on. Even though they define this schema, ElasticSearch is still essentially schema-less. This means that a developer can add any hierarchical data fields to their log lines and it will still be searchable.
The way this schema-plus-schemaless setup works, is that an engineer will search on the fields they know exist, but in the results they will see additional searchable fields that will enable them to dig deeper into the logged data.
Hierarchical fields names will looks like "web.signup.email" or "errors.signup.email". In this example, "web" and "errors" would be categories you can search on. I asked Beier if this type of JSON indexing and searching was something that Hootsuite built on-top of ElasticSearch, but he told me that this functionality comes out of the box with ElasticSearch.
ElasticSearch is expensive, Beier tells me. It will work 99% of the time, but will sometimes break on spikes. If a server drops from the cluster, they will see a large spike in I/O as it rebalances the cluster.
The searching functionality ElasticSearch gives you is great when you are debugging an issue, said Beier, but it is only useful when you know what you are looking for. It is of little help for BI queries. For instance, "tell me all the users that signed-up, but stopped using the product 15 minutes later". This is where Hadoop comes in.
BI specific events and user events are all forwarded to Hadoop, which Hootsuite uses to gain intelligence on their user-base.
One thing that ActiveState does with the streaming logs coming from our sandbox Stackato cluster is send them to a service called Papertrail. Papertrail is able to do real-time searches and send us alert emails if it sees pre-defined error conditions or administrative actions. I was curious if Hootsuite was doing anything similar with their log data.
Beier said that this is something they are not doing. When it comes to monitoring, the are only interested in the aggregate numbers and less so in the individual instances of an event. For instance, did usage drop 20% after a deployment?
Beier told me an interesting story about an event that happened just a few weeks after they rolled out their ElasticSearch cluster in production. They fell victim to a spam attack.
Luckily, Hootsuite were able to dig into the data of the compromised accounts, via their newly rolled-out ElasticSearch cluster, find patterns and stop the attackers in their tracks.
I am pretty sure that whoever came up with the idea to stream log data into ElasticSearch felt pretty good that week. This is a classic case of fire prevention beats fire fighting.
Prior to sitting down with Beier I thought we would be talking mostly about LogStash, which is the central hub of all their logging systems. It occurred to me that we actually spoke little about it. I pointed this out to Beier.
"There's not much to talk about. It's a great system! It works perfectly, but it is very simple", he told me. "Out-the-box, LogStash accepts tens of different inputs, hunderds of filters and provides lots of outputs. The filters are very flexible". That’s LogStash.
Streaming all manner of logs from across your distributed architecture is a powerful concept. With open-source tools like ElasticSearch and Hadoop, the number of ways you can quickly utilize your log data is becoming very interesting. Hootsuite shows us that this is not only powerful for developers digging in technical issues in production, but also for answering business questions that come all the way from the top.
ActiveState continues to invest heavily in this area and was the first PaaS solution to provide log streaming across the entire platform, from syslog to service and application logs.
Are you streaming your logs? Using LogStash? ElasticSearch? Hadoop? Something else? Please leave a comment below.
Image courtesy of hdworld@flickr
Share this post:
We introduced Logyard in Stackato 2.4 as a way to stream system logs to external log aggregators. In addition, we started using Logyard to manage application logs.
Logyard 2.0 has no single point of failure (SPoF). Instead of transferring system logs from all nodes into a common node—from which they may be forwarded to external aggregators—we obviated the need to move logs by using what are called "drains". Consequently, the new Logyard involves no inter-node network traffic.
Early on in the development of Stackato, we recognized that good logging features would be critical to developers and administrators. We also knew that we didn't want to reinvent the wheel, and made the intentional decision not to build log analysis and aggregation tools into the product itself.
There are quite a few log aggregators and indexers available already (Splunk, Loggly, Papertrail, Logstash, et al) and many developers and sysadmins have already made a significant commitment to using their favorite one.
Stackato's logging subsystem is called Logyard, and its purpose is to facilitate treating logs as event streams. Nothing else.
If you're using the Stackato Sandbox, you may have already noticed some differences in the Management Console's Logs interface. They hint at bigger changes behind the scenes, coming in Stackato 2.4:
Application logs can now be aggregated by a new process called Logyard, which streams log data to a (configurable) rolling buffer. From there, it can be retrieved with an API call for viewing with the 'stackato' client or the Management Console.
System logs are similarly aggregated, but can also be forwarded to log analysis software like Splunk, or third-party services like Loggly and Papertrail.
Subscribe to ActiveState Blogs by Email
Share this post: