Trustworthy Online Controlled Experiments – Complete Book Summary & All Key Ideas

Unlocking Innovation: A Comprehensive Guide to Trustworthy Online Controlled Experiments (Ron Kohavi, Diane Tang, Ya Xu)

In “Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing,” Ron Kohavi, Diane Tang, and Ya Xu, leaders in experimentation from Google, LinkedIn, and Microsoft, offer an indispensable roadmap for accelerating innovation through the scientific rigor of A/B tests. This book serves as the definitive text for anyone seeking to leverage data-driven decisions in the digital age, moving beyond intuition to verifiable, repeatable experiments. Throughout this summary, we will break down every important idea, example, and insight from the book, ensuring comprehensive coverage in clear, accessible language. From foundational concepts to advanced pitfalls and solutions, we promise to leave nothing significant out, helping you understand and apply the wisdom of trustworthy experimentation.

The authors draw upon their extensive practical experiences at companies that collectively run tens of thousands of controlled experiments annually. They highlight that while “getting numbers is easy; getting numbers you can trust is hard,” underscoring the book’s core message. This guide is designed for both students and industry professionals taking their first steps in experimentation, as well as experienced practitioners aiming to refine their organizational decision-making processes. It emphasizes using the scientific method, defining robust metrics, testing for trustworthiness, iterating quickly based on results, implementing protective guardrails, and building scalable platforms that drive the marginal cost of experiments toward zero. By delving into advanced topics like carryover effects, Twyman’s law, Simpson’s paradox, and network interactions, the book offers a complete picture of how to achieve reliable and impactful experimentation.

Chapter 1: Introduction and Motivation

This introductory chapter sets the stage for understanding why online controlled experiments are crucial for innovation and data-driven decision-making. It introduces fundamental terminology and emphasizes the critical distinction between correlation and causation.

The Power of Experimentation: A Bing Success Story

The chapter opens with a compelling anecdote from Microsoft’s Bing search engine in 2012, illustrating the surprising impact of a seemingly minor change. An engineer suggested lengthening the title line of ads by combining it with the first line of text below the title. This idea was initially low-priority and languished for months, but when implemented and A/B tested, it led to a 12% increase in Bing’s annual revenue, translating to over $100 million annually in the US alone, without negatively impacting user experience metrics. This unexpected success highlights several key themes: the difficulty of assessing an idea’s true value, how small changes can yield massive returns, the rarity of such large impacts, the necessity of a low overhead for running experiments, and the importance of a clear Overall Evaluation Criterion (OEC) that balances revenue with user experience.

Understanding Online Controlled Experiments Terminology

The authors establish a common language for discussing controlled experiments, which are also known as A/B tests, A/B/n tests, field experiments, randomized controlled experiments, split tests, or bucket tests. These experiments are widely used by leading tech companies like Airbnb, Amazon, and Google, often involving millions of users.

Here are the key terms defined:

Overall Evaluation Criterion (OEC): A quantitative measure of the experiment’s objective, designed to be measurable in the short term (experiment duration) and causally linked to long-term strategic objectives. It can be a single metric or a weighted combination of multiple metrics (e.g., sessions-per-user, relevance, ad revenue).
Parameter: A controllable experimental variable (also called a factor or variable) that influences the OEC or other metrics. Values assigned to parameters are called levels.
Variant: A specific user experience being tested, typically by assigning values to parameters. In A/B tests, A is the Control (existing version) and B is the Treatment (new version).
Randomization Unit: The unit (e.g., users, pages, sessions) to which variants are pseudo-randomly assigned in a persistent and independent manner. The authors strongly recommend using users as the randomization unit for online audiences. Proper randomization is crucial for ensuring statistically similar populations across variants, enabling causal inference.

The Imperative of Experimentation: Beyond Correlation to Causality

The chapter strongly argues for experimentation as the gold standard for establishing causality. It debunks the common fallacy that correlation implies causation, citing examples like Microsoft Office 365 users experiencing more errors having lower churn rates – a correlation driven by higher overall usage, not by errors themselves.
The authors present a hierarchy of evidence, inspired by medical literature, which positions randomized controlled experiments at the top for reliability, followed by systematic reviews (meta-analysis), other controlled experiments, observational studies, and finally, case studies and anecdotes.
Online controlled experiments offer:

The best scientific way to establish causality with high probability.
The ability to detect small, subtle changes in metrics (sensitivity).
The capacity to uncover unexpected impacts on other metrics, such as performance degradation or cannibalization.
A central theme is the emphasis on trustworthiness of results, achieved through rigorous methodology and detection of pitfalls.

Essential Ingredients for Effective Controlled Experiments

For controlled experiments to be useful and impactful, certain technical prerequisites must be met:

Experimental units that can be assigned to different variants with minimal interference: Users should ideally not influence each other’s behavior across variants.
Sufficient experimental units: Thousands of units are generally recommended for statistical power, with larger numbers enabling the detection of smaller effects.
Agreed-upon and measurable key metrics (ideally an OEC): Goals must be quantifiable, even if surrogate metrics are used for hard-to-measure concepts. Reliable and cost-effective data collection (instrumentation) is vital.
Ease of making changes: Software’s inherent flexibility makes it well-suited for rapid iteration and experimentation, especially server-side modifications.
These conditions are met by most non-trivial online services, allowing for agile development driven by experimentation.

Foundational Tenets for Experimentation Success

Beyond technical ingredients, three organizational tenets are critical for implementing online controlled experiments:

Commitment to data-driven decisions and a formalized OEC: Organizations must genuinely want to use data, not just say they do. This involves defining an OEC that is short-term measurable yet drives long-term strategic objectives, fostering a “data-informed” culture rather than solely relying on intuition (HiPPO).
Willingness to invest in infrastructure and trustworthiness: Building robust platforms and implementing tests (like A/A tests) to ensure reliable results is paramount. This investment ensures that “getting numbers you can trust is hard,” but achievable.
Recognition of poor intuition in assessing ideas: Most ideas, even seemingly good ones, often fail to improve key metrics. Data from Microsoft, Netflix, Etsy, and others show that only 10-33% of new features positively impact desired metrics. Embracing this humbling reality through experimentation is crucial for fostering a “fail fast” culture and driving genuine innovation.

The Gradual Path to Improvement: Inch by Inch Wins

Significant improvements to key metrics are typically achieved through an accumulation of many small changes, often ranging from 0.1% to 2% per experiment. Examples include:

Google Ads: Over a year of incremental experiments on ad ranking led to improved user experience and advertiser ROI.
Bing Relevance: The team aims for a 2% annual improvement in a single OEC metric, achieved through thousands of experiments. Success is often certified by replication experiments.
Bing Ads: Consistent 15-25% annual revenue growth was driven by monthly “packages” of small, incremental improvements from many experiments, highlighting the power of sustained, iterative testing.

Surprising Outcomes: Why Experiments Are Essential

The chapter provides several compelling examples of online controlled experiments with unexpected and significant results, reinforcing how difficult it is to predict the value of ideas:

UI Example: 41 Shades of Blue: Google’s testing of 41 blue gradations for search results pages led to substantial positive impacts on user engagement. Similarly, Microsoft Bing’s color tweaks improved task completion, time-to-success, and generated over $10 million annually in the US.
Making an Offer at the Right Time (Amazon): Moving a credit-card offer from the homepage to the shopping cart page, with simple math highlighting savings, increased Amazon’s annual profit by tens of millions of dollars.
Personalized Recommendations (Amazon): Greg Linden’s prototype for personalized recommendations based on shopping cart items, initially resisted by management, proved immensely profitable through a controlled experiment, leading to its widespread adoption.
Speed Matters a LOT (Bing, Amazon, Google): Experiments consistently showed that even tiny performance improvements (e.g., 10 milliseconds) significantly increased user engagement and revenue. For Bing, every four milliseconds of improvement could fund an engineer for a year.
Malware Reduction (Microsoft Bing): An experiment to restrict DOM modifications due to malware, which had been polluting ad spaces and degrading user experience, led to improvements across all key metrics for Bing, including sessions per user, search success, click speed, and several million dollars in annual revenue.
Backend Changes (Amazon Search): Implementing an algorithm for “People who searched for X bought item Y” for underspecified queries like “24” (for the TV show) led to a 3% increase in Amazon’s overall revenue, despite initial concerns about showing irrelevant items.

Strategy, Tactics, and the Role of Experiments

Controlled experiments are synergistic with business strategy, product design, and operational effectiveness.

Synergy with Strategy: Experiments can help hill-climb to local optima based on existing strategy, identify high ROI areas, optimize non-obvious elements (like color or spacing), and enable continuous iteration for site redesigns. They are critical for optimizing backend algorithms. Strategy drives the choice of OEC, and experiments provide feedback on its effectiveness. If many tactical experiments fail, it may signal a need to reconsider the overall strategy (pivot).
Scenario 1: Strategy and Product with Enough Users: Experiments help optimize “near” the current position, identify high ROI areas, make subtle optimizations, enable continuous site redesigns (avoiding “big bang” redesign failures), and optimize backend algorithms.
Scenario 2: Strategy Suggests a Pivot: When results indicate the current strategy might be flawed, experiments can help evaluate radical ideas or “jumps” to potentially bigger “hills.” This might involve longer experiment durations, country-level experiments, or a portfolio approach (most investments near current location, a few radical ideas). The ability to run controlled experiments significantly reduces uncertainty by allowing the testing of Minimum Viable Products (MVPs) and iterative learning. The concept of Expected Value of Information (EVI) underscores the value of gathering data to inform decision-making.

This chapter concludes by strongly advocating for widespread adoption of online controlled experiments, framing them as the scientific foundation for success in the digital realm.

Chapter 2: Running and Analyzing Experiments: An End-to-End Example

This chapter offers a practical, step-by-step guide to designing, running, and analyzing an online controlled experiment, using a concrete example of a fictional e-commerce site. It highlights the essential principles applicable across various software platforms.

Setting up the Experiment Example

The example focuses on a fictional online commerce site selling widgets. The marketing department wants to increase sales by sending promotional emails with coupon codes. However, the company has concerns that simply adding a coupon code field to the checkout page could degrade revenue, even if no coupons are actually available (a “fake door” or “painted door” approach). The goal is to assess this specific impact.

The experiment aims to test three variants:

Control: The original checkout page (no coupon field).
Treatment 1: Checkout page with a coupon/gift code field below credit card information.
Treatment 2: Checkout page with a coupon/gift code field as a popup.

The initial, more refined hypothesis is: “Adding a coupon code field to the checkout page will degrade revenue-per-user for users who start the purchase process.“

Key considerations for metric definition:

Revenue-per-user is chosen as the Overall Evaluation Criterion (OEC).
The denominator for this metric should be only users who start the purchase process, as these are the only users potentially impacted by the change in the checkout flow. Including all site visitors would add noise, while only including purchasers would miss the impact on conversion rates.

Hypothesis Testing: Establishing Statistical Significance

Before designing the experiment, it’s crucial to understand basic statistical hypothesis testing concepts:

Baseline Mean and Standard Error: Characterize the chosen metric (revenue-per-user) by understanding its baseline mean value and the standard error of the mean. This helps in sizing the experiment and calculating significance.
Statistical Significance: The core of hypothesis testing. We start with a Null hypothesis (e.g., no difference in revenue-per-user between Control and Treatment). If the observed difference is unlikely under the Null hypothesis, we reject it and declare the result statistically significant.
P-value: The probability of observing a difference as extreme as, or more extreme than, the one measured, assuming the Null hypothesis is true. A p-value less than 0.05 is the conventional threshold for statistical significance, meaning there’s a less than 5% chance the observed difference occurred randomly if no true difference exists.
Confidence Interval: A range that, if the experiment were repeated many times, would contain the true difference 95% of the time (for a 95% confidence interval). If the 95% confidence interval for the difference between Treatment and Control does not overlap with zero, the result is statistically significant at the 0.05 level.
Statistical Power: The probability of correctly detecting a meaningful difference when one truly exists. Experiments are usually designed for 80-90% power. Larger sample sizes generally lead to more power.
Practical Significance: Beyond statistical significance, it’s vital to define the minimum difference that matters from a business perspective. For the example, a 1% or larger increase in revenue-per-user is considered practically significant.

Designing the Experiment

The design phase involves making concrete decisions about how the experiment will be set up:

Randomization Unit: User is the chosen randomization unit, meaning each user consistently sees the same variant across multiple visits. This is the most common and recommended choice.
Target Population: All users are targeted, but analysis will focus on those who visit the checkout page, as defined in the refined hypothesis.
Experiment Size: A power analysis is conducted to determine the number of users needed to detect at least a 1% change in revenue-per-user with 80% power. Factors influencing size include:
- Metric Choice: A binary purchase indicator (yes/no) would require a smaller sample size than revenue-per-user due to lower variance.
- Practical Significance Level: Accepting larger changes as significant reduces the required sample size.
- P-value Threshold: A lower p-value (e.g., 0.01 for higher certainty) requires a larger sample size.
- Safety/Ramp-up: Starting with a smaller proportion of users is prudent for uncertain changes.
- Traffic Sharing: If multiple experiments run concurrently, traffic division affects individual experiment sizes.
Experiment Duration: Several factors influence how long to run the experiment:
- User Accumulation: Longer durations generally mean more users, increasing power, though accumulation rate is sub-linear due to repeat users.
- Day-of-Week Effect: Running for at least one full week captures typical weekly user behavior cycles.
- Seasonality: Considering holidays or other periodic shifts in user behavior (e.g., Christmas for gift cards).
- Primacy and Novelty Effects: Longer durations help determine if initial effects (positive or negative) stabilize over time.
- Overpowering: It’s often beneficial to slightly overpower an experiment to allow for segmented analysis (e.g., by geographic region) or to detect impacts on multiple key metrics.

For the coupon code experiment, the final design is:

Randomization Unit: User.
Targeting: All users, analyzing those who visit the checkout page.
Size: Determined by power analysis for 1% revenue-per-user change and 80% power, leading to a minimum of four days.
Duration: At least one full week for day-of-week effects, potentially longer for novelty/primacy. The traffic split is 34% Control, 33% Treatment 1, 33% Treatment 2.

Running the Experiment and Getting Data

Executing the experiment involves two key components:

Instrumentation: Monitoring and logging user interactions with the site and associating these interactions with their assigned experiment variant (see Chapter 13).
Infrastructure: The system that configures and manages experiment variants, assigns users, and processes the logged data (see Chapter 4).
Once deployed and data is collected, the raw data is processed, summary statistics are computed, and results are visualized.

Interpreting the Results

Before diving into the OEC, it’s crucial to run sanity checks using guardrail metrics (also called invariants):

Trust-related guardrail metrics: Metrics that should not change between variants due to the experiment design itself (e.g., ratio of users in each variant, cache-hit rates). A significant deviation often indicates a bug.
Organizational guardrail metrics: Metrics important to the business that are expected to remain constant or not degrade (e.g., latency, crash rates).
If these checks fail (e.g., a Sample Ratio Mismatch), the experiment results are likely invalid and require debugging.

In the example, after passing sanity checks, the results for revenue-per-user are:

Treatment 1 vs. Control: Difference of -$0.09 (-2.8%), with a p-value of 0.0003 and a confidence interval of [-4.3%, -1.3%].
Treatment 2 vs. Control: Difference of -$0.25 (-7.8%), with a p-value of 1.5e-23 and a confidence interval of [-9.3%, -6.3%].

Both treatments show statistically significant decreases in revenue-per-user (p-value < 0.05). This confirms the hypothesis: adding a coupon code field degraded revenue. The decline was attributed to fewer users completing the purchase process. This “painted door” A/B test effectively saved significant resources by preventing the full implementation of a potentially harmful feature.

From Results to Decisions

The ultimate goal of A/B tests is to drive decision-making. This involves considering not just statistical results but also broader context:

Tradeoffs between metrics: How do engagement, revenue, and other metrics balance out? (e.g., increased engagement vs. decreased revenue).
Cost of launching: This includes full feature build-out, ongoing maintenance, and potential future complexity. Higher costs necessitate a higher practical significance threshold.
Downside of wrong decisions: Not all mistakes are equal; the impact of launching a bad feature or missing a good one varies. Short-lived changes may have a lower bar for statistical and practical significance.

The authors provide a framework for decision-making based on statistical and practical significance, illustrated with six scenarios:

Not statistically or practically significant: Idea should be iterated on or abandoned.
Statistically and practically significant: Launch the change.
Statistically significant, but not practically significant: Confident about the magnitude, but it’s too small to justify costs. Consider not launching.
Not statistically significant, and confidence interval outside practical significance: Insufficient power to draw a conclusion. Rerun with more users.
Practically significant, but not statistically significant: Best guess is impactful, but high chance of no impact. Rerun with more power.
Statistically significant, likely practically significant: Reasonable to launch, but rerunning with more power would be ideal for greater certainty.

The chapter emphasizes that decision-making should be explicit about the factors considered, including practical and statistical significance thresholds, which ideally are defined before the experiment begins.

Chapter 3: Twyman’s Law and Experimentation Trustworthiness

This chapter delves into the crucial concept of Twyman’s Law, which states that “Any figure that looks interesting or different is usually wrong.” It underscores the importance of skepticism towards extreme results and outlines various threats to the trustworthiness of online controlled experiments.

Misinterpretation of Statistical Results

The authors highlight common pitfalls in interpreting the statistics from A/B tests:

Lack of Statistical Power: A non-statistically significant result (p-value > 0.05) does not mean there’s no difference. It often means the experiment was underpowered to detect the effect size that truly exists. It’s crucial to define practical significance and ensure sufficient power to detect changes of that magnitude or smaller.
Misinterpreting P-values: The p-value is not the probability that the Null hypothesis is true. It is the probability of observing data as extreme as, or more extreme than, what was observed, assuming the Null hypothesis is true. Misinterpretations can lead to false conclusions about the likelihood of a true effect or false positive rates.
Peeking at P-values: Continuously monitoring p-values during an experiment and stopping early once significance is reached leads to significant bias (5-10x) in declaring results statistically significant, resulting in many false positives. Solutions include using sequential tests with always-valid p-values or predetermining the experiment duration.
Multiple Hypothesis Tests: The problem of making multiple comparisons (across metrics, time, segments, or iterations of an experiment) increases the likelihood of finding a statistically significant result purely by chance. This is why a False Discovery Rate (FDR) framework is important, and why unexpected significant results for “third-order” (unlikely to be impacted) metrics should be viewed with skepticism.

Understanding Confidence Intervals

Confidence intervals quantify the uncertainty around the observed Treatment effect. A 95% confidence interval (CI) for the Treatment effect that does not cross zero implies statistical significance at the 0.05 p-value level. However, a common mistake is assuming that if the individual CIs for Control and Treatment overlap, the difference is not statistically significant; CIs can overlap by as much as 29% while still indicating a significant difference in the delta. It’s also critical to remember that a 95% CI means that if the experiment were repeated many times, the interval would contain the true Treatment effect 95% of the time, not that the specific observed interval has a 95% chance of containing the true effect.

Threats to Internal Validity

Internal validity refers to whether the experimental results are correct for the specific study, without generalizing to other populations or times. Key threats include:

Violations of SUTVA (Stable Unit Treatment Value Assumption): This assumption states that units (e.g., users) do not interfere with each other. Violations occur in:
- Social networks: Features can spill over to a user’s network regardless of variant assignment.
- Communication tools (e.g., Skype): A call involves two users, potentially in different variants.
- Co-authoring tools: Changes affect collaborators.
- Two-sided marketplaces (e.g., Airbnb, ad auctions): One side’s behavior (e.g., lower prices for Treatment) can affect the other side (e.g., less inventory for Control).
- Shared resources (e.g., CPU, storage): A bug in Treatment (e.g., memory leak) can degrade performance for all variants sharing the same resources.
Survivorship Bias: Analyzing only users who “survive” (e.g., stay active for two months) can lead to biased conclusions, as seen in the WWII bomber armor example (where armor was incorrectly added to areas with bullet holes, missing the fact that planes hit in other areas never returned).
Intention-to-Treat: When there’s non-random attrition or participation, analysis should stick to the initial assignment (intention-to-treat) to avoid selection bias, which commonly overstates the Treatment effect.
Sample Ratio Mismatch (SRM): A critical trust-related guardrail. If the actual ratio of users in variants deviates significantly from the designed ratio (e.g., a 50/50 split resulting in 0.993:1, with a p-value of 1.8E-6), it indicates a serious problem. The SRM check should trigger strong warnings and hide scorecards if the p-value is below a threshold (e.g., 0.001). Common causes include:
- Browser redirects: Redirecting Treatment to a new page can cause performance differences, affect how bots handle pages, and lead to contamination from shared links/bookmarks.
- Lossy instrumentation: How clicks or other interactions are recorded can differ between variants, leading to skewed user counts.
- Residual or carryover effects: Bugs fixed mid-experiment can cause users previously impacted to show different behaviors.
- Bad hash function for randomization: Flawed hashing can lead to non-random user distribution.
- Triggering impacted by Treatment: If the criteria for including users in an experiment (triggering) are affected by the Treatment itself, it can cause an SRM.
- Time-of-day effects: If variants are rolled out sequentially, user behavior can differ based on the time of day they enter the experiment.
- Data pipeline impacted by Treatment: Bot filtering, for example, can classify highly engaged Treatment users as bots, artificially showing negative results. An MSN example showed a 0.992 ratio for an SRM, which inverted a negative result to a 3.3% positive user engagement after correction.

Threats to External Validity

External validity refers to the generalizability of experiment results to other populations (e.g., different countries, websites) or over time.

Generalizations across populations are often questionable; experiments should be re-run in new contexts.
Generalizations across time are harder. Key threats include:
- Primacy Effects: Users need time to adopt a new feature, or machine learning models may need time to learn. Initial impact can be larger or smaller than long-term.
- Novelty Effects (Newness Effect): A new feature might initially attract users but then see declining usage if it’s not truly useful (e.g., the “operators are busy, please call again” trick, or the fake hair on an Instagram ad). The MSN example of the Outlook Mail app link showing a 28% CTR increase, then declining, indicated user confusion, not engagement.
- Detecting Primacy and Novelty Effects: Plotting usage over time and looking for increasing or decreasing trends is a good check. These trends are “red flags” that experiments need to run longer to stabilize, or that the idea itself is flawed.

Segment Differences and Simpson’s Paradox

Analyzing metrics by segments (e.g., market, device, time of day, user type) can provide deep insights, but also introduce pitfalls.

Segmented View of a Metric: Comparing metrics across segments (e.g., mobile OS) can reveal issues in data collection or bugs, as seen with Bing’s mobile ad CTRs that varied dramatically by OS due to different click tracking methodologies and a bug in Windows Phone. This emphasizes the need to invoke Twyman’s Law for anomalous data.
Segmented View of the Treatment Effect (Heterogeneous Treatment Effect): When the Treatment effect differs significantly across segments (e.g., a UI change causing a negative impact only for Internet Explorer 7 users due to JavaScript incompatibility), it points to specific issues. Machine learning techniques like Decision Trees can help identify interesting segments.
Analysis by Segments Impacted by Treatment Can Mislead: If user migration between segments occurs due to the Treatment, seeing positive impacts in all segments does not guarantee an overall positive impact. For example, if a feature’s users and non-users both show increased sessions-per-user in Treatment, the overall average might still be flat or decrease if users with below-average sessions move into the “users of feature” segment. The non-segmented (aggregate) metric should be the primary focus.
Simpson’s Paradox: This unintuitive phenomenon occurs when combining results from different periods (e.g., ramp-up phases with varying traffic allocations) or different subpopulations (e.g., countries, browser types). A Treatment can appear better than Control in every subgroup and in every period, but worse overall when data is aggregated. This happens due to weighted averages and disproportionate representation of certain subgroups. It’s mathematically possible but seems absurd, as highlighted by Pearl’s “Sure-Thing Principal” – if an action increases an event’s probability in each subpopulation, it must increase it in the whole population, implying that Simpson’s paradox in causal studies often points to an unacknowledged confounding factor.

Fostering Healthy Skepticism

The chapter concludes by stressing that trustworthy experimentation requires investing in skepticism. Data scientists should embrace anomalies, question results, and invoke Twyman’s Law, especially for “too good to be true” findings. Many initial “winning” results fail to translate into actual user acquisition or business improvement, emphasizing the need for rigorous validation. The goal is to learn from failures and ensure that decisions are based on reliable data, not false confidence.

Chapter 4: Experimentation Platform and Culture

This chapter delves into the practicalities of building and scaling an experimentation platform and fostering a culture that embraces data-driven decision-making. It outlines a maturity model for experimentation and explores the technical components of a robust platform.

Experimentation Maturity Models

The authors define four phases of experimentation maturity that organizations typically progress through:

Crawl (Monthly Experiments): Focus on building foundational prerequisites: instrumentation and basic data science capabilities to compute summary statistics. The goal is to design, run, and analyze a few experiments successfully to generate momentum. Roughly 10 experiments per year.
Walk (Weekly Experiments): Shift from prerequisites to defining standard metrics and increasing the volume of experiments. Emphasis on improving trust through instrumentation validation, A/A tests, and Sample Ratio Mismatch (SRM) checks. Roughly 50 experiments per year.
Run (Daily Experiments): Focus on scaling experiments. Metrics become comprehensive, ideally leading to a codified Overall Evaluation Criterion (OEC). Experimentation is used to evaluate most new features and changes. Roughly 250 experiments per year.
Fly (Thousands of Experiments Annually): A/B experiments are the norm for every change. Feature teams are adept at analyzing most experiments independently. Focus shifts to automation, establishing institutional memory, and improving the culture of experimentation through shared learnings and best practices. Over 1,000 experiments per year.

The Role of Leadership in Cultivating Experimentation

Leadership buy-in is paramount for embedding experimentation deeply within an organization’s culture:

Shifting Mindsets: Moving from “hubris” (intuition-driven) to “measurement and control,” and eventually to “fundamental understanding” where causes are truly understood. This involves overcoming resistance to new knowledge that contradicts existing beliefs.
Goal Alignment: Leaders must establish shared goals, agree on high-level metrics and guardrails, and ideally codify tradeoffs into an OEC.
Metric-Driven Goals: Shift from “shipping features X and Y” to “improving metric Z.” This is a difficult cultural change, especially for established teams, to adopt experiments as a guardrail (features must improve metrics).
Empowerment and Humility: Empower teams to innovate, expect ideas to be evaluated, and show humility when ideas fail to move metrics. Foster a “fail fast” culture.
Data Quality and Interpretation: Demand proper instrumentation and high data quality. Review experiment results, enforce interpretation standards (minimizing p-hacking), and ensure transparency in decision-making.
Strategic Informing: Recognize that experiments, while often for optimization, can also inform overall strategy (e.g., Bing abandoning social integration after two years of no value).
Portfolio Management: Support a mix of high-risk/high-reward projects and incremental gain projects, understanding that many will fail but learning from them is crucial.
Long-term Learning: Advocate for experiments run solely to collect data or establish ROI (e.g., the long-term impact of performance improvements).
Agility: Promote short release cycles and sensitive surrogate metrics to enable quick feedback loops.

Essential Processes for Trustworthy Experimentation

As an organization matures, formalizing educational processes and cultural norms is critical:

Just-in-Time Education:
- Checklists for Experiment Design (e.g., Google’s search experiment checklist): Guides experimenters through key questions (hypothesis, detectable change, power analysis) to ensure proper setup.
- Regular Experiment Review Meetings: Experts examine results for trustworthiness (often finding instrumentation issues), discuss launch/no-launch recommendations, broaden understanding of various metric types (goal, guardrail, quality, debug), and establish metric tradeoffs for OEC codification. These forums also foster cross-team learning and celebrate failures as learning opportunities.
Sharing Learnings Broadly: Using newsletters, internal feeds, or a “social network” attached to the platform to share surprising results, meta-analyses, and best practices. This builds institutional memory.
Cultivating Intellectual Integrity:
- Transparency: Compute many metrics, ensure important ones are highly visible on dashboards to prevent cherry-picking.
- Communication: Send newsletters about surprising results, meta-analyses, and how teams use experiments.
- Guardrails against Negative Impact: Make it difficult to launch features that negatively impact important metrics (warnings, notifications, or even blocks, though the latter can be counterproductive if it stifles open discussion).
- Embrace Learning from Failure: Acknowledge that most ideas fail and focus on extracting insights for future iterations.

Build vs. Buy: The Experimentation Platform Decision

The authors, having built in-house platforms at major tech companies, acknowledge the “build vs. buy” dilemma for experimentation platforms. Growth figures (Bing, Google, LinkedIn, Microsoft Office all showing order-of-magnitude growth over years) indicate that while early growth might be limited by platform capabilities, later growth is limited by the ability to convert ideas into code.

Key questions to consider:

Can an external platform provide the functionality you need?
- Versatility: Does it support frontend/backend, server/client, mobile/web experiments?
- Performance: Does it introduce latency (e.g., JavaScript snippets)?
- Metrics: Does it support complex metrics (e.g., sessionization, percentiles) and integrate with broader business reporting?
- Data Privacy & Access: Are there restrictions on data sharing? Is logged data easily accessible? Can it integrate external data sources?
- Real-time monitoring: Does it offer near real-time (NRT) results for quick detection of issues?
- Institutional Memory: Does it have features to capture and share historical learnings?
- Final Version Implementation: Does it require re-implementing features after testing?
What would the cost be to build your own? Building a scalable and trustworthy system is hard and expensive.
What’s the trajectory of your experimentation needs? If you anticipate significant growth, an in-house solution might be better in the long run despite initial cost.
Do you need to integrate into your system’s configuration and deployment methods? Deep integration is easier with in-house solutions.

The authors suggest that external solutions can be useful in the “Walk” phase to demonstrate impact before committing to an internal build.

Infrastructure and Tools: The Backbone of Trustworthy Experimentation

A robust experimentation platform aims to make experimentation self-service, minimize marginal costs, and ensure trustworthiness. It typically comprises four high-level components:

Experiment Definition, Setup, and Management:
- User Interface (UI) or API: For easy definition, setup, and management of experiments (owner, name, dates, description).
- Multiple Iterations: Support for evolving features (bug fixes, progressive rollouts) under the same experiment ID.
- Management Features: Drafting, comparing iterations, viewing history, automatic ID assignment, validation checks (e.g., conflicts, invalid targeting), status checks (start/stop).
- Safety Features (Run & Fly phases): Automated release/ramp-up, near-real-time monitoring/alerting, automated detection and shutdown of bad experiments.
- Pre-Live Checks: Test code execution or approval workflows.
Experiment Deployment:
- Experimentation Infrastructure: Provides experiment definitions, variant assignments, and other info.
- Production Code Changes: Implements variant-specific behavior.
- Variant Assignment Service: Returns variant assignment (e.g., based on user ID hash) and configuration parameters. It must be consistent and independent. A single implementation is crucial.
- Atomicity: Ensuring all servers simultaneously switch to a new iteration, crucial for consistent user experience in distributed systems.
- Location of Variant Assignment: Early in the flow (e.g., traffic front door, client-side) or later (e.g., server-side after lookups). Earlier assignment can improve performance but complicates triggering.
- Architecture Choices for Variant Behavior:
  1. Code Forks: if (variant == Treatment) then buttonColor = red else buttonColor = blue. Simple for small changes, but leads to escalating technical debt.
  2. Parameterized System: buttonColor = variant.getParam("buttonColor"). Reduces code debt.
  3. Full Configuration Pass-Down: buttonColor = config.getParam("buttonColor"). Variant assignment done early, passing all parameters, more performant for many parameters. Google and Bing use this.
- Platform Overhead Measurement: Run some traffic outside the experimentation platform to measure its impact on latency, CPU, and cost.
Experiment Instrumentation:
- Basic Instrumentation: Logging user actions and system performance (views, clicks, hovers, time-to-click, errors, crashes, latency, system response rates, cache hit rates). Critical for baseline understanding.
- Experiment-Specific Instrumentation: Logging which variant/iteration every user request and interaction belongs to.
- Counterfactual Logging: Logging what would have happened for a Treatment user if they were in Control, important for some analyses but can be expensive.
- Feedback Integration: Linking user feedback with variant IDs.
Experimentation Analytics:
- Automated Analysis: Crucial for scale, ensuring solid, consistent, and scientifically founded methodology.
- Data Processing (“Cooking the Data”):
  1. Sort and Group: Joining multiple client/server logs by user ID and timestamp to create sessions.
  2. Clean: Removing bots, fraud, duplicate events, fixing timestamps, but watching for filtering that causes SRMs.
  3. Enrich: Adding dimensions (browser family, country, platform) and measures (event duration), plus experiment-specific annotations.
- Data Computation: Calculating OEC, guardrail, and quality metrics by segments, p-values, confidence intervals, and trustworthiness checks (like SRM).
- Architectures for Data Computation:
  1. Materialize Per-User Statistics: Compute and store metrics for every user, then join with experiment assignment. Good for overall business reporting.
  2. Integrate Computation: Compute per-user metrics on-the-fly within the experiment analysis pipeline. More flexible per-experiment but requires consistency mechanisms.
- Speed and Efficiency: Daily processing of terabytes of data requires near real-time (NRT) paths for anomaly detection and automated shut-off, supplemented by batch processing for more complete results.
- Standardization: Define common metrics and definitions to ensure a shared vocabulary and consistent data intuition across the organization.
- Change Management: Plan for evolving metrics and OECs, including schema changes and data backfilling.
- Results Summary and Visualization:
  - Clear Trustworthiness Indicators: Highlight SRM and other key tests (e.g., Microsoft ExP hides scorecards if tests fail).
  - Comprehensive Metric Display: Show OEC, critical metrics, and other guardrails/quality/debug metrics.
  - Relative Change and Statistical Significance: Present metrics as relative changes, with clear indicators (e.g., color-coding) for statistical significance.
  - Segment Drill-Downs: Enable analysis by segments and auto-highlight interesting ones.
  - Accessibility: Dashboards should be understandable by marketers, data scientists, engineers, and product managers.
  - Per-Metric Views: Allow stakeholders to monitor global health of key metrics across experiments, fostering transparency and conversation.
  - Automated Approval/Alerts: Optional features for negatively impacted experiments (e.g., email digests, forced conversations with metric owners).
- Scaling Metrics (thousands of metrics): Categorize metrics (company-wide, product-specific, feature-specific; or data quality, OEC, guardrail, local features/diagnostic). Use p-value thresholds smaller than 0.05 to filter to most significant metrics and address multiple testing concerns. Automatically identify “metrics of interest.”
- Related Metrics: Show how movements in one metric relate to others (e.g., CTR up due to clicks or page views down).

Scaling Experimentation: Digging into Variant Assignment

Scaling the number of experiments while maintaining statistical power requires sophisticated variant assignment:

Single-Layer Method (Numberline): In early maturity, traffic is divided into disjoint buckets, and each experiment takes up a specified fraction of total traffic. Each user is in only one experiment. This is simple but limits the number of concurrent experiments. Bing, LinkedIn, and Google initially used manual negotiation for traffic “ranges,” which later shifted to programmatic assignment.
Concurrent (Overlapping) Experiments: Allows each user to be in multiple experiments simultaneously, crucial for scaling. This requires multiple experiment layers, where each layer behaves like a single-layer method. Orthogonality is ensured by adding the layer ID to the hash function for user assignment.
- Full Factorial Platform Design: Each user is simultaneously assigned to a variant for every running experiment. Simple to scale due to decentralized nature, but risks “collisions” or interactions between Treatments (e.g., blue text in Experiment 1, blue background in Experiment 2).
- Nested Platform Design: System parameters are partitioned into layers to prevent problematic combinations from running for the same user (e.g., a UI layer, a backend layer).
- Constraints-Based Platform Design: Experimenters specify constraints, and the system uses algorithms (like graph coloring) to ensure no two experiments with shared concerns are exposed to the same user. Automated interaction detection can be an extension.
- Interaction Detection: Platforms like Microsoft’s automate the detection of interactions, which is vital if using full factorial designs where interactions aren’t prevented by design.

This chapter concludes by emphasizing that a well-designed experimentation platform is not just a set of tools, but a complete ecosystem that integrates deeply with organizational culture and processes to drive trusted data-driven innovation at scale.

Chapter 5: Speed Matters: An End-to-End Case Study

This chapter presents a compelling case study illustrating the profound impact of website performance (speed) on key business metrics like revenue and user satisfaction. It demonstrates how controlled experiments, specifically slowdown experiments, can quantify this impact and provide actionable insights.

The Unseen Cost of Slowness

The chapter highlights that while “faster is better” is intuitively understood, quantifying the precise value of speed improvements is crucial for resource allocation. Leading companies like Amazon, Google, and Bing have consistently shown that even seemingly minor delays (e.g., 100 milliseconds) can lead to significant revenue loss and reduced user engagement.

Key findings from various companies:

Amazon: A 100 msec slowdown decreased sales by 1%.
Bing/Google Joint Talk: Significant impact of performance on distinct queries, revenue, clicks, satisfaction, and time-to-click.
Microsoft Bing (2012): Every 100 msec speedup improved revenue by 0.6%.
Microsoft Bing (2015): Even as Bing became faster (95th percentile under 1 second), further improvements still mattered; every four milliseconds of improvement could fund an engineer for a year.
These results underscore why latency should be a crucial guardrail metric. The chapter also advises against using third-party products (e.g., personalization tools) that inject blocking JavaScript snippets, as the latency cost might outweigh any potential gains. Server-side personalization and optimization are generally preferred.

Key Assumption: Local Linear Approximation

To measure the impact of performance, slowdown experiments are employed because it’s difficult to improve performance on demand. The core assumption is that the relationship between time (performance) and a metric (e.g., CTR, revenue-per-user) is linear around today’s performance point.

If slowing down by X milliseconds causes a Y% drop in a metric, then speeding up by X milliseconds is assumed to cause a Y% increase.
This is a first-order Taylor-series approximation.
Validation: Bing’s experience with both 100 msec and 250 msec slowdowns yielded deltas that were approximately proportional, supporting the linearity assumption. Discontinuities are unlikely for small time changes.

The Nuances of Measuring Website Performance

Measuring latency accurately is complex:

Server Clock Synchronization: Critical for accurate duration measurements, as requests often span multiple servers.
User-Perceived Page Load Time (PLT): Approximated by T7 - T1 (time from server receiving initial request to beacon reaching server after onload event).
Progressive Rendering: window.onload is a poor measure of user experience; metrics like “Above the Fold Time” or “Speed Index” are better at capturing perceived performance.
Time-to-User Action: Measuring the time until a user’s first click or successful task completion is a robust alternative, as it doesn’t rely on heuristics about “perceived performance.”
“Below the Fold” Elements: Elements loaded late or in right-pane areas may have less critical performance impacts, as shown by a Bing experiment where a 250 msec delay to right-pane elements had no statistically significant impact on key metrics.

Designing the Slowdown Experiment

Careful design is essential for a reliable slowdown experiment:

Where to Insert Delay: Bing learned not to delay the initial Chunk1 (which provides immediate user feedback and page “chrome”), but rather to delay after Chunk2 (the URL-dependent HTML) is computed, as this is more representative of general server-side processing time.
Delay Duration: A trade-off between:
- Accuracy of Slope Estimation: Larger delays (e.g., 250 msec) provide more precise measurements.
- Accuracy of Linear Approximation: Shorter delays maintain better accuracy.
- User Harm: Shorter delays minimize negative user impact.
Constant vs. Percentage Delay: A constant delay (e.g., 100 msec) is often chosen to model backend server-side delays, rather than a percentage which would account for network differences.
Impact on First vs. Later Pages: Consider if speedup is more important on initial page loads or subsequent pages in a session.

Extreme Results and Their Interpretation

The chapter warns against overstating or dismissing performance impacts:

Overstated Impact: Marissa Mayer’s claim that Google traffic and revenue dropped by 20% due to a 0.5-second increase from showing 30 search results (instead of 10) is likely an overstatement. While performance is critical, other factors (like information overload from too many results) were also at play.
Understated Impact: Dan McKinley’s claim that a 200 msec delay didn’t matter for Etsy users is likely due to insufficient statistical power to detect the difference. Dismissing performance can lead to significant degradation of the site over time.
Fake Progress Bars: In rare scenarios, too fast can reduce user trust, leading some products to add artificial delays or progress bars.

The chapter concludes by advocating for reporting replications of experiments, even (or especially) surprising ones, to build scientific consensus and improve trust in results.

Chapter 6: Organizational Metrics

This chapter emphasizes the foundational role of good metrics for any data-driven organization, whether or not it runs experiments. It introduces a comprehensive taxonomy of metrics and provides principles for their formulation, evaluation, and evolution.

The Importance of Metrics in a Data-Driven Culture

Metrics are crucial for:

Measuring Progress: Quantifying whether an organization is moving towards its goals.
Accountability: Holding teams and individuals responsible for their contributions.
Aligning on Goals: Metrics provide a common language and understanding of what success looks like, especially when using frameworks like Objectives and Key Results (OKRs).
The chapter notes that the definition of success often starts qualitatively and then needs to be translated into imperfect but quantifiable metrics.

A Comprehensive Metrics Taxonomy

The authors propose a primary taxonomy for organizational metrics:

Goal Metrics (Success Metrics / True North Metrics):
- Reflect what the organization ultimately cares about (e.g., long-term revenue, user happiness).
- Often derived from a mission statement.
- Usually a single or very small set of metrics.
- May be hard to move in the short term, as initiatives might have small or delayed impacts.
- Should be simple and stable.
Driver Metrics (Sign Post Metrics / Surrogate Metrics / Indirect or Predictive Metrics):
- Shorter-term, faster-moving, and more sensitive than goal metrics.
- Reflect a mental causal model of how the organization succeeds (hypotheses about what drives goal metrics).
- Examples of frameworks: HEART (Happiness, Engagement, Adoption, Retention, Task Success) and PIRATE (Acquisition, Activation, Retention, Referral, Revenue).
- Should be aligned with the goal (causally validated), actionable and relevant, sensitive, and resistant to gaming.
Guardrail Metrics:
- Protect against violated assumptions and unintended negative consequences.
- Organizational Guardrails: Prevent undesirable outcomes while optimizing other metrics (e.g., don’t increase registration at the cost of drastic engagement drops, don’t improve features if it significantly degrades page load time). Often more sensitive than goal/driver metrics.
- Trustworthiness Guardrails: Assess the internal validity of experiment results (discussed in Chapter 21, e.g., Sample Ratio Mismatch).

Other useful metric taxonomies:

Asset vs. Engagement Metrics: Asset metrics measure accumulation (e.g., total Facebook users), while engagement metrics measure value from action/usage (e.g., sessions).
Business vs. Operational Metrics: Business metrics track health (e.g., revenue-per-user, DAU), while operational metrics track system performance (e.g., queries per second).
Data Quality Metrics: Ensure internal validity and trustworthiness of experiments.
Diagnosis / Debug Metrics: Provide granular information for troubleshooting (e.g., clicks on specific page areas, revenue decomposed into purchase indicator and conditional revenue).

Metric alignment is crucial at every level of the company (company, team, feature). The same metric (e.g., latency) can be a guardrail for a product team but a goal for an infrastructure team.

Formulating Metrics: Principles and Techniques

Translating qualitative goals into quantifiable definitions is a key step:

Principles for Goal Metrics:
1. Simple: Easily understood and broadly accepted.
2. Stable: No need to update with every new feature.
Principles for Driver Metrics:
1. Aligned with the goal: Validated to be true drivers of success.
2. Actionable and relevant: Teams can influence them through their work.
3. Sensitive: Able to show impact from most initiatives.
4. Resistant to gaming: Difficult to manipulate for superficial gains.

Helpful techniques for metric development:

Using Hypotheses from Less-Scalable Methods: Use qualitative insights from surveys, user experience research (UER), or focus groups (see Chapter 10) to define quantitative metrics (e.g., how to define “bounce rate” thresholds based on user dissatisfaction).
Incorporating Quality: Build quality concepts into definitions (e.g., a “good” click vs. a “bad” click, a “good” user signup). Human evaluation can help define quality.
Statistical Models: When using models (e.g., Lifetime Value – LTV), ensure interpretability and validation over time.
Measuring Dissatisfaction: Sometimes easier to measure what you don’t want (e.g., dissatisfaction) as guardrail or debug metrics.
Avoiding Proxies’ Failure Cases: Be aware that all metrics are proxies. A CTR metric might lead to clickbait; supplement with relevance metrics.
Limiting Key Metrics: While an OEC might combine many, try to limit directly monitored “key” metrics to about five to avoid cognitive overload and multiple comparison problems (e.g., with 10 independent metrics, there’s a 40% chance of one being statistically significant by chance at p<0.05).

Evaluating Metrics

Metrics evaluation is ongoing:

Initial Evaluation: Does a new metric provide additional information?
Continuous Monitoring: For LTV predictions, check if errors remain small. For experimentation metrics, periodically assess if they encourage gaming.
Causal Validation (Driver to Goal): The most challenging evaluation. Does a driver metric actually lead to long-term financial objectives? Approaches include:
- Utilizing other data sources (surveys, UER) for triangulation.
- Analyzing observational data (with caveats about causality).
- Checking similar validations at other companies (e.g., site speed impact).
- Running experiments specifically to evaluate metrics (e.g., A/B test a loyalty program to see LTV impact).
- Using a corpus of historical experiments as “golden” samples for sensitivity and causal alignment.

Evolving Metrics Over Time

Metric definitions are not static and evolve due to:

Business Evolution: New business lines, shifts in focus (e.g., adoption to retention).
Environment Evolution: Competitive landscape, privacy concerns, new policies.
Improved Understanding: Discovering limitations of existing definitions, leading to refinements or new formulations.
Changes to metrics require structured handling, including schema changes and data backfilling. The time and effort invested in refining metrics have a high Expected Value of Information (EVI).

Chapter 7: Metrics for Experimentation and the Overall Evaluation Criterion

This chapter deepens the discussion on metrics by focusing on their specific requirements for online controlled experiments and introduces the critical concept of the Overall Evaluation Criterion (OEC).

From Business Metrics to Metrics Appropriate for Experimentation

While Chapter 6 covers organizational metrics broadly, not all of them are suitable for experimentation. Metrics used in online controlled experiments must possess specific characteristics:

Measurable: The effect must be quantifiable (e.g., post-purchase satisfaction is hard to measure directly).
Attributable: Metric values must be linked directly to the experiment variant (e.g., an app crash must be attributable to the Treatment). This can be challenging with third-party data providers.
Sensitive and Timely: Metrics must be able to detect meaningful changes within the typical experiment duration (e.g., one to two weeks).
- Sensitivity depends on statistical variance, effect size, and number of randomization units.
- Examples: Stock price is too insensitive. Click-throughs on a new feature are sensitive but too localized (don’t capture overall impact or cannibalization). Whole-page click-throughs (penalized for quick-backs), “success” measures (like purchase), and time-to-success are generally good.
- Outliers (e.g., very high cost-per-click ads) can inflate variance, making detection harder; truncated revenue (capping high values) can improve sensitivity.
- Long-term metrics (e.g., yearly subscription renewal rate) are usually too slow; surrogate metrics (e.g., usage as an indicator of satisfaction) are needed for experiments.
  The authors emphasize that you must “think hard about what you are optimizing for.” Optimizing for “time-on-site” without quality qualifiers can lead to slow sites and interstitial pages, increasing the metric short-term but causing long-term abandonment.

For experimentation, you’ll select a subset of business goal, driver, and organizational guardrail metrics that meet these criteria, then augment them with:

Additional surrogate metrics for business goals and drivers.
More granular metrics (e.g., click-through on specific features on a page).
Additional trustworthiness guardrails (see Chapter 21) and data quality metrics.
Diagnostic and debug metrics (e.g., decomposing revenue into purchase indicator and conditional revenue).

A typical experiment scorecard might have a few key metrics and hundreds to thousands of other metrics, viewable by segments. Different teams may even swap the roles of goal, driver, and guardrail metrics based on their specific focus (e.g., an infrastructure team might have performance as a goal, and business metrics as guardrails).

Combining Key Metrics into an OEC

When multiple goal and driver metrics exist, the question arises: how to combine them for decision-making?

The Challenge of a Single Metric: While some advocate for “One Metric That Matters (OMTM),” it’s often an oversimplification. Like a pilot’s dashboard, an online business needs multiple key metrics (engagement, monetary value).
Mental Models vs. Codified OEC: Organizations often have a mental model of tradeoffs. Devising a single Overall Evaluation Criterion (OEC) as a weighted combination of objectives is the “more desired solution” because it makes tradeoffs explicit, aligns the organization, enables consistent decision-making, and allows for automated optimization.
Avoiding Gaming: The OEC and its components must be resistant to gaming (see Chapter 6 Sidebar). For example, a simple “revenue” OEC can lead to spamming users; it needs to incorporate long-term value like user lifetime value or engagement.
Basketball Scoreboard Analogy: Like combining 2-point and 3-point shots into a single team score, an OEC summarizes complex performance.
Decision-Making Framework without an OEC: If a single OEC cannot be agreed upon, minimize the number of key metrics (e.g., limit to five to reduce cognitive overload and multiple comparison problems, as a p-value of 0.05 for 10 independent metrics leads to a 40% chance of one being significant by chance).
Benefit of an Agreed-Upon OEC: Enables automatic shipping of changes and parameter sweeps.

Example: OEC for E-mail at Amazon

Amazon initially used a click-through revenue OEC for its programmatic email campaigns. This metric was monotonically increasing with email volume, leading to spamming users and complaints.

The Problem: Optimizing for short-term revenue neglected user lifetime value (LTV). Annoyed users would unsubscribe, resulting in future lost revenue opportunities.
The Solution: Amazon built a simple model to estimate the unsubscribe_lifetime_loss (estimated revenue loss from a user unsubscribing). Their new OEC was: OEC = (Total Revenue - s * unsubscribe_lifetime_loss) / n (where s is unsubscribes, n is users).
The Impact: With just a few dollars assigned to unsubscribe loss, over half of programmatic campaigns showed a negative OEC, leading to their discontinuation. This insight also led to a redesigned unsubscribe page that defaulted to unsubscribing from a “campaign family” rather than all Amazon emails, drastically reducing the cost of an unsubscribe.

Example: OEC for Bing’s Search Engine

Bing uses query share and revenue as key organizational metrics. A ranker bug that showed very poor search results caused both distinct queries per user to increase by over 10% and revenue-per-user to increase by over 30%.

The Paradox: If these were the OEC, Bing would intentionally degrade quality. The long-term goal of a search engine is to help users find answers quickly, which often means fewer queries per task.
Decomposing Query Share: Monthly query share = Users/Month * Sessions/User * Distinct Queries/Session.
- Users/Month is determined by experiment design, not the OEC.
- Distinct Queries/Session should ideally be minimized (users find answers faster), but only if tasks are successfully completed (not due to abandonment).
- Sessions-per-user is the key metric to optimize (increase), as satisfied users visit more often.
Revenue Constraint: Revenue-per-user alone is insufficient for search engines. It needs to be balanced against engagement metrics, often by adding constraints (e.g., restricting average ad pixel usage on a page).

Goodhart’s Law, Campbell’s Law, and the Lucas Critique

These three concepts highlight the dangers of choosing an OEC based solely on correlations:

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” Optimizing solely for a metric can lead to perverse incentives and unintended consequences (e.g., increasing short-term profit by raising prices could hurt long-term LTV).
Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”
Lucas Critique: Relationships observed in historical data may not be causal or structural. Policy decisions can alter economic models, making past correlations invalid for future prediction (e.g., the Phillips Curve’s historical correlation between inflation and unemployment did not hold when inflation was intentionally raised to lower unemployment).
The authors emphasize that simply finding historical correlations is not enough; one must understand the underlying causal relationships and incentives to avoid making faulty decisions.

Chapter 8: Institutional Memory and Meta-Analysis

This chapter highlights the crucial role of institutional memory and meta-analysis as an organization matures in its experimentation practices, particularly reaching the “Fly” maturity phase.

What Is Institutional Memory?

Once a company fully embraces controlled experiments, it effectively creates a digital journal of all changes, including descriptions, screenshots, and key results. This journal is referred to as Institutional Memory.

It captures meta information for each experiment: owners, start/end dates, descriptions, visuals, and key results (including triggered and overall impact).
Crucially, it records the hypothesis, decision made, and the rationale behind that decision.
A centralized experimentation platform is essential for easily capturing and organizing this data.

Why Is Institutional Memory Useful?

Mining data from this institutional memory through meta-analysis offers five key benefits:

Experiment Culture: Institutional memory solidifies the culture of experimentation by:
- Quantifying Contribution: Showing how experimentation contributes to overall organizational goals (e.g., how much session-per-user improvement is attributable to launched experiments). Bing Ads’ plot of revenue gains from hundreds of incremental experiments is a powerful example.
- Highlighting Impactful Experiments: Regularly sharing stories of big wins or surprising results (both positive and negative) to make the learning concrete and emphasize humility.
- Revealing Success/Failure Rates: Illustrating that most ideas fail to improve metrics (Microsoft: 1/3 positive, 1/3 negative, 1/3 no impact; Bing/Google: 10-20% success). This reinforces the need for objective testing.
- Tracking Adoption & Accountability: Measuring the percentage of features launched through experiments, which teams run the most, and which are most effective. Linking outages to non-experimented changes provides a safety net.
Experiment Best Practices: Institutional memory helps enforce and improve best practices by:
- Identifying Gaps: Analyzing data to see if experimenters follow recommended ramp periods, have sufficient power, etc.
- Driving Automation: Insights from meta-analysis can justify investments in automation (e.g., LinkedIn building auto-ramp features after observing teams spending too much time in early ramp phases).
Future Innovations: The catalog of past successes and failures is highly valuable for:
- Avoiding Repeated Mistakes: Preventing teams from re-trying ideas that failed without understanding why.
- Inspiring Innovation: Providing a rich source of patterns and insights that can guide new ideas (e.g., GoodUI.org summarizing winning UI patterns).
- Predictive Power: After many experiments on a page (e.g., SERP), you can predict the impact of new elements, narrowing the search space for future tests.
- Heterogeneity Insights: Uncovering how different countries or user segments react to features allows for customized user experiences.
Metrics: Institutional memory deepens understanding of metrics by:
- Assessing Sensitivity: Identifying which metrics are meaningfully moved by experiments (e.g., DAU is hard to move short-term). Constructing a corpus of trusted experiments to evaluate new metrics.
- Identifying Related Metrics (Early Indicators): Discovering how metrics move together in experiments, especially identifying short-term metrics that predict long-term goals (e.g., Chen, Liu, and Xu’s work at LinkedIn on how message activity correlates with visits but doesn’t move similarly in experiments).
- Probabilistic Priors for Bayesian Approaches: Historical experiment data can provide reasonable prior distributions for Bayesian analyses, though caution is needed for rapidly evolving product areas.
Empirical Research: The vast amount of experiment data serves as rich empirical evidence for researchers:
- Studying Innovation Productivity: Analyzing how companies best utilize experimentation (e.g., Azevedo et al.’s study on Microsoft’s platform).
- Causal Evidence for Social Phenomena: Using experiment randomization as an instrumental variable (e.g., Saint-Jacques et al.’s study on LinkedIn’s “People You May Know” algorithm showing weak ties’ importance for job landing).
- Correcting for Statistical Biases: Investigating and correcting selection bias when launching successful experiments (e.g., Lee and Shen’s work on Airbnb).

Institutional memory transforms an organization from simply running tests to a continuously learning, data-driven entity, driving more robust results and fostering a culture of curiosity and intellectual integrity.

Chapter 9: Ethics in Controlled Experiments

This chapter addresses the critical importance of ethics in controlled experiments, especially in the context of human subjects. It provides a framework for understanding and evaluating the ethical considerations, drawing from principles established in biomedical and behavioral research.

Background: The Ethical Imperative

Controlled experiments involve real people, necessitating a strong ethical framework. The chapter highlights recent controversies, such as:

Facebook’s emotional contagion study (2014): Manipulating users’ news feeds to study emotional responses.
OKCupid’s dating algorithm experiment (2014): Misinforming users about match percentages to observe behavior.
These examples underscore the need for careful ethical assessment.

The authors reference the Belmont Report (1979) and the Common Rule (1991), which established three core principles for human subjects research:

Respect for Persons: Treating individuals as autonomous agents, ensuring transparency, truthfulness, and voluntariness (choice and consent). Protecting those with diminished autonomy.
Beneficence: Minimizing risks and maximizing benefits to participants. This means properly assessing and balancing risks and benefits.
Justice: Ensuring participants are not exploited and that risks and benefits are fairly distributed.

While these guidelines originate from medical contexts with higher potential for harm, they provide a useful framework for online experiments. The fundamental question is whether a specific A/B experiment is justified.

Key Areas for Ethical Consideration

Risk:
- Minimal Risk: Does the experiment’s risk exceed “that ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests?” Harm can be physical, psychological, emotional, social, or economic.
- Equipoise: Is there genuine uncertainty among experts about which treatment is better? If one variant is known to be superior or harmful, exposing users to the inferior/harmful one raises ethical questions.
- “A/B Illusion”: The false belief that running an A/B test is inherently riskier than just shipping a feature to 100% of users without testing. If shipping to 100% is acceptable, scientifically evaluating with 50% first should also be acceptable.
- Knowingly Providing a Worse Experience: Experiments that intentionally slow down a user (Chapter 5) or disable features to quantify tradeoffs (e.g., value of recommendations) violate equipoise. While carrying higher risk, their benefit is establishing tradeoffs for all users, similar to drug toxicity studies.
- Deception Experiments: Experiments focused on behavioral manipulation or deception (e.g., OKCupid) carry higher ethical risk and raise questions about respect for participants. Transparency and user expectations are key.
Benefits:
- Direct Benefits: Improving the product for users in Treatment.
- Indirect Benefits: Improving the product for all users who benefit from the experiment’s results, or ensuring a sustainable business.
Consent and Choice:
- Informed Consent: Participants agree to participate after being fully informed of risks, benefits, process, and data handling. While standard in medical trials, it’s often impractical and annoying for online experiments due to scale and low risk.
- Presumptive Consent: Asking a smaller, representative group if they would agree to participate in a study type.
- User Choice: Do users have alternatives (e.g., use another search engine)? Switching costs vary. These factors influence the balance of risks and benefits.

Data Collection Ethics

Given that experimentation relies on data, ethical data collection is paramount. Beyond legal compliance (e.g., GDPR), experimenters should address:

Transparency and Understanding:
- Do users understand what data is collected about them? (Privacy by Design).
- How sensitive is the data (financial, health)? Could it lead to discrimination?
- Can the data be linked to an individual (Personally Identifiable Information – PII)?
- What is the purpose of collection, how will it be used, and by whom?
- Is collection necessary? How soon can it be aggregated or deleted?
Potential Harm from Data Leakage: What if data is made public (health, psychological, social, financial harm)?
Confidentiality and Safeguards:
- What level of confidentiality can participants expect?
- Are internal safeguards adequate (access controls, logging, auditing)?
- What redress is offered if guarantees are not met?

Culture and Processes for Ethical Experimentation

Addressing complex ethical issues requires a company-wide commitment:

Cultural Norms and Education: Foster a culture where ethical questions are routinely asked in product and engineering reviews.
Institutional Review Boards (IRBs): Implement an internal process (analogous to IRBs) to review human subjects research, assess risks/benefits, ensure transparency, and provide guidance.
Secure Data Storage and Access: All data, identified or not, must be stored securely with time-limited access, clear policies, logging, and regular audits.
Clear Escalation Path: For cases with more than minimal risk or data sensitivity issues.

The chapter concludes by emphasizing that these ethical considerations are not just a checklist, but ongoing discussions that improve product design and experiments for the benefit of end users.

Chapter 10: Complementary Techniques

This chapter introduces various techniques that complement online controlled experiments. These methods are valuable for generating new ideas, validating metrics, and gathering evidence when A/B testing is not feasible or sufficient.

The Landscape of Complementary Techniques

To run successful A/B experiments, organizations need:

An “ideas funnel”: A continuous source of hypotheses to test.
Validated metrics: Reliable measures for success.
Evidence: To support or refute hypotheses when experiments are impossible or insufficient.
Supplementary metrics: Beyond those computed in A/B tests.

These complementary methods vary along two axes: scale (number of users) versus depth of information per user. Generally, larger scale means less depth per user.

The techniques covered are:

Logs-based Analysis (Retrospective Analysis): High scale, moderate depth.
Human Evaluation: Moderate scale, moderate depth.
User Experience Research (UER): Low scale, high depth.
Focus Groups: Low/moderate scale, moderate depth.
Surveys: Moderate/high scale, low/moderate depth.
External Data: High scale, varying depth (depending on source).

Logs-based Analysis

This involves analyzing existing instrumentation data to understand user behavior and system performance. It helps with:

Building Intuition: Understanding distributions of metrics (e.g., sessions-per-user, CTR), segment differences (country, platform), and how these shift over time. This provides baseline understanding for experiment design.
Characterizing Potential Metrics: Understanding variance, distributions, and correlations of new metrics. This helps assess if a metric is useful for decision-making or provides new information.
Generating Ideas for A/B Experiments: Identifying drop-offs in funnels, uncovering unusual action sequences, or sizing the potential impact of a feature before implementation (e.g., analyzing current email attachment usage to size potential impact of improving it).
Natural Experiments: Analyzing data from unexpected exogenous events (e.g., external company changes, bugs causing mass logouts) as a form of observational study (see Chapter 11).
Observational Causal Studies: Using quasi-experimental designs when controlled experiments are not possible (covered in Chapter 11).

Limitation: Logs-based analysis can only infer future behavior from past observations; it might not reveal why certain behaviors occur (e.g., low feature usage due to difficulty of use).

Human Evaluation

Paying human judges (raters) to complete tasks and provide feedback. Common in search and recommendation systems.

Tasks: Simple (e.g., A vs. B preference, image classification) to complex (e.g., relevance scoring for a query).
Calibration: Detailed instructions and multiple raters are used to ensure quality.
Limitations: Raters are not always end users, and may miss real-world context (e.g., search query “5/3” for a bank vs. arithmetic).
Advantages: Can be trained to detect spam or harmful experiences users might miss. Provides calibrated labeled data.
Applications in Experimentation:
- Additional Metrics: Use human evaluation results as metrics in A/B tests (e.g., raters preferring Treatment search results).
- Debugging: Detailed examination of poorly rated results to understand algorithmic flaws.
- Correlation with Logs: Pairing human evaluation with log analysis to link observed actions with relevant results.

User Experience Research (UER)

In-depth qualitative studies, typically with a small number of users (tens at most), often through observation in lab settings or in situ.

Goal: Generating ideas, spotting problems, and gaining insights from direct observation and timely questions (e.g., observing users struggling with a purchase flow).
Methods: Special equipment (eye-tracking), diary studies (users self-documenting longitudinal behavior, including offline activities).
Benefits: Helps identify “true” user intent and translate qualitative observations into quantitative metric ideas.
Limitation: Small scale means results need validation with methods that scale (observational analysis, controlled experiments).

Focus Groups

Guided group discussions with recruited users or potential users.

Scale: More scalable than UER studies, but less in-depth per user.
Applications: Getting feedback on ill-formed hypotheses, understanding underlying emotional reactions (e.g., for branding changes). Useful for early-stage product development.
Limitations: Risk of “group-think” and convergence on fewer opinions. What customers say may not match their true preferences (e.g., Philips boom box example: expressed preference for yellow, but chose black when given a free one).

Surveys

Recruiting a population to answer a series of questions (in-person, phone, online).

Scale: Can reach larger numbers of users than UER or focus groups.
Applications: Gathering data not observable from instrumentation (offline behavior, opinions, trust, satisfaction levels). Useful for observing trends over time on less-directly-measurable issues (e.g., trust, reputation).
Challenges:
- Question Wording: Must be careful to avoid misinterpretation, priming, or uncalibrated answers.
- Self-Reported Answers: Users may not be truthful.
- Bias: Population can be biased (e.g., only unhappy users respond), making relative results (trends) more useful than absolute ones.
  Surveys can be paired with observational analysis to correlate responses with user behavior, but results need careful interpretation due to respondent bias.

External Data

Data collected and analyzed by parties outside your company. Sources include:

Market Research Firms: Providing granular site/user data (e.g., comScore, Hitwise) based on large user panels.
Data Providers: Offering user segments joinable with internal logs.
Custom Surveys: Hiring external companies to run surveys.
Published Academic Papers: Studies on user behavior, product interactions (e.g., eye-tracking studies, user-reported satisfaction vs. task duration).
Crowd-sourced Lessons: Websites summarizing UI design patterns (e.g., GoodUI.org).

Benefits:

Validation: Comparing internal metrics to external benchmarks (e.g., total visitors, shopping traffic share).
Supporting Evidence: Providing evidence for business metrics or generating ideas for proxy metrics.
Generalizability: Establishing that latency matters (Chapter 5) without needing to run your own experiment.
Competitive Analysis: Benchmarking against competitors.
Caveat: Absolute numbers may not match due to different methodologies; focus on trends and correlations.

Putting It All Together

Choosing which complementary techniques to use depends on the goal:

Early-stage idea generation/qualitative understanding: UER studies, focus groups.
Validating metrics/understanding broad trends: Surveys, external data, observational analyses.
Identifying specific behavioral patterns: Logs-based analysis.

Tradeoffs: Number of users (generalizability/external validity) vs. depth of information per user.
Triangulation: Using multiple methods to converge on a more accurate understanding (e.g., UER + logs-based analysis + surveys + A/B tests to understand user happiness with recommendations). This establishes a hierarchy of evidence by bounding the answer through different perspectives.

Chapter 11: Observational Causal Studies

This chapter addresses scenarios where randomized controlled experiments are not possible, and instead, observational causal studies are used to assess causality. It discusses various designs and critical pitfalls associated with these methods, emphasizing their lower level of trustworthiness compared to controlled experiments.

When Controlled Experiments Are Not Possible

The goal of causal inference is to compare the outcome of a treated population to a counterfactual (what would have happened if they hadn’t been treated). Randomized controlled experiments achieve this by ensuring the “selection bias” term is zero. However, they are not always feasible in situations such as:

No Control Over the Causal Action: When the change isn’t controlled by the organization (e.g., a user switching phone brands).
Too Few Units: For single events like Mergers & Acquisitions (M&A).
High Opportunity Cost for Control: When withholding Treatment from a Control group is too costly (e.g., Super Bowl ads, long-term OEC).
Expensive Changes Relative to Perceived Value: Testing “what-if” scenarios that are too costly (e.g., forcibly signing out all users, removing all ads).
Randomization Unit Limitations: When proper randomization is impossible (e.g., randomizing by TV viewers for ad impact).
Unethical or Illegal Interventions: Withholding beneficial medical treatments.

In these situations, the best approach is often to use multiple methods lower in the hierarchy of evidence (e.g., small-scale UER, surveys, and observational causal studies). The chapter differentiates “observational causal studies” (aiming for causality with no unit manipulation) from “quasi-experimental designs” (where units are assigned to variants, but not randomly).

Designs for Observational Causal Studies

The challenge is constructing comparable Control and Treatment groups and modeling the impact:

Interrupted Time Series (ITS):
- Design: Uses multiple measurements before an intervention to create a model for the counterfactual, and then multiple measurements after to estimate the Treatment effect as the deviation from the model’s prediction.
- Extension: Introduce a Treatment, reverse it, and repeat multiple times (e.g., police helicopter surveillance and burglaries).
- Online Example: Bayesian Structural Time Series analysis for online advertising impact.
- Common Issue: Time-based confounds (seasonality, underlying system changes). Repeated interventions help.
- User Experience Concern: Flipping experience back and forth might irritate users, confounding the effect.
Interleaved Experiments:
- Design: Commonly used for evaluating ranking algorithms (e.g., search engines). Algorithms X and Y are mixed in a single display (e.g., x1, y1, x2, y2), with duplicate results removed. Click-through rates for results from each algorithm are compared.
- Limitation: Only applicable when results are homogenous and don’t significantly impact other parts of the page (e.g., if x1 takes up too much space, it complicates analysis).
Regression Discontinuity Design (RDD):
- Design: Used when there’s a clear threshold that defines Treatment assignment (e.g., scholarship for 80% grade, legal drinking age of 21). Assumes populations just above and below the threshold are similar.
- Example: Deaths by birthday for age 21 vs. 20 and 22.
- Key Issue: Confounding factors that share the same threshold (e.g., legal gambling age also at 21).
- Relevance to Software: Often applies when algorithms generate a score with a threshold.
Instrumental Variables (IV) and Natural Experiments:
- Design: Tries to approximate random assignment by identifying an “Instrument” that influences Treatment assignment but only affects the outcome through the Treatment (e.g., Vietnam war draft lottery for veteran earnings, charter school lotteries).
- Natural Experiments: Occur organically (e.g., monozygotic twins for twin studies, notification queues in social networks affecting message delivery order).
- Application: Estimating impact of notifications on engagement.
Propensity Score Matching (PSM):
- Design: Constructs comparable Control and Treatment populations by matching units on a single constructed propensity score (the probability of receiving Treatment given observed covariates).
- Application: Evaluating online ad campaign impact.
- Key Concern: Only observed covariates are accounted for; unaccounted factors can lead to hidden biases. Pearl argues that PSM is only reliable under “strong ignorability” conditions, which are hard to verify. Some claim it can increase bias and inefficiency.
Difference in Differences (DID):
- Design: Assumes “common trends” – groups may differ initially but move in parallel absent Treatment. Measures the change over time in the Treatment group minus the change over time in the Control group.
- Application: Geo-based experiments (e.g., TV ads in one DMA vs. another).
- Example: Impact of minimum wage change in New Jersey vs. Eastern Pennsylvania.

Common Pitfalls in Observational Causal Studies

Observational causal studies carry many pitfalls, leading to lower trustworthiness:

Unanticipated Confounds: The main pitfall. Factors that impact both the Treatment assignment and the outcome, leading to spurious correlations.
- Unrecognized Common Cause: A third variable that causes both the Treatment and the outcome (e.g., small palm size and longer life expectancy are both caused by gender; more errors and lower churn in Office 365 are both caused by higher usage).
Spurious or Deceptive Correlations:
- Deceptive Correlations: Often caused by strong outliers, or intentional misrepresentation (e.g., energy drink sales correlating with athletic performance, implying causality when none exists).
- Spurious Correlations: Purely coincidental correlations that can almost always be found when testing many hypotheses (e.g., people killed by venomous spiders correlating with word length in the National Spelling Bee).
Implicit Assumptions: Quasi-experimental methods rely on many assumptions that are often impossible to test and can easily be wrong. These incorrect assumptions lead to a lack of internal validity and can impact external validity.

The chapter concludes that while observational causal studies are sometimes the only option, they require immense care due to their inherent limitations. The scientific gold standard for establishing causality remains the controlled experiment.

Chapter 12: Client-Side Experiments

This chapter focuses on the unique considerations and challenges of running experiments on thick clients, such as native mobile or desktop applications, in contrast to thin clients like web browsers. Understanding these differences is crucial for ensuring trustworthy experiments in mobile-first or app-centric environments.

Differences between Server and Client Side

The primary distinctions lie in the release process and data communication:

Difference #1: Release Process

Server-Side Experiments (Webpages): New features are released continuously (multiple times/day). Changes are controlled by the organization, deployed instantaneously, and immediately visible to users without user action.
Client-Side Experiments (Thick Clients like Mobile Apps):
- App Store Involvement: Release requires submission to and review by app stores (e.g., Google Play, Apple App Store), which can take days.
- User Adoption: Users must actively update the app; they can delay or ignore updates. This means multiple app versions are simultaneously in use.
- Staged Rollouts: App stores support staged rollouts (releasing to a percentage of users), which are essentially randomized experiments but cannot be analyzed as such by app owners directly, as they only know who adopted, not who was eligible.
- Infrequent Updates: App owners may avoid frequent client updates due to user annoyance, network bandwidth costs, and reboot requirements (e.g., Windows updates).

Difference #2: Data Communication between Client and Server

Thick clients face unique challenges in communicating data:

Internet Connectivity: Connections can be unreliable (offline for days, temporary dead zones). This delays receiving new configurations and sending telemetry.
Cellular Data Bandwidth: Limited data plans mean telemetry is often sent over Wi-Fi only, delaying data arrival server-side.
Device Performance Impact: Network communication affects battery, CPU, latency, and memory/storage.
- Battery Drain: More communication increases battery consumption.
- CPU, Latency, Performance: Frequent data aggregation/communication can make the app less responsive.
- Memory and Storage: Caching data to reduce network usage impacts app size, potentially leading to uninstalls.
  These tradeoffs affect both visibility into client-side behavior and user engagement.

Implications for Experiments

These differences have significant implications for how client-side experiments are designed, deployed, and analyzed:

Implication #1: Anticipate Changes Early and Parameterize

Because client code releases are slow, all experiment variants must be pre-coded and shipped with the app build.

Feature Flags (Dark Features): Unfinished features are shipped “dark” (off by default), controlled by server-side configuration parameters, and turned on when ready.
Server-Side Configurability: More features are built to be configurable from the server, allowing A/B testing and instant rollback if a feature performs poorly, bypassing the lengthy client release cycle.
Fine-Grained Parameterization: Extensive use of parameters (e.g., number of feed items to fetch, ML model parameters) allows creating new variants without a client release. (e.g., Windows 10 search box text experiments increasing engagement and revenue millions).

Implication #2: Expect Delayed Logging and Effective Starting Time

Configuration Delays: Devices may be offline or in low-bandwidth areas, delaying receipt of new experiment configurations.
Session Consistency: New assignments may not take effect until the next user session to avoid interrupting current experience. This delays experiment start for light users.
App Version Adoption Lag: Initial adoption of new app versions takes time (weeks).
These delays mean:
Weak Initial Signals: Experiment data will initially appear weaker due to smaller, biased (frequent/Wi-Fi users) sample sizes.
Extended Duration: Experiments may need to run longer to account for delays.
Time Period Alignment: Careful selection of comparison periods is needed if Control and Treatment have different effective start times (e.g., shared Control already live).

Implication #3: Create a Failsafe to Handle Offline or Startup Cases

Cached Assignments: Experiment assignments should be cached for offline use to ensure consistency.
Default Variants: Have a default variant if the server is unresponsive.
OEM Agreements & First-Run Experience: Experiments must be set up properly for first-time app runs, ensuring stable randomization IDs before and after user sign-in.

Implication #4: Triggered Analysis May Need Client-Side Experiment Assignment Tracking

Over-triggering: If experiment assignment info is fetched for all active experiments at app start, relying on this for triggered analysis can lead to overcounting (not all features are used).
Feature-Specific Tracking: Sending assignment info only when a feature is actually used requires client-side instrumentation but can increase communication volume, potentially causing performance issues.

Implication #5: Track Important Guardrails on Device and App Level Health

Beyond user engagement, monitor device/app-level health:

CPU/Battery Consumption: Treatment might consume more CPU and drain battery.
Notification Disablement: Increased push notifications might lead users to disable them in device settings.
App Size: Larger app size can reduce downloads and increase uninstalls.
Internet Bandwidth Consumption: Impact on user data plans.
Crash Rate: Critical for client software. Logging clean exits helps track crashes on next app start.

Implication #6: Monitor Overall App Release through Quasi-experimental Methods

Bundling two full app versions for true A/B testing is impractical due to app size.
However, the natural staggered adoption of new app versions creates a quasi-experimental setting. Techniques can be used to remove adoption bias and analyze the overall app release.

Implication #7: Watch Out for Multiple Devices/Platforms and Interactions between Them

Different IDs: Same user on different devices/platforms may have different IDs, leading to randomization into different variants.
Cross-Device Interactions: Features like “Continue on desktop” or directing traffic between mobile app and web can cause unintended interactions.
Holistic View: Analysis must consider user behavior holistically across all platforms, as performance on one platform (e.g., app) might impact overall engagement if traffic is shifted.

The chapter concludes that while client-side experiments present unique challenges due to release processes and data communication, understanding these differences and implementing proper designs and monitoring is crucial for trustworthy results in the evolving technological landscape.

Chapter 13: Instrumentation

This chapter emphasizes that high-quality instrumentation is a fundamental prerequisite for running any trustworthy online controlled experiment. It’s also essential for understanding a system’s baseline performance and how users interact with it.

The Foundation of Data: Instrumentation

Before any experiment can be run, the system (website, application) must be instrumented to log what is happening to users and the system itself. This includes:

User Actions: What users see and do (e.g., clicks, hovers, scrolls, form field errors, slideshow navigation).
User Interactions: The timing of these actions, especially those without a server roundtrip.
System Performance: How long pages/apps take to display, server response times, component performance.
System Responses: Number of requests received, pages served, retry handling.
System Information: Exceptions, errors, cache hit rates.

The terms “instrument,” “track,” and “log” are used interchangeably. Privacy (Chapter 9) is a crucial consideration for all instrumentation.

Client-Side vs. Server-Side Instrumentation

Understanding the distinctions between client-side and server-side instrumentation is vital:

Client-Side Instrumentation

Focus: What the user experiences, sees, and does (user actions, perceived performance, errors, crashes).
Examples: Detecting malware overwriting server-sent content.
Drawbacks:
1. Resource Consumption: Can use significant CPU, network bandwidth, and deplete battery, negatively impacting user experience (e.g., large JavaScript snippets increasing load time, affecting user retention).
2. Lossiness: JavaScript-based instrumentation (e.g., web beacons for click tracking) can be lossy:
  - Race Conditions: Beacons may be canceled if a new page loads too quickly. Loss rate varies by browser.
  - Synchronous Redirects: Forcing beacons to send before page load reduces lossiness but increases latency and may cause user abandonment. Chosen for high-fidelity needs like ad clicks.
3. Clock Skew: Client clocks can be inaccurate, causing discrepancies with server times. Never subtract client and server times directly.

Server-Side Instrumentation

Focus: What the system does and why (internal processes, detailed performance metrics, internal scores/rankings for algorithms, server locations).
Advantages:
- Accuracy: Less impacted by network issues, often providing lower variance data.
- Granularity: More detail on internal system behavior (e.g., time to generate HTML, reasons for specific search results, cache hit rates).
- Stability: Less prone to lossiness or user-side interference.
Considerations: Server clocks must be synchronized. Mismatches can occur if a request is served by one server and logged by another.

Processing Logs from Multiple Sources

Modern systems collect logs from various streams (client types, servers, user states). To be useful, these logs must be combined and processed:

Sort and Group:
- Join logs from different sources (client, server) using a common identifier (e.g., user ID, randomization unit ID) to group events.
- Sort by user ID and timestamp to create sessions or group activities within time windows.
- Can be materialized (stored as a combined table) or virtual (joined during processing). Materialization is useful for debugging and hypothesis generation.
Clean the Data:
- Use heuristics to remove non-real users (bots, fraud) based on activity patterns (too much/too little activity, too little time between events).
- Fix instrumentation issues (duplicate events, incorrect timestamps).
- Be aware that data cleansing can’t fix missing events due to lossiness.
- Crucial Warning: Some filtering can unintentionally remove more events from one variant than another, causing a Sample Ratio Mismatch (SRM) (see Chapter 3).
Enrich the Data:
- Parse raw data to create useful dimensions (e.g., browser family/version from user agent string, day of week from dates).
- Compute useful measures (e.g., event duration, total events per session).
- Add experiment-specific annotations (e.g., whether to include a session in results, experiment transition info like ramp-up). These annotations can incorporate business logic for performance.

Fostering a Culture of Instrumentation

The most challenging aspect is getting engineers to prioritize and implement instrumentation. It’s like flying a plane with broken instruments – unsafe, even if the plane still flies.

Cultural Norm: “Nothing ships without instrumentation.” Make it part of the feature specification and give broken instrumentation the same priority as broken features.
Invest in Testing: Engineers should test their instrumentation during development, and code reviewers should check it.
Monitor Raw Logs for Quality:
- Check event counts by key dimensions.
- Validate invariants (e.g., timestamps within range).
- Implement tools to detect outliers.
- Ensure prompt bug fixes for instrumentation issues across the organization.

A strong instrumentation culture ensures that engineers have the data needed to understand system performance and user behavior, and that experiments can be analyzed with high trust.

Chapter 14: Choosing a Randomization Unit

This chapter underscores that the choice of randomization unit is a critical decision in experiment design, deeply impacting both the user experience and the validity of metric analysis. It provides guidance on available options and considerations for making this choice.

The Importance of the Randomization Unit

The randomization unit is the base identifier to which a variant is assigned, and it’s also often used as a join key for processing log files (Chapter 13, Chapter 16). The choice affects:

User Experience Consistency: How frequently a user’s experience might change within a single session or across visits.
Metric Analysis: Which metrics can be reliably used and how their variance is computed.

The primary axis for consideration is granularity. For websites, common granularities include:

Page-level: Each new webpage view is a unit.
Session-level: All pageviews within a single visit (e.g., defined by 30 mins inactivity).
User-level: All events from a single user (identified by cookies, login IDs). This is the most common for online audiences.

Other granularities mentioned:

Query-level: For search engines (between page and session).
User-day: Events from the same user on different days are different units.

Key Questions for Choosing Granularity

Two main questions guide the choice:

How important is the consistency of the user experience?
- If the change is highly noticeable (e.g., font color, new feature appearing/disappearing), a coarser granularity (e.g., user-level) is preferred to avoid inconsistent and potentially frustrating user experiences.
Which metrics matter?
- Finer granularity (e.g., page-level) creates more units, leading to smaller variance of the mean and higher statistical power to detect smaller changes. However, this has trade-offs:
  - Feature Dependence: If features act across granularity levels (e.g., personalization, inter-page dependencies), finer granularity is invalid. A user’s first query in Treatment impacting a second query in Control can confound results.
  - Metric Compatibility: Metrics computed across that level of granularity (e.g., total sessions per user) cannot be used if randomization is at a finer level (e.g., page-level). You cannot measure user-level impact over time.
  - SUTVA Violation: Exposing users to different variants within a short period (e.g., page-to-page) might cause users to notice and modify their behavior, violating the Stable Unit Treatment Value Assumption (SUTVA) (Chapter 3).

Randomization Unit and Analysis Unit

A general recommendation is that the randomization unit should be the same as (or coarser than) the analysis unit for the metrics you care about.

Matching Units: If randomization and analysis units match (e.g., user-level randomization for sessions-per-user), computing variance is straightforward as samples are i.i.d. (independently identically distributed).
Coarser Randomization, Finer Analysis: Randomizing by user but analyzing by page-level metrics (e.g., click-through rate per page) is possible. This requires more nuanced analysis methods like bootstrap or the delta method (Chapter 18) to correctly compute variance due to within-user correlation. This can also be skewed by bots.
Finer Randomization, Coarser Analysis (NOT Recommended): Randomizing by page but analyzing by user-level metrics (e.g., sessions-per-user) is problematic. The user’s experience is mixed across variants, making user-level metrics meaningless for evaluating the experiment. If these are part of your OEC, you cannot use finer granularity.

User-Level Randomization: The Most Common Choice

User-level randomization is the most common because it avoids inconsistent user experiences and allows for long-term measurements (e.g., user retention). Within user-level, there are choices for user identification:

Signed-in User ID/Login:
- Scope: Stable across devices and platforms.
- Longevity: Typically persistent longitudinally.
- Best Choice: Ideal when consistency and long-term measurement are critical.
Pseudonymous User ID (e.g., Cookie):
- Scope: Not persistent across devices/platforms (e.g., desktop browser vs. mobile web).
- Longevity: Less persistent than signed-in IDs (cookies can be erased, browser settings can limit).
- Use Case: Effective for testing processes that cut across sign-in boundaries (e.g., new user onboarding).
Device ID:
- Scope: Immutable ID tied to a specific device. Not cross-device.
- Longevity: Stable longitudinally.
- Ethical Consideration: Considered identifiable.
IP Address (NOT Recommended Generally):
- Use Case (Rare): Only for infrastructure changes (e.g., comparing hosting services) controlled at the IP level.
- Drawbacks:
  - Inconsistency: User’s IP can change (e.g., home vs. work).
  - Granularity Variation: Many users can share one IP (e.g., large companies), leading to low statistical power and outlier issues.

The authors conclude that randomization at a sub-user level is only useful if there are no concerns about carryover or leakage from the same user (Chapter 22), and the success metrics are also at the sub-user level (e.g., clicks-per-page, not clicks-per-user). It’s often chosen for increased statistical power from larger sample sizes.

Chapter 15: Ramping Experiment Exposure: Trading Off Speed, Quality, and Risk

This chapter discusses the crucial practice of ramping experiment exposure – gradually increasing traffic to new Treatments. It highlights how this process, while designed to control risk, can also introduce inefficiency if not done in a principled way. The core challenge is balancing speed, quality, and risk (SQR) throughout the ramp.

What Is Ramping?

In practice, experiments don’t just start at a fixed traffic allocation. Instead, they go through a ramping process:

Gradual Increase: A new feature begins with a small percentage of users, then gradually increases traffic (e.g., 1% -> 5% -> 10% -> 50% -> 100%).
Purpose: To control unknown risks associated with new feature launches (e.g., preventing a disaster like Healthcare.gov’s initial 100% rollout failure).
Challenge: Ramping too slowly wastes time and resources; ramping too quickly increases risk.
Goal: Guide experimenters and automate the process to enforce principles at scale.
This chapter primarily focuses on ramping up; ramping down is typically done quickly to zero for bad Treatments. Client-side updates may exclude some users from ramps.

SQR Ramping Framework

The SQR framework balances speed, quality, and risk based on the primary goal of each ramp phase.

Why Run Experiments?
- Measure Impact/ROI: Quantify the effect if launched to 100%.
- Reduce Risk: Minimize damage/cost to users and business from negative impacts.
- Learn: Understand user reactions, identify bugs, inform future plans.
Maximum Power Ramp (MPR): For measurement-only experiments, the optimal allocation is often 50% Treatment / 50% Control, providing the highest statistical sensitivity and fastest, most precise measurement.
Beyond MPR: Intermediate stages (e.g., 75%) might be needed for operational scaling.
Long-Term Holdout: A small fraction (5-10%) of users might be withheld from Treatment for an extended period (e.g., 2 months) to learn about long-term sustainability (see Chapter 23).

Four Ramp Phases

The SQR framework divides the ramp process into four phases, each with a primary goal:

Ramp Phase One: Pre-MPR (Risk Mitigation)

Goal: Safely determine that risk is small and ramp quickly to MPR.
Methods:
1. Testing Rings: Gradually expose Treatment to successive populations:
  - Whitelisted Individuals: The immediate feature team (qualitative feedback).
  - Company Employees: More forgiving for bugs.
  - Beta Users/Insiders: Vocal, loyal, early adopters willing to give feedback.
  - Single Data Centers: Isolate and identify challenging interactions (e.g., memory leaks, resource misuse). Common for Bing.
  - Bias: Measurements from early rings can be biased (e.g., insiders).
2. Automated Dialing: Gradually increase traffic (e.g., over an hour) to the desired allocation (e.g., 5%) to limit bad bug impact.
3. Real-time/Near-Real-Time Monitoring: Provide quick measurements on key guardrail metrics to enable rapid decision-making for the next ramp phase.

Ramp Phase Two: MPR (Measurement)

Goal: Precisely measure the experiment’s impact.
Duration: Keep experiments at MPR for a minimum of one week, longer if novelty or primacy effects are present.
Purpose: Captures time-dependent factors (e.g., day-of-week effects, heavy vs. light users).
Precision: While longer runs reduce variance, there’s diminishing return after about a week if no trends are observed.

Ramp Phase Three: Post-MPR (Operational Concerns)

Goal: Address remaining operational concerns (e.g., scaling services to increasing traffic load).
Duration: Short ramps (a day or less), covering peak traffic periods with close monitoring.
Context: Assumes no end-user impact concerns remain from earlier phases.

Ramp Phase Four: Long-Term Holdout or Replication (Learning)

Goal: Learn about long-term impact, sustainability, and potential delayed effects.
Caution: Not a default step. Can be unethical if Treatment is known to be superior and customers pay equally.
Useful Scenarios for Long-Term Holdout:
1. Long-term Treatment effect differs from short-term: Due to novelty/primacy, large short-term impact needing sustainability verification, or delayed effects (adoption/discoverability).
2. Early indicator metrics show impact: But the true-north metric is long-term (e.g., one-month retention).
3. Benefit of variance reduction: For holding longer (Chapter 22).
Traffic Allocation: If short-term impact is too small to detect at MPR, continue holdout at MPR; don’t dilute sensitivity by going to 90%+ Treatment.
Uber Holdouts: Some companies withhold a portion of traffic from any feature launch for a long term (e.g., a quarter) to measure cumulative impact. Bing has a global holdout (10% of users) to measure experimentation platform overhead.
Reverse Experiments: Put users back into Control weeks/months after 100% launch to measure long-term learned effects (Chapter 23).
Replication: Crucial for surprising results. Rerun experiments with different users or orthogonal re-randomization to build confidence and reduce multiple-testing bias.

Post Final Ramp

After an experiment is ramped to 100% (or shut down), cleanup is needed:

Code Fork Cleanup: If using an architecture that creates code forks (Chapter 4), remove dead code paths to prevent technical debt and accidental execution of unmaintained code.
Parameter System Default: If using a parameter system, update the new parameter value as the default.
This ensures a healthy and maintainable production system.

Chapter 16: Scaling Experiment Analyses

This chapter delves into the practicalities of building an automated and scalable data analysis pipeline within an experimentation platform. This is crucial for organizations moving into the “Run” or “Fly” maturity phases, ensuring trustworthy results, consistency, and timely decision-making.

Data Processing: “Cooking the Data”

The first step is to transform raw instrumented data into a usable state for computation:

Sort and Group:
- Join Multiple Logs: Combine data from different client-side and server-side instrumentation streams (Chapter 13).
- Common Identifier: Use a consistent join key (user ID, randomization unit) to link events for the same user.
- Sorting: Order by user ID and timestamp to facilitate session creation and activity grouping.
- Materialization: The joined data can be materialized (stored as a combined table) for debugging and hypothesis generation, or virtually joined during processing.
Clean the Data:
- Heuristics for Non-Real Users: Remove bots or fraudulent activity (Chapter 3) using rules for session length, activity intensity, or event timing.
- Instrumentation Issue Correction: Fix errors like duplicate event detection or incorrect timestamp handling.
- Important Caveat: Data cleansing cannot fix missing events due to lossy data collection. Be cautious if filtering disproportionately affects one variant, potentially causing a Sample Ratio Mismatch (SRM) (Chapter 3).
Enrich the Data:
- Parsing: Extract useful dimensions (e.g., browser family/version from user agent string, day of week from dates).
- Computing Measures: Add new metrics like event duration, total events per session.
- Experiment-Specific Annotations: Mark sessions for inclusion in experiment results based on business logic (e.g., experiment transitions like start times, ramp-up changes, version numbers). These are often added for performance reasons.

Data Computation

Once data is processed, the next step is calculating segments and metrics, then aggregating results into summary statistics (e.g., Treatment effect, p-value, confidence interval).

Architectural Approaches for Data Computation:

Materialize Per-User Statistics:
- Method: For every user, compute and store statistics (e.g., pageviews, impressions, clicks). This table is then joined with a user-to-experiment mapping.
- Advantage: Per-user statistics can be used for overall business reporting, not just experiments.
- Flexibility: Allows computing metrics/segments only needed for specific experiments.
Fully Integrate Computation:
- Method: Per-user metrics are computed on-the-fly within the experiment analysis pipeline, not materialized separately.
- Advantage: More flexibility per-experiment, potentially saving machine/storage resources.
- Consistency: Requires mechanisms to share metric/segment definitions across pipelines (experiment vs. business reporting).

Scaling and Efficiency:

Terabyte-Scale Processing: Companies like Bing, LinkedIn, and Google process terabytes of experiment data daily.
Near Real-Time (NRT) Paths: Essential for quick anomaly detection and stopping bad experiments (e.g., misconfigured or buggy). NRT paths often use simpler metrics and operate on raw logs.
Batch Processing: Handles intra-day computations and updates for trustworthy, timely results.

Ensuring Correctness and Trustworthiness:

Standardized Vocabulary: Define common metrics and definitions across the organization to ensure consistent understanding and reduce re-litigation of definitions.
Consistent Implementation: Ensure definitions are implemented uniformly, or use testing/comparison mechanisms to verify.
Change Management: Plan for metric evolution (OEC, segments) and how to propagate changes (e.g., backfilling historical data).

Results Summary and Visualization

The ultimate goal is to create clear, actionable visualizations for decision-makers:

Trustworthiness Indicators: Highlight key tests like SRM failures prominently (e.g., Microsoft ExP hides scorecards if SRM fails).
Metric Display: Show OEC and critical metrics first, but also include guardrails, quality, and other metrics.
Relative Change and Statistical Significance: Present results as relative changes, using color-coding and filters to make significant changes stand out.
Segment Drill-Downs: Enable exploration of results by segments, with automated highlighting of “interesting” segments.
Triggered Impact: If an experiment has triggering conditions, include the overall diluted impact in addition to the impact on the triggered population (Chapter 20).

Cultivating a Data-Driven Culture through Visualization:

Accessibility: Dashboards should be understood by all stakeholders (marketers, data scientists, engineers, product managers, executives). Some debug metrics can be hidden for less technical audiences.
Common Language: Promotes a shared understanding of definitions and fosters transparency and curiosity.
Per-Metric Views: Allow stakeholders to monitor the global health of key metrics and see which experiments are most impactful.
Approval Processes: If an experiment has negative impacts, the platform can initiate an approval process, forcing conversation with metric owners before ramp-up.
Institutional Memory Gateway: Visualization tools provide access to past experiments, decisions, and learnings (Chapter 8).

Scaling Metrics (Thousands of Metrics):

Categorization: Group metrics by tier (company-wide, product-specific, feature-specific) or function (Data quality, OEC, Guardrail, Local features/diagnostic).
Multiple Testing: As metric count grows, the risk of false positives increases. Options include using smaller p-value thresholds (e.g., 0.01, 0.001) for less important metrics, or applying more sophisticated methods like Benjamini-Hochberg procedure (Chapter 17).
Metrics of Interest: Automatically identify unexpectedly significant metrics by combining factors like company importance, statistical significance, and false positive adjustment.
Related Metrics: Show how a metric’s movement is explained by other related metrics (e.g., CTR up because clicks are up vs. page views are down), or provide more sensitive versions of high-variance metrics (e.g., trimmed revenue).

Chapter 17: The Statistics behind Online Controlled Experiments

This chapter provides a deeper dive into the statistical concepts fundamental to designing and analyzing controlled experiments, building upon the basic definitions introduced earlier in the book.

Two-Sample t-Test

The two-sample t-test is the most common statistical significance test used to determine if the observed difference between a Treatment and Control group is real or merely due to random noise.

Principle: It assesses the size of the difference between the two means (Treatment average and Control average) relative to the variance.
Null Hypothesis (H0): The means of the Treatment and Control groups are the same.
Alternative Hypothesis (HA): The means are different.
t-statistic (T): Calculated as Delta / sqrt(var(Delta)), where Delta is the difference between Treatment and Control averages. A larger absolute value of T suggests it’s less likely the means are the same.

P-Value and Confidence Interval

P-value: The probability of obtaining a t-statistic as extreme as, or more extreme than, the observed one, assuming the Null hypothesis is true.
- Common Misinterpretation: The p-value is not the probability that the Null hypothesis is true given the observed data. This requires Bayesian analysis.
- Convention: A p-value < 0.05 is typically considered “statistically significant.”
Confidence Interval (CI): A range of values that, if the experiment were repeated many times, would contain the true difference 95% of the time (for a 95% CI).
- Duality with P-value: For the Null hypothesis of no difference, a 95% CI that does not cross zero is equivalent to a p-value < 0.05.
- Common Misunderstanding: Overlapping individual CIs for Treatment and Control do not necessarily mean the difference is not significant (they can overlap by up to 29% and still be significant). However, non-overlapping CIs do imply significance.
- Misinterpretation: A specific 95% CI does not have a 95% chance of containing the true effect; the true effect is either 100% in it or 0% in it. The 95% refers to the long-run frequency of CIs containing the true value.

Normality Assumption

The t-test assumes that the t-statistic (T) follows a normal distribution under the Null hypothesis.

Common Misconception: This is not an assumption that the sample distribution of the metric Y itself is normal. Many metrics in practice are not normally distributed.
Central Limit Theorem (CLT): For sufficiently large sample sizes (thousands in online experiments), the average of the metric (Y-bar) does approximate a normal distribution, regardless of the underlying metric’s distribution.
Skewness: The required sample size for the CLT to hold depends on the skewness of the metric’s distribution. Highly skewed metrics (like revenue) need larger sample sizes.
- Rule of Thumb: Minimum samples needed ~ 16 * skewness^2.
- Solution: Capping extreme values or transforming metrics (e.g., log transformation, binarization) can reduce skewness and the required sample size.
Two-Sample t-tests: The difference between two variables (Treatment and Control) with similar distributions tends to converge to normality faster, especially with equal traffic allocations.
Validation: For small sample sizes, offline simulations (permutation tests) can check if the p-value distribution is uniform under the Null hypothesis.

Type I/II Errors and Power

Type I Error: Concluding there is a significant difference when there is none (false positive). Controlled by the significance level (alpha, typically 0.05).
Type II Error: Concluding there is no significant difference when there is one (false negative).
Power: The probability of correctly detecting a difference when one truly exists (1 - Type II error).
- Industry Standard: Typically 80% power.
- Power Analysis: Conducted before an experiment to determine the required sample size to detect a practically significant difference (delta) with sufficient power. Formula: n ~ 16 * sigma^2 / delta^2.
- Power is Relative: An experiment powered for a 10% difference might not detect a 1% difference. Analogized to “Spot the Difference” games.
Beyond Power: For small sample sizes, Gelman and Carlin suggest calculating:
- Type S (Sign) Error: Probability of an estimate being in the wrong direction.
- Type M (Magnitude) Error (Exaggeration Ratio): Factor by which the effect might be overestimated.

Bias

Bias arises when the estimate of the mean is systematically different from the true value. It can be caused by platform bugs, flawed experiment design, or unrepresentative samples (e.g., internal employees). Chapter 3 covers prevention and detection.

Multiple Testing

When many metrics are computed for an experiment (hundreds) or many experiments are run, the likelihood of finding a statistically significant result purely by chance increases. This is the “multiple testing problem.”

False Discoveries: If 100 independent metrics are tested at p<0.05, about 5 will appear significant by chance.
Solutions:
- Bonferroni Correction: Simple but overly conservative (divides p-value threshold by number of tests).
- Benjamini-Hochberg Procedure: More sophisticated, uses varying p-value thresholds.
- Rule-of-Thumb (Bayesian Interpretation):
  1. Categorize metrics into three groups: first-order (expected impact), second-order (potential impact), third-order (unlikely impact).
  2. Apply tiered significance levels (e.g., 0.05, 0.01, 0.001) based on the prior belief about the Null hypothesis for that metric.

Fisher’s Meta-analysis

This method combines results from multiple independent experiments testing the same hypothesis.

Purpose: Increases overall statistical power and reduces false positives by aggregating evidence. Useful for replicating surprising results or overcoming underpowered individual experiments.
Method: X^2 = -2 * Sum(ln(pi)), where pi is the p-value from each independent test. This combined statistic follows a chi-squared distribution.
Extensions: Methods exist for combining non-independent p-values (e.g., Brown’s method).

Chapter 18: Variance Estimation and Improved Sensitivity: Pitfalls and Solutions

This chapter focuses on the crucial role of variance in experiment analysis. It details common pitfalls in estimating variance and provides techniques for variance reduction, which directly improves the sensitivity (statistical power) of hypothesis tests.

The Centrality of Variance

Variance is the core of experiment analysis, directly impacting:

Statistical Significance
P-value
Power
Confidence Interval
Incorrect variance estimation leads to wrong conclusions: overestimated variance causes false negatives, while underestimated variance causes false positives.

The standard formula for the variance of an average metric (Y_bar) from n independent and identically distributed (i.i.d.) samples is (sample variance) / n.

Common Pitfalls in Variance Estimation

Mistakes in estimating variance can invalidate experiment results:

Delta vs. Delta %

Relative Difference (Percent Delta): Often used for reporting, defined as Delta / Control_Average.
Common Mistake: Calculating variance of percent delta as var(Delta) / Control_Average^2. This is incorrect because Control_Average itself is a random variable.
Correct Method: Use the delta method (Equation 18.6) to estimate the variance of a ratio. This accounts for the variability of the denominator.

Ratio Metrics: When Analysis Unit Differs from Experiment Unit

Problem: Many metrics are ratios of two underlying metrics (e.g., Click-Through Rate (CTR) = Clicks / Pageviews; Revenue-per-click = Revenue / Clicks).
Violation of i.i.d. Assumption: If the randomization unit is “user” but the analysis unit is “pageview” (for CTR), the assumption that samples are independent is violated (multiple pageviews come from the same user). The simple variance formula will be biased.
Correct Estimation:
- Delta Method: Used for ratios of “average of user-level metrics.” (Equation 18.5)
- Bootstrap Method: A powerful, broadly applicable technique for metrics not expressible as a ratio of user-level averages (e.g., 90th percentile of page load time). It involves simulating randomization by sampling with replacement to estimate variance from repeated simulations. Computationally expensive but effective for small sample sizes.

Outliers

Impact: Outliers (e.g., bots, spam behaviors generating many clicks/pageviews) have a significant impact on both the mean and, even more so, on the variance.
Effect on Significance: A single large outlier can dramatically increase variance, causing a statistically significant result to become non-significant, even if the mean difference increases.
Solution: Remove outliers when estimating variance. A practical method is to cap observations at a reasonable threshold (e.g., capping user activity for search to 500 queries/day). Many other outlier removal techniques exist.

Improving Sensitivity (Variance Reduction Techniques)

Improving sensitivity (or power) means increasing the ability to detect a true Treatment effect when it exists. This is primarily achieved by reducing variance:

Choose Metrics with Smaller Variance: Select evaluation metrics that capture similar information but naturally have lower variance (e.g., binary purchase indicator vs. actual purchase amount, number of searchers vs. number of searches).
Transform Metrics:
- Capping: Limit extreme values (e.g., capping revenue-per-user to $10 reduced skewness and sample size needed for Bing).
- Binarization: Convert continuous metrics to binary (e.g., Netflix using “user streamed more than X hours” instead of average streaming hours).
- Log Transformation: Useful for heavy, long-tailed metrics, especially if interpretability of the logged value is not a concern.
Use Triggered Analysis (Chapter 20): Filter out noise from users who could not have been impacted by the Treatment, improving sensitivity by focusing on the affected population.
Use Stratification, Control-Variates, or CUPED:
- Stratification: Divide sampling regions into strata (e.g., platform, browser type, day of week) and combine results. Post-stratification applies this retrospectively during analysis.
- Control-Variates: Uses covariates as regression variables to reduce variance.
- CUPED (Controlled-experiment Using Pre-Experiment Data): A specific application emphasizing using pre-experiment data to reduce variance (e.g., using a user’s past behavior to predict future behavior).
Randomize at a More Granular Unit:
- Benefit: Increases sample size, potentially reducing variance (e.g., randomizing per pageview for page load time).
- Disadvantages: Can lead to inconsistent user experiences for noticeable UI changes; impossible to measure user-level impact over time (e.g., user retention).
Design a Paired Experiment: Show the same user both Treatment and Control to remove between-user variability (e.g., interleaving ranked lists).
Pool Control Groups: If multiple experiments run concurrently, sharing a large Control group across all Treatments can increase power for each comparison. Considerations include instrumenting triggers on shared controls, and potential benefits of balanced variant sizes.

Variance of Other Statistics

While the focus is often on the mean, other statistics like quantiles are crucial (e.g., 90th or 95th percentile for page-load-time).

Statistical Tests for Quantiles: Can use bootstrap methods (computationally expensive) or estimate variance asymptotically if the statistic follows a normal distribution.
Combination with Delta Method: For time-based metrics (event/page level) with user-level randomization, a combination of density estimation and the delta method is needed.

Chapter 19: The A/A Test

This chapter highlights the A/A test as a critical, perhaps the best, method for establishing trust in an experimentation platform. It’s a simple yet powerful diagnostic tool because, in practice, these tests often fail, revealing underlying issues.

What is an A/A Test?

An A/A test involves splitting users into two groups (A and A), just like an A/B test, but ensuring that both groups receive identical experiences.

Expected Outcome (Theoretical): If the system operates correctly, for a given metric, approximately 5% of the time it should appear statistically significant (p-value < 0.05). The distribution of p-values from repeated A/A tests should be uniform.
Purpose: A/A tests are highly useful for:
- Validating Type I Error Rates: Ensuring the actual false positive rate matches the expected significance level (e.g., 5%). Discrepancies point to problems with variance calculations or normality assumptions.
- Assessing Metric Variability: Understanding how a metric’s variance changes over time and confirming expected reductions in variance.
- Identifying Bias: Detecting systematic differences or biases between Treatment and Control groups, especially those introduced at the platform level (e.g., carry-over effects from prior experiments).
- Comparing with System of Record: Validating that key metrics (users, revenue, CTR) from the A/A test match the overall system’s records.
- Estimating Variances for Power Calculations: Providing empirical estimates of metric variances to determine sample sizes for future A/B tests.
  The authors strongly recommend running continuous A/A tests in parallel with other experiments to proactively uncover distribution mismatches and platform anomalies.

Real-World Examples of A/A Test Failures

The chapter provides compelling examples of how A/A tests uncover critical bugs and misunderstandings:

Example 1: Analysis Unit Differs from Randomization Unit:
- Problem: If randomization is by user, but analysis is for a page-level metric (e.g., CTR calculated as total clicks / total pageviews), the standard variance calculation (assuming i.i.d. samples) is incorrect because multiple pageviews come from the same user (correlated).
- Discovery: A/A tests showed that this CTR metric was statistically significant far more often than 5%.
- Solution: Use the delta method or bootstrapping (Chapter 18) to correctly compute variance.
Example 2: Optimizely Encouraged Stopping When Results Were Statistically Significant:
- Problem: Early versions of Optimizely encouraged “peeking” (stopping an experiment early when a significant p-value appeared). This violates the assumption of a single test at the end, leading to a much higher false positive rate.
- Discovery: Experimenters running A/A tests with Optimizely found many “false successes,” leading to articles like “How Optimizely (Almost) Got Me Fired.”
- Solution: Optimizely subsequently updated its “Stats Engine” to account for sequential testing.
Example 3: Browser Redirects:
- Problem: Implementing an A/B test by redirecting Treatment users to a new website version (A'). This approach is flawed because:
  1. Performance Differences: Redirects add latency.
  2. Bot Behavior: Bots handle redirects inconsistently.
  3. Contamination: Bookmarks and shared links cause users to bypass the initial randomization, contaminating groups.
- Discovery: Redirects consistently fail A/A tests.
- Solution: Implement server-side changes or ensure symmetric redirects for both Control and Treatment (though this degrades Control).
Example 4: Unequal Percentages:
- Problem: Running experiments with unequal traffic splits (e.g., 10% Treatment, 90% Control) can lead to issues, particularly with shared resources like Least Recently Used (LRU) caches. The larger variant gets more cache entries, potentially benefiting it.
- Discovery: A/A tests for unequal splits often fail.
- Solution: Run 50/50 A/A tests, and if running unequal splits in production, run A/A tests specifically for those allocations. Unequal percentages can also affect the rate of convergence to a normal distribution for highly skewed metrics.
Example 5: Hardware Differences:
- Problem: Facebook ran an A/A test between an old and a new fleet of machines for a service. Even though hardware was thought to be identical, the A/A test failed.
- Discovery: Small, unexpected hardware differences can lead to significant unmeasured differences.

How to Run A/A Tests

Simulation: Ideally, simulate thousands of A/A tests by:
- Replaying Historical Data: Re-randomize users from a past period (e.g., the last week of raw data) multiple times, each with a new hash seed for user assignment.
- P-value Distribution: For each simulated A/A test, compute p-values for all metrics of interest and plot their histograms.
- Goodness-of-Fit: Use statistical tests (e.g., Anderson-Darling, Kolmogorov-Smirnoff) to assess if p-value distributions are uniform. If not, the system is untrustworthy.
Benefits of Replay: Cost-effective, identifies many issues, though it won’t catch real-time performance or shared resource issues.

When the A/A Test Fails

Common p-value scenarios that indicate A/A test failure:

Skewed P-value Distribution (Not Uniform):
- Cause: Incorrect variance estimation (Chapter 18), often due to violated independence assumptions (analysis unit vs. randomization unit) or highly skewed metric distributions where Normal approximation fails for insufficient sample sizes (e.g., >100,000 users needed for some).
- Solution: Deploy delta method/bootstrapping, cap metrics, or enforce minimum sample sizes.
Large Mass around P-value of 0.32:
- Cause: A single very large outlier in the data. The outlier skews the mean and, more significantly, inflates the variance, causing the t-statistic to cluster around +/- 1, which maps to a p-value around 0.32.
- Solution: Investigate the outlier, or cap/remove the outlier (Chapter 18).
Point Masses with Large Gaps:
- Cause: Data is single-valued (e.g., mostly zeros) with very rare non-zero instances. The delta of means can only take a few discrete values, resulting in few possible p-values.
- Severity: Less serious than other failures; if a Treatment causes the rare event to happen often, it will still show a large, significant effect.

Even after an A/A test passes, regularly running concurrent A/A tests with A/B tests is recommended to catch regressions or new issues with metrics.

Chapter 20: Triggering for Improved Sensitivity

This chapter explores triggering, a powerful technique to enhance the sensitivity (statistical power) of experiments by filtering out users who could not have been impacted by the Treatment. It also highlights common pitfalls that can lead to incorrect results.

The Power of Triggering

Triggering involves analyzing only those users (or other randomization units) who have the potential to experience a difference between the variants. This removes noise from unaffected users, significantly improving statistical power. It’s crucial to log triggering events at runtime.

Examples of Triggering

The chapter illustrates triggering with increasing complexity:

Example 1: Intentional Partial Exposure:
- Scenario: A change applied only to a segment (e.g., US users, Edge browser users, heavy users).
- Trigger: Analyze only users within that specific segment. Users outside the segment have a zero Treatment effect and only add noise. Crucially, definition for segments must be based on data prior to experiment start to avoid bias.
Example 2: Conditional Exposure:
- Scenario: A change to a specific part of the website (e.g., checkout process, Excel graph feature).
- Trigger: Only analyze users who actually reached that part of the website or used that feature.
Example 3: Coverage Increase:
- Scenario: Control offers free shipping for $35+ cart, Treatment offers it for $25+.
- Trigger: Only users with carts between $25 and $35. Users with < $25 or > $35 carts have the same experience in both variants (assuming no promotion display of the change itself). Only users for whom there was some difference between their variant and the counterfactual are triggered.
Example 4: Coverage Change:
- Scenario: Control offers free shipping for $35+ cart; Treatment for $25+ cart except if they returned an item in last 60 days.
- Trigger: More complex. Both Control and Treatment must evaluate the “other” condition (the counterfactual). Users are triggered only if their experience differs between variants.
Example 5: Counterfactual Triggering for Machine Learning Models:
- Scenario: Testing a new machine learning classifier (e.g., for promotions) or recommender model.
- Trigger: Users are triggered only if the new model’s output (classification, recommendation) differs from the old model’s output for that user.
- Implementation: Requires both models to be executed for all users in the experiment (Control and Treatment), logging the output of both (counterfactual logging). This increases computational cost.

A Numerical Example

A simple example demonstrates the power gain:

Baseline: An e-commerce site with a 5% purchase rate. Detecting a 0.05% change in purchase rate needs ~121,600 users overall.
Triggered (Checkout Change): If only 10% of users initiate checkout, and the purchase rate for those who initiate checkout is 50%, then detecting a 0.05% change for checkout users (i.e., a 0.05% change in the 50% conversion, not the 5% overall) requires only ~6,400 checkout users. This means the experiment needs ~64,000 overall users, almost half the original amount, allowing for faster results.

Optimal and Conservative Triggering

Optimal Triggering: Include only users for whom there was a difference between their assigned variant and the counterfactual of the other variant. This is ideal but can be costly (e.g., requiring logging all counterfactuals).
Conservative Triggering: Include more users than optimal (e.g., all users who could have been impacted). This does not invalidate analysis but loses statistical power. Useful if the simplicity outweighs small power loss.

Overall Treatment Effect (Diluted Impact)

When reporting results for a triggered population, it’s crucial to also report the diluted impact on the overall user base, as the impact on the triggered group rarely translates directly to the whole.

Common Pitfall: Multiplying the triggered percent improvement by the triggering rate (Delta_theta * tau). This is often inaccurate if the triggered population is skewed.
Correct Dilution:(Absolute_Effect_on_Triggered * Triggered_Control_Users) / (Overall_Control_Metric_Value * Overall_Control_Users).
- This can be simplified to (Delta_theta / M_theta_C) * tau * (M_theta_C / M_omega_C).
Ratio Metrics: Dilution formulas are more refined for ratio metrics, as Simpson’s paradox can occur (triggered ratio improves, but global impact regresses).

Trustworthy Triggering

Two critical checks ensure trustworthy triggering:

Sample Ratio Mismatch (SRM) on Triggered Population: If the overall experiment has no SRM, but the triggered analysis shows one, it indicates a bias (e.g., counterfactual logging issues).
Complement Analysis (Never Triggered Users): Run an A/A test on users who were never triggered. This scorecard should show no statistically significant metrics (uniform p-value distribution). If it does, the trigger condition is likely incorrect, meaning unaffected users were influenced.

Common Pitfalls in Triggering

Pitfall 1: Experimenting on Tiny Segments That Are Hard to Generalize: Focusing on a small triggered population, even with a massive impact, results in a negligible diluted impact on the overall population (Amdahl’s Law).
- Exception: Generalizations of small ideas (e.g., MSN UK’s “link opens in new tab” experiment, which led to a broad “search results in new tab” feature with a 5% engagement increase for 12 million users).
Pitfall 2: A Triggered User Is Not Properly Triggered for the Remaining Experiment Duration: Once a user triggers, all their subsequent activities must be included in the analysis, as the Treatment may impact future behavior. Analyzing by day or session can underestimate long-term effects if users abandon the product.
Pitfall 3: Performance Impact of Counterfactual Logging: If counterfactual logging executes multiple models, and one is significantly slower, this performance difference won’t be visible in the experiment’s Treatment effect.
- Solution: Log individual model timings for comparison. Run an A/A’/B experiment (A=original, A’=original with counterfactual logging, B=Treatment with counterfactual logging) to check if A and A’ differ.

Open Questions (No Clear Answer)

Triggering Unit: Should analysis include all user activities from experiment start once triggered, or only activities after the trigger point? Including all activities simplifies computation but can slightly reduce statistical power.
Plotting Metrics Over Time: Triggered analyses often show false trends (e.g., decreasing Treatment effect over time) if not handled carefully. This is because the composition of triggered users changes daily.

Chapter 21: Sample Ratio Mismatch and Other Trust-Related Guardrail Metrics

This chapter focuses on Sample Ratio Mismatch (SRM), a critical trust-related guardrail metric, and other similar metrics designed to alert experimenters to violated assumptions. The core message is that if an SRM is detected, the experiment results are highly likely to be invalid and should not be trusted.

Sample Ratio Mismatch (SRM)

Definition: An SRM occurs when the actual ratio of users (or other randomization units) between variants significantly deviates from the designed ratio (e.g., a 50/50 split should yield a 1:1 ratio).
Underlying Principle: The decision to expose a user to a variant must be independent of the Treatment itself. Therefore, the distribution of users across variants should match the experimental design.
Detection: Use a standard t-test or chi-squared test to compute the p-value for the observed ratio. A low p-value (e.g., below 0.001) indicates an SRM.
Consequence: When an SRM is detected, all other metrics are probably invalid and should not be trusted.
Example (Scenario 1): Designed 50/50 split, but observed 821,588 vs. 815,482 users (ratio 0.993). P-value of 1.8E-6 (less than 1 in 500,000 chance). This is an extremely unlikely event, indicating a bug.
Example (Scenario 2 – Bing): A small ratio deviation (0.994) with a p-value of 2E-5. An actual Bing scorecard showed initial positive results for all success metrics, but after segmenting out users from an old Chrome browser version (the cause of the SRM), the remaining 96% of users showed no statistically significant movement. This illustrates how even a small SRM can massively skew results.
Prevalence: At Microsoft, about 6% of experiments exhibited an SRM.

SRM Causes

Many factors can lead to an SRM:

Buggy Randomization: Errors in the system that assigns users to variants (e.g., complex ramp-up procedures, exclusions, attempts to balance covariates, or unexpected interactions between concurrent experiments). A real example involved internal Microsoft Office users skewing results because they were heavy users.
Data Pipeline Issues: Problems in collecting or processing data (e.g., bot filtering incorrectly classifying highly engaged Treatment users as bots, as seen in an MSN example where an SRM inverted a negative result to a positive one).
Residual Effects: If an experiment is restarted after fixing a bug, users previously impacted by the bug might exhibit different behavior, causing an SRM.
Bad Trigger Condition: The criteria for including users in an experiment (triggering) is flawed (e.g., website redirects where users are lost in Treatment, leading to an SRM if only those who successfully reach the redirected page are counted).
Triggering based on Treatment-Impacted Attributes: If the attribute used for triggering (e.g., “dormant user” status) is affected by the Treatment itself, identifying users by this attribute after the experiment starts can lead to an SRM.

Debugging SRMs

Debugging SRMs is challenging but essential. Common investigation directions:

Validate Upstream: Ensure no differences between variants before the randomization or trigger point.
Check Variant Assignment: Verify proper randomization and handling of concurrent experiments and isolation groups.
Inspect Data Processing Pipeline: Follow data stages to identify where the SRM is introduced (e.g., bot filtering).
Exclude Initial Period: Check if the SRM is due to staggered variant start times or caching issues.
Segment Analysis: Examine the SRM within specific segments (e.g., by day, browser, new vs. returning users) to pinpoint the source.
Intersection with Other Experiments: Check if other concurrent experiments are causing user “stealing” from one variant.

If an SRM is understood, it might be fixable in the analysis phase (e.g., correcting bot filtering). Otherwise, re-running the experiment may be necessary.

Other Trust-Related Guardrail Metrics

Beyond SRM, other metrics help indicate system or data quality issues:

Telemetry Fidelity: Metrics assessing data loss (e.g., click tracking via web beacons can be lossy). If Treatment impacts loss rate, results are distorted. Dual logging (e.g., for ad clicks) can help uncover this.
Cache Hit Rates: Monitoring cache hit rates can reveal unexpected interference between variants due to shared resources (e.g., LRU caches).
Cookie Write Rate: Monitoring the rate at which permanent cookies are written can expose issues like cookie clobbering (e.g., a Bing experiment where writing an unused random cookie with every search response led to massive user degradation and untrustworthy results).
Quick Queries: For search engines, an increase or decrease in quick queries (multiple queries within one second) can indicate a problem with the Treatment, as these are often associated with untrustworthy results.

These guardrail metrics, when implemented and actively monitored, are crucial for ensuring the internal validity and trustworthiness of online controlled experiments.

Chapter 22: Leakage and Interference between Variants

This chapter explores situations where the fundamental assumption of independent experimental units is violated, leading to leakage and interference between variants. This phenomenon, known as a violation of the Stable Unit Treatment Value Assumption (SUTVA), can significantly bias experiment results.

The SUTVA Assumption

Definition: SUTVA states that the behavior of each unit in an experiment is unaffected by the variant assignment of other units (Yi(z) = Yi(zi)). In simpler terms, what happens to one user should not be influenced by what happens to another user’s variant assignment.
Plausibility: This assumption holds true in most straightforward A/B tests (e.g., a user’s purchase on an e-commerce site is independent of others’ variant assignment).
Consequence of Violation (Interference/Spillover/Leakage): If SUTVA is violated, the analysis results can be incorrect or biased.

Examples of Interference

Interference can arise through direct connections or indirect connections (shared resources):

Direct Connections

Social Networks (Facebook/LinkedIn): User behavior is highly influenced by their social network.
- Scenario: A new social-engagement feature (Treatment) causes users to send more connection invitations or messages.
- Interference: Users in the Control group, who receive these invitations/messages from Treatment users, may also increase their activity (e.g., visit LinkedIn, accept invitations, send replies).
- Bias: The observed delta between Treatment and Control is biased downwards, as the Control group benefits from the Treatment’s spillover, underestimating the true impact of the new feature.
Communication Tools (Skype):
- Scenario: Improved call quality (Treatment) increases calls made by Treatment users.
- Interference: These calls may go to Control users, causing them to use Skype more (to answer calls, or subsequently call their friends).
- Bias: The Treatment effect is underestimated.

Indirect Connections (Shared Resources or Latent Variables)

Two-Sided Marketplaces (Airbnb, Uber/Lyft, eBay): Supply and demand interact.
- Scenario (Airbnb): Improved conversion flow for Treatment users leads to more bookings.
- Interference: This reduces available inventory for Control users.
- Bias: The Treatment effect is overestimated, as the Control group performs worse due to resource scarcity caused by the Treatment.
- Scenario (Uber/Lyft): A new pricing algorithm (Treatment) leads to more ride requests.
- Interference: Fewer drivers are available for the Control group, leading to higher prices and fewer rides for Control users.
- Bias: The Treatment effect is overestimated.
- Scenario (eBay): A Treatment encourages higher bidding from buyers.
- Interference: Control users competing for the same items are less likely to win auctions.
- Bias: The Treatment effect on total transactions is overestimated.
Ad Campaigns:
- Scenario: A Treatment encourages more clicks on ads.
- Interference: This consumes shared ad campaign budgets faster. The Control group then has a smaller budget available.
- Bias: The Treatment effect is overestimated. Ad revenue experiments can also show different results at different times of the month/quarter due to budget pacing.
Relevance Model Training:
- Scenario: A new Treatment relevance model is better at predicting user clicks.
- Interference: If both Treatment and Control models are trained on data from all users (including Treatment users’ “good” clicks), the “good” clicks from Treatment users will gradually benefit the Control model over time.
- Bias: The Treatment effect is underestimated over longer runs.
Shared Infrastructure (CPU):
- Scenario: A bug in Treatment unexpectedly consumes more CPU/memory on shared machines.
- Interference: All requests (Treatment and Control) on those machines slow down.
- Bias: The negative Treatment effect on latency is underestimated.
Sub-User Experiment Unit:
- Scenario: Randomizing by pageview for a latency improvement (Treatment). The same user experiences both fast (Treatment) and slow (Control) pages.
- Interference: The user’s behavior on fast pages might be influenced by the experience on slow pages, and vice versa (e.g., overall frustration impacting clicks).
- Bias: The Treatment effect is underestimated.

Practical Solutions for Addressing Interference

The goal is to estimate the delta between two parallel universes (all Treatment vs. all Control) without bias from leakage.

Rule-of-Thumb: Ecosystem Value of an Action:
- Concept: Identify actions that can spill over (e.g., messages sent) and related downstream impacts (e.g., messages responded to).
- Method: Establish a general “ecosystem value” for an action (e.g., how much a message translates to visits from both sender and neighbors) using historical experiments or Instrumental Variables.
- Benefit: Relatively easy to implement, more sensitive than isolation methods because it uses Bernoulli randomization.
- Limitation: An approximation, may not capture scenario-specific complexities.
Isolation: The most direct way to address interference is to prevent communication between variants by identifying and separating the “medium” of interference.
- Splitting Shared Resources: Divide shared resources (e.g., ad budget, model training data) proportionally between variants.
  - Watch-outs: Resources might not be perfectly splittable (e.g., machine heterogeneity); splitting training data can bias models against smaller-data variants.
- Geo-Based Randomization: Randomize by geographical region to isolate interference between Treatment and Control in location-sensitive scenarios (e.g., hotels, taxis).
  - Caveat: Reduces sample size and statistical power.
- Time-Based Randomization: Flip a coin at time t to assign all users to Treatment or all to Control for a period. Requires interference to be short-lived and no within-user cross-period effects.
- Network-Cluster Randomization: In social networks, group users into “clusters” based on their likelihood to interfere and randomize these clusters into Treatment or Control.
  - Limitations: Difficult to achieve perfect isolation in dense networks; effective sample size is small, leading to variance-bias trade-offs.
- Network Ego-Centric Randomization: Focuses on an “ego” (focal individual) and their “alters” (connected individuals), allowing separate variant assignment for egos and alters to achieve better isolation and smaller variance. (e.g., give all alters Treatment, half egos Treatment).
- Combining Isolation Methods: Leverage multiple isolation techniques (e.g., network-cluster + time-based) to increase sample size.
Edge-Level Analysis:
- Method: Use Bernoulli randomization on users (nodes), then classify interactions (edges) based on the variants of the two users involved (e.g., T-T, T-C, C-C, C-T).
- Benefit: Allows unbiased delta estimation (e.g., T-T vs. C-C) and identification of network effects like “Treatment affinity” (Treatment units preferring to interact with other Treatment units).

Detecting and Monitoring Interference

Mechanism Understanding: Key to choosing the right solution.
Monitoring System: Crucial to have strong monitoring and alert systems for detecting interference, even if a precise measurement isn’t always practical.
Ramp Phases: Can help detect really bad interference early (e.g., a Treatment consuming all CPU) by ramping to small populations first (Chapter 15).

Interference is a complex challenge, but understanding its mechanisms and applying appropriate experimental designs and analytical techniques can lead to more accurate and trustworthy causal estimates.

Chapter 23: Measuring Long-Term Treatment Effects

This chapter addresses the significant challenge of measuring long-term Treatment effects in online controlled experiments, particularly in agile, fast-paced product development environments. It differentiates short-term from long-term effects and explores various methodologies for capturing the latter.

What Are Long-Term Effects?

Short-Term Effect: The Treatment effect measured over a typical experiment duration (one to two weeks). This is often sufficient for most experiments.
Long-Term Effect: The asymptotic effect of a Treatment, theoretically years out, but practically considered 3+ months or after a sufficient number of exposures.
Divergence: Short-term and long-term effects can differ significantly.
- Examples: Raising prices (short-term revenue increase, long-term decrease due to churn); showing poor search results (short-term query share increase, long-term decrease due to user abandonment); increasing ad load (short-term revenue increase, long-term decrease in ad clicks/searches).
Exclusion: The chapter explicitly excludes changes with very short life spans (e.g., ephemeral news headlines), focusing on features where long-term performance is meaningful.
OEC Connection: Measuring long-term effects provides insights to improve and devise short-term metrics that are predictive of long-term goals (the OEC challenge).

Reasons the Treatment Effect May Differ between Short-Term and Long-Term

Several factors contribute to the divergence:

User-Learned Effects: Users adapt to changes over time.
- Adaptation: Users learn to avoid annoying features (e.g., frequent crashes lead to abandonment).
- Discoverability: Users take time to discover new features, then engage heavily.
- Primacy/Novelty: Initial effects (positive or negative) might stabilize as users reach an equilibrium point after the “newness” wears off or they get used to the “old” way (Chapter 3).
Network Effects: Feature adoption or usage can propagate through social or marketplace networks.
- Propagation: If a feature encourages friends to use it, its full effect may take time to manifest as it spreads through the network (Chapter 22 discusses short-term interference).
- Supply Constraints: In two-sided marketplaces (Airbnb, Uber), increased demand from a Treatment might outpace supply, delaying full revenue impact. Similar for recommendation systems where “supply” of new, diverse items is limited.
Delayed Experience and Measurement:
- Physical Arrival: For travel sites, user retention metrics might be affected by offline experiences months after online booking.
- Contract Cycles: Annual contracts mean renewal decisions are made only after a year of cumulative experience.
Ecosystem Change: External factors change the environment over time.
- Other Feature Launches: New features (internal) can interact with or erode the value of the tested feature.
- Seasonality: Performance varies by time of year (e.g., gift cards during Christmas).
- Competitive Landscape: Competitors launching similar features can reduce value.
- Government Policies: Regulations (e.g., GDPR) change user behavior and ad targeting.
- Concept Drift: Machine learning model performance degrades if not refreshed, as data distributions change.
- Software Rot: Features degrade over time if not maintained.

Why Measure Long-Term Effects?

Not all long-term differences are worth measuring, but these are key reasons:

Attribution: For tracking team goals, financial forecasting, and understanding the true ROI of a feature over time. This requires considering both endogenous (user-learned) and exogenous (competitive landscape) factors.
Institutional Learning: Understanding why short-term and long-term effects differ provides insights for improving future iterations (e.g., if a strong novelty effect indicates a suboptimal user experience, or if slow adoption points to poor in-product education).
Generalization: Deriving general principles from specific long-running experiments (e.g., the long-term impact of search ads) or building short-term metrics that predict long-term outcomes.

Long-Running Experiments

The simplest approach is to keep an experiment running for an extended period (e.g., months or years) and compare Treatment effects at the beginning and end.

Challenges and Limitations:
- Treatment Effect Dilution: Users may use multiple devices or entry points where the experiment isn’t active, or cookies may churn, leading to a mixed experience over time. The longer the run, the more diluted the measured effect.
- Network Leakage: Interference between Treatment and Control can cascade through networks, creating larger leakage over longer periods (Chapter 22).
- Survivorship Bias: If survival rates differ between Treatment and Control (e.g., users disliking a feature abandon), the long-term measurement will be biased towards remaining users. This should trigger an SRM.
- Interaction with Other New Features: Other features launched during the long run can interact and confound the observed long-term effect of the specific feature being tested.
- Time-Extrapolated Effect Interpretation: The difference between initial and final measurements may be due to exogenous factors (seasonality), not just the Treatment itself. Generalizing requires caution.

Alternative Methods for Long-Running Experiments

To address the challenges of long-running experiments, several methods are proposed, none fully perfect:

Method #1: Cohort Analysis:
- Design: Construct a stable cohort of users (e.g., logged-in user IDs) at the experiment start and only analyze their short-term and long-term effects.
- Benefits: Addresses dilution and survivorship bias if the cohort is stable and trackable.
- Considerations:
  - Cohort Stability: Ineffective if IDs (e.g., cookies) churn frequently.
  - Representativeness/External Validity: If the cohort is not representative of the overall population (e.g., logged-in users vs. non-logged-in), results may not generalize. Weighting adjustments (stratification) can help, but carry limitations of observational studies (Chapter 11).
Method #2: Post-Period Analysis (Reverse Experiment):
- Design: Turn off the experiment after a period T, and then measure the difference between the former Treatment and Control users in a subsequent post-period (T+1). During the post-period, both groups experience the same features.
- “Learning Effect”: The difference measured in the post-period (when Treatment is off or ubiquitous) is attributed to what users learned or how the system learned from the Treatment.
  - User-learned effect: Users adapted behavior (e.g., learned ad quality).
  - System-learned effect: Permanent user state changes (profile updates, opt-outs), or ML models learning from Treatment-period data.
- Benefits: Isolates impact from exogenous factors and interactions with newly launched features. Provides insights into why effects differ.
- Limitations: Suffers from potential dilution and survivorship bias from the initial Treatment period.
- Extrapolation: With enough experiments, this method can help extrapolate anticipated long-term effects from new short-term experiments.
Method #3: Time-Staggered Treatments:
- Design: Run two versions of the same Treatment with staggered start times (e.g., T0 starts at t=0, T1 starts at t=1).
- Convergence Check: At any time t > 1, compare T0 and T1. If the difference between them becomes statistically insignificant (or below practical significance), it indicates the Treatments have converged and the long-term effect is stable.
- Benefit: Helps determine “how long is long enough” to run the experiment.
- Assumptions: Assumes the difference between the two Treatments decreases over time. Requires sufficient time gap for effects to manifest.
Method #4: Holdback and Reverse Experiment:
- Holdback: Keep a small percentage of users (e.g., 10%) in Control for several weeks/months after launching Treatment to the rest (90%). This is a typical long-running experiment type.
  - Cost: Control group incurs opportunity cost. Small Control group means lower power.
- Reverse Experiment: After launching Treatment to 100%, ramp a small percentage (e.g., 10%) back into Control for several weeks/months.
  - Benefit: Allows the network/marketplace to reach new equilibrium before measuring (useful for network effects or supply constraints).
  - Disadvantage: May confuse users by reverting their experience.

The chapter concludes by highlighting the ongoing research in this area and the importance of evaluating limitations for each method when interpreting results.

Key Takeaways

“Trustworthy Online Controlled Experiments” is more than a technical manual; it’s a strategic guide for any organization aiming to embed data-driven decision-making into its DNA. The core lessons revolve around the scientific rigor, cultural shifts, and robust platforms necessary to transform ideas into quantifiable business impact.

The most important insights readers should remember are:

Causality is King, and Experiments are its Throne: Correlation is not causation. Only randomized controlled experiments can reliably establish causal links between a change and its outcome. This is the fundamental reason to invest in experimentation.
Trustworthiness is Paramount: Numbers are easy to get, but trustworthy numbers are hard. Twyman’s Law (“Any figure that looks interesting or different is usually wrong”) is a constant reminder to be skeptical and rigorously validate results, especially for seemingly “too good to be true” findings. Issues like Sample Ratio Mismatches (SRMs) are red flags that invalidate an experiment.
Metrics Drive Behavior and Strategy: Define clear, actionable, and non-gameable metrics (especially the Overall Evaluation Criterion – OEC) that align with long-term strategic objectives. Goodhart’s Law and Campbell’s Law teach that targets can become corrupted if not carefully chosen.
Most Ideas Fail, and That’s Okay: A significant majority (67-90%) of new features or changes will not positively impact key metrics. Embracing this “fail fast” culture, learning from failures, and iterating is crucial for innovation and continuous improvement.
Experimentation is a Journey of Maturity: Organizations progress through “Crawl,” “Walk,” “Run,” and “Fly” phases, each requiring increasing sophistication in platform, process, and culture. Automation, continuous A/A testing, and institutional memory are vital for scaling.
Context and Design Matter: The choice of randomization unit, how to handle interference (SUTVA violations), and whether to measure short-term or long-term effects profoundly impact experiment design and the validity of results. These are not trivial details.

Here are the specific next actions readers should take immediately, and why they matter:

Define Your OEC (or Start the Conversation): If your organization doesn’t have a clear, agreed-upon OEC that balances multiple objectives, start discussions now. This is the single most important step for aligning efforts and making consistent, data-driven decisions. Without it, your experiments lack a guiding star.
Run an A/A Test: If you’re starting with A/B testing or haven’t done one recently, run an A/A test on your current system. This is the fastest way to uncover fundamental flaws in your instrumentation, randomization, or statistical analysis (e.g., incorrect variance calculations). Don’t trust your A/B test results until your A/A tests pass.
Implement SRM Checks: Ensure every experiment automatically checks for Sample Ratio Mismatch. If an SRM is detected, reject the results and investigate immediately. This simple guardrail protects you from making decisions based on fundamentally flawed data.
Embrace a “Learn from Failure” Mindset: Recognize that many of your ideas won’t work as expected. Celebrate the learning that comes from failed experiments, rather than just the successes. This cultural shift fosters innovation and prevents teams from hiding negative results.
Review Your Instrumentation: Understand what data you’re collecting, how, and why. Ensure it’s high-quality, attributed correctly to variants, and that you have a culture that prioritizes instrumentation as much as core features. Without good data, experiments are useless.

Reflection Prompts:

Considering Twyman’s Law, what seemingly “obvious” or “too good to be true” insights or successes have you or your organization celebrated in the past? How might you re-evaluate them with a more skeptical, data-driven lens?
If you were to define the single most important “Overall Evaluation Criterion” for your current product or team that encapsulates its long-term success, what would it be? What are the biggest challenges in measuring it in a short-term experiment, and what surrogate metrics might you consider?
Think about a recent decision made based on intuition or limited data. How might a controlled experiment, or a specific observational causal study, have changed that decision or provided deeper insight?