A pleasant walk through computing

Comment for me? Send an email. I might even update the post!

Why the Repository Pattern Fails for Legacy Databases Like Yours and What to Do About It

Repo-Man-Poster.jpg
By Source, Fair use, Link

Contents

Introduction: A Classic Example (spot the flaw?)

One of the hallmarks of creating a loosely coupled architecture is using the repository pattern to return data. Each model's repository is implemented from an interface, and that interface is injected into, typically, a service.

namespace Company.Domain.DomainModels
{
    public class Customer
    {
        public int CustomerId {get; set;}
        public string Name {get; set;}
    }
}
namespace Company.Domain.DataInterfaces
{
    public interface IRepository<T> where T: class
    {
        T Single(Expression<Func<T, bool> where);
        IEnumerable<T> Many(Expression<Func<T, bool> where);
    }
}

The concrete implementation, using Entity Framework 6.

namespace Company.Data
    public class CustomerRepository: IRepository<Customer>
    {
        CompanyDbContext _context = new CompanyDbContext();
        Customer Single(Expression<Func<Customer, bool> where)
        {
            return _context.Customers.SingleOrDefault(where);   
        }
        IEnumerable<Customer> Many(Expression<Func<Customer, bool> where)
        {
            return _context.Customers.Where(where);
        }
    }
}

The concrete customer service with the repository interface injected.

namespace Company.Services
{
    public class CustomerService: ICustomerService
    {
        IRepository<Customer> _customerRepository;
        
        public CustomerService(IRepository<Customer> customerRepository)
        {
            _customerRepository = customerRepository
        }
        
        public Customer GetCustomer(int id)
        {
            return _customerRepository.Single(a => a.CustomerId == id);
        }
        
        public List<Customers> GetCustomers(Expression<Func<Customer, bool>> where)
        {
            return _customerRepository.Many(where);
        }
    }
}

A unit test. (Note this is a lousy unit test because it's "testing" a data passthrough, but it illustrates the point.)

using XUnit;
using NSubstitute;
using FluentAssertions;

namespace Company.Tests
{
    public class CustomerService_Should
    {
        [Fact]
        public void Return_one_customer()
        {
            int id = 1;
            Customer expected = new Customer() {CustomerId = 1;}
            var customerRepository = Substitute.For<IRepository<Customer>>();
            var customerService = new CustomerService(customerRepository);
            customerRepository.Single((a => a.CustomerId == id).Returns(expected);
            var result = customerService.GetCustomer(id);
            result.Should().BeEquivalentTo(expected);
        }
    }
}

The namespaces indicate where in the Onion Architecture the classes belong.

  1. Core: Customer represents a company customer. It has no behavior.
  2. Core: IRepository is our generic standard for getting data. By injecting this standard into services, we can swap out the database by simply implementing it in a new concrete repository.
  3. Core: ICustomerService is our contract for what a CustomerService should implement
  4. Data: CustomerRepository is the concrete instance that will return models from the database.
  5. Service: CustomerService is the concrete instance that returns models to the client.

Hint The flaw with the repository pattern is #2 and #4.

Image Copyright Jeffrey Palermo

Implicit Architecture Rules

Implicit in the Onion architecture are these rules:

  1. Domain Models represent the way the organization really works.
  2. Domain Services return Domain Models. (e.g. CustomerService returns Customer)
  3. Data Services return Data Models (e.g. CustomerRepository returns...Customer, in this case)
  4. APIs return with View Models (MVC Controllers, WebApi return something like CustomerMaintModel)
  5. UIs work with View Models

This is a loosely coupled architecture. By using interfaces, it's also highly testable.

So, what's wrong with it?

The repository expects to return a domain model from the database. And that's not how most databases, especially legacy databases, are designed.

Let's make sure this is really clear, because it's the flaw in every example I've seen of this architecture. Here's how the domain and database stay loosely coupled.

  1. The domain models represent how the organization sees information.
  2. The database models represent how the data is stored.
  3. The domain should have no knowledge of the database.

The mistake I've seen is that organizations' repositories return their data models, not domain models. They then rely on the service to map to the domain models. This means the repositories are tightly coupled to the data model.

In this light, look at the classic example again. The repository returns a Customer object. Customer is a domain model. This means that, if the database tables aren't one-to-one matches to the domain models, we have two choices in mapping the table model to the domain model.

  1. Use IRepository and have the CustomerService do the mapping. For instance, the List<Customer> GetCustomers() method would get the data from the CustFileRepository and populate a Customer list.

    This is how I've seen most organizations try to do it. But we've tightly coupled our data model to our domain's IRepository. The data model is now part of the domain model. If we wanted to swap out the database, it would have to have a CustFile table. If it doesn't, we're rewriting the repositories and injecting new interfaces into the services.

  2. Use IRepository and have the CustomerRepository do the mapping. For instance, the IEnumerable<Customer> Many() method would get the data using a data model such as CustFile and populate a Customer list. There's no guarantee CustFile has the same properties as Customer.

    This would be "more right" since there'd be loose coupling to the data. We could supposedly swap out the database. But as you'll see it won't work the way you think, and becomes an unnecessary layer of abstraction.

Let's try doing the second way. But let's use a more real-world scenario.

The customer data is stored in two Microsoft SQL Server tables that were created over the course of ten years. The extended table was used to handle new columns for the original table. Note the different primary key names and storing name information in a weird way. Also, there's no foreign key for the one-to-zero-one relationship.

CustFile
=========
CustId   int PK
FullName varchar(100)

CustFileExtended
===============
CustomerNumber int PK
Zip5           int

We need data models specific to these tables.

namespace Company.Data.DataModels
{
    public class CustFile
    {
        public int CustId { get; set; }
        public string FullName { get; set; }
    }
    
    public class CustFileExtended
    {
        public int CustNumber { get; set; }
        public int Zip5 { get; set; }
    }
}

To keep the example clear, we'll use an explicit ICustomerRepository. Remember, in order to be loosely coupled, this repository returns a Customer domain model.

If it returned CustFile and CustFileExtended models, then those data models would have to be part of the Domain. And that would tightly couple the abstract domain model to the concrete data model.

namespace Company.Domain.DataInterfaces
{
    public interface ICustomerRepository
    {
        // Create
        Customer Create(Customer entity);

        // Retrieve
        Customer RetrieveSingle(int id);
        IEnumerable<Customer> RetrieveMany();
        IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where);

        // Update
        Customer Update(Customer entity);

        // Delete
        void Delete(int id);
    }
}

Note the use of Expression. That's coupling to Queryables, which are provider-dependent.

Now we'll try to implement the repository using Entity Framework. To make the problem clear, we'll only look at the RetrieveMany methods.

namespace Company.Data
{
    public class CustomerRepository : ICustomerRepository
    {
        private CompanyDbContext _context;
        public CustomerRepository(CompanyDbContext context)
        {
            _context = context;
        }
        
        public IEnumerable<Customer> RetrieveMany()
        {
            // Now what?
        }
        
        public IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where)
        {
            // Now what?
        }
    }
}

We'll assume we can mock the context somehow. But this is where things go awry.

Many organizations end up adding IRepository and IUnitOfWork to DbSet and DbContext. Patterns on top of patterns. Abstractions on top of abstractions!

How do we return a collection of Customers? Remember, we're getting the data from two different tables.

Important The calls to the database have no knowledge of what a Customer is.

public IEnumerable<Customer> RetrieveMany()
{
    List<Customer> customers = new List<Customer>();
    
    // there's no filtering, so get all records. That's right. ALL OF THEM.
    var custFiles = _context.CustFiles.ToList();
    int[] ids = custFiles.Select(a => a.CustId).ToArray();
    var custFileExtendeds = _context.CustFileExtendeds.Where(a => ids.Contains(a.CustNumber)).ToList();
    
    customers = custFiles.Select(a => new Customer()
    {
       CustomerId = a.CustId,
       Name = a.FullName,
       ZipCode = custFileExtendeds.SingleOrDefault(b => b.CustNumber == a.CustId)?.Zip5
    });
    
    return customers;
}

We're using a generic repository pattern, which is typically going to include a method such as above because sometimes we do want all the records. But not if there are a hundred million. OK, let's try filtering.

public IEnumerable<Customer> RetrieveMany(Expression<Func<Customer, bool>> where)
{
    List<Customer> customers = new List<Customer>();
    
    // And...now what?
}

We have a filter expression that can be examined for its properties and conditions. So at first it seems like all we have to do is

  1. Extract the Customer properties
  2. Map them to the CustFile and CustFileExtended properties
  3. Pull out the boolean conditions
  4. Create Expressions from the properties and conditions and apply the filters to the appropriate CustFile or CustFileExtended DbSets.
  5. Build and return the Customer list.

That's...actually, a lot of work. And wait...that's not complete, is it? There could be a few more steps after 4.

  1. Programmatically handle a CustFile property that's filtering based on a CustFileExtended property. E.g. Customer.Discount == Customer.MaxDiscount. Except in the database it's CustFile.Discount and CustFileExtended.MaxDiscount. How's that going to work?
  2. Programmatically handle nested conditions that go across table properties. Good luck to you.
  3. Hope to heck a Customer property being filtered on isn't made up of two properties from two different tables.

and so on until...

  1. Build and return the Customer list and pray it's right.

OK, so maybe we need custom repository interfaces for each domain model. Then we can make the client supply known filters like customer ids, get our initial datasets down to a managable size, then filter on the domain model.

//No need for an Expression since List<Customer> is generic.
public IEnumerable<Customer> RetrieveMany(int zipCode, Func<Customer, bool> where)
{
    List<Customer> customers = new List<Customer>();
    var custFileExtendeds = _context.CustFileExtendeds.Where(a => a.Zip5 == zipCode).ToList();
    //Build the customer list and add the filter at the end.
    return custFileExtendeds.Select(a => new Customer()
    {
       CustomerId = a.CustNumber,
       ZipCode = a.Zip5,
       Name = _context.CustFiles.Single(a => a.CustId == a.CustNumber).FullName
    }).Where(where);
}

And the Service method would look like this:

public List<Customer> GetCustomers(int zipCode, Func<Customer, bool> where)
{
    return _customerRepository.RetrieveMany(zipCode, where);
}

Seeing that service method, you're hopefully asking some important questions. Like...

What if my domain model was being populated by data from multiple databases and maybe a web service?

What's the repository pattern gaining me?

The Unnecessary Abstraction

We've discovered a few things about having separate domain and data models.

  1. No matter what, we need to map data models to domain models.
  2. We can't have fully flexible filtering.
  3. The mapping problems using IRepository methods are the same as with CustomerService methods. Specifically, the methods need to restrict the data that will come from the database in order for performance to be good.

There's one other requirement: whatever data sources are injected into CustomerService need to be mockable.

Whenever you see a passthrough query such as shown above in CustomerService.GetCustomers, it's worth asking, "Can I cut out the middleman?"

The answer here is "Yes." The repository is just getting in the way. We know the concrete CustomerService will depend on concrete data. The repository is just an abstraction of that data so we can unit test. We've already determined that if we want to swap out the database, we're going to be rewriting something. Why add to our troubles by essentially writing our CustomerService methods twice?

Solution Inject the mockable data source directly into the service

If the data source uses an interface, great. If there are multiple data sources, do whatever it takes to make them mockable. But don't use a repository abstraction. It just adds complexity for no gain.

Cutting Out the Middleman

Here's how our CustomerService example looks with the repository removed, organized by onion layers. These could be different assemblies.

Domain

namespace Company.Domain.DomainModels
{
    public class Customer
    {
        public int CustomerId { get; set; }
        public string Name { get; set; }
        public int ZipCode { get; set; }

    }
}
namespace Company.Domain.ServiceInterfaces
{
    public interface ICustomerService
    {
        //Create
        Customer Create(Customer customer);

        //Retrieve
        Customer RetrieveSingleById(int id);
        Customer RetrieveSingleByName(string name);
        List<Customer> RetrieveManyByIds(int[] ids);
        List<Customer> RetrieveManyByPartialName(string partialName);
        List<Customer> RetrieveManyByZipCode(int zipCode, Func<Customer, bool> where);

        //Update
        Customer Update(Customer customer);

        //Delete
        void Delete(int id);

    }
}

Data

namespace Company.Data.DataModels
{
    public class CustFile
    {
        public int CustId { get; set; }
        public string FullName { get; set; }
    }
    
    public class CustFileExtended
    {
        public int CustNumber { get; set; }
        public int Zip5 { get; set; }
    }
}
namespace Company.Data.SqlDatabase
{
    public class SqlDbContext : DbContext
    {
        public IDbSet<CustFile> CustFiles { get; set; }
        public IDbSet<CustFileExtended> CustFileExtendeds {get;set;}

        public SqlDbContext(string nameOrConnectionString) : base(nameOrConnectionString) { }
        
        //This constructor is used by Effort in unit testing
        public SqlDbContext(DbConnection existingConnection) : base(existingConnection, true) { }

        protected override void OnModelCreating(DbModelBuilder modelBuilder)
        {
            modelBuilder.Entity<CustFile>()
                .HasKey(a => a.CustId);

            modelBuilder.Entity<CustFileExtended>()
                .HasKey(a => a.CustNumber);
        }
    }
}

Services

This is only showing one method.

namespace Company.Services
{
    public class CustomerService : ICustomerService
    {
        SqlDbContext _context = new EF6Context("SqlDb");

        public CustomerService(SqlDbContext context)
        {
            _context = context;
        }

        public List<Customer> RetrieveManyByZipCode(int zipCode, Func<Customer, bool> where)
        {
            var custFileExtendeds = _context.CustFileExtendeds.Where(a => a.Zip5 == zipCode).ToList();
            return GetCustomers(custFileExtendeds)
                .Where(where).ToList();
        }
        
        // Helpers for mapping
        private List<Customer> GetCustomers(List<CustFile> custFiles)
        {
            List<Customer> customers = new List<Customer>();
            var custFileExtendeds = _context.CustFileExtendeds
                .Where(a => customers.Select(b => b.CustomerId).ToList().Contains(a.CustNumber)).ToList();

            customers.AddRange(custFiles.Select(a =>
            {
                var custFileExtended = custFileExtendeds.Single(b => b.CustNumber == a.CustId);
                return new Customer()
                {
                    CustomerId = a.CustId,
                    Name = a.FullName,
                    ZipCode = custFileExtended.Zip5
                };
            }));

            return customers;
        }

        private List<Customer> GetCustomers(List<CustFileExtended> custFileExtendeds)
        {
            int[] ids = custFileExtendeds.Select(a => a.CustNumber).ToArray();
            var custFiles = _context.CustFiles.Where(a => ids.Contains(a.CustId)).ToList();
            return GetCustomers(custFiles);
        }
    }
}

Test

There are various ways to mock Entity Framework. I like Effort because you don't need to add interfaces to your DbContext, you just need a particular constructor.

//other usings above, these are the testing dependencies
using Company.Data;
using Company.Data.DataModels;
using Company.Domain.DomainModels;
using Company.Domain.ServiceInterfaces;
using Company.Services;
using Xunit;
using Effort;
using FluentAssertions;

namespace DealingWithData.Tests
{
    public class CustomerService_Should
    {
        SqlDbContext _context;
        ICustomerService _customerService;

        public CustomerService_Should()
        {
            //This is what makes Effort the in-memory database
            _context = new SqlDbContext(DbConnectionFactory.CreateTransient());
            _customerService = new CustomerService(_context);
        }

        ~CustomerService_Should() 
        {
            _context.Dispose();
        }

        [Fact]
        public void Return_a_single_customer_by_id()
        {
            // arrange
            Customer expected = new Customer()
            {
                CustomerId = 1,
                Name = "Herbert",
                ZipCode = 12345
            };
            // mock the data the service works with
            _context.CustFiles.Add(new CustFile() { CustId = 1, FullName = "Herbert" });
            _context.CustFileExtendeds.Add(new CustFileExtended() { CustNumber = 1, Zip5 = 12345 });
            _context.SaveChanges();

            // act
            var actual = _customerService.RetrieveSingleById(expected.CustomerId);

            // assert
            actual.Should().BeEquivalentTo(expected);
        }
    }
}

If...when...we have changes in our data sources, we'll have to update the service and the unit tests. That's OK. Before, we'd have been updating repositories and unit tests. This removes the repository pattern dead weight.

Your Data Sources Will Change...

Sometimes radically. Your domain model will change, too, but it shouldn't be driven by the backend data. You should be able to code and unit test without any concrete dependencies, and it shouldn't be painful.

Do you really need the Repository pattern? Probably not.

Mock returning a List as IMongoQueryable for unit testing

The Problem

The latest MongoDb driver for .Net doesn't have a way to convert a collection such as List to IMongoQueryable. If the code depends on that interface, it needs to be mocked, but how to set the concrete data?

The Solution

What you'll need to run this sample yourself.

Let's say you've settled on using MongoDb as your NoSQL data store. You write a simple repository pattern with one method to query for any concrete type.1

	public interface IMongoRepository<T> where T : class
	{
		public IMongoQueryable<T> QueryAll();
	}

You also have a simple Customer service that calls the repository

public class CustomerService
{
	IMongoRepository<Customer> _customerRepository = null;
	public CustomerService(IMongoRepository<Customer> customerRepository)
	{
		_customerRepository = customerRepository;
	}

	public List<Customer> GetCustomers()
	{
		return _customerRepository.QueryAll().ToList();
	}
}

Finally, you start writing the following test. But you discover there's no way to get a concrete instance of IMongQueryable. There used to be, but it's legacy code.

public class CustomerService_Should
{
	[Fact]
	public void Return_customers()
	{
		var expected = new List<Customer>() { new Customer() { Id = 1 } };
		var customerRepository = Substitute.For<IMongoRepository<Customer>>();
		
		//return the mocked data. But how to convert the list into IMongoQueryable???

		customerRepository.QueryAll().Returns([argh, what goes here??]);
		var service = new CustomerService(customerRepository);
		var actual = service.GetCustomers();
		Assert.Equal(expected.Count, actual.Count);
		Assert.Equal(expected.First().Id, actual.First().Id);
	}
}

Like me, you probably try all kinds of typecasting before realizing you're always trying to do something impossible. Finally, you find the answer on Stack Overflow. There are two ways to mock up the data, and both make the IMongoQueryable class accept IQueryable.

Using NSubstitute

public class CustomerService_Should
{
	[Fact]
	public void Return_customers()
	{
		var expected = new List<Customer>() { new Customer() { Id = 1 } };
		var customerRepository = Substitute.For<IMongoRepository<Customer>>();
		
		//Mock IMongoQueryable to accept IQueryable, enabling just enough of the interface
		//to work
		var expectedQueryable = expected.AsQueryable();
		var mockQueryable = Substitute.For<IMongoQueryable<Customer>>();
		mockQueryable.ElementType.Returns(expectedQueryable.ElementType);
		mockQueryable.Expression.Returns(expectedQueryable.Expression);
		mockQueryable.Provider.Returns(expectedQueryable.Provider);
		mockQueryable.GetEnumerator().Returns(expectedQueryable.GetEnumerator());

		//return the mocked data
		customerRepository.QueryAll().Returns(mockQueryable);
		var service = new CustomerService(customerRepository);
		var actual = service.GetCustomers();
		Assert.Equal(expected.Count, actual.Count);
		Assert.Equal(expected.First().Id, actual.First().Id);
	}
}

Pretty slick.

Creating MongoQueryable

A simple concrete class that allows setting a List property.

public class MongoQueryable<T> : IMongoQueryable<T>
{
	public List<T> MockData { get; set; }

	public Type ElementType => MockData.AsQueryable().ElementType;

	public Expression Expression => MockData.AsQueryable().Expression;

	public IQueryProvider Provider => MockData.AsQueryable().Provider;

	public IEnumerator<T> GetEnumerator() => MockData.AsQueryable().GetEnumerator();
	IEnumerator IEnumerable.GetEnumerator() => MockData.AsQueryable().GetEnumerator();

	public QueryableExecutionModel GetExecutionModel() => throw new NotImplementedException();

	public IAsyncCursor<T> ToCursor(CancellationToken cancellationToken = default) => throw new NotImplementedException();

	public Task<IAsyncCursor<T>> ToCursorAsync(CancellationToken cancellationToken = default) => throw new NotImplementedException();

}

The test.

[Fact]
public void Return_customers2()
{
	var expected = new List<Customer>() { new Customer() { Id = 1 } };
	var customerRepository = Substitute.For<IMongoRepository<Customer>>();

	//Mock IMongoQueryable using a class
	var mockQueryable = new MongoQueryable<Customer>();
	mockQueryable.MockData = expected;

	//return the mocked data
	customerRepository.QueryAll().Returns(mockQueryable);
	var service = new CustomerService(customerRepository);
	var actual = service.GetCustomers();
	Assert.Equal(expected.Count, actual.Count);
	Assert.Equal(expected.First().Id, actual.First().Id);
}

References


  1. The comments in Stack Overflow point out that using IMongoQueryable--or IQueryable-- isn't ideal because it tightly couples the code to MongoDb or to a Queryable backend. It might be better to use a truly generic repository and convert to/from MongoDb (or other database) as needed.

Git Basics With Visual Studio 2019

Contents

What problem(s) does Git solve?

  • Multi-user offline development
  • Rapid, dependable branching and merging
  • Decentralized version control

Git is a distributed version control system that was originally developed by Linus Torvalds in April 2005 to replace BitKeeper, which was withdrawn from free use due to alleged reverse engineering by SourcePuller creator Andrew Tridgell.

Torvalds needed a system that could manage the Linux kernel development and scale to hundreds of merges performed in seconds. Git has been maintained by Junio Hamano since July 2005.

Git is currently the most-used version control system by far. Microsoft recommends Git, has contributed to its development, and hosts the Windows source code using Git--the largest such repository in the world.

How does Git basically work?

A folder, .git, contains the complete repository for files and folders underneath it. A clone of a repository contains the same history as the original.

The repository is a file system. Each file is named with an SHA-1 hash. A file contains either the actual content of source code file, or contains a tree of hash file names. In this way, Git maintains references, sometimes called "pointers"--the hash file names--to content.

There's a file that contains the reference to the root of the solution. From here, all the links to other files can be traced, so this becomes the current state.

If any file changes, Git creates a new copy of the file with a new hash reference, updates links to it, and creates a new root-level reference. This becomes the new point-in-time version.

Branches exist by creating a file with the branch name, e.g. "master", whose content points to a particular root reference file. When people talk about branching in Git being "cheap," it's because, to branch from another branch, all Git has to do is create another file, e.g. "feature-1," with the same reference file name.

If Charlene clones a repository from Blake, all she's doing is making a copy of the .git folder. Let's say Charlene and Blake each make a change to the same file. Each of them will get a different new reference hash. This is how Git knows to merge files: when it finds a divergence in the tree, it can trace backward in the references to find the common ancestor, then determine the changes via a three-way comparison. Once the differences are resolved, a new version of the file is created with a new reference name.

The process is exactly the same with repositories located on remote servers. The only difference is that remote repositories are typically "bare," meaning there's no working folder.

Anyone can create a bare repository on their local file filesystem. It's really just the contents of the .git folder moved to the root level.

Remote branch reference information is maintained in a .git folder named remotes. Where the remotes are located is maintained in a .git config file.

What's important to understand at this point is that your repository can contain branches for your local work, and branches for remote work. When you fetch a remote branch, the contents are merged into your local copy of that branch, e.g. origin/master. You then decide whether to merge those files into your local branch, i.e. master

When it comes to tracking local changes, there are potentially three versions of the file in play at any one time.

First, there's the version of the file you're editing. This is the working file.

Second, there's the version of the file you're going to commit. This is the staged file. (Also called being in the "index".)

Third, there's the version of the file since the last commit. This is the repository file.

How does this look in practice? Let's say the last committed file named Test.txt has one line:

one

You edit the file and add a line:

one
two

So, now your working file is changed, but Git doesn't know about the changes. You add the file to the index via git add Test.txt.

If you were to commit now, the repository file would be updated to match the staged file. But what if you don't commit, and instead add a third line?

one
two
three

If you commit, Git will still only update the repository file with the two lines from the staged file. You'd have to git add again to stage the file with all three lines.

This is a very flexible approach, letting you commit only the changes that you want. While I don't cover it in this guide, it's even possible to only commit portions of a file, called "hunks."

What's a typical daily flow?

Let's assume you've already cloned a repository from a remote server, and have configured your local Git to work easily with that remote. Your typical day will look something like this.

  1. Create and 'checkout' a local branch named 'feature-1' from the 'master' branch to work on a feature.

    This is a local feature branch that's not replicated in the remote.

  2. Make a small set of changes to the files.
  3. Add the changes to the index.
  4. Commit those changes with a short message
  5. Do this a bunch of times until the feature--or enough of the feature to make available to everyone--is complete (tests run, etc).
  6. Interactively rebase those commits, combining them and updating the messages so the changes will be clear in the remote repository history that will be shared with everyone.
  7. Checkout the master branch
  8. Pull (fetch and merge) any changes from the remote's master branch into your local master branch.
  9. Checkout the feature-1 branch.
  10. Rebase onto the master branch. This makes it seem as if all your changes are also the latest changes.
  11. Checkout master
  12. Merge feature-1 into master
  13. Push your local master branch to the remote master branch.
  14. Delete your local feature branch
  15. Create and checkout a new feature branch and do the whole thing again.

While this is many discreet steps, the flow soon becomes natural, and the result is a history of changes that's easy to follow.

Here are the commands for the above flow, which are explained in the Common Operations section.

git checkout -b feature-1
git add -A
git commit -m 'Try use existing GetAddress feature for GetCustomer'
git commit -m 'Update GetCustomer with new collection'
git commit -m 'Fix broken tests'
git rebase -i master (result is a single commit with message 'Allow git GetCustomer to return addresses')
git checkout master
git pull
git checkout feature-1
git rebase master
git checkout master
git merge feature-1 --no-ff (--no-ff forces the feature to appear clearly as a merge in the log)
git push
git branch -d feature-1
git checkout -b feature-1

Doing It In Visual Studio 2019

Here are the same operations from above done in Visual Studio 2019. Using Visual Studio 2017 is very similar.

I don't recommend using Visual Studio 2015's Git features.

Create and Checkout a New Branch

git checkout -b feature-1

d56bf2ea153b26c70f88de059c3bd2b4.png

1ce43676df42b7b589e5f1e858bde163.png

726441de43f69112a3b7d22a0229f89c.png

Stage Changes for Commit

git add -A

Note: In most cases you won't explicitly stage. You'll just Commit All.

8c1abfe61b1708c31429dc860cb16040.png

static void Main(string[] args)
        {
            Console.WriteLine("Hello Git World!");
        }

c8e059f6058183635ecdd882c2df4c3d.png

e588fe8a85594c950331ee7df4fd6849.png

Commit the Change

git commit -m 'Try use existing GetAddress feature for GetCustomer'

8c1abfe61b1708c31429dc860cb16040.png

190cbc8f15ebc9a03600d9489a294c8f.png

a852e75d75b3907f161341a813946d27.png

Make a Couple More Commits

git commit -m 'Update GetCustomer with new collection
git commit -m 'Fix broken tests'

[Note shown]

Clean Up the Commits

git rebase -i master (result is a single commit with message 'Allow git GetCustomer to return addresses')

Note: Visual Studio only gives the option to "squash" the commits. This is what you'll want to do most of the time, anyway.

d56bf2ea153b26c70f88de059c3bd2b4.png

011820fcf438a353cbfd9f2cf638c46c.png

Select the commits to squash, right-click, choose "Squash commits..."
72e0fc55a21c062f8e00eb9f3af21467.png

42865a089fe639c93271b70295a68f3d.png

Note: Visual Studio's "Squash Commits" dialog is surprisingly terrible. You effectively can't edit the text, because A) there's no way to create a soft return, and B) you can't paste more than the top line of copied text.

Until it's better, I recommend using the command line for squashing.

Checkout and Synchronize Master

git checkout master
git pull

9d26612a714f0fa28863d54bbbf77b66.png

35f579bba4a8184b23c869eba0daf1f5.png

a21d807d67769575d76026ed837b8d39.png

2d69bff4e05bca1b78988e9a5c2d95ac.png

Checkout and Rebase Feature Onto Master

git checkout feature-1
git rebase master

e32a0e0f368f590511a459b46d8b9e10.png

4dfbf5cbdba3900185bcc5d65e3583e1.png

Checkout and Merge Feature Into Master

git checkout master
git merge feature-1 --no-ff

9d26612a714f0fa28863d54bbbf77b66.png

081e72e7927dc10242f22af4b4e4b72a.png

3846c6938dca5a4f98eabe249d05c7f5.png

Note: Visual Studio doesn't appear to have a --no-ff option. Because of the value to a good history, I recommend doing the final commit at the command line using an alias.

Note: It possible to set Git to always use --no-ff via the merge.ff false option. However, this will only work in your local instance unless a repository-specific config file is used and committed.

Here's how that commit looks after using --no-ff at the command line. Notice how much clearer it is in both the command window and Visual Studio

50591bd7eef98c1ca4a6ce8fe50b7de8.png

27def2246530c8a31c8a548c1b32f05b.png

Push Changes to the Remote Server

git push

16c89e011e40d24dd6384644f134357c.png

14db281f2c515fd6c663d0249abf073b.png

Delete the Branch

git branch -d feature-1

6bd13d905ef6da80cf15c4c379bf22c7.png

Common Operations

Checkout

Checkout means to set your working folder to a particular point in time. You're always choosing a reference hash of a "root". However, you can use branch and tag names because they "point" to a particular reference.

git checkout master
git checkout feature-1
git checkout origin\master
git checkout head
git checkout head~3
git checkout v2.35.9
git checkout f935ea4

Branch

With the above information about how Git works, you should now understand what people mean by "a branch is a pointer to a reference."

You may not use the branch command to create a branch very often. Typically, you'll create and checkout a branch in a single step using this command:

git checkout -b feature-1

That single command runs these two commands:

git branch master feature-1
git checkout feature-1

Other branch commands.

branch -d feature-1
branch -D featue-1

Add (stage)

Adds changes to the index, to be committed.

git add -A
git add Test.txt
git add Models/Customer.cs

Commit

Commits staged changes (in the index) to the repository.

commit
commit -m 'Allow GetCustomer to show addresses'
commit -amend (changes history)

You'll often amend a commit if, for example, you realized you forgot to add a file, or mispelled something in the commit message.

For example, assume this history:

1234567 Fix spelling erors
1234568 Fix my wrong spelling fix
1234569 Add marketing copy
123456A Remove temporary files
123456B Add history file

After committing "Fix spelling erors", you realize you not only forgot to add the latest spelling file, but also mispelled "errors." You'd execute something like

git add spelling.dat
git commit --amend

The editor would open, and you could either change the message or--more likely--just close the editor. Here's how the result might look.

afgec73 Fix spelling errors
1234568 Fix my wrong spelling fix
1234569 Add marketing copy
123456A Remove temporary files
123456B Add history file

Here's the important thing to notice: The "fix spelling errors" commit's ref hash has changed.

Merge

Merges changes from another branch into the current branch. You must have the destination branch checked out. For example, in order to merge the changes from the feature-1 branch into the master branch, you must first be in the master branch.

git checkout master
git merge feature-1

Rebase (changes history)

Rebase has two main uses. The first is to modify a set of commits, usually combining ("squashing") them, so that the resulting commits are "the way it should have been". For example, let's say you have these five commits in a local feature branch.

1234567 Fix spelling errors
1234568 Fix my wrong spelling fix
1234569 Add marketing copy
123456A Remove temporary files
123456B Add history file

Using interactive rebase, you could revise this to two commits:

987ABCD Fix spelling errors
987AGF2 Add history file

Notice that the original file reference hashes were not reused. New commits were created. It's as if these new commits had always been the only commits.

However, those other five commits still exist in the reflog if needed.

Here's the command:

git rebase -i

The result is a screen that guides you through the changes.

The second use is making your branch's commits seem like they were just done so that they appear as the latest commits on another branch. For example, you're working on your feature branch and have cleaned up your commits. Now you want to merge your changes into master, so you checkout master and pull.

The problem is, while you were working, someone else committed changes to master. So now your changes are behind that other person's. If you were to merge your changes, then view history, they'd be kind of jumbled in. Instead, you want your changes to appear after the other person's.

To do that, you rebase your changes onto master. Git will temporarily reset your branch to when you originally branched, merge the latest changes from master into your branch (making them identical), then replay your commits on top of this history. Each of your commits is treated as new, so it gets a new reference hash.

The command sequence to do this is:

git checkout master
git pull
git checout feature-1
git rebase master

There's a chance you'll need to resolve conflicts between your changes and the other persons.

Check Status

Shows which files are modified, new, or deleted. Also shows which files are staged.

git status

Check History

Shows information about commits.

git log
git log --graph --oneline

Revert

The revert command "unapplies" a set of commits that appear after a commit in history, and creates a new commit at the top. Let's say you have this history.

6293daae maybe ok 
fec62970 mistake 2
96ca7600 mistake 1
bac6d253 working fine
// reverts the top three commits
git revert 96ca7600^..6293daae


// reverts the middle two commits
git revert 96ca7600^..fec62970

Each commit that's reverted gets its own new commit. After each reversion, you need to enter git revert --continue until all reversions are complete.

Reset (changes history)

The reset command undoes a set of commits starting at the latest. It does not create a new commit. It effectively throws away the changes.

Basically, the HEAD file is updated to point to the given ref hash. In the example below, HEAD would change from 6293daae to bac6d253. The only other question is whether the commit changes are retained in some way.

Given this history,

6293daae maybe ok 
fec62970 mistake 2
96ca7600 mistake 1
bac6d253 working fine

Here are the common ways to use the command to remove the top three commits.

git reset bac6d253

The default command executes in --mixed mode. The index is reset, but the working folder is left alone. This means any changes are ready to add and commit.

git reset bac6d253 --soft

The index is not reset, nor is the working folder. Any changes are already added and ready to commit.

git reset bac6d253 --hard

Both the index and working folder are reset. All changes are lost.

Tag

There are two kinds of tags: annotated and lightweight. Annotated tags are meant for release (such as pushing to the remote repository), while lightweight tags are good for local use and cannot be pushed.

Personally, I only use annotated tags.

List all tags

git tag

Create a lightweight tag at the HEAD

git tag temp1

Create annotated tag at the HEAD, with a message of the same text

git tag -a v1.2.4 -m v1.2.4

Delete a tag

git tag -d v1.2.4

Force an existing tag onto a different commit

git tag -a -f 63fac95

The Rules For Teams

Git is powerful, and one of its powers comes with a risk. That is the power to "rewrite history." What this means in practice is that a user could push changes to a remote repository that conflict with existing log entries others have already pulled. I show an example below, but first here are the rules.

  1. Don't push changes from rebase,revert, commit, or tag if you've previously pushed the affected commits.
  2. Rebase features onto master, then merge features into master.

First, here's an example of an easy mistake to make. The user is on the master branch, commits locally, pushes to the remote, then amends the commit and pushes again.

git commit -m "All done!"
git log --graph --oneline
  afea725 All done!
  7c934ag Pass all tests

git push
--Forgot to add some files!
git add --all
git commit --amend -m "All done!" <= gets a warning
git pull <= thinks this is the right thing to do, but it isn't
git commit --amend -m "All done!"
git log --graph --oneline
  024ag7d All done!
  7c934ag Pass all tests

git push

Notice that the "All done!" commit's reference hash is different after being amended. The problem above would be compounded if the user didn't amend the commit for awhile.

What if developer B pulls from the server after user A's first push? B will have a history that includes ref afea725. In the meantime, the A amends and pushes. Now B pulls. Does her ref afea725 magically disappear? No. She ends up with something like this.

  024ag7d All done!
  afea725 All done!
  7c934ag Pass all tests

Or, it could be worse. User A could force the amended commit. This will lead B to getting an even worse history.

The problems arising from using the other commands that change history are similar.

Second, if a user doesn't rebase features onto master, then this leads to an unclear history where the branch the user just finished and is pushing to the remote server looks as if it was done a week ago (or whatever). That's because, according to the history, it has. But that's not what the user intended.

Solving Common Problems

Pull, Merge and Rebase Conflicts

The first time I encountered a merge conflict, I was utterly flummoxed. I didn't understand what to do, and I was definitely confused by whether my local files were actually "local" or "remote".

Here's a message from trying to push changes that conflict with local changes.

69d8820294e311fe814870be03de14d8.png

Error: hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Error encountered while pushing to the remote repository: rejected Updates were rejected because the remote contains work that you do not have locally. This is usually caused by another repository pushing to the same ref. You may want to first integrate the remote changes before pushing again.

So, we do what we're told and try to pull.

d7d0d0f7dddc3c2971ccb392e65b6910.png

We're now in the midst of resolving a merge conflict. At this point we have two options: resolve the conflict, or abort. If we abort, it's as if the pull never happened.

To visually merge the files, click "Conflicts: 1", then select a file that has a conflict.

a7b756d8efc2495e6ea4f652b111ee50.png

47b2758f25fe31fa6f8fe2659855627c.png

a76231cdf708eb9070d7fb8dbef685d7.png

There are several options for comparing and merging files. The one we'll look at is the three-way merge. Click Merge. This will run whatever merge tool you've configured Git to use.

I've used KDiff3 for years because it's very clear about where the conflict is. However, since this article is about using Visual Studio, that's what I'll demonstrate. It looks like Microsoft has improved Visual Studio's diff/merge quite a bit, which is good!

7ac96fb74ef0e5e5cb024288e58fb1bd.png

Choose which version you want to keep, the left or right or both. You can also type directly into the base. We'll take the left (our) change, Accept Merge, and close the editor.

7fa4131d5f8f086fd3a2177b401ac653.png

Finally, commit the merge.

071afeda9a3ac3898e7b02244d7daabb.png

fec0a2bf91ee0811825e6fabbbf8e6d7.png

And, you're ready to push.

382ecb4efcec7315a3c9dbfd8e394526.png

Remote vs Local, plus Base and Backup

The "base" file is easy to understand; it's the version that's common to both of the changed versions prior to when they diverged.

If you're dealing with a merge conflict from another developer, that's easy, too. The Remote will be their file, and Local will be yours.

But what if there's a conflict locally between two branches? For example, what if you

  1. Branch from master into feature, commit a change there
  2. Go back to master and commit a change there
  3. Try to merge feature into master OR rebase feature onto master

Visual Studio makes this pretty easy by using clearer terminology.

In the case of Merge, the master branch file is the Target, and the feature file is the Source. If you used the command git mergetool, master would be LOCAL and feature would be REMOTE.

In the case of Rebase, it's the same: 'masteris Target/LOCAL, andfeature` is Source/REMOTE.

Merge vs Rebase terminology is what confuses people, so let's repeat it:

  • If I'm on master branch and merge, master is Target/LOCAL
  • If I'm on feature branch and rebase, master is still Target/LOCAL

The target is whichever branch you're merging into, or rebasing onto.

If Git is configured to keep backups of the file before merging begins (mergetool.keepBackup = true), after the merge is committed there will be files with a .backup extension. These need to be cleaned up.

git clean -f

I suggest setting the config file's mergetool.keepBackup to false. Several times I've accidentally added the backup files to a commit.

Last Word

Git's powerful, at times confusing, always complex, and frequently complicated. Hopefully this article has given you a solid foundation in Git's basics.

   Older